Slide 1 Regression Analysis in the Literature This class will conduct the regression analysis found in the following journal article: Clifton P. Flynn,

Slide 1

Regression Analysis in the Literature

This class will conduct the regression analysis found in the following journal article:

Clifton P. Flynn, "Regional Differences in Attitudes Toward Corporal Punishment." Journal of Marriage & the Family. 56(2):314-324. 1994 May.

The data for this analysis is in the data set: CorporalPunishment.Sav. This file contains the reduced data set used in the article, not the entire sample from the 1988 General Social Survey. Specifically, the file includes only those subjects who were asked the question about corporal punishment.

The variables in this data set have been recoded from their original format on the GSS to the format used in the article. I believe that all of the values for the variables are identical to the data used in the article, with the exception of the variable for Conservative Protestant, which was not defined in complete detail in either the article or any of the references cited. I do believe that the codes are accurate for all but a half dozen cases. The numbers in our analyses will be close to the results obtained in the article, but not an exact match.


Slide 2

Stage 1: Definition Of The Research Problem

In the first stage, we state the research problem, specify the variables to be used in the analysis, and specify the method for conducting the analysis: standard multiple regression, hierarchical regression, or stepwise regression.


Relationship to be Analyzed

The question to be answered in the regression analyses is stated on page 316 of the article: to determine whether any regional differences in spanking attitudes remain after controlling for important social factors; and to assess whether the relationship between the control variables and attitudes toward corporal punishment is the same within each region.

Slide 3

Specifying the Dependent and Independent Variables

The dependent variable for this study was attitude toward spanking (SPANKING). This variable was measured using the following item: "Do you strongly agree, agree, disagree, or strongly disagree that it is sometimes necessary to discipline a child with a good, hard spanking?" Responses were coded from 1 to 4, with 4 being "strongly agree," so that a higher score indicated a more favorable attitude toward corporal punishment.

Region of the country served as the predictor independent variable (REGION)

The analysis included eight control independent variables:


• Income - measured using 20 income ranges, which were coded from lowest (1)--under $1,000--to highest (20)--$60,000 or over (INCOME86)

• Age (AGE) • Number of children (CHILDS) • Years of schooling completed (EDUC) • Sex - coded as female and male (FEMALE) • Race - coded as blacks and nonblacks (BLACK) • Religion - coded as conservative Protestants versus compared to all other

respondents. (CONSPROT) • Native residence - coded as rural natives and nonrural natives. A rural

native was defined as one who resided in a community of fewer than 50,000 people both at age 16 and at the time of the survey. (RURALNAT)

Slide 4

Method for including independent variables: standard, hierarchical, stepwise

While the structure of the question suggests a hierarchical regression, the author opts to do standard multiple regression. Both methods produce the same result, but the change in R² statistic is not available in standard multiple regression. Since we are replicating the analysis, we will do standard multiple regression.


Slide 5

Stage 2: Develop The Analysis Plan: Sample Size Issues

In stage 2, we examine sample size and measurement issues.


Missing data analysis

To test for missing data, we run the SPSS script MissingDataCheck.SBS. This script tallies the number of missing variables per case and the number of missing cases per variable. It will filter out cases missing large numbers of variables, e.g. more than 50% of the variables included in the missing data check. The script also creates a pattern variable and a correlation matrix of dichotomous missing/valid variables. It does not do the t-tests or chi-square tests to compare missing and valid groupings on one variable to the pattern of values on other variables.

Slide 6

Run the MissingDataCheck Script

First, select the 'Run Script...' command from the Utilities menu.

Second, navigate to the SW388R7 folder where we downloaded the script files.

Third, highlight the "MissingDataCheck" script.

Fourth, click on the Run button.


Slide 7

Complete the 'Check for Missing Data' Dialog Box

First, move the following variables to the list box for testing variables: CHILDS 'NUMBER OF CHILDREN', AGE 'AGE OF RESPONDENT', EDUC 'HIGHEST YEAR OF SCHOOL COMPLETED', REGION 'REGION OF INTERVIEW', INCOME86 'TOTAL FAMILY INCOME', SPANKING 'FAVOR SPANKING TO DISCIPLINE CHILD', CONSPROT 'Conservative Protestant', BLACK 'Black', FEMALE 'Female', and RURALNAT 'Rural Native'.

Second, accept the defaults, and click on the OK button to produce the output for the missing data check.


Slide 8

The Number of Missing Cases for Each Variable


In the frequency table that lists each variable and the number of missing and valid cases, we see that the variable with the largest number of missing cases is Total Family Income, with 80 cases. The number of missing cases for the other variables is negligible.

Slide 9

Patterns of Missing Variables


Only one case had two missing variables, Age and Total Family Income. The most prevalent pattern concerns the Total Family Income variable, which we had already detected. There does not appear to be any pattern of missing variables about which we should be concerned.

Slide 10

The Correlation Matrix of Valid/Missing Dichotomous Variables


In the segment of the matrix shown below, there are not any correlations about 0.40. If we inspect the remainder of the matrix in the output, we will not find any moderate correlations. The correlation matrix does not support the presence of a missing data process.

Slide 11

Recoding for Missing Data

The authors of the article opt to do substitution for the missing cases. For the metric variables we can specify Mean Substitution on the Regression command to solve this problem. For the nonmetric variables, the authors employ modal substitution, which I take to mean substituting the largest category on each of the nonmetric variables that have missing data (Conservative Protestant and Rural Native). We can accomplish this modal substitution by running the following syntax commands:

RECODE consprot ruralnat (MISSING=0).EXECUTE.


Slide 12

Power to Detect Relationships

The table on page 165 of the text indicates that our sample size will show statistical significance for very small R² values, e.g. 2%, even at the conservative alpha value of 0.01. I extracted this from the sample size row of 1000 and the number of independent variables of 10.


Minimum Sample Size Requirement: 15-20 Cases Per Independent Variable

A sample size of 978 cases and 11 independent variables (the eight control variables plus the three dummy coded variables for region) produces a case to variable ration of 89 to 1, far in excess of the guideline of 15-20 cases per independent variable.

Slide 13

Stage 2: Develop The Analysis Plan: Measurement Issues

In this part of stage 2, we examine measurement issues.


Incorporating Nonmetric Data with Dummy Variables

The nonmetric variables for female, black, conservative protestant, and rural have already been dummy-coded. The original region variable contains nine categories. To see the codes for these nine categories, select the Variables... command from the Utilities menu and select the region variable.

The division for regions in the article uses four categories: Northeast, West, Midwest, and South. The dummy coding for four regions would normally require the creation of three new variables. However, we created the full set of four to support the individual regressions for each region. These variables have already been added to the data set.

Slide 14

Incorporating Nonmetric Data with Dummy Variables

The nonmetric variables for female, black, conservative protestant, and rural have already been dummy-coded. The original region variable contains nine categories. To see the codes for these nine categories, select the Variables... command from the Utilities menu and select the region variable.

The division for regions in the article uses four categories: Northeast, West, Midwest, and South. The dummy coding for four regions would normally require the creation of three new variables. However, we created the full set of four to support the individual regressions for each region. These variables have already been added to the data set.


Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.

Slide 15

Stage 3: Evaluate Underlying Assumptions

In this stage, we verify that all of the independent variables are metric or dummy-coded, and test for normality of the metric variables, linearity of the relationships between the dependent and the independent variables, and test for homogeneity of variance for the nonmetric independent variables.


Metric Dependent Variable and Metric or Dummy-coded Independent Variables

All of the variables in the analysis are metric or have already been dummy coded.

Slide 16

Normality of metric variables

The null hypothesis in the K-S Lilliefors test of normality is that the data for the variable is normally distributed. The desirable outcome for this test is to fail to reject the null hypothesis. When the probability value in the Sig. column is less than 0.05, we conclude that the variable is not normally distributed.

If a variable is not normally distributed, we can try three transformations (logarithmic, square root, and inverse) to see if we can induce the distribution of cases fit a normal distribution. If one or more of the transformations induces normality, we have the option of substituting the transformed variable in the analysis to see if it improves the strength of the relationship.

To test for normality, we will run the NormalityAssumptionAndTransformations.SBS script and test the metric variables: CHILDS 'NUMBER OF CHILDREN', AGE 'AGE OF RESPONDENT', EDUC 'HIGHEST YEAR OF SCHOOL COMPLETED', INCOME86 'TOTAL FAMILY INCOME', and SPANKING 'FAVOR SPANKING TO DISCIPLINE CHILD'.


Slide 17

Run the 'NormalityAssumptionAndTransformations' Script



Third, highlight the "NormalityAssumptionAnd Transformations" script.



Slide 18

Complete the 'Test for Assumption of Normality' Dialog Box

First, move the metric variables that will be used in the regression to the 'Variable to Test: ' list box.

Second, click on the OK button to produce the output for the normality tests.


Slide 19

Output for the Tests of Normality

Explore the output for the tests of normality to confirm that all the metric variables failed the normality test and, furthermore, none of the transformations produced a variable that was normally distributed.


Slide 20

Linearity between metric independent variables and dependent variable

Another script, 'LinearityAssumptionAndTransformations' tests for linearity of relationships between the dependent variable and each of the independent variables. Since there is no simple score that indicates whether or not a relationship is linear or nonlinear, a scatterplot matrix is created for the dependent variable, the independent variable, and transformations of the independent variable. The user can visually inspect the scatterplot matrix for evidence of nonlinearity. If nonlinearity is evident, but not corrected by a transformation of the independent variable, transformation of the dependent variable is available. More detailed information is available by requesting a correlation matrix for the variables included in the scatterplot matrix. If the scatterplot matrix does not provide sufficient detail, individual scatterplots overlaid with fit lines can be requested.

When we run the script as described below, we do not find any nonlinear relationships between the dependent variables and the metric independent variables.


Slide 21

Run the 'LinearityAssumptionAndTransformations' Script



Third, highlight the "LinearityAssumptionAnd Transformations" script.



Slide 22

Complete the 'Check for Linear Relationships' Dialog Box

First, move the variable that will be used for the dependent variable in the regression to the 'Dependent (Y) Variable: ' list box.

Second, move the metric variables that will be used as independent variables in the regression to the 'Independent (X) Variables: list box.

Third, click on the OK button to produce the output for the linearity tests.


Slide 23

The Scatterplot Matrix


The scatterplot matrices produced for this analysis have a different look because the dependent variable contains only four discrete categories, so all of the values will line up in four vertical columns.

The relationship can still be evaluated for linearity by determining whether or not the fit line crosses the vertical columns in an ascending or descending pattern from left to right, or whether the bars in the center are higher, or lower, than the surrounding bars.

Slide 24

Constant variance across categories of nonmetric independent variables

Another script, 'HomoscedasticityAssumptionAndTransformations' tests for homogeneity of variance across groups designated by nonmetric independent variables. The script uses the One-Way ANOVA procedure to produce a Levene test of the homogeneity of variance.

The null hypothesis of the Levene test is that the variance of all groups of the independent variable are equal. If the Sig value associated with the test is greater than the alpha level, we fail to reject the null hypothesis and conclude that the variance for all groups is equivalent.

If the Sig value associated with the test is less than the alpha level, we reject the null hypothesis and conclude that the variance of at least one group is different. If we fail the homogeneity of variance test, we can attempt to correct the problem by applying the transformations for normality to the dependent variable specified in the test. The script computes the transformations and applies the Levene test to the transformed variables. If one of the transformed variables corrects the problem, we can consider substituting it for the original form of the variable.


Slide 25

Run the 'HomoscedasticityAssumptionAndTransformations' Script



Third, highlight the "HomoscedasticityAssumptionAndTransformations" script.



Slide 26

Complete the 'Test for Assumption of Homogeneity of Variance' Dialog Box

First, move the dependent variable for the regression to the 'Dependent (Y) Variable: ' list box.

Second, move the nonmetric independent variables to the 'Nonmetric Independent (X) Variables: list box.

Third, click on the OK button to produce the output for the linearity tests.


Slide 27

Output for theTest of Homogeneity of Variances

The output for the tests of homoegeneity of variances indicates that the variables FEMALE 'Female' and NORTHEAS 'Northeast Region' do not pass the Levene test. An inverse transformation of the dependent variable would correct the problem for the female variable, while a log or square root transformation corrects the homogeneity problem for the northeast variable. Since no transformation would solve both problems and since we did not find a transformation that corrected the normality problem for the spanking variable, we will not consider a transformation any further.


Slide 28

Stage 4: Compute the Statistics And Test Model Fit: Computations

In this stage, we compute the actual statistics to be used in the analysis. Regression requires that we specify a variable selection method. The article specifies standard multiple regression.


Compute the Regression Model

The first task in this stage is to request the initial regression model and all of the statistical output we require for the analysis.

Slide 29

Request the Regression Analysis

Choose 'Regression | Linear...' from the Analyze menu.


Slide 30

Specify the Dependent and Independent Variables and the Variable Selection Method

First, in the Linear Regression dialog box, move the variable SPANKING 'FAVOR SPANKING TO DISCIPLINE CHILD' to the 'Dependent: ' text box.

Second, move the variables CHILDS 'NUMBER OF CHILDREN', AGE 'AGE OF RESPONDENT', EDUC 'HIGHEST YEAR OF SCHOOL COMPLETED', INCOME86 'TOTAL FAMILY INCOME', CONSPROT 'Conservative Protestant', BLACK 'Black', FEMALE 'Female', RURALNAT 'Rural Native', NORTHEAS 'Northeast Region', WEST 'West Region', and MIDWEST 'Midwest Region' to the 'Independent(s): ' list box.

Third, select 'Enter' from the 'Method: ' drop down menu to request a standard multiple regression.


Slide 31

Specify the Statistics Options

First, click on the 'Statistics...' button to access the 'Linear Regression: Statistics' dialog box.


Slide 32

Complete the Linear Regression Statistics Dialog

First, mark the 'Model fit' checkbox, the 'Descriptives' checkbox, and the 'Collinearity diagnostics' checkbox.

Third, on the 'Residuals' panel, mark the 'Durbin-Watson' checkbox, the 'Casewise diagnostics' checkbox, the 'Outliers' option button,

and fill in a '3' in the 'Outliers outside ... standard deviations' text box.

Third, click on the 'Continue' button to close the dialog box.


Slide 33

Specify the Plots to Include in the Output

First, click on the 'Plots...' button to access the dialog box.


Slide 34

Complete the Linear Regression Plots dialog box

First, move the '*SRESID' variable name to the 'Y: ' text box for the residual scatter plot.

Second, move the '*ZPRED' variable name to the 'X: ' text box for the residual scatter plot.

Third, mark the 'Normal probability plot' checkbox on the 'Standardized Residual Plots' panel.

Fourth, mark the 'Produce all partial plots' checkbox.

Fifth, click on the Continue button.


Slide 35

Specify Diagnostic Statistics to Save to the Data Set

First, click on the Save...' button to access the dialog box.


Slide 36

Complete the Linear Regression Save dialog box.

First, mark the 'Mahalanobis' checkbox and the 'Cook's' checkbox

on the 'Distances' panel.

Second, click on the Continue button.


Slide 37

Specify the Mean Substitution Option


Slide 38

Complete the Linear Regression: Options dialog box

First, mark the 'Replace with mean' option button on the 'Missing Values' panel.

Second, click on the Continue button to close the dialog.

Third, click on the OK button on the Linear Regression dialog box to complete the regression analysis request.


Slide 39

Stage 4: Compute the Statistics And Test Model Fit: Model Fit

In this stage, we examine the relationships between our independent variables and the dependent variable.

First, we look at the F test for R Square which is associated with the overall relationship between the dependent variable and the set of independent variables. The F statistic tests the hypothesis that there is no relationship between the dependent variable and the set of independent variables, i.e. the null hypothesis is: R² = 0. If we cannot reject this null hypothesis, then our analysis is concluded; there is no relationship between the dependent variable and the independent variables that we can interpret.

If we reject the null hypothesis and conclude that there is a relationship between the dependent variable and the set of independent variables, then we examine the table of coefficients to identify which independent variables have a statistically significant individual relationship with the dependent variable. For each independent variable in the analysis, a t-test is computed that the slope of the regression line (B) between the independent variable and the dependent variable is not zero. The null hypothesis is that the slope is zero, i.e. B = 0, implying that the independent variable has no impact or relationship on scores on the dependent variable. This part of the analysis is most important in standard multiple regression where we enter all of the independent variables into the regression at one time, and hierarchical multiple regression where we specify the order of entry of independent variables than it is in stepwise multiple regression where the computer picks the order of entry and stops adding variables when some statistical limit is reached. In stepwise regression, we would expect all of the individual variables that passed the statistical entry for entry to have a significant individual relationship with the dependent variable.


Slide 40

Stage 4: Compute the Statistics And Test Model Fit: Model Fit

When we are determining which independent variables have a significant relationship with the dependent variable, we often are interested in the question of the relative importance of the predictor variables to predicting the dependent variable. To answer this question, we will examine the Beta coefficients, or standardized version of the coefficients of the individual independent variables.


Slide 41

1. Significance Test of the Coefficient of Determination R Square


In the ANOVA table, we see that R² of 0.111 is associated with a statistically significant relationship (sig < 0.0001) between the dependent variable and the independent variables. We reject the null hypothesis that R² is zero and conclude that there is a relationship between the independent variables and the dependent variable.

Slide 42

2. Significance Test of Individual Regression Coefficients


The interpretation of the coefficients would be identical to that found in the article in the first column on page 320. Looking at the Sig column for the tests of individual coefficients, we see that number of children, years of education, conservative Protestant, ethnicity, rural native, and Northeast region all have significant individual relationships with the dependent variable of favoring spanking. (Note that the numbers in our table follow the same general pattern as the article, but are not identical.)

Slide 43

Stage 4: Compute the Statistics And Test Model Fit: Meeting Assumptions

Using output from the regression analysis to examine the conformity of the regression analysis to the regression assumptions is often referred to as "Residual Analysis" because if focuses on the component of the variance which our regression model cannot explain. Using the regression equation, we can estimate the value of the dependent variable for each case in our sample. This estimate will differ from the actual score for each case by an amount referred to as the residual. Residuals are a measure of unexplained variance or error that remains in the dependent variable that cannot be explained or predicted by the regression equation.


Slide 44

Linearity and Constant Variance for the Dependent Variable - Residual Plot

The residual plot shows the pattern that is associated with a discrete dependent variable, i.e. if there are four possible values for the dependent variable, the residuals will fall in bands associated with these values. If the residuals are randomly distributed, the bands will tend to fill the graphic space both vertically and horizontally. The lack of data points in the upper righthand corner and lower lefthand corner are a result of the restricted range of the possible values of the dependent variable and are not indicative of a problem with underlying assumptions.

I cannot identify any violation of homoscedasticity or nonlinear trend in the plot of residuals. The tendency to have a larger number of points in the left half of the chart is associated with the normality problem of the dependent variable which we could not correct.


Scatterplot

Dependent Variable: FAVOR SPANKING TO DISCIPLINE CHILD

Regression Standardized Predicted Value

43210-1-2-3

Re

gre

ssio

n S

tud

en

tize

d R

esi

du

al

3

2

1

0

-1

-2

-3

-4

Slide 45

Normal Distribution of Residuals - Normality Plot of Residuals


To check for meeting the assumption that the residuals or error terms are normally distributed, we look at the Normal p-p plot of Regression Standardized Residual as shown to the right.

Our criteria is the degree to which the plot for the actual values coincides with the green line of expected values. For this problem, the plot of residuals fits the expected pattern well enough to support a conclusion that the residuals are normally distributed. If a more exact computation is desired, we instruct SPSS to save the residuals in our data file and do a test of normality on the residual values using the Explore command.

Slide 46

Linearity of Independent Variables - Partial Plots


To verify the assumption of linearity for the metric independent variables, we examine the partial regression plots. The partial plots for the numeric variables, Age, Number of Children, Education level, and Income in this analysis do not show any nonlinear patterns. All show a weak linear relationship, like the partial plot for number of children shown below.

We ignore the partial plots for the dummy-coded variables as extraneous output since linearity is not an expectation for nonmetric variables.

Slide 47

Independence of Residuals - Durbin-Watson Statistic


The next assumption is that the residuals are not correlated serially from one observation to the next. This means the size of the residual for one case has no impact on the size of the residual for the next case. While this is particularly a problem for time-series data, SPSS provides a simple statistical measure for serial correlation for all regression problems. The Durbin-Watson Statistic is used to test for the presence of serial correlation among the residuals. Unfortunately, SPSS does not print the probability for accepting or rejecting the presence of serial correlation, though probability tables for the statistic are available in other texts.

The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are uncorrelated is the Durbin-Watson statistic is approximately 2. A value close to 0 indicates strong positive correlation, while a value of 4 indicates strong negative correlation.

For our problem, the value of Durbin-Waston is 1.909, approximately equal to 2, indicating no serial correlation.

Slide 48

Identifying Dependent Variable Outliers - Casewise Plot of Standardized Residuals


As shown in the following table of Residual Statistics, all standardized residuals (Std. Residual) fell within +/- 3 standard deviations. We do not have cases where the value of the dependent variable indicates an outlier

Slide 49

Identifying Independent Variable Outliers - Mahalanobis Distance

To identify only the most extreme outliers, the level of significance used is set to 0.001. The critical value of Mahalanobis Distance for 11 independent variables is 30.264. The table of residual statistics shows that the maximum Mahalanobis distance found for any case is 30.831, which is larger than the critical value. While we should, perhaps. re-run the analysis excluding this case, we will skip this test for the pragmatic reason that 1 case out of 978 with a problematic distance score will not affect the analysis.


Slide 50

Identifying Influential Cases - Cook's Distance

Cook's distance identifies cases that are influential or have a large effect on the regression solution and may be distorting the solution for the remaining cases in the analysis. While we cannot associate a probability with Cook's distance, we can identify problematic cases that have a score larger than the criteria computed using the formula: 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem which has 978 subjects and 11 independent variables, the formula equates to: 4 / (978 - 11 - 1) = 0.004.

To identify the influential cases with large Cook's distances, we sort the data set by the Cook's distance variable, 'coo_1' that SPSS created in the data set.


Slide 51

Sorting Cook's Distance Scores in Descending Order

First, we select the 'Sort Cases...' command from the Data menu.

Second, we move the 'Cook's Distance (coo_1)' variable to the 'Sort by: ' text box.

Third, mark the 'Descending' option in the 'Sort Order' panel.

Fourth, click on the OK button to commence sorting.


Slide 52

Cases with Large Cook's Distances

When the data set is sorted we see that there are 40 cases that have a Cook's distance larger than 0.004. Even though the standard for Cook's distance for this problem appears very small, we will run the regression without these cases to see what effect they are having on our analysis.


Slide 53

Removing the Cases with Large Cook's Distance

First, select 'Linear Regression' fromthe Dialog Recall drop down menu.

Second, move the'Cook's Distance'c00_1 variable to the'Selection Variable: 'text box.

Third, click onthe 'Rule' button.

Fourth, select the 'lessthan' relationship fromthe drop down menu.

Fifth, type thevalue .004 into the'Value' text box.

Sixth, click on the'Continue' button andthen the OK button tore-run the regression.


Slide 54

The Overall Strength of Relationship without the Influential Cases


Removing the influential cases had only a small impact on R² overall fit of the model, from 0.111 to 0.145. Using the interpretative guidelines for R², we would evaluate both the relationship in both models as weak.

Slide 55

Significant Individual Coefficient with the Influential Cases


Without the influential cases, the same six independent variables had statistically significant individual relationships with the dependent variable.

Since neither the overall fit nor the tests of individual relationships were different without the cases identified as influential, I would opt to return to the full data set and continue the analysis. To return the full data set to the analysis, we remove the clear the selection criterion.

Slide 56

Stage 5: Interpret The Findings - Regression Coefficients

Interpreting the regression coefficients enables us to make statements about the direction of the relationship between the dependent variable and each of the independent variables, the size of the contribution of each independent variable to the dependent variable, and the relative importance of the independent variables as predictors of the dependent variable.


Slide 57

8. Direction of relationship and contribution to dependent variable


The table of coefficients derived for the full data set is shown to the right.

To determine the direction of the relationship between each independent variable and the dependent variable, we look at the sign of the B coefficient. We will only discuss those variables which had a statistically significant relationship to the dependent variable.

The coefficient for number of children is positive, indicating that the more children one has, the more one favors spanking as discipline. Similarly, the coefficients for conservative Protestant, Black, Rural Native were positive, indicating that respondents with these characteristics were more favorably disposed to corporal punishment. Two variables had negative coefficients: total education and Northeast region. Persons with higher education looked less favorably on corporal punishment. To interpret the dummy-coded northeast variable, we have to compare it to the reference variable, South, which was not included in the independent variables. The negative coefficient for Northeast region means that persons in the northeast were less likely to favor corporal punishment than were persons in the south.

We will not discuss the magnitude of the coefficients since fractional contributions to an attitude scale on corporal punishment have no intuitive application.

Slide 58

Importance of Predictors


We can look at the Beta coefficients in the table to see which variables had a greater impact on attitude toward corporal punishment. The largest beta coefficient is associated with the regional variable northeast (-.154), followed by the variable for education (-.148). We can say that these two variables had the greatest effect on differences in attitude toward corporal punishment.

Slide 59

Stage 5: Interpret The Findings - Impact of Multicollinearity


Multicollinearity is a concern in our interpretation of the finding because it could lead us to mistakenly conclude that there was not a relationship between the dependent variable and one of the independent variables because a strong relationship between the independent variable and another independent variables in the analysis prevented the independent variable from demonstrating its relationship to the dependent variable.

SPSS supports the detection of this problem by providing Tolerance statistics and the Variance Inflation Factor or VIF statistic, which is the inverse of tolerance. To detect problems of multicolllinearity, we look for tolerance values less than 0.10 or VIF greater than 10 (1/0.10=10).

Neither the tolerance values nor the VIF statistics indicate a problem of multicollinearity for any of the independent variables in the analysis.

Slide 60

Stage 6: Validate The Model

If we wish to generalize our findings to populations beyond our sample, we need to aggregate evidence that our regression results are not limited to our sample. Since we do not usually have the resources available to replicate and validate our study, we employ statistical procedures to assure ourselves that we do not have a solution that fits our data sample but unlikely to generalize.


Slide 61

Interpreting Adjusted R Square


Our first indicator of generalizability is the adjusted R Square value, which is adjusted for the number of variables included in the regression equation and is used to estimate the expected shrinkage in R Square that would not generalize to the population because our solution is over-fitted to the data set by including too many independent variables.

For the problem we are analyzing, R Square = .111 and the Adjusted R Square =.101. This represents about a 10% drop in R². This large decrease should be expected, however, because we know that our regression analysis contains a number of independent variables that do not have a statistically significant individual relationship with the dependent variable.

Slide 62

Split-Sample Validation

A more elaborate strategy to validate our regression requires us to randomly divide our sample into two groups, a screening sample and a validation sample. The regression is computed for the screening sample and used to predict the values of the dependent variable in the validation sample. SPSS provides us with Multiple R statistics for both the screening and the validation sample. If the Multiple R value for the validation sample is close to the value for the screening sample, the model is validated. In the double cross- validation strategy, we reverse the designation of the screening and validation sample and re-run the analysis.

We can then compare the regression equations derived for both samples. If the two regression equations contain a very different set of variables, it indicates that the variables might have achieved significance because of the sample size and not because of the strength of the relationship. Our findings about these individual variables would that the predictive utility of these variables is not generalizable.


Slide 63

Set the Starting Point for Random Number Generation

First, select the 'Random Number Seed...' command from the 'Transform' menu.

Second, click on the 'Set seed to: ' option to access the text box for the seed number.

Third, type '010199' in the 'Set seed to: ' text box.

Fourth, click on the OK button to complete this action.


Slide 64

Compute the Variable to Randomly Split the Sample into Two Halves

First, select the 'Compute...' command from the Transform menu.

Second, create a new variable named 'split' that has the values 1 and 0 to divide the sample into two part. Type the name 'split' into the 'Target Variable: ' text box.

Third, type the formula 'uniform(1) > 0.50' in the 'Numeric Expression: ' text box. The uniform function will generate a random number between 0.0 and 1.0 for each case. If the generated random number is greater than 0.50, the numeric expression will result in a 1. I f the generated random number is 0.50 or less, the numeric expression will produce a 0.

Fourth, we click on the OK button to compute the split variable.


Slide 65

Specify the Cases to Include in the First Screening Sample

First, select 'Linear Regression' from the 'Dialog Recall' drop down menu.

Second, highlight the 'split' variable and click on the move button to put it into the 'Selection Variable: ' text box.

Third, after 'split=?' appears in the 'Selection Variable: ' text box, click on the 'Rule..' button to specify which cases to include in the screening sample.

Fourth, accept the default relationship of 'equal to' in the drop down menu and type a '0' in the 'Value: ' text box.

Fifth, click on the 'Continue' button to complete setting the rule and then ‘OK’ to run the regression.


Slide 66

Specify the Cases to Include in the Second Screening Sample


Second, highlight 'split=0' in the 'Selection Variable: ' text box and click on the 'Rule...' button.

Third, replace the 0 in the 'Value: ' text box with a 1.

Fourth, click on the 'Continue' button to close the 'Set Rule' dialog box and then ‘OK’ to run the regression.


Slide 67

Summary Table for Validation AnalysisThe abundance of output makes it difficult to interpret our validation analysis. I find it useful to create a table of key statistical evidence.

All of the values for R and R² support the presence of a weak relationship between the dependent variable and the independent variables.

Four variables appear in both the full model and the two validation models. We can conclude that the relationship between approval of corporal punishment and these four variables (number of children, education, black, and northeast region) is generalizable to the population from which the sample was drawn.

Thus, we have validated the presence of a relationship between attitude toward corporal punishment and the predictor variables of family size, education, race, and northeast region. We would characterize the relationship as weak, which often occurs in social science research.


Full Model Split = 0 Split = 1

R for Learning Sample 0.333 0.372 0.316

R for Validation Sample 0.281 0.330

Significant Coefficients(sig < 0.05)

Number Of ChildrenHighest Year Of School

CompleteConservative ProtestantBlackRural NativeNortheast Region


CompletedBlackRural NativeNortheast RegionFemale


CompletedBlackNortheast Region

R² 0.111 0.138 0.100

Adjusted R² 0.101 0.119 0.078

Slide 68

Hierarchical Regression as an Alternative Method

When we wish to test the effect of a predictor variable after taking into account the effects of a set on control variables, we can test the model with a hierarchical regression. If we have the same variables in the regression model, our results will be the same no matter which method we used to select variables.

The reason hierarchical regression is useful in this type of analysis is that we can obtain the intuitively appealing R² change statistic which tells us the size of the contribution to R² made by the predictor variables after the control variables have been entered into the equation.


Requesting a Hierarchical Regression in SPSS

To request a hierarchical regression, we specify the independent variables in 'blocks' or groups specifying the order of entry. The variables in each block will be entered at the same time, followed by the variables in the second block, etc. We can specify different variable entry methods for each block, i.e. stepwise or direct entry, but typical we directly enter all variables in each block as we would in standard multiple regression.

Slide 69

Requesting the Hierarchical Regression


Second, remove the 'split' variable from the 'Selection Variable: ' text box.

Third, move the regional variables 'Northeast Region', 'West Region'. And 'Midwest Region' from the list box of 'Independent(s): ' back to the list of variables in the data set.


Slide 70

Requesting a Second Block for Variables

To request a second block,click on the Next buttonwith some independentvariables included in the'Independent(s): ' list box.


Slide 71

Entering Variables in the Second Block

We know we are enteringvariables into the secondblock because of the label'Block 2 of 2'.

First, highlight thedichotomous threeregional variables.

Second, click on the movearrow button to add thehighlighted variables to thelist box of 'Independent(s): '.


Slide 72

Requesting the R² Change Statistic

First, click on the 'Statistics...' button to open the 'Linear Regression: Statistics' dialog box.

Second, mark the 'R squared change' check box.

Third, click on the 'Continue' button to close the 'Linear Regression: Statistics' dialog box and then ‘OK’ to run the regression.


Slide 73

The R² change Statistic


The change in R² information is shown in the following Model Summary, which I transposed to fit on the page. The control variables in the first block account for 9.3% of the variance in the dependent variable. In addition, the region variables account for 1.8% more of the variance in the dependent variable.

The increase in R² by adding the regional information after the control variables (.018) is statistically significant at Sig < 0.0001. This finding strengthens the authors’ case about the effect of regions on the dependent variable.

Documents

Slide 1 Regression Analysis in the Literature This class will conduct the regression analysis found in the following journal article: Clifton P. Flynn,