Fitting Linear Regression in SPSS and Output Interpretation

7/29/2019 Fitting Linear Regression in SPSS and Output Interpretation

1/12

UNICEF Workshop on Global Study18th to 28thAugust 2008

Centre for Global Health, Population, Poverty and Policy (GHP3) 1

Fitting Linear Regression in SPSS and OutputInterpretation

Tuesday 26thAugust

The aim of this workshop is to introduce you to fitting linear regression in SPSS. It

will be using the DHS from Ghana, although the techniques shown are the same for

all datasets. This worksheet and the data associated with the workshop are all

available on the course website.

At the end of this session you should be able to:

- fit a simple linear regression model in SPSS

- understand how to create dummy variables for use in linear regression with

categorical explanatory variables

- interpret output from linear regression analyses

1. Simple Linear Regression Continuous Explanatory

Variables

First of all, download the dataset from the course website at

www.southampton.ac.uk/socsci/ghp3/course/material.html to your desktop. The

dataset that will be used for this session is the same as for Computer Workshop 3. It

is a reduced version of the Ghana DHS 2003, with a line for each child aged under 5

years old in the selected households.

Open SPSS in the usual way and open up the dataset.

In the first part of the workshop we will be looking at the relationship between birth

weight and weight-for-age z-score. The hypothesis is that the lower the birth weight,

the lower the weight-for-age z-score against the reference population. We will start

with some data manipulation, followed by exploratory analyses and then to the

simple linear regression.

It is always extremely important to get a feel for the data before you rush headlong

into some complicated statistical analysis.
http://www.southampton.ac.uk/socsci/ghp3/course/material.html


2/12



1. Select Analyze | Descriptive Statistics | Explore. The following dialogue

box should appear.

2. Transfer the variable Wt/A Standard deviation to the Dependent List box

by clicking the right arrow next to the box, and then click on the OKbutton.

3. The output will appear in the right-hand pane of the Output Viewer window.

Scroll through this output carefully and note what SPSS has produced. The

default output will include the mean and standard deviation for your data, a

95% confidence interval for the population mean, a stem-and-leaf plot and a

boxplot. The stem-and-leaf plot is useful as it enables us to see whether the

distribution of our response variable (weight-for-age) is highly skewed or not

in this case it is not! However, it is clear from the boxplot that there are

some strange values, with a score of about 1000.

4. There are a number of children who have had their weight-for-age flagged.

This is because the values for weight-for-age for those children are outside

acceptable ranges the measurement for height may have been incorrect.

These are coded as 9998, but are included in the analysis at the moment. We

need to change this (and while we are doing this we will change other

variables like this as well.


3/12



5. Go to Transform | Recode into Same Variables and recode values 9996 and

9998 into System-missing for Height-for-age, Weight-for-age, Weight-for-

height and birth weight. If you have forgotten how to recode variables please

ask.

6. Rerun the Explore command and study the results again. The results have

changed by a large amount.

7. We can investigate the relationship between weight-for-age and birth weight

by looking at the correlation between the two variables. Correlation is usually

calculated between two continuous variables. A correlation of 1 indicates

perfect positive correlation as one variable increases the other also increases

at exactly the same rate, while a correlation of -1 indicates perfect negative

correlation as one variable increases the other decreases at exactly the same

rate. A correlation of 0 indicates no linear relationship between the two

variables.

- Go to Analyse | Correlate | Bivariate. The following box appears.


4/12



- Place Wt/A Standard Deviations and Birth weight in the right hand

Variables Box, as shown above. ClickOK. The following table is

produced in the output.

Correlations

Wt/A Standard

deviations

Birth weight

(kilos - 3 dec.)

Pearson Correlation 1.000 .109**

Sig. (2-tailed) .002

Wt/A Standard deviations

N 3094.000 837

Pearson Correlation .109**

1.000

Sig. (2-tailed) .002

Birth weight (kilos - 3 dec.)

N 837 974.000

**. Correlation is significant at the 0.01 level (2-tailed).

- The correlation between Weight-for-age Standard deviation and birth

weight is 0.109. This is not that high, but the p-value (in the Sig. (2-

tailed) is 0.002. This is below 0.05 (for a 5% test) and thus is

significant at the 5% level. Thus there is a relationship between the two

variables. Also note that the number of children included in thiscorrelation is only 837. Many children do not have a recorded birth

weight, and some do not have a weight-for-age (the children without a

weight-for-age include those who have died between birth and the

survey)/

8. It is now time for the simple linear regression. Select Analyze | Regression |

Linear. The linear regression dialogue box appears (see next page).

9. Our dependent variable is Wt/A Standard deviations, so place this into the

dependent box. We are predicting weight-for-age using Birth Weight, so

place birth weight into the independent(s) box.

10. ClickOK.


5/12



11. The following output is produced:

Variables Entered/Removedb

ModelVariablesEntered

VariablesRemoved

Method

1 Birth weight(kilos - 3 dec.)

a . Enter

a. All requested variables entered.

b. Dependent Variable: Wt/A Standard deviations

Model Summary

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate

1 .109a .012 .011 120.660

a. Predictors: (Constant), Birth weight (kilos - 3 dec.)

ANOVAb

Model Sum of Squares df Mean Square F Sig.

Regression 145568.173 1 145568.173 9.999 .002a

Residual 1.216E7 835 14558.868

1

Total 1.230E7 836

a. Predictors: (Constant), Birth weight (kilos - 3 dec.)

b. Dependent Variable: Wt/A Standard deviations

This table simply statesthe variables in themodel and the selectionmethod chosen.

The results indicate thecorrelation (0.109, as seen before)and the r-square this indicateshow much variation is explained in this case not much!

Do notworryabout this

box!


6/12



Coefficientsa

Unstandardized CoefficientsStandardizedCoefficients

Model B Std. Error Beta t Sig.

(Constant) -146.216 18.112 -8.073 .0001

Birth weight (kilos - 3 dec.) .017 .005 .109 3.162 .002

a. Dependent Variable: Wt/A Standard deviations

The final box, labelled coefficients gives the results of the analysis. Each of the

columns is explained below:

- Unstandardized Coefficients B: This shows the values of the numbers in the

linear regression equation.

o The constant term is -146.2 indicating that a child who weighs 0g at birth

(impossible, but this is the theory) will be -146.2 standard deviations below

the mean for their weight-for-age.

o The relationship between birth weight and weight-for-age is 0.017. For every

gram increase in birth weight, weight-for-age increases by 0.017.

- Unstandardized Coefficients Std.Error: This is the standard error for the

coefficient it is used in the calculation of significance

- Standardized Coefficients Beta: Do not worry about this!

- t: This is the t-test to see if the coefficients are significantly different from 0. A

value over 1.96 indicates significance at the 5% level.

- Sig.: This is the p-value. If it is under 0.05 then the variable is significant. The

value we have here is 0.002, which is highly significant. There is a significant

relationship between birth weight and weight-for-age.

2. Simple Linear Regression Categorical Explanatory

Variables

1. The procedure for conducting linear regression when there are categorical

explanatory variables is slightly different, as you need to create dummy

variables, as explained earlier. If you do not do this, the results that you

obtain will not be valid. We will look at the relationship between wealth index

and weight-for-age standard deviations.


7/12



2. Firstly, do some exploratory analysis. One way to do this with categorical

variables is to calculate the mean standard deviation for each wealth quintile.

To do this:

- Go toAnalyze | Compare Means | Means- Place Wt/A Standard Deviations in the Dependent List

- Put Wealth index into the Independent list box

- Click OK. The following results should be produced:

Report

Wt/A Standard deviations

Wealthindex Mean N Std. Deviation

Poorest -135.55 1031 127.879

Poorer -113.66 694 122.574

Middle -110.86 556 117.847

Richer -94.47 425 112.391

Richest -68.28 388 117.536

Total -112.12 3094 123.417

- There are large differences in weight-for-age by wealth. The average for the

poorest quintile is -135.55, while for the richest it is -68.28. As wealth

increases, weight-for-age against the reference population also increases.

3. We will now recreate this analysis by conducting linear regression. But first,

we will need to create dummy variables for the wealth index

- Four new variables need to be created, as wealth has five categories

(remember that the number of dummy variables is needed is one less than the

number of categories!)

- Go to Transform | Recode into Different Variables

- PlaceWealth index into the central box. On the right hand side, under

Output Variable, enter in Poorest into the name variable and label this

Dummy variable for Poorest Wealth Quintile. ClickChange.


8/12


9/12



- ClickContinue and then OK. A new variable is created called poorest.

4. You now need to create three more dummy variables for other categories of

wealth. To do this, go to Transform | Recode into Different Variablesand follow the process above for Poorer, Middle and Richer. Each time

you will need to recode a different value to be the dummy (for instance for

Middle, all those with a 3 in the original dataset need to be recoded as a 1,

and all other variables as a 0. Please ask if you are confused!

Alternatively, use the syntax to do this automatically. A file is included on the

website for you to use to create your dummy variables.

5. Now the linear regression can be run. Go toAnalyze | Regression |

Linear. The regression from the previous analysis will still be there. The

Dependent variable remains the same,Wt/A Standard deviations, but the

Independent variables are now different.

Remove Birth weight from the Independent(s)box. Enter instead the four

dummy variables: Poorest, Poorer, Middle and Richer.


10/12



ClickOK

6. Four boxes are produced, as before. Below is the final box, labelled

Coefficients.

Coefficientsa



(Constant) -68.284 6.173 -11.062 .000

Dummy variable for poorestwealth quintile

-67.262 7.242 -.257 -9.288 .000

Dummy variable for poorerwealth quintile

-45.381 7.707 -.153 -5.888 .000

Dummy variable for middlewealth quintile

-42.576 8.043 -.132 -5.294 .000

1

Dummy variable for richerwealth quintile

-26.189 8.537 -.073 -3.068 .002


You will see that all of the variables are highly significant! This is seen in the

final column, Sig., which shows the p-value. This indicates that all wealth

quintiles are different from the Constant, which is the Richest quintile.

The value for the constant is -68.284, which is the same as seen previously for

the mean standard deviation for the Richest quintile!

For the poorest quintile the average score is -68.284 67.262 = -135.546. The

same as before! For all the wealth quintiles the results mirror the results seen

before.

3. Multiple Linear Regression

You may be wondering why we bothered doing the regression on weight-for-age and

wealth when we can get the results simply using the Compare Means command.

The reason is to show the differences when more than one variable is added into the

model at the same time.

We have seen that birth weight and wealth are related to weight-for-age when thesimple bivariate analysis is conducted. But what happens if we analyse them together?


11/12



Birth weight is highly related to wealth: infants born to poorer households are likely

to be lighter than infants born to richer households. So is the relationship between

wealth and weight-for-age only due to the relationship with birth weight those of a

lighter birth weight are likely to remain below the norm throughout childhood.

To test this we enter the variables into the model together.

1. Go toAnalyze | Regression | Linear. The previous regression variables

will still be contained in the different boxes.

2. Click on Birth Weight and place it into the Independent(s)box, alongside

the wealth quintile dummy variables.

3. ClickOK. The final table in the output is copied below.

Coefficientsa



(Constant) -119.658 18.412 -6.499 .000

Dummy variable for poorest

wealth quintile-83.830 14.066 -.220 -5.960 .000

Dummy variable for poorerwealth quintile

-37.202 12.418 -.115 -2.996 .003

Dummy variable for middlewealth quintile

-42.243 12.494 -.130 -3.381 .001

Dummy variable for richerwealth quintile

-39.684 11.140 -.138 -3.562 .000

1

Birth weight (kilos - 3 dec.) .018 .005 .120 3.491 .001


The results have changed! Partly this is due to there being a different sample

being used (only those with a birth weight AND a wealth quintile are included

in the analysis) but it is also due to having both variables in the model at one

time.

All the variables are significant in the model still, although after taking

account of birth weight the difference between richest and poorest actually

increases. This shows that even though birth weight is significantly related to

weight-for-age, there is a very large effect of wealth after the birth on weight-for-age.


12/12



4. The analysis can be extended to include other variables, such as Type of

Place of Residence, Educational Level and Place of Delivery.

However, all of these are categorical variables, so remember to categorise

these as dummy variables first!

Exercises

1. Conduct multiple linear regression on Weight-for-age Standard deviations,

including as explanatory variables birth weight, wealth index, urban/rural and

highest educational level of the parent

2. Conduct multiple linear regression on Weight-for-Height, using the same

variables as in Exercise 1. Are there any obvious differences that you can see?

What is the relationship between wealth and weight-for-height after

controlling for the other variables?

Documents

Fitting Linear Regression in SPSS and Output Interpretation