Upload
christine-richardson
View
245
Download
0
Tags:
Embed Size (px)
Citation preview
Multiple Regression in SPSS
GV917
Multiple Regression
Multiple Regression involves more than one predictor variable. For example in the turnout model
Yi = a + b1Xi1 + b2Xi2 + ei If Ŷ = a + b1Xi1 + b2Xi2 Then Yi – Ŷ = ei
Where
Yi is the observed value of Reported Turnout Xi1 is the observed value of Actual Turnout Xi2 is the Effective Number of Parties Index a is the intercept and bj are the slope coefficients of the relationship between Reported
and Actual Turnout and Reported Turnout and Electoral Distortion Ŷ is the predicted value of Reported Turnout from the linear relationship with Actual
Turnout and Electoral Distortion ei is the residual or error term
Add an Effective Number of Parties Index to the Turnout Model This measure was devised by Laakso and
Taagepera (Comparative Political Studies 1979). It is designed to summarize the degree of fragmentation of the party system in a country. It is defined as:
1 ------ Σ (Pv)2
Where Pv is each party’s proportion of the total vote
Two Examples Suppose there is a two party system in a country and the votes are
shared 60% to 40%. This is not a fragmented system so that: 1 1 ------ = ------------------- = 1.92 Σ (Pv)2 (0.60) 2 + (0.40) 2
Intuitively this means that the party system contains 1.92 ‘equally sized’ parties.
But suppose in the country next door the vote is divided among four parties as follows: 35%, 30%, 20%, 15%. This is much more fragmented:
1 1 ------ = -------------------------------------------- = 3.64 Σ (Pv)2 (0.35) 2 + (0.30) 2 + (0.20) 2 + (0.15) 2
In this case there are 3.64 ‘equally sized’ parties.
Country Reported Turnout
Actual Turnout Effective No Parties
Austria 80.88 84.30 3.02
Belgium 78.71 90.60 8.84
Switzerland
54.14 43.20 5.87
Czech Republic
63.43 57.90 4.82
Germany 77.89 79.10 4.09
Denmark 88.33 87.10 4.69
Spain 71.40 68.70 3.12
Finland 71.43 65.30 6.03
France 62.84 60.30 5.22
Britain 67.18 59.40 3.33
Greece 83.37 75.00 2.64
Hungary 78.59 73.50 2.94
Ireland 75.57 62.60 4.13
Israel 71.15 67.80 7.05
Italy 84.28 81.40 6.32
Luxembourg
80.99 79.10 4.71
Netherlands 81.03 75.00 6.04
Norwary 61.42 46.20 6.19
Poland 68.71 62.80 4.50
Portugal 81.38 80.10 3.03
Slovenia 74.10 70.40 5.15
Reported Turnout Regression with Two Predictors
Model Summary
.928a .861 .846 3.45050Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), epartyv Effective No of Partiesby Votes, ActualX Actual turnout (IDEA data)
a.
ANOVAb
1330.069 2 665.034 55.857 .000a
214.307 18 11.9061544.376 20
RegressionResidualTotal
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), epartyv Effective No of Parties by Votes, ActualX Actualturnout (IDEA data)
a.
Dependent Variable: ReportedY Reported turnout (ESS data)b.
Coefficientsa
33.932 5.037 6.737 .000
.636 .061 .908 10.337 .000
-.884 .486 -.160 -1.819 .086
(Constant)ActualX Actualturnout (IDEA data)epartyv Effective Noof Parties by Votes
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: ReportedY Reported turnout (ESS data)a.
Why this effect?
Note that the fragmentation of parties tends to reduce reported turnout. This effect has been attributed to information processing costs. If the average citizen has to make choices among a lot of alternatives before voting, this raises the costs of voting and it has the effect of reducing turnout
The parties effect is independent of the actual turnout effect – since in multiple regression we identify the effects of one predictor controlling for all other predictors.
In the Turnout model we are fitting a regression plane to a Three Dimensional Scattergram
How Does Controlling Work? Step One: Regress the Effective Number of Parties on
Reported Turnout: Yi = a + b1Xi2 + vi Note that the vi represents the variation in Reported
Turnout NOT accounted for by the Effective Number of Parties. We have removed the number of parties as an influence on reported turnout.
Step Two: Regress the Effective Number of Parties on Actual Turnout
Xi1 = a + b2Xi2 + ui Thus ui represents the variation in Actual Turnout NOT
accounted for by the Effective Number of Parties. We have removed the number of parties as an influence on Actual Turnout
Controlling in Multiple Regression
Step Three: In the Multiple Regression Model Yi = a + b1Xi1 + b2Xi2 + ei
b1 or the effect of actual turnout on reported turnout can be found by regressing the residuals vi on the residuals ui because both are independent of the Effective Number of Parties.
This is in effect what multiple regression does.
Actual Turnout
Effective Number of Parties
Reported Turnout
Controlling in Regression
Coefficientsa
-1E-014 .733 .000 1.000
.636 .060 .925 10.620 .000
(Constant)resu Actual = f(Effective Parties)
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: rese Reported = f (Effective Parties)a.
In this model we are regressing the residuals of the Effective Number of Parties (vi) on the residuals of the Actual Number of Parties (ui). This produces the same regression coefficient (0.636) as in the earlier multivariate model
Another Look at ANOVA and the F test in Multiple Regression The F test compares the Mean Square with the
Residual Mean Square. If it has a high value then the regression explains a lot more variation than is left unexplained.
If it has a low value then the regression explains very little variation
The theoretical F distribution measures the probability that the F statistics will take on a particular value if the Null Hypothesis (the regression explains nothing) is correct
F Test in Multiple RegressionANOVAb
1330.069 2 665.034 55.857 .000a
214.307 18 11.9061544.376 20
RegressionResidualTotal
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), epartyv Effective No of Parties by Votes, ActualX Actualturnout (IDEA data)
a.
Dependent Variable: ReportedY Reported turnout (ESS data)b.
Mean Square = Regression Sum of Squares 1330.07 _________________ = ______ = 665.04 Degrees of Freedom 2
Residual Mean Square = Residual Sum of Squares = 214.31 ____________________ _____ = 11.91 Degrees of Freedom 18 F = Mean Square/ Residual Mean Square = 665.03 / 11.91 = 55.86
What are Degrees of Freedom? –They are useable bits of information
Total: If we had one observation we could not say anything about the total variation – we need more than one case. This is why the degrees of freedom or usable bits of information is n-1 or 20 (given 21 cases).
Residual: If we had two observations we could fit the regression line in a bivariate model since the shortest distance between two points is a straight line, but there would be no residuals since the line would fit perfectly. In a three variable model we would need three observations to fit the regression line since it is a three dimensional space. So to define residuals we need n-3 degrees of freedom or 18 degrees of freedom
Since the Total Variation = Explained Variation + Residual Variation Then Explained Variation = Total Variation – Residual Variation Explained Variation = (N-1) – (N-3) = 2 Degrees of freedom
The F test F = Mean Square/ Residual Mean Square is an F
distribution. If we start by assuming that the regression explains
nothing then the F ratio will not be zero, because by chance we might get a small positive value
The F distribution maps the probability that a ratio of a given size will occur if the regression actually explains nothing
The larger the value of F, the smaller the likelihood that it will occur by chance if the regression explains nothing.
In this case an F of 55.86 occurring due to chance is much smaller than 0.05, so we can say that the F statistic is significant at the 0.05 level.
The F Distribution – (named after Ronald Fisher)
Another Model – Explaining Happiness in the ESS 2002 Dataset
happy How happy are you
Frequency Percent Valid Percent
Cumulative
Percent
Valid 0 Extremely unhappy 247 .6 .6 .6
1 1 238 .6 .6 1.2
2 2 450 1.1 1.1 2.2
3 3 943 2.2 2.2 4.5
4 4 1149 2.7 2.7 7.2
5 5 4128 9.7 9.8 17.0
6 6 3349 7.9 7.9 24.9
7 7 7169 16.9 17.0 41.9
8 8 11859 28.0 28.1 70.1
9 9 7555 17.8 17.9 88.0
10 Extremely happy 5069 12.0 12.0 100.0
Total 42157 99.5 100.0
Missing 77 Refusal 29 .1
88 Don't know 118 .3
99 No answer 54 .1
Total 201 .5
Total 42358 100.0
Income Scale in the European
Social Survey 2002hinctnt Household's total net income, all sources
Frequency Percent Valid Percent
Cumulative
Percent
Valid 1 J 713 1.7 2.1 2.1
2 R 1752 4.1 5.3 7.4
3 C 2762 6.5 8.3 15.7
4 M 4722 11.1 14.2 29.9
5 F 4736 11.2 14.2 44.2
6 S 4113 9.7 12.4 56.5
7 K 3738 8.8 11.2 67.8
8 P 3136 7.4 9.4 77.2
9 D 4719 11.1 14.2 91.4
10 H 1978 4.7 5.9 97.4
11 U 554 1.3 1.7 99.0
12 N 326 .8 1.0 100.0
Total 33248 78.5 100.0
Missing 77 Refusal 4876 11.5
88 Don't know 3573 8.4
99 No answer 660 1.6
Total 9110 21.5
Total 42358 100.0
Does Money Buy Happiness?
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .271a .073 .073 1.857
a. Predictors: (Constant), income
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 9043.539 1 9043.539 2621.315 .000a
Residual 114347.222 33144 3.450
Total 123390.761 33145
a. Predictors: (Constant), income
b. Dependent Variable: happy How happy are you
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) 6.150 .027 228.961 .000
income .208 .004 .271 51.199 .000
a. Dependent Variable: happy How happy are you
Is the Specification Correct? Perhaps we should use a Quadratic Version of the Income Variable *Calculating Quadratic Functions in the ESS 2002.
Compute income = hinctnt. compute incomsq = hinctnt*hinctnt.
Where incomsq is the square of the hinctnt (household income) variable.
If we use incomsq in the model in addition to income this captures a non-linear relationship between income and happiness – more income increases happiness but at a declining rate of change
Regression of Income on Happiness in the ESS 2002 – Does Money Buy
Happiness?Model Summary
.278a .077 .077 1.824Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), incomsq, hinctnt Household'stotal net income, all sources
a.
ANOVAb
7993.407 2 3996.703 1200.938 .000a
95473.91 28688 3.328103467.3 28690
RegressionResidualTotal
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), incomsq, hinctnt Household's total net income, all sourcesa.
Dependent Variable: happy How happy are youb.
Coefficientsa
5.221 .066 79.038 .000
.545 .022 .703 24.839 .000
-.027 .002 -.450 -15.875 .000
(Constant)hinctnt Household's totalnet income, all sourcesincomsq
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: happy How happy are youa.
Quadratic Relationship Between Two Variables
Suppose we want to use Occupational Status as a predictor in the Happiness model – we would have to create this variable This is done with the assistance of the variable ISCOCO. This is a
classification of the many occupations which exist in Europe. For example: iscoco Occupation 100 Armed forces 1100 Legislators and senior officials 1110 Legislators, senior government officials 1140 Senior officials of special-interest org 1141 Senior officials of political-party org 1142 Senior officials of economic-interest org
To put this in a form which is useable in the regression model we recode it as follows:
recode iscoco (2000 thru 2470=6)(1000 thru 1319=5)(3000 thru 3480=4)(4000 thru 4223=3)(5000 thru 8340=2)(9000 thru 9330=1)(else=sysmis) into occup.
value labels occup 1 'unskilled or semi-skilled manual workers' 2 'skilled manual workers' 3 'white collar clerical & administrative workers' 4 'white collar technical workers' 5 'middle managers' 6 'professionals and senior managers'.
The Recoded Occupational Status Variable in the ESS 2002 Data
occup
3805 9.0 10.8 10.8
14349 33.9 40.7 51.5
4033 9.5 11.4 62.9
5474 12.9 15.5 78.5
2840 6.7 8.1 86.5
4752 11.2 13.5 100.0
35253 83.2 100.0
7105 16.8
42358 100.0
1.00 unskilled orsemi-skilled manualworkers
2.00 skilled manualworkers
3.00 white collar clerical& administrative workers
4.00 white collartechnical workers
5.00 middle managers
6.00 professionals andsenior managers
Total
Valid
SystemMissing
Total
Frequency Percent Valid PercentCumulative
Percent
Suppose we want to add a gender variable – to see if women are happier than men If statements can be used to create new
variables in SPSS. These are recodes which are carried out if certain conditions are met.
For example: compute female=0. (creates a new variable
consisting only of zeroes) if (gndr eq 2) female=1.(changes this new
variable to a score of 1 if the existing variable gndr has a score of 2)
If Statements in SPSS – gndr and Female gndr Gender
20322 48.0 48.0 48.0
21981 51.9 52.0 100.0
42303 99.9 100.0
55 .1
42358 100.0
1 Male
2 Female
Total
Valid
9 No answerMissing
Total
Frequency Percent Valid PercentCumulative
Percent
Female
20309 47.9 48.0 48.021994 51.9 52.0 100.042303 99.9 100.0
55 .142358 100.0
.001.00Total
Valid
SystemMissingTotal
Frequency Percent Valid PercentCumulative
Percent
Revised Happiness Model
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 9393.557 4 2348.389 700.984 .000a
Residual 95396.295 28475 3.350
Total 104789.852 28479
a. Predictors: (Constant), incomsq, female, occup, income
b. Dependent Variable: happy How happy are you
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) 5.050 .063 80.658 .000
female .090 .022 .024 4.160 .000
occup .035 .007 .029 4.818 .000
income .565 .020 .741 27.645 .000
incomsq -.029 .002 -.481 -17.942 .000
a. Dependent Variable: happy How happy are you
Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .299a .090 .090 1.830
a. Predictors: (Constant), incomsq, female, occup, income
Conclusions
Multiple Regression is a relatively simple extension of Two variable regression
Unlike two variable regression in multiple regression we are controlling for the influence of additional variables when examining the relationship between the independent variable and the dependent variable – it is a bit like a statistical experiment
The great majority of social science models are multivariate models and so commonly we used multiple regression