40
Regression

Regression. A goal of science is prediction and explanation of phenomena In order to do so we must find events that are related in some way such that

Embed Size (px)

Citation preview

Page 1: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Regression

Page 2: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

A goal of science is prediction and explanation of phenomena

In order to do so we must find events that are related in some way such that knowledge about one will lead to knowledge about the other

In psychology we seek to understand the relationship among variables that are indicators of an innumerable amount of information about human nature in order better understand ourselves and why we do the things we do

Page 3: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

While we could just use our N of 1 personal experience to try and understand human behavior, a scientific (and better) means of understanding the relationship between variables is by means of assessing correlation

Two variables take on different values, but if they are related in some fashion they will covary

They may do so in a way in which their values tend to move in the same direction, or they may tend to move in opposite directions

The underlying statistic assessing this is covariance, which is at the heart of every statistical procedure you are likely to use inferentially

Page 4: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Covariance as a statistical construct is unbounded and thus difficult to interpret in its raw form

Correlation (Pearson’s r) is a measure of the direction and degree of a linear association between two variables

Correlation is the standardized covariance between two variables

1

( )( )cov( , )

1

n

i ii

x x y yx y

n

yxxy ss

yxr

),cov(

),cov( yx

1

1

i i

n

x yi

xy

Z Zr

n

11 r

Page 5: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Regression allows us to use the information about covariance to make predictions

Given a particular value of x, we can predict y with some level of accuracy

The basic model is that of a straight line (the general linear model)

Only one possible straight line can be drawn once the slope and Y intercept are specified

The formula for a straight line is: Y = bx + a

▪ Y = the calculated value for the variable on the vertical axis▪ a = the intercept▪ b = the slope of the line▪ X = a value for the variable on the horizontal axis

Once this line is specified, we can calculate the corresponding value of Y for any value of X entered

In more general terms Y = Xb + e, where these elements represent vectors and/or matrices (of the outcome, data, coefficients and error respectively), is the general linear model to which most of the techniques in psychological research adhere to

Page 6: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Real data do not conform perfectly to a straight line

The best fit straight line is that which minimizes the amount of variation in data points from the line The common, but by no means the only or only

acceptable method attempts to derive a least squares regression line which minimizes the squared deviations from it

The equation for this line can be used to predict or estimate an individual’s score on Y on the basis of his or her score on X

abx

Page 7: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models

The intercept and weight values are called the parameters of the model

While typical regression analysis by itself does not determine causal relations, the assumption indicated by such a model is that the variable on the left-hand side of the previous equation is being caused by the variable(s) on the right side The arrows explicitly go from the

predictors to the outcome, not vice versa*

Variable X

Variable Y

Variable Z

Criterion

A

B

C

Page 8: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

The process of obtaining the correct parameter values (assuming we are working with the right model) is called parameter estimation

Often, theories specify the form of the relationship rather than the specific values of the parameters

The parameters themselves, assuming the basic model is correct, are typically estimated from data. We refer to the estimation processes as “calibrating the model”

A method is required for choosing parameter values that will give us the best representation of the data possible

In estimating the parameters of our model, we are trying to find a set of parameters that minimizes the error variance.

With least-squares estimation, we want to be as small as it possibly can be.

N

YY 2ˆ

Page 9: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Estimating the Slope (the regression coefficient) requires first estimating the covariance

Estimating the Y intercept

where and are the means based on the sets of the Y and X values respectively, and b is the estimated slope

These calculations ensure that the regression line passes through the point on the scatterplot defined by the two means

cov( , )

var( )

X Yb

X

a Y bX

Page 10: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

,

,

ˆ

y

x

y y

x x

Alternatively slope

sb r

s

so by substituting we get

s sY r X Y r X

s s

Page 11: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

y

yS2

S2y

S2(yi - i)y

Total variance =

predicted variance + error variance

Total variability in the dependent variable (observed – mean) comes from two sources

Variability predicted by the model i.e. what variability in the dependent variable is due to the independent variable How far off our predicted values are from the mean of Y

Error or residual variability i.e. variability not explained by the independent variable The difference between the predicted values and the observed

values

Page 12: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

The square of the correlation, r², is the fraction of the variation in the values of y that is explained by the regression of y on x

R² = variance of predicted values y divided by the variance of observed values y

We can also show this graphically using a Venn diagram Showing r2 as the proportion of variability shared by two variables (X and Y)

The larger the area of overlap, the greater the strength of the association between the two variables

Page 13: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

1

)ˆ(2

n

yys iy

2

2ˆ2

222ˆ

y

y

yy

s

sr

srs

Page 14: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

residual

residualXY df

SSS

2

)ˆ( 2

How good a fit does our line represent?

The error associated with a prediction (of a Y value from a known X value) is a function of the deviations of Y about the predicted point

The standard error of estimateprovides an assessment of accuracy of prediction the standard deviation of Y

predicted from X In terms of R2, we can see that the

more variance we account for the smaller our standard error of estimate will be

2

1)1( 2

N

RS XY

Page 15: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Intercept Value of Y if X is 0 Often not meaningful, particularly if it’s practically

impossible to have an X of 0 (e.g. weight) Slope

Amount of change in Y seen with 1 unit change in X Standardized regression coefficient

Amount of change in Y seen in standard deviation units with 1 standard deviation unit change in X

In simple regression it is equivalent to the r for the two variables

Standard error of estimate Gives a measure of the accuracy of prediction

R2

Proportion of variance explained by the model

Page 16: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

The General Linear Model with Categorical Predictors

Page 17: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Regression can actually handle different types of predictors, and in the social sciences we are often interested in differences between groups

For now we will concern ourselves with the two independent groups case E.g. gender, republican vs. democrat etc.

Page 18: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

There are different ways to code categorical data for regression, and in general, to represent a categorical variable you need k-1* coded variables k = number of categories/groups

Dummy coding involves using zeros and ones to identify group membership, and since we only have two groups, one group will be zero (the reference group) and the other 1

We will revisit coding with k > 2 after we’ve discussed multiple regression

Page 19: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Example

The thing to note at this point is that we have a simple bivariate correlation/simple regression setting

The correlation between group and the DV is .76

This is sometimes referred to as the point biserial correlation (rpb) because of the categorical variable

However, don’t be fooled, it is calculated exactly the same way as before i.e. you treat that 0,1 grouping variable like any other in calculating the correlation coefficient

Group DV0 30 50 70 20 31 61 71 71 81 9

Page 20: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Graphical display

The R-square is .762 = .577

The regression equation is

ˆ 4 3.4Y X

Page 21: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Look closely at the descriptive output compared to the coefficients.

What do you see?

Coefficients a

4.000 .728 5.494 .001 2.321 5.679

3.400 1.030 .760 3.302 .011 1.026 5.774

(Constant)

group

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: dva.

Descriptive Statistics a

5 4.0000 2.00000

5

5 7.4000 1.14018

5

dv

Valid N (listwise)

dv

Valid N (listwise)

group

.00

1.00

N Mean Std. Deviation

No statistics are computed for one or more split files because there are no valid cases.a.

Page 22: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Note again our regression equation Recall the definition for the slope and constant First the constant, what does “when X = 0” mean here in

this setting? It means when we are in the 0 group What is that value?

Y = 4, which is that group’s mean The constant here is thus the reference group’s mean

Coefficients a

4.000 .728 5.494 .001 2.321 5.679

3.400 1.030 .760 3.302 .011 1.026 5.774

(Constant)

group

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: dva.

Descriptive Statistics a

5 4.0000 2.00000

5

5 7.4000 1.14018

5

dv

Valid N (listwise)

dv

Valid N (listwise)

group

.00

1.00

N Mean Std. Deviation

No statistics are computed for one or more split files because there are no valid cases.a.

ˆ 4 3.4Y X

Page 23: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Now think about the slope What does a ‘1 unit change in X’ mean in this

setting? It means we go from one group to the other Based on that coefficient, what does the slope

represent in this case (i.e. can you derive that coefficient from the descriptive stats in some way?)

The coefficient is the difference between means

Coefficients a

4.000 .728 5.494 .001 2.321 5.679

3.400 1.030 .760 3.302 .011 1.026 5.774

(Constant)

group

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: dva.

Descriptive Statistics a

5 4.0000 2.00000

5

5 7.4000 1.14018

5

dv

Valid N (listwise)

dv

Valid N (listwise)

group

.00

1.00

N Mean Std. Deviation

No statistics are computed for one or more split files because there are no valid cases.a.

ˆ 4 3.4Y X

Page 24: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

The regression line covers the values represented i.e. 0, 1, for the two groups

It passes through each of their means Using least squares

regression the regression line always passes through the mean of X and Y

The constant (if we are using dummy coding) is the mean for the zero (reference) group

The coefficient is the difference between means

Page 25: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Analysis of variance Recall that in regression we are trying to

account for the variance in the DV That total variance reflects the sum of the

squared deviations of values from the DV mean Sums of squares

That breaks down into: Variance we account for

Sums of squares predicted or model or regression

And that which we do not account for Sums of squares ‘error’ (observed – predicted)

Page 26: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

What are our predicted values in this case?

We only have 2 values of X to plug in We already know what Y is if X is zero, and

so we’d predict the group mean of 4 for all zero values

The only other value to plug in is 1 for the rest of the cases In other words for those in the 1 group, we’re

predicting their respective mean

ˆ 4 3.4*0Y

ˆ 4 3.4*1 7.4Y

Page 27: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

So in order to get our model summary and F-statistic, we need:

Total variance Predicted variance

Predicted value minus grand mean of the DV just like it has always been

Note again how our average predicted value is our group average for the DV

Error variance Essentially each person’s

score minus group mean

Page 28: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Predicted SS = 5[(4-5.7)2 + (7.4-5.7)2] 28.9

Error SS = (3-4)2 + 5-4)2…+ (9-7.4)2

21.2 Total variance to be accounted for = (3-5.7)2+(5-

5.7)2+…(9-5.7)2

Or just Predicted SS + Error SS 50.1 Calculate R2 from these values

Page 29: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Here is the summary table from our regression

The mean square is derived from dividing our sums of squares by the degrees of freedom K-1 for the regression Total = N -1 Error N-k

The ratio of the mean squares is the F-statistic

ANOVAb

28.900 1 28.900 10.906 .011a

21.200 8 2.650

50.100 9

Regression

Residual

Total

Model

1

Sum of Squares df Mean Square F Sig.

Predictors: (Constant), groupa.

Dependent Variable: dvb.

Page 30: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Note the title of the summary table ANOVA

It is an ANOVA summary table because you have in fact just conducted an analysis of variance, specifically for the two group situation

ANOVA, the statistical procedure as it is so-called, is a special case of regression

Below the first table is the ANOVA, as opposed to regression output.

ANOVAb

28.900 1 28.900 10.906 .011a

21.200 8 2.650

50.100 9

Regression

Residual

Total

Model

1

Sum of Squares df Mean Square F Sig.

Predictors: (Constant), groupa.

Dependent Variable: dvb.

Tests of Between-Subjects Effects

Dependent Variable: dv

28.900 1 28.900 10.906 .011 .577

21.200 8 2.650

375.000 10

50.100 9

Source

group

Error

Total

Corrected Total

Type III Sum

of Squares df Mean Square F Sig.

Partial Eta

Squared

Page 31: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Note the ‘partial eta-squared’

Eta-squared has the same interpretation as R-squared and as one can see, is R-squared from our regression SPSS calls it partial as there

is often more than one grouping variable, and we are interested in unique effects (i.e. partial out the effects from other variables)

However it is actually eta-squared here, as there is no other variable effect to partial out

ANOVAb

28.900 1 28.900 10.906 .011a

21.200 8 2.650

50.100 9

Regression

Residual

Total

Model

1

Sum of Squares df Mean Square F Sig.

Predictors: (Constant), groupa.

Dependent Variable: dvb.

Tests of Between-Subjects Effects

Dependent Variable: dv

28.900 1 28.900 10.906 .011 .577

21.200 8 2.650

375.000 10

50.100 9

Source

group

Error

Total

Corrected Total

Type III Sum

of Squares df Mean Square F Sig.

Partial Eta

Squared

Page 32: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

The t-test is a special case of ANOVA ANOVA can handle more than two groups,

while the t-test is just for two However, F = t2 in the two group setting,

the p-value is exactly the sameIndependent Samples Test

-3.302 8 .011 -3.40000 1.02956 -5.77418 -1.02582Equal variances assumeddv

t df Sig. (2-tailed) Mean Difference

Std. Error

Difference Lower Upper

95% Confidence Interval of

the Difference

t-test for Equality of Means

Tests of Between-Subjects Effects

Dependent Variable: dv

28.900 1 28.900 10.906 .011 .577

21.200 8 2.650

375.000 10

50.100 9

Source

group

Error

Total

Corrected Total

Type III Sum

of Squares df Mean Square F Sig.

Partial Eta

Squared

Page 33: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Compare to regression The t, standard error, CI and p-value

are the same, and again the coefficient is the difference between means

Independent Samples Test

-3.302 8 .011 -3.40000 1.02956 -5.77418 -1.02582Equal variances assumeddv

t df Sig. (2-tailed) Mean Difference

Std. Error

Difference Lower Upper

95% Confidence Interval of

the Difference

t-test for Equality of Means

Coefficients a

4.000 .728 5.494 .001 2.321 5.679

3.400 1.030 .760 3.302 .011 1.026 5.774

(Constant)

group

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: dva.

Page 34: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Statistics is a language used for communicating research ideas and findings

We have various dialects with which to speak it and of course pick freely of the words available

Sometimes we prefer to do regression and talk about amount of variance to be accounted for

Sometimes we prefer to talk about mean differences and how large those are In both cases we are interested in the effect size

Which tool we use reflects how we want to talk about our results

Page 35: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Let’s assume that we believe there is a linear relationship between X and Y.

Which set of parameter values will bring us closest to representing the data accurately?

Page 36: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

We begin by picking some values, plugging them into the equation, and seeing how well the implied values correspond to the observed values

We can quantify what we mean by “how well” by examining the difference between the model-implied Y and the actual Y value

This difference between our observed value and the one predicted, , is often called error in prediction, or the residual

XY 22ˆ

yy ˆ

yy ˆ

Page 37: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Let’s try a different value of b and see what happens

Now the implied values of Y are getting closer to the actual values of Y, but we’re still off by quite a bit

XY 12ˆ

Page 38: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Things are getting better, but certainly things could improve

XY 02ˆ

Page 39: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Ah, much betterXY 12ˆ

Page 40: Regression.  A goal of science is prediction and explanation of phenomena  In order to do so we must find events that are related in some way such that

Now that’s very nice

There is a perfect correspondence between the predicted values of Y and the actual values of Y

XY 22ˆ