89
Linear regression Linear regression Brian Healy, PhD Brian Healy, PhD BIO203 BIO203

Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Embed Size (px)

Citation preview

Page 1: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Linear regressionLinear regression

Brian Healy, PhDBrian Healy, PhD

BIO203BIO203

Page 2: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Previous classesPrevious classes

Hypothesis testingHypothesis testing– ParametricParametric– NonparametricNonparametric

CorrelationCorrelation

Page 3: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

What are we doing today?What are we doing today?

Linear regressionLinear regression– Continuous outcome with continuous, Continuous outcome with continuous,

dichotomous or categorical predictordichotomous or categorical predictor– Equation: Equation:

Interpretation of coefficientsInterpretation of coefficients Connection between regression andConnection between regression and

– correlationcorrelation– t-testt-test– ANOVAANOVA

xxXYE 10)|(

Page 4: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Big pictureBig picture Linear regression is the most commonly used Linear regression is the most commonly used

statistical technique. It allows the comparison statistical technique. It allows the comparison of dichotomous, categorical and continuous of dichotomous, categorical and continuous predictors with a continuous outcome.predictors with a continuous outcome.

Extensions of linear regression allow Extensions of linear regression allow – Dichotomous outcomes- logistic regressionDichotomous outcomes- logistic regression– Survival analysis- Cox proportional hazards Survival analysis- Cox proportional hazards

regression regression – Repeated measures Repeated measures

Amazingly, many of the analyses we have Amazingly, many of the analyses we have learned can be completed using linear learned can be completed using linear regressionregression

Page 5: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ExampleExample Yesterday, Yesterday,

we we investigateinvestigated the d the associatioassociation between n between age and age and BPF using BPF using a a correlation correlation coefficientcoefficient

Can we fit Can we fit a line to a line to this data?this data?

.75

.8.8

5.9

.95

BP

F

20 30 40 50 60Age

Page 6: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Quick math reviewQuick math review As you remember from As you remember from

high school math, the high school math, the basic equation of a line basic equation of a line is given by is given by y=mx+b y=mx+b where where mm is the slope is the slope and and bb is the y-intercept is the y-intercept

One definition of One definition of mm is is that for every one unit that for every one unit increase in x, there is increase in x, there is an m unit increase in yan m unit increase in y

One definition of One definition of bb is is the value of y when x the value of y when x is equal to zerois equal to zero

Line

y = 1.5x + 4

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12

Page 7: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

PicturePicture

Look at the data in Look at the data in this picturethis picture

Does there seem to Does there seem to be a correlation be a correlation (linear relationship) (linear relationship) in the data?in the data?

Is the data Is the data perfectly linear?perfectly linear?

Could we fit a line Could we fit a line to this data?to this data?

0

5

10

15

20

25

0 2 4 6 8 10 12

Page 8: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

What is linear regression?What is linear regression?

Linear regression tries Linear regression tries to find the best line to find the best line (curve) to fit the data (curve) to fit the data

The method of finding The method of finding the best line (curve) is the best line (curve) is least squares, which least squares, which minimizes the sum of minimizes the sum of the distance from the the distance from the line for each of pointsline for each of points

y = 1.5x + 4

0

5

10

15

20

25

0 2 4 6 8 10 12

Page 9: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

How do we find the best How do we find the best line?line?

Let’s look at three Let’s look at three candidate linescandidate lines

Which do you think Which do you think is the best?is the best?

What is a way to What is a way to determine the best determine the best line to use?line to use?

Page 10: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ResidualsResiduals The actual observations, The actual observations,

yyii, may be slightly off the , may be slightly off the population line because population line because of variability in the of variability in the population. The equation population. The equation is is yyi i = = + + xxii + + ii, , where where ii is the deviation from is the deviation from the population line (See the population line (See picture). picture).

This is called the This is called the residualresidual

This is the distance from the line for patient 1,

Page 11: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Least squaresLeast squares

The method employed to find the best line The method employed to find the best line is called least squares. This method finds is called least squares. This method finds the values of the values of that minimize the squared that minimize the squared vertical distance from the line to each of vertical distance from the line to each of the point. This is the same as minimizing the point. This is the same as minimizing the sum of the the sum of the ii

22

n

ii

n

ii xy

1

2110

1

2

Page 12: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Estimates of regression Estimates of regression coefficientscoefficients

Once we have solved the least squares Once we have solved the least squares equation, we obtain estimates for the equation, we obtain estimates for the ’s, ’s, which we refer to as which we refer to as

The final least squares equation is where The final least squares equation is where yhat is the mean value of y for a value of yhat is the mean value of y for a value of xx11

10ˆ,ˆ

110ˆˆˆ xy

xy

xx

yyxx

n

ii

n

iii

10

1

2

11

ˆˆ

ˆ

Page 13: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Assumptions of linear Assumptions of linear regressionregression

LinearityLinearity– Linear relationship between outcome and predictorsLinear relationship between outcome and predictors– E(Y|X=x)=E(Y|X=x)=++xx1 1 + + 22xx22

22 is still a linear regression is still a linear regression equation because each of the equation because each of the ’s is to the first ’s is to the first powerpower

Normality of the residualsNormality of the residuals– The residuals, The residuals, ii, are normally distributed, N(0, , are normally distributed, N(0,

Homoscedasticity of the residualsHomoscedasticity of the residuals– The residuals, The residuals, ii, have the same variance, have the same variance

IndependenceIndependence– All of the data points are independentAll of the data points are independent– Correlated data points can be taken into account Correlated data points can be taken into account

using multivariate and longitudinal data methodsusing multivariate and longitudinal data methods

Page 14: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Linearity assumptionLinearity assumption

One of the assumptions of linear One of the assumptions of linear regression is that the relationship between regression is that the relationship between the predictors and the outcomes is linearthe predictors and the outcomes is linear

We call this the population regression line We call this the population regression line E(Y | X=x) = E(Y | X=x) = y|xy|x = = + + 11xx

This equation says that the mean of y This equation says that the mean of y given a specific value of x is defined by given a specific value of x is defined by the the coefficients coefficients

The coefficients act exactly like the slope The coefficients act exactly like the slope and y-intercept from the simple equation and y-intercept from the simple equation of a line from beforeof a line from before

Page 15: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Normality and Normality and homoscedasticity assumptionhomoscedasticity assumption

Two other Two other assumptions of assumptions of linear regression are linear regression are related to the related to the ii’s’s– Normality- the Normality- the

distribution of the distribution of the residuals are normal. residuals are normal.

– Homoscedasticity- Homoscedasticity- the variance of y the variance of y given x is the same given x is the same for all values of xfor all values of x

Distribution of y-values at each value of x is normal with the same variance

Page 16: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ExampleExample

Here is a Here is a regression regression equation for the equation for the comparison of age comparison of age and BPFand BPF

iii ageBPF

ageageBPFE

10

10)|(

.75

.8.8

5.9

.95

BP

F

20 30 40 50 60Age

Page 17: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ResultsResults

The estimated The estimated regression regression equationequation

ageFBP *0029.0957.0ˆ .7

5.8

.85

.9.9

5

20 30 40 50 60Age

BPF predval

Page 18: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

_cons .957443 .035037 27.33 0.000 .885553 1.029333 age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total .066750142 28 .002383934 Root MSE = .04061 Adj R-squared = 0.3083 Residual .044524108 27 .001649041 R-squared = 0.3330 Model .022226034 1 .022226034 Prob > F = 0.0010 F( 1, 27) = 13.48 Source SS df MS Number of obs = 29

. regress bpf age

Estimated intercept

Estimated slope

Page 19: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

The final regression equation is The final regression equation is

The coefficients mean The coefficients mean – the estimate of the mean BPF for a patient with the estimate of the mean BPF for a patient with

an age of 0 is 0.957 (an age of 0 is 0.957 (00hat)hat)

– an increase of one year in age leads to an an increase of one year in age leads to an estimated decrease of 0.0029 in mean BPF estimated decrease of 0.0029 in mean BPF ((11hat)hat)

ageFBP *0029.0957.0ˆ

Interpretation of regression Interpretation of regression coefficientscoefficients

Page 20: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Unanswered questionsUnanswered questions

Is the estimate of Is the estimate of 11 ( (11hat) hat) significantly different than zero? In significantly different than zero? In other words, is there a significant other words, is there a significant relationship between the predictor relationship between the predictor and the outcome?and the outcome?

Have the assumptions of regression Have the assumptions of regression been met?been met?

Page 21: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Estimate of variance for Estimate of variance for hathat’s’s

In order to determine if there is a significant In order to determine if there is a significant association, we need an estimate of the association, we need an estimate of the variance of variance of 00hat and hat and 11hathat

ssy|xy|x is the residual variance in y after is the residual variance in y after accounting for x (standard deviation from accounting for x (standard deviation from regression, root mean square error)regression, root mean square error)

n

ii

xy

xx

ses

1

2

|1̂ˆ

n

ii

xy

xx

x

nses

1

2

2

|0

1ˆˆ

Page 22: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Test statisticTest statistic

For both regression coefficients, we use a For both regression coefficients, we use a t-statistic to test any specific hypothesist-statistic to test any specific hypothesis– Each has n-2 degrees of freedom (This is the Each has n-2 degrees of freedom (This is the

sample size-number of parameters estimated)sample size-number of parameters estimated)

What is the usual null hypothesis for What is the usual null hypothesis for 11??

111

ˆˆ

ˆ

est

0

00

ˆˆ

ˆ

est

Page 23: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Hypothesis testHypothesis test

1)1) HH00: : 11=0=0

2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor

3)3) Linear regressionLinear regression

4)4) Test statistic: t=-3.67 (27 dof)Test statistic: t=-3.67 (27 dof)

5)5) p-value=0.0011p-value=0.0011

6)6) Since the p-value is less than 0.05, we Since the p-value is less than 0.05, we reject the null hypothesisreject the null hypothesis

7)7) We conclude that there is a significant We conclude that there is a significant association between age and BPFassociation between age and BPF

Page 24: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

_cons .957443 .035037 27.33 0.000 .885553 1.029333 age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total .066750142 28 .002383934 Root MSE = .04061 Adj R-squared = 0.3083 Residual .044524108 27 .001649041 R-squared = 0.3330 Model .022226034 1 .022226034 Prob > F = 0.0010 F( 1, 27) = 13.48 Source SS df MS Number of obs = 29

. regress bpf age

Estimated intercept

Estimated slope p-value for slope

Page 25: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Comparison to correlationComparison to correlation

In this example, we found a relationship In this example, we found a relationship between the age and BPF. We also between the age and BPF. We also investigated this relationship using investigated this relationship using correlationcorrelation

We get the same p-value!!We get the same p-value!! Our conclusion is exactly the same!!Our conclusion is exactly the same!! There are other relationships we will see There are other relationships we will see

laterlaterMethodMethod p-valuep-value

CorrelationCorrelation 0.00100.0010

Linear Linear regressionregression

0.00100.0010

Page 26: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Confidence interval for Confidence interval for 11

As we have done previously, we can As we have done previously, we can construct a confidence interval for the construct a confidence interval for the regression coefficientsregression coefficients

Since we are using a t-distribution, we do Since we are using a t-distribution, we do not automatically use 1.96. Rather we use not automatically use 1.96. Rather we use the cut-off from the t-distributionthe cut-off from the t-distribution

Interpretation of confidence interval is same Interpretation of confidence interval is same as we have seen previouslyas we have seen previously

1,2/11,2/1ˆˆ*ˆ,ˆˆ*ˆ estest dofdof

Page 27: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

InterceptIntercept STATA also provides a test statistic STATA also provides a test statistic

and p-value for the estimate of the and p-value for the estimate of the interceptintercept

This is for HThis is for Hoo: : 0 0 = 0, which is often not = 0, which is often not a hypothesis of interest because this a hypothesis of interest because this corresponds to testing whether the corresponds to testing whether the BPF is equal to zero at age of 0BPF is equal to zero at age of 0

Since BPF can’t be 0 at age 0, this test Since BPF can’t be 0 at age 0, this test is not really of interestis not really of interest

We can center covariates to make this We can center covariates to make this test importanttest important

Page 28: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

PredictionPrediction

Page 29: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

PredictionPrediction

Beyond determining if there is a Beyond determining if there is a significant association, linear regression significant association, linear regression can also be used to make predictionscan also be used to make predictions

Using the regression equation, we can Using the regression equation, we can predict the BPF for patients with specific predict the BPF for patients with specific age valuesage values– Ex. A patient with age=40Ex. A patient with age=40

The expected BPF for a patient of age 40 The expected BPF for a patient of age 40 based on our experiment is 0.841based on our experiment is 0.841

841.040*0029.0957.0ˆ FBP

Page 30: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ExtrapolationExtrapolation

Can we predict the BPF for a patient Can we predict the BPF for a patient with age 80? What assumption would with age 80? What assumption would we be making?we be making?

.75

.8.8

5.9

.95

20 30 40 50 60Age

BPF predval

Page 31: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Confidence interval for Confidence interval for predictionprediction

We can place a confidence interval around We can place a confidence interval around our predicted mean valueour predicted mean value

This corresponds to the plausible values for This corresponds to the plausible values for the mean BPF at a specific agethe mean BPF at a specific age

To calculate a confidence interval for the To calculate a confidence interval for the predicted mean value, we need an predicted mean value, we need an estimate of variability in the predicted estimate of variability in the predicted meanmean

n

ii

xy

xx

xx

nsyes

1

2

2

|

1ˆˆ

Page 32: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Confidence intervalConfidence interval

Note that the standard error equation has Note that the standard error equation has a different magnitude based on the x a different magnitude based on the x value. In particular, the magnitude is the value. In particular, the magnitude is the least when x=the mean of xleast when x=the mean of x

Since the test statistic is based on the t-Since the test statistic is based on the t-distribution, our confidence interval isdistribution, our confidence interval is

This confidence interval is rarely used for This confidence interval is rarely used for hypothesis testing because hypothesis testing because

yestyyesty dfdf ˆˆ*ˆ,ˆˆ*ˆ ,2/,2/

Page 33: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

.75

.8.8

5.9

.95

20 30 40 50 60Age

Page 34: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Prediction intervalPrediction interval A confidence interval for a mean provides A confidence interval for a mean provides

information regarding the accuracy of a information regarding the accuracy of a estimated mean value for a sample sizeestimated mean value for a sample size

Often, we are interested in how accurate Often, we are interested in how accurate our prediction would be for a single our prediction would be for a single observation, not the mean of a group of observation, not the mean of a group of observations. This is called a prediction observations. This is called a prediction intervalinterval

What would you estimate as the value for What would you estimate as the value for a single new observation?a single new observation?

Do you think a prediction interval is Do you think a prediction interval is narrower or wider?narrower or wider?

Page 35: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Prediction intervalPrediction interval

Confidence interval always tighter than Confidence interval always tighter than prediction intervalsprediction intervals

The variability in the prediction of a single The variability in the prediction of a single observation contains two types of observation contains two types of variabilityvariability– Variability of the estimate of the mean Variability of the estimate of the mean

(confidence interval)(confidence interval)– Variability around the estimate of the mean Variability around the estimate of the mean

(residual variability)(residual variability) 22| ˆˆ~ yessyse xy

yestyyesty dfdf~ˆ*~,~ˆ*~

,2/,2/

Page 36: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

.7.8

.91

20 30 40 50 60Age

Page 37: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ConclusionsConclusions

Prediction interval is always wider Prediction interval is always wider than confidence intervalthan confidence interval– Common to find significant differences Common to find significant differences

between groups but not be able to between groups but not be able to predict very accuratelypredict very accurately

– To predict accurately for a single To predict accurately for a single patient, we need limited overlap of the patient, we need limited overlap of the distribution. The benefit of an increased distribution. The benefit of an increased sample size decreasing the standard sample size decreasing the standard error does not helperror does not help

Page 38: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Model checkingModel checking

Page 39: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

How good is our model?How good is our model?

Although we have found a relationship Although we have found a relationship between age and BPF, linear regression between age and BPF, linear regression also allows us to assess how well our also allows us to assess how well our model fits the datamodel fits the data

RR22=coefficient of =coefficient of determination=proportion of variance in determination=proportion of variance in the outcome explained by the modelthe outcome explained by the model– When we have only one predictor, it is the When we have only one predictor, it is the

proportion of the variance in y explained by xproportion of the variance in y explained by x2

2|

22

y

xyy

s

ssR

Page 40: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

RR22

What if all of the variability in y was explained What if all of the variability in y was explained by x?by x?– What would RWhat would R2 2 equal?equal?– What does this tell you about the correlation What does this tell you about the correlation

between x and y?between x and y?– What if the correlation between x and y is negative?What if the correlation between x and y is negative?

What if none of the variability in y is explained What if none of the variability in y is explained by x?by x?– What would RWhat would R22 equal? equal?– What is the correlation between x and y in this What is the correlation between x and y in this

case?case?

Page 41: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

r vs. Rr vs. R22

RR22=(Pearson’s correlation coefficient)=(Pearson’s correlation coefficient)22=r=r22

Since r is between -1 and 1, RSince r is between -1 and 1, R22 is always is always less than rless than r

– r=0.1, Rr=0.1, R22=0.01=0.01– r=0.5, Rr=0.5, R22=0.25=0.25

MethodMethod EstimateEstimate

rr -0.577-0.577

RR22 0.3330.333

Page 42: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Evaluation of modelEvaluation of model

Linear regression required several Linear regression required several assumptionsassumptions– LinearityLinearity– HomoscedasticityHomoscedasticity– NormalityNormality– Independence-usually from study designIndependence-usually from study design

We must determine if the model We must determine if the model assumptions were reasonable or a assumptions were reasonable or a different model may have been neededdifferent model may have been needed

Statistical research has investigated Statistical research has investigated relaxing each of these assumptionsrelaxing each of these assumptions

Page 43: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Scatter plotScatter plot

A good first step in any regression is A good first step in any regression is to look at the x vs. y scatter plot. to look at the x vs. y scatter plot. This allows us to seeThis allows us to see– Are there any outliers?Are there any outliers?– Is the relationship between x and y Is the relationship between x and y

approximately linear?approximately linear?– Is the variance in the data Is the variance in the data

approximately constant for all values of approximately constant for all values of x?x?

Page 44: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Tests for the assumptionsTests for the assumptions

There are several different ways to test There are several different ways to test the assumptions of linear regression. the assumptions of linear regression. – GraphicalGraphical– StatisticalStatistical

Many of the tests use the residuals, which Many of the tests use the residuals, which are the distances from the fitted line and are the distances from the fitted line and the outcomesthe outcomes

iiiii xyyy 10ˆˆˆˆ

Page 45: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Residual plotResidual plot-.

1-.

050

.05

.1R

esid

ual

s

.8 .85 .9Fitted values

If the assumptions of linear regression are met, we will observe a random scatter of points

Page 46: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Investigating linearityInvestigating linearity

Scatter plot of Scatter plot of predictor vs predictor vs outcomeoutcome

What do you notice What do you notice here?here?

One way to handle One way to handle this is to transform this is to transform the predictor to the predictor to include a quadratic include a quadratic or other termor other term

Non-linear relationship

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10 12

Page 47: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

AgingAging

Research has Research has shown that the shown that the decrease in decrease in BPF in normal BPF in normal people is people is pretty slow up pretty slow up until age 65 until age 65 and then there and then there is a more is a more steep dropsteep drop

.65

.7.7

5.8

.85

BP

F

40 50 60 70 80Age

Page 48: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Fitted lineFitted line

.65

.7.7

5.8

.85

40 50 60 70 80Age

Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends

Page 49: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

What if we fit a line for this?What if we fit a line for this?

Residual plot shows a non-random Residual plot shows a non-random scatter because the relationship is scatter because the relationship is not really linearnot really linear-.

050

.05

Res

idu

als

.72 .74 .76 .78 .8 .82Fitted values

Page 50: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

What can we do?What can we do?

If the relationship between x and y is If the relationship between x and y is not linear, we can try a not linear, we can try a transformation of the valuestransformation of the values

Possible transformationsPossible transformations– Add a quadratic termAdd a quadratic term– Fit a spline. This is when there is a slope Fit a spline. This is when there is a slope

for a certain part of the curve and a for a certain part of the curve and a different slope for the rest of the curvedifferent slope for the rest of the curve

Page 51: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Adding a quadratic termAdding a quadratic term.6

5.7

.75

.8.8

5

40 50 60 70 80Age

Page 52: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Residual plotResidual plot

-.05

0.0

5R

esid

ual

s

.7 .72 .74 .76 .78 .8Fitted values

Page 53: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Checking linearityChecking linearity

Plot of residuals vs. the predictor is also Plot of residuals vs. the predictor is also used to detect departures from linearityused to detect departures from linearity

These plots allow you to investigate These plots allow you to investigate each predictor separately so becomes each predictor separately so becomes important in multiple regressionimportant in multiple regression

If linearity holds, we anticipate a If linearity holds, we anticipate a random scatter of the residuals on both random scatter of the residuals on both types of residual plottypes of residual plot

Page 54: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

HomoscedasticityHomoscedasticity

The second assumption The second assumption is equal variance across is equal variance across the values of the the values of the predictor predictor

The top plot shows the The top plot shows the assumption is met, assumption is met, while the bottom plot while the bottom plot shows that there is a shows that there is a greater amount of greater amount of variance for larger variance for larger fitted valuesfitted values

Page 55: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ExampleExample0

100

000

200

000

300

000

Exp

ress

ion

leve

l

1 2 3 4 5 6Lipid number

Page 56: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ExampleExample In this example, In this example,

we can fit a we can fit a linear linear regression regression model model assuming that assuming that there is a linear there is a linear increase in increase in expression with expression with lipid number, lipid number, but here is the but here is the residuals plot residuals plot from this from this analysisanalysis

What is wrong?What is wrong?

-100

000

01

0000

02

0000

0R

esid

ual

s

-50000 0 50000 100000Fitted values

Page 57: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Transform the y-valueTransform the y-value

Clearly, the Clearly, the residuals residuals showed that showed that we did not we did not have equal have equal variancevariance

What if we What if we log-transform log-transform our y-value?our y-value?

68

10

12

14

log

Exp

ress

ion

leve

l

1 2 3 4 5 6Lipid number

Page 58: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

New regression equationNew regression equation

By transforming the outcome variable we By transforming the outcome variable we have changed our regression equation:have changed our regression equation:– Original: ExpressionOriginal: Expressionii = =00+ + 11*lipid*lipidii++ii

– New: ln(ExpressionNew: ln(Expressionii) =) =00+ + 11*lipid*lipidii++ii

What is the interpretation of What is the interpretation of 11 from the new from the new regression model?regression model?– For every one unit increase in lipid number, there For every one unit increase in lipid number, there

is a is a 11 unit increase in the ln(Expression) on unit increase in the ln(Expression) on averageaverage

– The interpretation has changed due to the The interpretation has changed due to the transformationtransformation

Page 59: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Residual plotResidual plot

On the log-On the log-scale, the scale, the assumption assumption of equal of equal variance variance appears appears much more much more reasonablereasonable

-3-2

-10

12

Res

idu

als

7 8 9 10 11Fitted values

Page 60: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Checking homoscedasticity Checking homoscedasticity

If we do not appear to have equal If we do not appear to have equal variance, a transformation of the variance, a transformation of the outcome variable can be usedoutcome variable can be used– Most common are log-transformation or Most common are log-transformation or

square root transformationsquare root transformation Other approaches involving weighted Other approaches involving weighted

least squares can also be used if a least squares can also be used if a transformation does not worktransformation does not work

Page 61: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

NormalityNormality

Regression Regression requires that requires that the residuals the residuals are normally are normally distributeddistributed

To test if the To test if the residuals are residuals are normal:normal:– Histogram of Histogram of

residualsresiduals– Normal Normal

probability plotprobability plot

02

46

8D

ensi

ty

-.1 -.05 0 .05resid

Several statistical tests for normality of residuals are also available

Page 62: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

What if normality does not What if normality does not hold?hold?

Transformations of the outcome can Transformations of the outcome can often helpoften help

Changing to another type of Changing to another type of regression that does not require regression that does not require normality of the residualsnormality of the residuals– Logistic regressionLogistic regression– Poisson regressionPoisson regression

Page 63: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

OutliersOutliers

Investigating the residuals also provides Investigating the residuals also provides information regarding outliersinformation regarding outliers

If a value is extreme in the vertical If a value is extreme in the vertical direction, the residual will be extreme as direction, the residual will be extreme as wellwell– You will see this in labYou will see this in lab

If a value is extreme in the horizontal If a value is extreme in the horizontal direction, this value can have too much direction, this value can have too much importance (leverage)importance (leverage)– This is beyond the scope of this classThis is beyond the scope of this class

Page 64: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation
Page 65: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ExampleExample

Another measure of disease burden Another measure of disease burden in MS is the T2 lesion volume in the in MS is the T2 lesion volume in the brainbrain– Over the course of the disease patients Over the course of the disease patients

accumulate brain lesions that they do accumulate brain lesions that they do not recover fromnot recover from

This is a measure of the disease This is a measure of the disease burden in the brainburden in the brain

Is the significant linear relationship Is the significant linear relationship between T2 lesion volume and age?between T2 lesion volume and age?

Page 66: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

01

02

03

0L

esio

n v

olu

me

20 30 40 50 60Age

Page 67: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Linear modelLinear model

Our initial linear model:Our initial linear model:– LVLVii = =00++11*age*ageii + +ii

– What is the interpretation of What is the interpretation of 11??– What is the interpretation of What is the interpretation of 00??

Using STATA, we get the following Using STATA, we get the following regression equation:regression equation:

– Is there a significant relationship between age Is there a significant relationship between age and lesion volume?and lesion volume?

ii ageVL *062.070.3ˆ

Page 68: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Hypothesis testHypothesis test

1)1) HH00: : 11=0=0

2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor

3)3) Linear regressionLinear regression

4)4) Test statistic: t=0.99 (102 dof)Test statistic: t=0.99 (102 dof)

5)5) p-value=0.32p-value=0.32

6)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail to reject the null hypothesisto reject the null hypothesis

7)7) We conclude that there is no significant We conclude that there is no significant association between age and lesion volumeassociation between age and lesion volume

Page 69: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

_cons 3.699857 2.742369 1.35 0.180 -1.739618 9.139333 age .0623605 .0628706 0.99 0.324 -.0623429 .187064 lv_entry Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 3474.0327 103 33.7284729 Root MSE = 5.8081 Adj R-squared = -0.0002 Residual 3440.84404 102 33.7337651 R-squared = 0.0096 Model 33.1886601 1 33.1886601 Prob > F = 0.3236 F( 1, 102) = 0.98 Source SS df MS Number of obs = 104

. regress lv_entry age

Estimated coefficient

p-value

Page 70: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

-10

01

02

03

0R

esid

ual

s

5 5.5 6 6.5 7 7.5Fitted values

Page 71: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Linear modelLinear model

Our initial linear model:Our initial linear model:– ln(LVln(LVii) =) =00++11*age*ageii + +ii

– What is the interpretation of What is the interpretation of 11??

– What is the interpretation of What is the interpretation of 00??

Using STATA, we get the following Using STATA, we get the following regression equation:regression equation:

– Is there a significant relationship between age Is there a significant relationship between age and lesion volume?and lesion volume?

ii ageLV *0034.036.1)(n̂l

Page 72: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Hypothesis testHypothesis test

1)1) HH00: : 11=0=0

2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor

3)3) Linear regressionLinear regression

4)4) Test statistic: t=0.38 (102 dof)Test statistic: t=0.38 (102 dof)

5)5) p-value=0.71p-value=0.71

6)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail to reject the null hypothesisto reject the null hypothesis

7)7) We conclude that there is no significant We conclude that there is no significant association between age and lesion volumeassociation between age and lesion volume

Page 73: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

_cons 1.355875 .3952489 3.43 0.001 .5719006 2.139849 age .0034291 .0090613 0.38 0.706 -.014544 .0214022 lnlv Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 71.5754302 103 .69490709 Root MSE = .8371 Adj R-squared = -0.0084 Residual 71.4750773 102 .700736052 R-squared = 0.0014 Model .100352931 1 .100352931 Prob > F = 0.7059 F( 1, 102) = 0.14 Source SS df MS Number of obs = 104

. regress lnlv age

Estimated coefficient

p-value

Page 74: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

-2-1

01

2R

esid

ual

s

1.4 1.45 1.5 1.55Fitted values

Page 75: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Histograms of residualsHistograms of residuals

Untransformed values

Transformed values

Page 76: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Conclusions for model Conclusions for model checkingchecking

Checking model assumptions for Checking model assumptions for linear regression is needed to ensure linear regression is needed to ensure inferences are correctinferences are correct– If you have the wrong model, your If you have the wrong model, your

inference will be wrong as wellinference will be wrong as well Majority of model checking based on Majority of model checking based on

the residualsthe residuals If model fit is bad, should use a If model fit is bad, should use a

different modeldifferent model

Page 77: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Dichotomous predictorsDichotomous predictors

Page 78: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Linear regression with Linear regression with dichotomous predictordichotomous predictor

Linear regression can also be used for Linear regression can also be used for dichotomous predictors, like sexdichotomous predictors, like sex

To do this, we use an indicator variable, To do this, we use an indicator variable, which equals 1 for male and 0 for female. which equals 1 for male and 0 for female. The resulting regression equation for BPF The resulting regression equation for BPF isis

iii sexBPF

sexsexBPFE

10

10)|(

Page 79: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

GraphGraph

.75

.8.8

5.9

.95

BP

F

0 .2 .4 .6 .8 1Sex

Page 80: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

The regression equation can be rewritten asThe regression equation can be rewritten as

The meaning of the coefficients in this case The meaning of the coefficients in this case are are – 0 0 is the mean BPF when sex=0, in the female is the mean BPF when sex=0, in the female

groupgroup

– is the mean BPF when sex=1, in the is the mean BPF when sex=1, in the male groupmale group

What is the interpretation of What is the interpretation of 11??

– For a one-unit increase in sex, there is a For a one-unit increase in sex, there is a 11 increase in mean of the BPFincrease in mean of the BPF

– The difference in mean BPF between the males The difference in mean BPF between the males and femalesand females

imale

ifemale

BPF

BPF

10

0

Page 81: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Interpretation of resultsInterpretation of results

The final regression equation is The final regression equation is

The meaning of the coefficients in this case The meaning of the coefficients in this case are are – 0.823 is the estimate of the mean BPF in the 0.823 is the estimate of the mean BPF in the

female groupfemale group– 0.037 is the estimate of the mean increase in 0.037 is the estimate of the mean increase in

BPF between the males and femalesBPF between the males and females– What is the estimated mean BPF in the males?What is the estimated mean BPF in the males?

How could we test if the difference between How could we test if the difference between the groups is statistically significant?the groups is statistically significant?

sexFBP *037.0823.0ˆ

Page 82: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Hypothesis testHypothesis test

1)1) HH00: There is no difference based on gender : There is no difference based on gender ((11=0)=0)

2)2) Continuous outcome, dichotomous Continuous outcome, dichotomous predictorpredictor

3)3) Linear regressionLinear regression4)4) Test statistic: t=1.82 (27 dof)Test statistic: t=1.82 (27 dof)5)5) p-value=0.079p-value=0.0796)6) Since the p-value is more than 0.05, we Since the p-value is more than 0.05, we

fail to reject the null hypothesisfail to reject the null hypothesis7)7) We conclude that there is no significant We conclude that there is no significant

difference in the mean BPF in males difference in the mean BPF in males compared to femalescompared to females

Page 83: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865 sex .0371364 .0203586 1.82 0.079 -.004636 .0789087 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total .066750142 28 .002383934 Root MSE = .04691 Adj R-squared = 0.0767 Residual .059426595 27 .002200985 R-squared = 0.1097 Model .007323547 1 .007323547 Prob > F = 0.0792 F( 1, 27) = 3.33 Source SS df MS Number of obs = 29

. regress bpf sex

Estimated difference between groups

p-value for difference

Page 84: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

-.1

-.05

0.0

5.1

Res

idu

als

.82 .83 .84 .85 .86Fitted values

Page 85: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

T-testT-test

As hopefully you remember, you As hopefully you remember, you could have tested this same null could have tested this same null hypothesis using a two sample t-testhypothesis using a two sample t-test

Linear regression makes an equal Linear regression makes an equal variance assumption, so let’s use the variance assumption, so let’s use the same assumption for our t-testsame assumption for our t-test

Page 86: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Hypothesis testHypothesis test

1)1) HH00: There is no difference based on gender : There is no difference based on gender 2)2) Continuous outcome, dichotomous predictorContinuous outcome, dichotomous predictor3)3) t-testt-test4)4) Test statistic: t=-1.82 (27 dof)Test statistic: t=-1.82 (27 dof)5)5) p-value=0.079p-value=0.0796)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail

to reject the null hypothesisto reject the null hypothesis7)7) We conclude that there is no significant We conclude that there is no significant

difference in the mean BPF in males difference in the mean BPF in males compared to femalescompared to females

Page 87: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Pr(T < t) = 0.0396 Pr(|T| > |t|) = 0.0792 Pr(T > t) = 0.9604 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 27 diff = mean(0) - mean(1) t = -1.8241 diff -.0371364 .0203586 -.0789087 .004636 combined 29 .8318276 .0090667 .0488255 .8132553 .8503998 1 7 .86 .0196457 .0519775 .8119288 .9080712 0 22 .8228636 .0096717 .0453645 .8027502 .8429771 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances

. ttest bpf, by(sex)

Page 88: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

Amazing!!!Amazing!!!

We get the same result using both We get the same result using both approaches!!approaches!!

Linear regression has the advantages Linear regression has the advantages of: of: – Allowing multiple predictors (tomorrow)Allowing multiple predictors (tomorrow)– Accommodating continuous predictors Accommodating continuous predictors

(relationship to correlation)(relationship to correlation)– Accommodating categorical predictors Accommodating categorical predictors

(tomorrow)(tomorrow) Very flexible approachVery flexible approach

Page 89: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation

ConclusionConclusion

Indicator variables can be used to Indicator variables can be used to represent dichotomous variables in a represent dichotomous variables in a regression equationregression equation

Interpretation of the coefficient for Interpretation of the coefficient for an indicator variable is the same as an indicator variable is the same as for a continuous variablefor a continuous variable– Provides a group comparisonProvides a group comparison

Tomorrow we will see how to use Tomorrow we will see how to use regression to match ANOVA resultsregression to match ANOVA results