Linear Regression

Linear Regression

Blythe Durbin-Johnson, Ph.D.

April 2017

We are video recording this seminar so please hold questions until the end.

Thanks

When to Use Linear Regression

•Continuous outcome variable

•Continuous or categorical predictors

*Need at least one continuous predictor for name “regression” to apply

When NOT to Use Linear Regression

•Binary outcomes

•Count outcomes

•Unordered categorical outcomes

•Ordered categorical outcomes with few (<7) levels

Generalized linear models and other special methods exist for these settings

Some Interchangeable Terms

•Outcome

•Response

•Dependent Variable

•Predictor

•Covariate

• Independent Variable

Simple Linear Regression

•Model outcome Y by one continuous predictor X:

𝑌 = 𝛽0+ 𝛽1X + ε

• ε is a normally distributed (Gaussian) error term

Model Assumptions

•Normally distributed residuals ε

• Error variance is the same for all observations

• Y is linearly related to X

• Y observations are not correlated with each other

•X is treated as fixed, no distributional assumptions

•Covariates do not need to be normally distributed!

A Simple Linear Regression Example

Data from Lewis and Taylor (1967) via http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_reg_examples03.htm

Goal: Find straight line that minimizes sum of squared distances from actual weight to fitted line “Least squares fit”

A Simple Linear Regression Example—SAS Code

proc reg data = Children; model Weight = Height; run;

Children is a SAS dataset including variables Weight and Height

Simple Linear Regression Example—SAS Output

Parameter Estimates

Variable DF

Parameter

Estimate

Standard

Error t Value Pr > |t|

Intercept 1 -143.02692 32.27459 -4.43 0.0004

Height 1 3.89903 0.51609 7.55 <.0001

Slope: How much weight increases for a 1 inch increase in height

Intercept: Estimated weight for child of height 0 (Not always interpretable…)

S.E. of slope and intercept

Parameter estimates divided by S.E.

Weight increases significantly with height

P-Values

Weight = -143.0 + 3.9*Height

Analysis of Variance

Source DF

Sum of

Squares

Square

Value Pr > F

Model 1 7193.24912 7193.24912 57.08 <.0001

Error 17 2142.48772 126.02869

Corrected

18 9335.73684

Sum of squared differences between model fit and mean of Y

Sum of squared differences between model fit and observed values of Y

Sum of squared differences between mean of Y and observed values of Y

Sum of squares/df

Mean Square(Model)/MSE

Regression on X provides a significantly better fit to Y than the null (intercept-only) model

Root MSE 11.22625 R-Square 0.7705

Dependent Mean 100.02632 Adj R-Sq 0.7570

Coeff Var 11.22330

Percent of variance of Y explained by regression

Version of R-square adjusted for number of predictors in model

Mean of Y

Root MSE/mean of Y

Thoughts on R-Squared • For our model, R-square is 0.7705

• 77% of the variability in weight is explained by height

• Not a measure of goodness of fit of the model: • If variance is high, will be low even with the “right” model • Can be high with “wrong” model (e.g. Y isn’t linear in X) • See http://data.library.virginia.edu/is-r-squared-useless/

• Always gets higher when you add more predictors • Adjusted R-square intended to correct for this

• Take with a grain of salt

Simple Linear Regression Example—SAS Output Fit Diagnostics for Weight

0.757Adj R-Square

0.7705R-Square

126.03MSE

17Error DF

2Parameters

19Observations

Proportion Less

0.0 0.4 0.8

Residual

0.0 0.4 0.8

Fit–Mean

-32 -16 0 16 32

Residual

0 5 10 15 20

Observation

60 80 100 120 140

Predicted Value

-2 -1 0 1 2

Quantile

0.05 0.15 0.25

Leverage

60 80 100 120 140

Predicted Value

60 80 100 120 140

Predicted Value

Fit Diagnostics for Weight

0.757Adj R-Square

0.7705R-Square

126.03MSE

17Error DF

2Parameters

19Observations

Proportion Less

0.0 0.4 0.8

Residual

0.0 0.4 0.8

Fit–Mean

-32 -16 0 16 32

Residual

0 5 10 15 20

Observation

60 80 100 120 140

Predicted Value

-2 -1 0 1 2

Quantile

0.05 0.15 0.25

Leverage

60 80 100 120 140

Predicted Value

60 80 100 120 140

Predicted Value

• Residuals should form even band around 0

• Size of residuals shouldn’t change with predicted value

• Sign of residuals shouldn’t change with predicted value

0 100 200 300

Fitted Values

Suggests Y and X have a nonlinear relationship

2 4 6 8

Fitted Values

Suggests data transformation

0.757Adj R-Square

0.7705R-Square

126.03MSE

17Error DF

2Parameters

19Observations

Proportion Less

0.0 0.4 0.8

Residual

0.0 0.4 0.8

Fit–Mean

-32 -16 0 16 32

Residual

0 5 10 15 20

Observation

60 80 100 120 140

Predicted Value

-2 -1 0 1 2

Quantile

0.05 0.15 0.25

Leverage

60 80 100 120 140

Predicted Value

60 80 100 120 140

Predicted Value

• Plot of model residuals versus quantiles of a normal distribution

• Deviations from diagonal line suggest departures from normality

-2 -1 0 1 2

Normal Q-Q Plot

Theoretical Quantiles

Suggests data transformation may be needed

0.757Adj R-Square

0.7705R-Square

126.03MSE

17Error DF

2Parameters

19Observations

Proportion Less

0.0 0.4 0.8

Residual

0.0 0.4 0.8

Fit–Mean

-32 -16 0 16 32

Residual

0 5 10 15 20

Observation

60 80 100 120 140

Predicted Value

-2 -1 0 1 2

Quantile

0.05 0.15 0.25

Leverage

60 80 100 120 140

Predicted Value

60 80 100 120 140

Predicted Value

Studentized (scaled) residuals by predicted values (cutoff for outlier depends on n, use 3.5 for n = 19 with 1 predictor)

Y by predicted values (should form even band around line)

Histogram of residuals (look for skewness, other departures from normality)

Cook’s distance > 4/n (= 0.21) may suggest influence (cutoff of 1 also used)

Studentized residuals by leverage, leverage > 2(p + 1)/n (= 0.21) suggests influential observation

Residual-fit plot, see Cleveland, Visualizing Data (1993)

Thoughts on Outliers

• An outlier is NOT a point that fails to support the study hypothesis

• Removing data can introduce biases

• Check for outlying values in X and Y before fitting model, not after

• Is there another model that fits better? Do you need a nonlinear model or data transformation?

• Was there an error in data collection?

• Robust regression is an alternative

Multiple Linear Regression

A Multiple Linear Regression Example—SAS Code

proc reg data = Children; model Weight = Height Age; run;

A Multiple Linear Regression Example—SAS Output

Parameter Estimates

Variable DF

Parameter

Estimate

Standard

Intercept 1 -141.22376 33.38309 -4.23 0.0006

Height 1 3.59703 0.90546 3.97 0.0011

Age 1 1.27839 3.11010 0.41 0.6865

Adjusting for age, weight still increases significantly with height (P = 0.0011). Adjusting for height, weight is not significantly associated with age (P = 0.6865)

Categorical Variables

•Let’s try adding in gender, coded as “M” and “F”:

proc reg data = Children;

model Weight = Height Gender;

ERROR: Variable Gender in list does not match type prescribed for this list.

• For proc reg, categorical variables have to be recoded as 0/1 variables:

data children;

set children;

if Gender = 'F' then numgen = 1;

else if Gender = 'M' then numgen = 0;

else call missing(numgen);

• Let’s try fitting our model with height and gender again, with gender coded as 0/1:

proc reg data = Children;

model Weight = Height numgen;

Parameter Estimates

Variable DF

Parameter

Estimate

Standard

Intercept 1 -126.16869 34.63520 -3.64 0.0022

Height 1 3.67890 0.53917 6.82 <.0001

numgen 1 -6.62084 5.38870 -1.23 0.2370

Adjusting for gender, weight still increases significantly with height Adjusting for height, mean weight does not differ significantly between genders

•Can use proc glm to avoid recoding categorical variables: •Recommend this approach if a categorical variable has

more than 2 levels

proc glm data = children;

class Gender;

Proc glm output Source DF Type I SS Mean Square F Value Pr > F

Height 1 7193.24911

7193.249119 58.79 <.0001

Gender 1 184.714500 184.714500 1.51 0.2370

Source DF Type III SS Mean Square F Value Pr > F

Height 1 5696.84066

5696.840666 46.56 <.0001

Gender 1 184.714500 184.714500 1.51 0.2370

• Type I SS are sequential • Type III SS are nonsequential

Proc glm

• By default, proc glm only gives ANOVA tables

• Need to add estimate statement to get parameter estimates:

proc glm data = children;

class Gender;

estimate 'Height' height 1;

estimate 'Gender' Gender 1 -1;

Proc glm

Parameter Estimate

Standard

Height 3.67890306 0.53916601 6.82 <.0001

Gender -6.62084305 5.38869991 -1.23 0.2370

Same estimates as with proc reg

Model Selection

• ‘Rule of thumb’ suggests model should include no more than 1 covariate for every 10—15 observations

•What if you have more? •Pre-specify a smaller model based on literature,

subject-matter knowledge, etc. • Select a smaller model in a data-driven fashion

Stepwise Methods

• Forward selection: Start with best single-variable model, add variables until no variable meets criteria to enter model

•Backward elimination: Start with full model, remove variables until no variable meets criteria to be removed

• Forward and backward selection: Variables can be added and removed at each step

Stepwise Methods

•Different criteria to enter and leave model can be used: •P-value from ANOVA F-test •Mallows C(p)

• Estimate of mean square prediction error

•Adjusted R^2 •AIC (not implemented in proc reg)

Model Selection Example Data Set Name WORK.NSQIP_BASEC

Observations 1413

Member Type DATA Variables 7

Engine V9 Indexes 0

Created 04/08/2017 14:24:17 Observation

Length

Last Modified 04/08/2017 14:24:17 Deleted

Observations

Protection Compressed NO

Data Set Type Sorted NO

Representation

WINDOWS_64

Encoding wlatin1 Western

(Windows)

Alphabetic List of Variables and Attributes

# Variable Type Len

1 age2 Num 8

6 album Num 8

7 bmi Num 8

5 creat Num 8

4 sex Num 8

3 smoke Num 8

2 steroid Num 8

Model Selection Example

proc reg data = nsqip_basechars;

model logcreat = bmi logalbum steroid

smoke age2 sex / selection = forward;

Summary of Forward Selection

Variable

Entered

Number

Vars In

Partial

R-Square

R-Square C(p) F Value Pr > F

1 sex 1 0.0842 0.0842 57.7965 129.68 <.0001

2 age2 2 0.0305 0.1147 11.0253 48.50 <.0001

3 logalbum 3 0.0059 0.1206 3.6103 9.42 0.0022

4 smoke 4 0.0013 0.1219 3.4948 2.12 0.1458

model logcreat = bmi logalbum steroid smoke age2 sex / selection = backward;

Model Selection Example Summary of Backward Elimination

Variable

Removed

Number

Vars In

Partial

R-Square

R-Square C(p) F Value Pr > F

1 steroid 5 0.0001 0.1221 5.2208 0.22 0.6385

2 bmi 4 0.0002 0.1219 3.4948 0.27 0.6006

3 smoke 3 0.0013 0.1206 3.6103 2.12 0.1458

Variable

Parameter

Estimate

Standard

Error Type II SS F Value Pr > F

Intercept -0.41900 0.07742 2.17950 29.29 <.0001

logalbum 0.14193 0.04625 0.70071 9.42 0.0022

age2 0.00345 0.00047777 3.88588 52.23 <.0001

sex -0.17584 0.01473 10.60565 142.54 <.0001

Caveats about stepwise model selection

• Test statistics of final model don’t have the correct distributions • P-values will be incorrect

• Regression coefficients will be biased

• R-squared values will be too high

• Doesn’t handle multicollinearity well

See http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/ among many others

Multicollinearity

•Highly correlated covariates cause problems • Inflated standard errors • Sometimes can’t fit model at all

•Two highly correlated variables might be significant on their own and very non-significant when included in a model together

Diagnosing Multicollinearity

model logcreat = logalbum smoke age2 sex / vif;

Diagnosing Multicollinearity

Parameter Estimates

Variable DF

Parameter

Estimate

Standard

Variance

Inflation

Intercept 1 -0.39537 0.07907 -5.00 <.0001 0

logalbum 1 0.13626 0.04639 2.94 0.0034 1.01490

smoke 1 -0.02821 0.01939 -1.46 0.1458 1.06614

age2 1 0.00328 0.0004922

6.66 <.0001 1.07280

sex 1 -0.17541 0.01473 -11.91 <.0001 1.00267

Variance Inflation Factor (VIF): Rule of thumb suggests VIF > 10 means severe multicollinearity

Other Issues

Data Transformations

• Skewed residuals

•Residuals with non-constant variance

• Large ‘outliers’ at one end of the data scale

•Data transformation may be the answer to these issues

Data Transformations

• Log transformation: Useful for lab data, other biological assay data

•Reciprocal transformation: Reduces skewness

• Logit transformation: Use with percentages bounded away from 0 and 1

•Box-Cox family of power transformations

Data Transformation Example

0.5 1.0 1.5

logalbum

Residuals for creat

model creat = logalbum;

0.5 1.0 1.5

logalbum

Residuals for logcreat

model logcreat = logalbum;

0.5 1.0 1.5

logalbum

Residuals for recipcreat

proc reg data =

nsqip_basechars;

model recipcreat =

logalbum;

Regression vs. Correlation

•Regression and correlation analysis are closely related

•Regression designates one variable as the outcome, correlation does not

•Regression slope = Pearson correlation X SD(Y)/SD(X)

•P-values from simple linear regression and from correlation test will be identical

Thank you!

Help is Available

CTSC Biostatistics Office Hours

– Every Tuesday from 12 – 1:30 in Sacramento

– Sign-up through the CTSC Biostatistics Website

EHS Biostatistics Office Hours

– Every Monday from 2-4 in Davis

Request Biostatistics Consultations

– CTSC - www.ucdmc.ucdavis.edu/ctsc/

– MIND IDDRC - www.ucdmc.ucdavis.edu/mindinstitute/centers/iddrc/cores/bbrd.html

– Cancer Center and EHS Center

Linear Regression - University of California, Davis...•Adjusted R-square intended to correct for...

Documents

Regression analysis Linear regression Logistic regression

Linear regression: Part 1 - NTNU · Linear regression: Part 1. Lecture Outline What are linear models?-EX1: What is the ‘best’ line? What is linear regression? ... Linear regression

Regression-Adjusted GPS Algorithm (RGPS) · Regression-Adjusted GPS Algorithm (RGPS) William DuMouchel and Rave Harpaz . Regression-Adjusted GPS Algorithm (RGPS) ... This new Regression-adjusted

Linear Regression - Wharton Finance - Finance …finance.wharton.upenn.edu/.../Linear-Regression-Slides.pdf · 2009-10-07 · regression model, 2-variable linear regression model)

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model

SIMPLE LINEAR REGRESSION. 2 Simple Regression Linear Regression

Linear Models & Linear Regression

REVIEW OF SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION Determining the Regression ... of linear... · 2012. 6. 21. · REVIEW OF SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION

1 Curve-Fitting Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression Interpolation

Multiple Linear Regression - Analysis Made Easy · Multiple Linear Regression The MULTIPLE LINEAR REGRESSION command performs simple multiple regression using least squares. Linear

The Algebra of Linear Regression - statpower.net Notes/RegressionAlgebra.pdf · The Algebra of Linear Regression 1 Introduction 2 Bivariate Linear Regression 3 Multiple Linear Regression

REGRESSION 12.1 Simple Linear Regression Model 12.2 ...sman/courses/6739/SimpleLinearRegression.pdf · Goldsman — ISyE 6739 Linear Regression REGRESSION 12.1 Simple Linear Regression

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II

Generalized linear regression - Little Dumb doctor to Multiple Linear Regression 3 tics tutorial at Multiple Linear RegressionMultiple Linear Regression • Regression analysis is

Hierarchical Multiple Linear Regression and the correct ......Cognadev Technical Report #6 5 | P a g e 1.1 Adjusted R2 In any multiple regression situation, the model R2 is adjusted/corrected

Chapter 13: SIMPLE LINEAR REGRESSION. 2 Simple Regression Linear Regression

1 Curve-Fitting Polynomial Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression

Chapter 8 Linear Regression. Objectives & Learning Goals Understand Linear Regression (linear modeling): Create and interpret a linear regression model

Regression Linear Regression Regression Trees