29
Correlation and Regression Analysis: Learning Objectives Explain the purpose of regression analysis and the meaning of independent versus dependent variables. Compute the equation of a simple regression line from a sample of data, and interpret the slope and intercept of the equation. Estimate values of Y to forecast outcomes using the regression model. Understand residual analysis in testing the assumptions and in examining the fit underlying the regression line. Compute a standard error of the estimate and interpret

Statr session14, Jan 11

Embed Size (px)

Citation preview

Page 1: Statr session14, Jan 11

Correlation and Regression Analysis:Learning Objectives

• Explain the purpose of regression analysis and the meaning of independent versus dependent variables.

• Compute the equation of a simple regression line from a sample of data, and interpret the slope and intercept of the equation.

• Estimate values of Y to forecast outcomes using the regression model.

• Understand residual analysis in testing the assumptions and in examining the fit underlying the regression line.

• Compute a standard error of the estimate and interpretits meaning.

• Compute a coefficient of determination and interpret it.

Page 2: Statr session14, Jan 11

Correlation

• Correlation is a measure of the degree of relatedness of variables.

• Coefficient of Correlation (r) - applicable only if both variables being analyzed have at least an interval level of data.

Page 3: Statr session14, Jan 11

Three Degrees of Correlation

r < 0 r > 0

r = 0

Page 4: Statr session14, Jan 11

Degree of Correlation

• The term (r) is a measure of the linear correlation of two variables– The number ranges from -1 to 0 to +1

Positive correlation: as one variable increases, the other variable increases

Negative correlation: as one variable increases, the other one decreases

No correlation: the value of r is close to 0– Closer to +1 or -1, the higher the correlation

between two variables

Page 5: Statr session14, Jan 11

Pearson Product-MomentCorrelation Coefficient

=

Page 6: Statr session14, Jan 11

Regression Analysis

• Regression analysis is the process of constructing a mathematical model or function that can be used to predict or determine one variable by another variable or variables.

Page 7: Statr session14, Jan 11

Simple Regression Analysis

• Bivariate (two variables) linear regression -- the most elementary regression model– dependent variable, the variable to be predicted, usually

called Y– independent variable, the predictor or explanatory

variable, usually called X– Usually the first step in this analysis is to construct a

scatter plot of the data• Nonlinear relationships and regression models with

more than one independent variable can be explored by using multiple regression models

Page 8: Statr session14, Jan 11

Regression Models

• Deterministic Regression Model - - produces an exact output:

• Probabilistic Regression Model

• 0 and 1 are population parameters

• 0 and 1 are estimated by sample statistics b0

and b1

0 1y x

0 1y x

Page 9: Statr session14, Jan 11

Equation of the Simple Regression Line

Page 10: Statr session14, Jan 11

A typical regression line

X

Y

𝑏0

ϴ Slope = =

y-intercept =

Page 11: Statr session14, Jan 11

Least Squares Analysis• Least squares analysis is a process whereby a regression model

is developed by producing the minimum sum of the squared error values

• The vertical distance from each point to the line is the error of the prediction.

• The least squares regression line is the regression line that results in the smallest sum of errors squared.

Page 12: Statr session14, Jan 11

Least Squares Analysis

1 2 2 2

22b

X X X X X XX X Y Y XY nXY

n

XYX Y

n

n

0 1 1b b bY XY

nX

n

Page 13: Statr session14, Jan 11

Least Squares Analysis

SS X X Y Y XYX Y

n

SSn

SSSS

XY

XX

XY

XX

X X X X

b

2 2

2

1

0 1 1b b bY XY

nX

n

Page 14: Statr session14, Jan 11

Solving for and of the Regression Line: Airline Cost DataAirlines Cost Data include the costs and associated number of

passengers for twelve 500-mile commercial airline flights using Boeing 737s during the same season of the year.

Number of CostPassengers ($1,000) 61 4,280 63 4,080 67 4,420 69 4,170

70 4,480 74 4,300 76 4,820 81 4,700 86 5,110 91 5,130 95 5,640 97 5,560

Page 15: Statr session14, Jan 11

Solving for and of the Regression Line: Airline Cost Example (Part 1)

Number ofPassengers Cost ($1,000) x y x 2 xy

61 4.28 3,721 261.0863 4.08 3,969 257.0467 4.42 4,489 296.1469 4.17 4,761 287.7370 4.48 4,900 313.6074 4.30 5,476 318.2076 4.82 5,776 366.3281 4.70 6,561 380.7086 5.11 7,396 439.4691 5.13 8,281 466.8395 5.64 9,025 535.8097 5.56 9,409 539.32

x = 930 y = 56.69 2x = 73,764 xy = 4,462.22

Page 16: Statr session14, Jan 11

Solving for and of the Regression Line: Airline Cost Example (Part 2)

745.6812

)69.56)(930(22.462,4 n

YXXYSS XY

168912

)930(764,73)( 22

2 n

XXSS XX

0407.1689

745.681

XX

XY

SSSSb

57.112

930)0407(.12

69.5610

nX

bn

Yb

XY 0407.57.1ˆ

Page 17: Statr session14, Jan 11

Residual Analysis

• Residual is the difference between the actual values and the predicted values i.e.

• Reflects the error of the regression line at any given point.

Page 18: Statr session14, Jan 11

Residual Analysis: Airline Cost Example

Number of PredictedPassengers Cost ($1,000) Value Residual X Y Y YY ˆ

61 4.28 4.053 .22763 4.08 4.134 -.05467 4.42 4.297 .12369 4.17 4.378 -.20870 4.48 4.419 .06174 4.30 4.582 -.28276 4.82 4.663 .15781 4.70 4.867 -.16786 5.11 5.070 .04091 5.13 5.274 -.14495 5.64 5.436 .20497 5.56 5.518 .042

001.)ˆ( YY

Page 19: Statr session14, Jan 11

Residual Analysis: Airline Cost Example

Outliers: Data points that lie apart from the rest of the points. They can produce large residuals and affect the regression line.

Page 20: Statr session14, Jan 11

Using Residuals to Test the Assumptions of the Regression Model

• The assumptions of the regression model– The model is linear– The error terms have constant variances– The error terms are independent– The error terms are normally distributed

Page 21: Statr session14, Jan 11

Using Residuals to Test the Assumptions of the Regression Model

• The assumption that the regression model is linear does not hold for the residual plot shown above

• In figure (a) below the error variance is greater for smaller values of x and smaller for larger values of x and vice-versa in figure (b) below. This is a case of heteroscedasiticity.

Page 22: Statr session14, Jan 11

Standard Error of the Estimate

• Residuals represent errors of estimation forindividual points.

• A more useful measurement of error is thestandard error of the estimate.

• The standard error of the estimate, denoted by se,is a standard deviation of the error of theregression model.

Page 23: Statr session14, Jan 11

Standard Error of the Estimate

SSE

Y XY

SSEn

Y Y

Y b b

Se

2

20 1

2

Sum of Squares Error

Standard Errorof the

Estimate

Page 24: Statr session14, Jan 11

Determining SSE for the Airline Cost Data Example

Number ofPassengers Cost ($1,000) Residual X Y YY ˆ 2)ˆ( YY

61 4.28 .227 .0515363 4.08 -.054 .0029267 4.42 .123 .0151369 4.17 -.208 .0432670 4.48 .061 .0037274 4.30 -.282 .0795276 4.82 .157 .0246581 4 .70 -.167 .0278986 5.11 .040 .0016091 5.13 -.144 .0207495 5.64 .204 .0416297 5.56 .042 .00176

001.)ˆ( YY 2)ˆ( YY =.31434

Sum of squares of error = SSE = .31434

Page 25: Statr session14, Jan 11

Coefficient of Determination ()

• The coefficient of determination is the proportion of variability of the dependent variable (y) accounted for or explained by the independent variable (x)

• The coefficient of determination ranges from 0 to 1.• An r 2 of zero means that the predictor accounts for

none of the variability of the dependent variable and that there is no regression prediction of y by x.

• An r 2 of 1 means perfect prediction of y by x and that 100% of the variability of y is accounted for by x.

Page 26: Statr session14, Jan 11

Coefficient of Determination ()

n

SSESSSSE

SSSSR

SSSSE

SSSSR

SSESSRSSiationlaineduniationlainedSS

nSS

YY

r

YYYY

YY

YY

YYYY

YY

YY

YY

2

2

2

2

22

1

1

1

var expvar exp

Page 27: Statr session14, Jan 11

Coefficient of Determination () forthe Airline Cost Example

899.11209.3

31434.1

1

11209.312

56.699251.270

31434.0

2

22

2

YY

YY

SSSSE

r

nY

YSS

SSE

89.9% of the variabilityof the cost of flying a

Boeing 737 is accounted for by the number of passengers.

Page 28: Statr session14, Jan 11

Relation between and

• The coefficient of determination is the square of the coefficient of correlation

• is always positive• may be positive or negative• The researcher must examine the sign of the slope

of the regression line to determine whether a positive or negative relationship exists between the variables.

Page 29: Statr session14, Jan 11

Exercise in R: Linear Regression

Open URL: www.openintro.orgGo to Labs in R and select 7 - Linear Regression