Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)

Regression

Correlation – RecapCorrelation provides an estimate of how well

change in ‘x’ causes change in ‘y’.

The relationship has a magnitude (the r value) and a direction (the sign on the r value).

The r value measures how close the untransformed data points are to a straight line.

Therefore, the r value is a very important statistic for regression analysis because it tells you how

accurate your predicted values will be.

That is why we tested the correlation value so thoroughly because being able to predict the

future is pretty cool.

RegressionRegression analysis is another method by which

the relationship between dependent and independent variables can be estimated.

Unlike correlation, which just tells you the strength and direction of a relationship, regression tells you much more about each point and its place in the

relationship.

Regression also tells you how you can use an ‘x ’ value to predict a ‘y ’ value using the mathematical

expression of a trend line.

This is how you can predict the future.

Trend LinesA trend line is a line drawn through a frequency distribution of paired values called a scatterplot.

The scatterplot shows the overall pattern of the points around the trend line.

TWO IMPORTANT POINTS:

The data values are the finest degree of resolution in your data.

The trend line is the coarsest degree.

Each shows you a different type of information.

Trend LinesTrend lines can be straight and have their

‘straightness’ defined by their different angles: i.e. are they steep or shallow?

Trend lines can also be curvilinear and have their ‘curviness’ defined by polynomial, logistic, or

exponential (log) functions.

Both types of lines also have the ‘fit’ of their data points to the line defined by their correlation

coefficient.

Types of Trend Lines

Also called exponential functions

and defined by the exponent on

y=xx

Are labeled by their ‘degree’:

Quadratic = 2Cubic = 3

Quartic = 4Quintic = 5

Takes many variants of the form:

Are the simplest linear expressions of

the form:

Linear Trend LinesThe other aspect of trend lines, apart from their shape, is whether they have one or more than

one independent variable.

That is, are they bivariate or multivariate.

We have seen only bivariate trend lines so far: that is, lines having a y and one x.

For our discussion on regression we will stick with these bivariate linear trend lines.

How are linear trend lines created?First, the line always passes through the

arithmetic means of the x and the y variables.

Second, the trend line is always as close as it can possibly be to every data point.

Third, the difference between each data point and the line is as small as it can be when all points

are considered.

This is done by minimising the sum of the squared differences.

This is why the Pearson formulation we shall use is called the “least squares” method.

An example:42 pairs of grades.

Each student has a high school grade (the X or independent variable)

and a 1st year university grade (the Y or

dependent variable).

They are labelled such because the HS grade

could influence the Uni grade but not the other

way around – i.e. Y is dependent on X and not

X on Y.

Student # Best 6 HS Grade

1st Yr Uni Grade Student # Best 6 HS

Grade1st Yr Uni

Grade

X Y X Y1 91.83 87.00 22 73.67 69.002 90.83 88.00 23 73.17 67.503 84.67 82.40 24 72.83 76.904 83.83 76.30 25 72.83 63.005 83.67 78.30 26 72.67 65.006 83.33 80.20 27 72.33 67.007 82.00 75.40 28 72.00 71.008 81.17 67.40 29 71.50 65.009 81.00 84.30 30 71.17 54.00

10 78.33 76.80 31 71.00 63.5011 77.83 78.60 32 70.67 67.0012 77.83 67.90 33 70.67 67.9013 77.67 55.80 34 70.50 69.0014 76.67 76.00 35 70.33 65.4015 76.50 76.60 36 70.17 65.1016 75.67 42.70 37 70.17 64.7017 74.83 80.30 38 70.00 70.0018 74.67 82.90 39 70.00 70.0019 74.67 67.00 40 69.67 40.6020 74.50 71.00 41 69.67 61.0021 74.33 65.80 42 69.33 67.00 Mean of all X = 75.48 Mean of all Y = 69.76 r = 0.617 r2 = 0.38 SEE = 8.03

High School Best Six OACs & 1st Year University Grades

y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

Means of the Regression Line

Mean of x

Mean of y

The regression line passes through the mean of y

(75.48) and the mean of x

(69.96)

The sum of the squared

distances from each point to the line is as

small as it can be when all points are

considered. That is, the line cannot get any closer overall to the points.


y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

Prediction - Regression

A high school grade of 80% will predict a

first year university grade of almost 75%

But can we get a more accurate prediction than

“almost”?

Yes, using this linear equation.

Regression for RealRegression is a mathematical method which uses a linear

equation by which one value (y) can be predicted by another value (x).

Furthermore, the predicted value can be given ‘margins of error’ - that is, x will predict y within ± whatever

units y is in.

The accuracy of the predicted value of y and the size of the margins of error will depend on how well the data

points match a straight line.

AND THAT DEPENDS ON HOW HIGH YOUR r VALUE IS.

It also depends on how many pairs of data points you have – that is, your ‘n’.

Linear Regression – Equating the Line

Where:

is the predicted value of y for a given x b is the intercept valuem is the slope of the linex1 is the given value in the x (independent) dataset from which you want to predict y.

This is the formula that Excel uses.You sometimes also see:y = a + bx

CALLED ‘WHY’ HAT

��=𝑚𝑥1+𝑏

��


y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

Prediction - Regression

This is the linear regression

equation used to predict the value in Excel,

where =mx+b, with ‘m’ as the slope and ‘b’ as

the intercept

��

��


y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

Prediction Reprise - “Almost” Regression Using The Line

A high school grade of 80% will predict a

first year university grade of almost 75%

But can we get a more accurate prediction that

“almost”?

Yes, using this linear equation.

Predicting an 80% Incoming HS Grade from the Linear Regression Equation

= 1.0862x - 12.22

= 1.0862 * 80% - 12.22

= 74.7%This is our “almost” 75%.

��=𝑚𝑥1+𝑏

��

��

Standard Error of the Estimate (SEE) of

The predicted is not exact even if we have an exact x to start with because…

There is likely more than one y value for every x value, and…

The line is based on the correlation coefficient which was not a perfect 1.0 but an imperfect 0.617, and…

Our r2 of 0.38 only explains 38% of variability, and…

Our ‘n’ is only a ample of 42 pairs and not everyone in the population from which the sample of 42 pairs

came.

��

HS Grade

X

University grade

Y

HS Grade

X

University grade

Y 91.83 87.00 73.67 69.00 90.83 88.00 73.17 67.50 84.67 82.40 72.83 76.90 83.83 76.30 72.83 63.00 83.67 78.30 72.67 65.00 83.33 80.20 72.33 67.00 82.00 75.40 72.00 71.00 81.17 67.40 71.50 65.00 81.00 84.30 71.17 54.00 78.33 76.80 71.00 63.50 77.83 78.60 70.67 67.00 77.83 67.90 70.67 67.90 77.67 55.80 70.50 69.00 76.67 76.00 70.33 65.40 76.50 76.60 70.17 65.10 75.67 42.70 70.17 64.70 74.83 80.30 70.00 70.00 74.67 82.90 70.00 70.00 74.67 67.00 69.67 40.60 74.50 71.00 69.67 61.00 74.33 65.80 69.33 67.00

Mean of X = 75.48 Mean of Y = 69.76

r = 0.617 r2 = 0.38 SEE = 8.03

These two students have the same HS grade but

widely differing first year grades.

These two students have the same HS grade and very

close first year grades.

If you plug 70.17 into the equation you get a

predicted first year value of 63.99, which is very close to

actual grades.

If you plug 72.83 into the equation you get a

predicted first year value of 66.89, which is not very close to actual grades.

This variability between the predicted grades

and the actual grades is called the error of

estimate and it can be calculated as a statistic

called the standard error of estimate (SEE).


y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

These lines represent the idea of variability of the

data points from the line. The SEE is the average of all the squared

differences from the data points to

the line.

Standard Error of Estimate of

Luckily we don’t have to calculate this by hand. Excel calculates it for you.

The result is the ± value on in whatever values y was in (e.g. in this case, student CGPA in %).

Again the SEE is similar to using the standard deviation.

𝑆𝐸𝐸=√ 1(𝑛−2) [∑ (𝑦− 𝑦 )2−

[∑ (𝑥−𝑥 )(𝑦− 𝑦 )2 ]√(𝑥−𝑥)2 ]

��

��

Note the ‘2’. That’s because we have a pair of values and not just one value.

And remember that n-1 is the

sample n.

Note the squared differences of x ’s and y ’s from the mean of all x ’s and y ’s.

Standard Error of Estimate of

Now look again, …… and compare it to this:

Which you should all recognize as the standard deviation formula.

Once again the usefulness of the arithmetic mean and the standard deviation is evident.

𝑆𝐸𝐸=√ 1(𝑛−2) [∑ (𝑦− 𝑦 )2−

[∑ (𝑥−𝑥 )(𝑦− 𝑦 )2 ]√(𝑥−𝑥)2 ]

��

𝑠=√∑ ¿¿¿ ¿

The Importance of s and Stop and absorb this.

The importance of the arithmetic mean and the standard deviation cannot be overstated in statistics because the same rules that apply to the and the s

apply to whatever equation they appear in.

That is, the distribution is normal:It has no extreme values.

It has no gaps.It has no outliers.

It is not skewed (skewness).It is not peaked (kurtosis).

It is not bi-modal (two peaks).It is not poly-modal (many peaks).

Interpreting the SEEThe SEE for the example data is 8.03%.

This number is the ± value on in whatever values y was in (e.g. student 1st year University CGPA %).

Since the SEE is similar to the standard deviation, then saying 1.96*SEE is the same as saying 1.96*s.

Thus you can say that the population value of (labeled as upper case ) will, with 95% certainty, fall between ±1.96*SEE, or…

±1.96*8.03%=15.74%

��

��

High School Best 6 OACs & 1st Year University Grades

y = 1.0862x - 12.22R = 0.617

R2 = 0.3812SEE = 8.03

45

55

65

75

85

95

105

65 70 75 80 85 90 95

HS best six %

1st

Yea

r U

niv

ersi

ty C

GP

A

These lines represent the

average variability of

the data points from the line

This average variability is calculated as the SEE and

represents the average margin of error of any data point from the

trend line

= 1.0862x - 12.22

= 1.0862 * 80% - 12.22

= 74.7%

Predicting the Value x = 80% from the Linear Regression Equation

��=𝑚𝑥+𝑏

��

��

Predicting Margins of Error at 95% Confidence from the Linear Regression Equation

Predicted value = 74.7%Predicted margin (the SEE) at 95% = 1.96 * 8.03%

Predicted margin = ±15.78%

Range within which population value ( )falls with 95% certainty = 74.7% ±15.78% = 58.9% to 90.5%.

The large range of the margin is a function of the relatively modest ‘r ‘ value (.617) and the small ‘n ’

(42 pairs).

How might you reduce these margins?

Reducing Margins of ErrorThe good news is that theoretically they can be reduced by

increasing the number of pairs or ‘n’. The bad news is that they likely cannot be reduced by very much. Why? Consider

the following:


y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

If your sample were to have caught only the red circled

students, then your r would be small and hence your SEE high.

But the larger the sample, the more likely you’ll approximate the distribution and the r seen on the

graph, but no better.

Remember what happens when you increase sample size using √n.

Change in CI for every change in 'n'

1

2

3

45 6 7 8 9 10 11 12 13 14 15 16 17 18 19

-0.2

-0.18

-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

Doublings of 'n'

CI%

28

Diminishing Returns on Sample Size

Doublings of ‘n’ starting at 30

Change in CI for every doubling of ‘n’

Confi

denc

e In

terv

al

We’ll look at this more closely later in Sampling lecture.

n=30

n=60

n=120n=240

n=480

Relationships SummaryRelationships measure the effect of one variable (the independent or x) on another (the dependent or y).

The direction and strength of the effect is given by the correlation coefficient (r) and its reliability by the Ser and Ser0 .

The degree (in % terms) to which x causes change in y is given by the coefficient of determination (r2).

Using line equations, regression allows us to use the relationship measured by correlation to forecast values of y

for given values of x.

Using the standard error (called SEE) allows us to put margins of error on the predicted values.

ResidualsAnalysisInRegression

Regression and ResidualsLinear equations express how well a straight line

fits your data.

The actual regression line is calculated as the line that minimizes the squared distances of all points

in the dataset from the line.

More precisely, it calculates a line where the sum of the squared differences of all values of y from

for any value of x, will be the smallest.��

What Are Residuals?

A residual is the difference between an observed value of y and a predicted value of y:

Residual = observed value – predicted valuee = y - Both the sum and the mean of a residuals set equal 0:Σ e = 0, = 0

So what?

��

What Does A Residuals Analysis Tell You?First:

A residuals plot confirms whether a linear trend exists in your data.

Second:A residuals plot can also indicate whether another type

of trend exists in your data.

If the trend in the residuals plot is…

Strong Weak/NoneThe trend in the data is not

linear but could be non-linear.

The trend in the data is linear not non-linear.


y = 1.0862x - 12.22R = 0.617

R2 = 0.381

45.00

50.00

55.00

60.00

65.00

70.00

75.00

80.00

85.00

90.00

95.00

65.00 70.00 75.00 80.00 85.00 90.00 95.00

HS best 6 %

1st

year

Un

iver

sity

CG

PA

Observed value of a ‘y’ for agiven ‘x’.

Predicted value of y…

Red lines (or their mathematical

equivalents) are called residuals.

Best fit line will minimize the total of the squared ‘lengths’ of the all the

red lines

…using line equation.

Another observed value of a ‘y’… for an identical given value of ‘x’.

STUDENT SAMPLE #

HIGH SCHOOL

G12 GRADE

X

FIRST YEAR UNIVERSITY

GRADEY Residuals

1 91.83 87.00 85.41 1.592 90.83 88.00 84.58 3.423 84.67 82.40 79.47 2.934 83.83 76.30 78.77 -2.475 83.67 78.30 78.64 -0.346 83.33 80.20 78.35 1.857 82.00 75.40 77.25 -1.858 81.17 67.40 76.56 -9.169 81.00 84.30 76.42 7.88

10 78.33 76.80 74.20 2.6011 77.83 78.60 73.79 4.81

x y Y eThe is

predicted by the linear

equation for each ‘x’

Y

The e is given by y -Y

Y

How Residuals Are Calculated

68.00 73.00 78.00 83.00 88.00 93.00 98.0045.0050.0055.0060.0065.0070.0075.0080.0085.0090.0095.00

f(x) = 0.738998422933728 x + 9.5919129758953R² = 0.249369551739253

HS GRADES AND 1ST YEAR CGPA

HS GRADES AND 1ST YEAR CGPALinear (HS GRADES AND 1ST YEAR CGPA)

Linear pattern to the data and a

moderate r.

Scatterplot of y against xFor HS grades and first year University

xy

68.00 73.00 78.00 83.00 88.00 93.00 98.00-20.00

-15.00

-10.00

-5.00

0.00

5.00

10.00

15.00

20.00

25.00

f(x) = 1.57706627156973E-06 x − 0.00391297589516879R² = 1.51356704947148E-12

Scatterplot of residuals e against xFor HS grades and first year University

ex

No pattern to the data and an

almost zero r.

Residuals SummaryThe stronger the linear trend, the weaker the pattern in

the residuals.

The weaker the linear trend, the stronger the pattern in the residuals.

BUT!

Take a look at these non-linear trend lines and their residuals.

$0.00 $10,000.00 $20,000.00 $30,000.00 $40,000.00 $50,000.00 $60,000.000.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0

f(x) = − 0.00243219927787781 x + 62.7303255135724R² = 0.389406588370063

R² = 0.679994888694155

Income (X) and IM (Y) Raw Data

Natural raw data scatter is strongly skewed. The r2, 38%, is fair but the shape of these data clearly do not fit a linear trend line (they fit the red logarithmic line much better).

$0.00$10,000.00

$20,000.00

$30,000.00

$40,000.00

$50,000.00

$60,000.00

-100

-50

0

50

100

150

R² = 0.408306988867007

X Variable 1 Residual Plot

X Income

Resid

uals

And the residuals plot for those same data shows another strong pattern (observe the r2) - but its not linear, indicating

that a straight line fit to these data is inadvisable.

If you calculate your residuals from a linear trend line and your data is not linear, then there will be a

pattern in the residuals.

The poorer a linear trend line fits your data the stronger the residuals pattern will be.

If you calculate your residuals from a linear trend line and your data is linear, then there will be

weaker pattern in the residuals.

At r = ±1.0, then r2=1.0 and there will be no pattern in the residuals because all points will fall on the line and there will be no residual values –

that is, the values will all be zero.

Residuals Summary

Documents

Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)