Upload
bartholomew-bond
View
216
Download
0
Embed Size (px)
Citation preview
Regression
Correlation – RecapCorrelation provides an estimate of how well
change in ‘x’ causes change in ‘y’.
The relationship has a magnitude (the r value) and a direction (the sign on the r value).
The r value measures how close the untransformed data points are to a straight line.
Therefore, the r value is a very important statistic for regression analysis because it tells you how
accurate your predicted values will be.
That is why we tested the correlation value so thoroughly because being able to predict the
future is pretty cool.
RegressionRegression analysis is another method by which
the relationship between dependent and independent variables can be estimated.
Unlike correlation, which just tells you the strength and direction of a relationship, regression tells you much more about each point and its place in the
relationship.
Regression also tells you how you can use an ‘x ’ value to predict a ‘y ’ value using the mathematical
expression of a trend line.
This is how you can predict the future.
Trend LinesA trend line is a line drawn through a frequency distribution of paired values called a scatterplot.
The scatterplot shows the overall pattern of the points around the trend line.
TWO IMPORTANT POINTS:
The data values are the finest degree of resolution in your data.
The trend line is the coarsest degree.
Each shows you a different type of information.
Trend LinesTrend lines can be straight and have their
‘straightness’ defined by their different angles: i.e. are they steep or shallow?
Trend lines can also be curvilinear and have their ‘curviness’ defined by polynomial, logistic, or
exponential (log) functions.
Both types of lines also have the ‘fit’ of their data points to the line defined by their correlation
coefficient.
Types of Trend Lines
Also called exponential functions
and defined by the exponent on
y=xx
Are labeled by their ‘degree’:
Quadratic = 2Cubic = 3
Quartic = 4Quintic = 5
Takes many variants of the form:
Are the simplest linear expressions of
the form:
Linear Trend LinesThe other aspect of trend lines, apart from their shape, is whether they have one or more than
one independent variable.
That is, are they bivariate or multivariate.
We have seen only bivariate trend lines so far: that is, lines having a y and one x.
For our discussion on regression we will stick with these bivariate linear trend lines.
How are linear trend lines created?First, the line always passes through the
arithmetic means of the x and the y variables.
Second, the trend line is always as close as it can possibly be to every data point.
Third, the difference between each data point and the line is as small as it can be when all points
are considered.
This is done by minimising the sum of the squared differences.
This is why the Pearson formulation we shall use is called the “least squares” method.
An example:42 pairs of grades.
Each student has a high school grade (the X or independent variable)
and a 1st year university grade (the Y or
dependent variable).
They are labelled such because the HS grade
could influence the Uni grade but not the other
way around – i.e. Y is dependent on X and not
X on Y.
Student # Best 6 HS Grade
1st Yr Uni Grade Student # Best 6 HS
Grade1st Yr Uni
Grade
X Y X Y1 91.83 87.00 22 73.67 69.002 90.83 88.00 23 73.17 67.503 84.67 82.40 24 72.83 76.904 83.83 76.30 25 72.83 63.005 83.67 78.30 26 72.67 65.006 83.33 80.20 27 72.33 67.007 82.00 75.40 28 72.00 71.008 81.17 67.40 29 71.50 65.009 81.00 84.30 30 71.17 54.00
10 78.33 76.80 31 71.00 63.5011 77.83 78.60 32 70.67 67.0012 77.83 67.90 33 70.67 67.9013 77.67 55.80 34 70.50 69.0014 76.67 76.00 35 70.33 65.4015 76.50 76.60 36 70.17 65.1016 75.67 42.70 37 70.17 64.7017 74.83 80.30 38 70.00 70.0018 74.67 82.90 39 70.00 70.0019 74.67 67.00 40 69.67 40.6020 74.50 71.00 41 69.67 61.0021 74.33 65.80 42 69.33 67.00 Mean of all X = 75.48 Mean of all Y = 69.76 r = 0.617 r2 = 0.38 SEE = 8.03
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
Means of the Regression Line
Mean of x
Mean of y
The regression line passes through the mean of y
(75.48) and the mean of x
(69.96)
The sum of the squared
distances from each point to the line is as
small as it can be when all points are
considered. That is, the line cannot get any closer overall to the points.
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
Prediction - Regression
A high school grade of 80% will predict a
first year university grade of almost 75%
But can we get a more accurate prediction than
“almost”?
Yes, using this linear equation.
Regression for RealRegression is a mathematical method which uses a linear
equation by which one value (y) can be predicted by another value (x).
Furthermore, the predicted value can be given ‘margins of error’ - that is, x will predict y within ± whatever
units y is in.
The accuracy of the predicted value of y and the size of the margins of error will depend on how well the data
points match a straight line.
AND THAT DEPENDS ON HOW HIGH YOUR r VALUE IS.
It also depends on how many pairs of data points you have – that is, your ‘n’.
Linear Regression – Equating the Line
Where:
is the predicted value of y for a given x b is the intercept valuem is the slope of the linex1 is the given value in the x (independent) dataset from which you want to predict y.
This is the formula that Excel uses.You sometimes also see:y = a + bx
CALLED ‘WHY’ HAT
��=𝑚𝑥1+𝑏
��
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
Prediction - Regression
This is the linear regression
equation used to predict the value in Excel,
where =mx+b, with ‘m’ as the slope and ‘b’ as
the intercept
��
��
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
Prediction Reprise - “Almost” Regression Using The Line
A high school grade of 80% will predict a
first year university grade of almost 75%
But can we get a more accurate prediction that
“almost”?
Yes, using this linear equation.
Predicting an 80% Incoming HS Grade from the Linear Regression Equation
= 1.0862x - 12.22
= 1.0862 * 80% - 12.22
= 74.7%This is our “almost” 75%.
��=𝑚𝑥1+𝑏
��
��
Standard Error of the Estimate (SEE) of
The predicted is not exact even if we have an exact x to start with because…
There is likely more than one y value for every x value, and…
The line is based on the correlation coefficient which was not a perfect 1.0 but an imperfect 0.617, and…
Our r2 of 0.38 only explains 38% of variability, and…
Our ‘n’ is only a ample of 42 pairs and not everyone in the population from which the sample of 42 pairs
came.
����
HS Grade
X
University grade
Y
HS Grade
X
University grade
Y 91.83 87.00 73.67 69.00 90.83 88.00 73.17 67.50 84.67 82.40 72.83 76.90 83.83 76.30 72.83 63.00 83.67 78.30 72.67 65.00 83.33 80.20 72.33 67.00 82.00 75.40 72.00 71.00 81.17 67.40 71.50 65.00 81.00 84.30 71.17 54.00 78.33 76.80 71.00 63.50 77.83 78.60 70.67 67.00 77.83 67.90 70.67 67.90 77.67 55.80 70.50 69.00 76.67 76.00 70.33 65.40 76.50 76.60 70.17 65.10 75.67 42.70 70.17 64.70 74.83 80.30 70.00 70.00 74.67 82.90 70.00 70.00 74.67 67.00 69.67 40.60 74.50 71.00 69.67 61.00 74.33 65.80 69.33 67.00
Mean of X = 75.48 Mean of Y = 69.76
r = 0.617 r2 = 0.38 SEE = 8.03
These two students have the same HS grade but
widely differing first year grades.
These two students have the same HS grade and very
close first year grades.
If you plug 70.17 into the equation you get a
predicted first year value of 63.99, which is very close to
actual grades.
If you plug 72.83 into the equation you get a
predicted first year value of 66.89, which is not very close to actual grades.
This variability between the predicted grades
and the actual grades is called the error of
estimate and it can be calculated as a statistic
called the standard error of estimate (SEE).
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
These lines represent the idea of variability of the
data points from the line. The SEE is the average of all the squared
differences from the data points to
the line.
Standard Error of Estimate of
Luckily we don’t have to calculate this by hand. Excel calculates it for you.
The result is the ± value on in whatever values y was in (e.g. in this case, student CGPA in %).
Again the SEE is similar to using the standard deviation.
𝑆𝐸𝐸=√ 1(𝑛−2) [∑ (𝑦− 𝑦 )2−
[∑ (𝑥−𝑥 )(𝑦− 𝑦 )2 ]√(𝑥−𝑥)2 ]
��
��
Note the ‘2’. That’s because we have a pair of values and not just one value.
And remember that n-1 is the
sample n.
Note the squared differences of x ’s and y ’s from the mean of all x ’s and y ’s.
Standard Error of Estimate of
Now look again, …… and compare it to this:
Which you should all recognize as the standard deviation formula.
Once again the usefulness of the arithmetic mean and the standard deviation is evident.
𝑆𝐸𝐸=√ 1(𝑛−2) [∑ (𝑦− 𝑦 )2−
[∑ (𝑥−𝑥 )(𝑦− 𝑦 )2 ]√(𝑥−𝑥)2 ]
��
𝑠=√∑ ¿¿¿ ¿
The Importance of s and Stop and absorb this.
The importance of the arithmetic mean and the standard deviation cannot be overstated in statistics because the same rules that apply to the and the s
apply to whatever equation they appear in.
That is, the distribution is normal:It has no extreme values.
It has no gaps.It has no outliers.
It is not skewed (skewness).It is not peaked (kurtosis).
It is not bi-modal (two peaks).It is not poly-modal (many peaks).
Interpreting the SEEThe SEE for the example data is 8.03%.
This number is the ± value on in whatever values y was in (e.g. student 1st year University CGPA %).
Since the SEE is similar to the standard deviation, then saying 1.96*SEE is the same as saying 1.96*s.
Thus you can say that the population value of (labeled as upper case ) will, with 95% certainty, fall between ±1.96*SEE, or…
±1.96*8.03%=15.74%
��
��
High School Best 6 OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.3812SEE = 8.03
45
55
65
75
85
95
105
65 70 75 80 85 90 95
HS best six %
1st
Yea
r U
niv
ersi
ty C
GP
A
These lines represent the
average variability of
the data points from the line
This average variability is calculated as the SEE and
represents the average margin of error of any data point from the
trend line
= 1.0862x - 12.22
= 1.0862 * 80% - 12.22
= 74.7%
Predicting the Value x = 80% from the Linear Regression Equation
��=𝑚𝑥+𝑏
��
��
Predicting Margins of Error at 95% Confidence from the Linear Regression Equation
Predicted value = 74.7%Predicted margin (the SEE) at 95% = 1.96 * 8.03%
Predicted margin = ±15.78%
Range within which population value ( )falls with 95% certainty = 74.7% ±15.78% = 58.9% to 90.5%.
The large range of the margin is a function of the relatively modest ‘r ‘ value (.617) and the small ‘n ’
(42 pairs).
How might you reduce these margins?
Reducing Margins of ErrorThe good news is that theoretically they can be reduced by
increasing the number of pairs or ‘n’. The bad news is that they likely cannot be reduced by very much. Why? Consider
the following:
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
If your sample were to have caught only the red circled
students, then your r would be small and hence your SEE high.
But the larger the sample, the more likely you’ll approximate the distribution and the r seen on the
graph, but no better.
Remember what happens when you increase sample size using √n.
Change in CI for every change in 'n'
1
2
3
45 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-0.2
-0.18
-0.16
-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
Doublings of 'n'
CI%
28
Diminishing Returns on Sample Size
Doublings of ‘n’ starting at 30
Change in CI for every doubling of ‘n’
Confi
denc
e In
terv
al
We’ll look at this more closely later in Sampling lecture.
n=30
n=60
n=120n=240
n=480
Relationships SummaryRelationships measure the effect of one variable (the independent or x) on another (the dependent or y).
The direction and strength of the effect is given by the correlation coefficient (r) and its reliability by the Ser and Ser0 .
The degree (in % terms) to which x causes change in y is given by the coefficient of determination (r2).
Using line equations, regression allows us to use the relationship measured by correlation to forecast values of y
for given values of x.
Using the standard error (called SEE) allows us to put margins of error on the predicted values.
ResidualsAnalysisInRegression
Regression and ResidualsLinear equations express how well a straight line
fits your data.
The actual regression line is calculated as the line that minimizes the squared distances of all points
in the dataset from the line.
More precisely, it calculates a line where the sum of the squared differences of all values of y from
for any value of x, will be the smallest.��
What Are Residuals?
A residual is the difference between an observed value of y and a predicted value of y:
Residual = observed value – predicted valuee = y - Both the sum and the mean of a residuals set equal 0:Σ e = 0, = 0
So what?
��
What Does A Residuals Analysis Tell You?First:
A residuals plot confirms whether a linear trend exists in your data.
Second:A residuals plot can also indicate whether another type
of trend exists in your data.
If the trend in the residuals plot is…
Strong Weak/NoneThe trend in the data is not
linear but could be non-linear.
The trend in the data is linear not non-linear.
High School Best Six OACs & 1st Year University Grades
y = 1.0862x - 12.22R = 0.617
R2 = 0.381
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
65.00 70.00 75.00 80.00 85.00 90.00 95.00
HS best 6 %
1st
year
Un
iver
sity
CG
PA
Observed value of a ‘y’ for agiven ‘x’.
Predicted value of y…
Red lines (or their mathematical
equivalents) are called residuals.
Best fit line will minimize the total of the squared ‘lengths’ of the all the
red lines
…using line equation.
Another observed value of a ‘y’… for an identical given value of ‘x’.
STUDENT SAMPLE #
HIGH SCHOOL
G12 GRADE
X
FIRST YEAR UNIVERSITY
GRADEY Residuals
1 91.83 87.00 85.41 1.592 90.83 88.00 84.58 3.423 84.67 82.40 79.47 2.934 83.83 76.30 78.77 -2.475 83.67 78.30 78.64 -0.346 83.33 80.20 78.35 1.857 82.00 75.40 77.25 -1.858 81.17 67.40 76.56 -9.169 81.00 84.30 76.42 7.88
10 78.33 76.80 74.20 2.6011 77.83 78.60 73.79 4.81
x y Y eThe is
predicted by the linear
equation for each ‘x’
Y
The e is given by y -Y
Y
How Residuals Are Calculated
68.00 73.00 78.00 83.00 88.00 93.00 98.0045.0050.0055.0060.0065.0070.0075.0080.0085.0090.0095.00
f(x) = 0.738998422933728 x + 9.5919129758953R² = 0.249369551739253
HS GRADES AND 1ST YEAR CGPA
HS GRADES AND 1ST YEAR CGPALinear (HS GRADES AND 1ST YEAR CGPA)
Linear pattern to the data and a
moderate r.
Scatterplot of y against xFor HS grades and first year University
xy
68.00 73.00 78.00 83.00 88.00 93.00 98.00-20.00
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
f(x) = 1.57706627156973E-06 x − 0.00391297589516879R² = 1.51356704947148E-12
Scatterplot of residuals e against xFor HS grades and first year University
ex
No pattern to the data and an
almost zero r.
Residuals SummaryThe stronger the linear trend, the weaker the pattern in
the residuals.
The weaker the linear trend, the stronger the pattern in the residuals.
BUT!
Take a look at these non-linear trend lines and their residuals.
$0.00 $10,000.00 $20,000.00 $30,000.00 $40,000.00 $50,000.00 $60,000.000.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
200.0
f(x) = − 0.00243219927787781 x + 62.7303255135724R² = 0.389406588370063
R² = 0.679994888694155
Income (X) and IM (Y) Raw Data
Natural raw data scatter is strongly skewed. The r2, 38%, is fair but the shape of these data clearly do not fit a linear trend line (they fit the red logarithmic line much better).
$0.00$10,000.00
$20,000.00
$30,000.00
$40,000.00
$50,000.00
$60,000.00
-100
-50
0
50
100
150
R² = 0.408306988867007
X Variable 1 Residual Plot
X Income
Resid
uals
And the residuals plot for those same data shows another strong pattern (observe the r2) - but its not linear, indicating
that a straight line fit to these data is inadvisable.
If you calculate your residuals from a linear trend line and your data is not linear, then there will be a
pattern in the residuals.
The poorer a linear trend line fits your data the stronger the residuals pattern will be.
If you calculate your residuals from a linear trend line and your data is linear, then there will be
weaker pattern in the residuals.
At r = ±1.0, then r2=1.0 and there will be no pattern in the residuals because all points will fall on the line and there will be no residual values –
that is, the values will all be zero.
Residuals Summary