Upload
eugenia-winifred-booker
View
214
Download
2
Embed Size (px)
Citation preview
REGRESSION ANALYSISIn the OUTCOME that you will commence
second week back, you might be given data and asked to perform a REGRESSION
ANALYSIS
YOU NEED TO KNOW WHAT THIS MEANS
REGRESSION ANALYSIS is the process of fitting a linear model to a data set.
The aim is to determine the best linear model possible and to use
it to make predictions.
What do we mean by the “best possible linear model”?
The best possible linear model is the one in which:
a. The data is linear or has been linearized by a data transformation:
and we also wantb. the linear model which has the greatest possible value of r2
REMEMBER: the value of the coefficient of determination measures the predictive
power of our regression model.
R2
PREDICTIVE POWER
If r2 > 30%, then our model will have
Predictive power
STEP 1:Construct a scatterplot of the RAW (Original ) Data and note:a. Its shapeb. The value of the coefficient of determination
We are predicting LIFE EXPECTANCY from GDP, so:
FIRST: We must decide
Which is the INDEPENDENT (x) VARIABLE:
Which is the DEPENDENT (y ) VARIABLE
GDPY LIFE EXPECTANCY
X
GDP lifeex
950 58
1670 65
4250 68
11520 74
12280 73
4170 73
14300 75
5540 71
9830 72
1680 61
320 67
22260 66
550 50
930 66
940 64
2670 72
11220 74
1420 48
150 41
330 44
520 44
940 49
350 48
180 48
List A = gdp List B = leLife expectancy
CONCLUSION: Data is NON-LINEAR
From the Home screen determine the value of r2. Value of r2 = 0.3665.
STEP 2: We seek a Transformation to linearize the data.
CHECK THE CIRCLE OF TRANSFORMATIONS!!
Our scatterplot most closely resembles Quadrant 2!
Quadrant 1Quadrant 2
Quadrant 3 Quadrant 4
POTENTIALLY SUITABLE TRANSFORMATIONS are:
Y2
Logx
1 x
Try each of these transformations to determine which one effectively linearizes the data and gives the highest value for r2.
Step 3
In each case, obtain a RESIDUAL PLOT to confirm that the transformed data is linear.
gdp le lesqu
950 58 33641670 65 42254250 68 462411520 74 547612280 73 53294170 73 532914300 75 56255540 71 50419830 72 51841680 61 3721320 67 448922260 66 4356550 50 2500930 66 4356940 64 40962670 72 518411220 74 54761420 48 2304150 41 1681330 44 1936520 44 1936940 49 2401350 48 2304180 48 2304
List A gdp ( x variable)List B le (y variable)List C lesqu (y2 transformed variable )
TRANSFORMED DATA APPEARS NON-LINEAR STILL
R2 = 38.3%
Y SQUARED TRANSFORMATION
Establish the value of r2 in HOMESCREEN:
CONFIRM WITH RESIDUAL PLOT
Remember: to get the correct residual plot use the split screen view. Make sure that the scatterplot at the top has the correct transformed variable.
CONCLUSION: The residual plot shows a definite curved pattern, indicating that the transformed data is still not linear. The y2 transformation has NOT succeeded in producing an effective linear model.
NEXT STEP….
You guessed it!!
Now we try the next potential candidate transformation.
It was the log x transformation!
GDP lifeex logGDP
950 58 2.981670 65 3.224250 68 3.63
11520 74 4.0612280 73 4.094170 73 3.62
14300 75 4.165540 71 3.749830 72 3.991680 61 3.23320 67 2.51
22260 66 4.35550 50 2.74930 66 2.97940 64 2.97
2670 72 3.4311220 74 4.051420 48 3.15150 41 2.18330 44 2.52520 44 2.72940 49 2.97350 48 2.54350 48 2.54
(Delete the y2 column, as we have discarded this transformation.)
List A= gdp List B= le List C= loggdp
R2 = 66.0%
CONCLUSION: It appears that the log(GDP) transformation has successfully linearized the data! Scatterplot appears linear, and R2 has increased.
NOTE THE VARIABLES ARE LISTED HERE SO YOU CAN CHECK
Now confirm this by creating a RESIDUAL PLOT for the log(x) transformation. Open a new graphing screen!!
CONCLUSION: The residual plot shows a random scattering of points with no pattern, indicating that the transformed data is linear.The value of r2 has now increased to 66.0%. The logx transformation has succeeded in producing an effective linear model for the data with significant predictive power.
And now……
Yes you guessed it!
We need to check out the reciprocal x transformation, because …..
maybe it will give a higher coefficient of determination than the logx!
(here we go again)
list A listB list C list Dlog(list1) 1/list1
950 58 2.98 0.001051670 65 3.22 0.000604250 68 3.63 0.00024
11520 74 4.06 0.0000912280 73 4.09 0.00008
4170 73 3.62 0.0002414300 75 4.16 0.00007
5540 71 3.74 0.000189830 72 3.99 0.000101680 61 3.23 0.00060
320 67 2.51 0.0031322260 66 4.35 0.00004
550 50 2.74 0.00182930 66 2.97 0.00108940 64 2.97 0.00106
2670 72 3.43 0.0003711220 74 4.05 0.00009
1420 48 3.15 0.00070150 41 2.18 0.00667330 44 2.52 0.00303520 44 2.72 0.00192940 49 2.97 0.00106350 48 2.54 0.00286180 48 2.26 0.00556
Don’t delete log x column because we think this model was effective!
List A = gdp List B = le List C = loggdp List D = recgdp
Life expectancy
1/GNP
R2 = 51.5%
CONCLUSION: The transformed data appears to be linear, but the value of the coefficient of determination is 51.5%, lower than for the loggdp transformation.
Coefficient of determination
CONCLUSION: The residual plot shows a random scattering of points with no pattern, indicating that the 1/x transformation has made the data linear.
Remember to create a new graphing screen for the new transformation!!
OVERALL CONCLUSIONS
We have tested three transformations:
Y squared transformation: Ineffective (did not linearize the data)
Log (x ) transformation: Effective in linearizing data with r2 = 66.0%
1/x transformation: Effective in linearizing data with r2 = 51.5%
Based on this regression analysis, we conclude that the log(GDP) transformation provides the best model for making predictions from this data.
MAKING A PREDICTIONUse your linear regression model to predict the Life Expectancy in a country where the GNP is $8000gnp le List3
loggnp950 58 2.98
1670 65 3.224250 68 3.63
11520 74 4.0612280 73 4.094170 73 3.62
14300 75 4.165540 71 3.749830 72 3.991680 61 3.23320 67 2.51
22260 66 4.35550 50 2.74930 66 2.97940 64 2.97
2670 72 3.4311220 74 4.051420 48 3.15150 41 2.18330 44 2.52520 44 2.72940 49 2.97350 48 2.54180 48 2.26
Find the equation of the LEAST SQUARES REGRESSION line for the Log transformation
Regression(a+bx) Xlist = log(GNP)Ylist=le
a = 14.3b = 14.5
Life Expectancy = 14.3 + 14.5 log(GNP)
Life Expectancy = 14.3 + 14.5 × log(8000)
= 70.9
Predicted Life Expectancy = 70.9 years