REGRESSION ANALYSIS In the OUTCOME that you will commence second week back, you might be given data and asked to perform a REGRESSION ANALYSIS YOU NEED

REGRESSION ANALYSISIn the OUTCOME that you will commence

second week back, you might be given data and asked to perform a REGRESSION

ANALYSIS

YOU NEED TO KNOW WHAT THIS MEANS

REGRESSION ANALYSIS is the process of fitting a linear model to a data set.

The aim is to determine the best linear model possible and to use

it to make predictions.

What do we mean by the “best possible linear model”?

The best possible linear model is the one in which:

a. The data is linear or has been linearized by a data transformation:

and we also wantb. the linear model which has the greatest possible value of r2

REMEMBER: the value of the coefficient of determination measures the predictive

power of our regression model.

R2

PREDICTIVE POWER

If r2 > 30%, then our model will have

Predictive power

STEP 1:Construct a scatterplot of the RAW (Original ) Data and note:a. Its shapeb. The value of the coefficient of determination

We are predicting LIFE EXPECTANCY from GDP, so:

FIRST: We must decide

Which is the INDEPENDENT (x) VARIABLE:

Which is the DEPENDENT (y ) VARIABLE

GDPY LIFE EXPECTANCY

X

GDP lifeex

950 58

1670 65

4250 68

11520 74

12280 73

4170 73

14300 75

5540 71

9830 72

1680 61

320 67

22260 66

550 50

930 66

940 64

2670 72

11220 74

1420 48

150 41

330 44

520 44

940 49

350 48

180 48

List A = gdp List B = leLife expectancy

CONCLUSION: Data is NON-LINEAR

From the Home screen determine the value of r2. Value of r2 = 0.3665.

STEP 2: We seek a Transformation to linearize the data.

CHECK THE CIRCLE OF TRANSFORMATIONS!!

Our scatterplot most closely resembles Quadrant 2!

Quadrant 1Quadrant 2

Quadrant 3 Quadrant 4

POTENTIALLY SUITABLE TRANSFORMATIONS are:

Y2

Logx

1 x

Try each of these transformations to determine which one effectively linearizes the data and gives the highest value for r2.

Step 3

In each case, obtain a RESIDUAL PLOT to confirm that the transformed data is linear.

gdp le lesqu

950 58 33641670 65 42254250 68 462411520 74 547612280 73 53294170 73 532914300 75 56255540 71 50419830 72 51841680 61 3721320 67 448922260 66 4356550 50 2500930 66 4356940 64 40962670 72 518411220 74 54761420 48 2304150 41 1681330 44 1936520 44 1936940 49 2401350 48 2304180 48 2304

List A gdp ( x variable)List B le (y variable)List C lesqu (y2 transformed variable )

TRANSFORMED DATA APPEARS NON-LINEAR STILL

R2 = 38.3%

Y SQUARED TRANSFORMATION

Establish the value of r2 in HOMESCREEN:

CONFIRM WITH RESIDUAL PLOT

Remember: to get the correct residual plot use the split screen view. Make sure that the scatterplot at the top has the correct transformed variable.

CONCLUSION: The residual plot shows a definite curved pattern, indicating that the transformed data is still not linear. The y2 transformation has NOT succeeded in producing an effective linear model.

NEXT STEP….

You guessed it!!

Now we try the next potential candidate transformation.

It was the log x transformation!

GDP lifeex logGDP

950 58 2.981670 65 3.224250 68 3.63

11520 74 4.0612280 73 4.094170 73 3.62

14300 75 4.165540 71 3.749830 72 3.991680 61 3.23320 67 2.51

22260 66 4.35550 50 2.74930 66 2.97940 64 2.97

2670 72 3.4311220 74 4.051420 48 3.15150 41 2.18330 44 2.52520 44 2.72940 49 2.97350 48 2.54350 48 2.54

(Delete the y2 column, as we have discarded this transformation.)

List A= gdp List B= le List C= loggdp

R2 = 66.0%

CONCLUSION: It appears that the log(GDP) transformation has successfully linearized the data! Scatterplot appears linear, and R2 has increased.

NOTE THE VARIABLES ARE LISTED HERE SO YOU CAN CHECK

Now confirm this by creating a RESIDUAL PLOT for the log(x) transformation. Open a new graphing screen!!

CONCLUSION: The residual plot shows a random scattering of points with no pattern, indicating that the transformed data is linear.The value of r2 has now increased to 66.0%. The logx transformation has succeeded in producing an effective linear model for the data with significant predictive power.

And now……

Yes you guessed it!

We need to check out the reciprocal x transformation, because …..

maybe it will give a higher coefficient of determination than the logx!

(here we go again)

list A listB list C list Dlog(list1) 1/list1

950 58 2.98 0.001051670 65 3.22 0.000604250 68 3.63 0.00024

11520 74 4.06 0.0000912280 73 4.09 0.00008

4170 73 3.62 0.0002414300 75 4.16 0.00007

5540 71 3.74 0.000189830 72 3.99 0.000101680 61 3.23 0.00060

320 67 2.51 0.0031322260 66 4.35 0.00004

550 50 2.74 0.00182930 66 2.97 0.00108940 64 2.97 0.00106

2670 72 3.43 0.0003711220 74 4.05 0.00009

1420 48 3.15 0.00070150 41 2.18 0.00667330 44 2.52 0.00303520 44 2.72 0.00192940 49 2.97 0.00106350 48 2.54 0.00286180 48 2.26 0.00556

Don’t delete log x column because we think this model was effective!

List A = gdp List B = le List C = loggdp List D = recgdp

Life expectancy

1/GNP

R2 = 51.5%

CONCLUSION: The transformed data appears to be linear, but the value of the coefficient of determination is 51.5%, lower than for the loggdp transformation.

Coefficient of determination

CONCLUSION: The residual plot shows a random scattering of points with no pattern, indicating that the 1/x transformation has made the data linear.

Remember to create a new graphing screen for the new transformation!!

OVERALL CONCLUSIONS

We have tested three transformations:

Y squared transformation: Ineffective (did not linearize the data)

Log (x ) transformation: Effective in linearizing data with r2 = 66.0%

1/x transformation: Effective in linearizing data with r2 = 51.5%

Based on this regression analysis, we conclude that the log(GDP) transformation provides the best model for making predictions from this data.

MAKING A PREDICTIONUse your linear regression model to predict the Life Expectancy in a country where the GNP is $8000gnp le List3

loggnp950 58 2.98

1670 65 3.224250 68 3.63

11520 74 4.0612280 73 4.094170 73 3.62

14300 75 4.165540 71 3.749830 72 3.991680 61 3.23320 67 2.51

22260 66 4.35550 50 2.74930 66 2.97940 64 2.97

2670 72 3.4311220 74 4.051420 48 3.15150 41 2.18330 44 2.52520 44 2.72940 49 2.97350 48 2.54180 48 2.26

Find the equation of the LEAST SQUARES REGRESSION line for the Log transformation

Regression(a+bx) Xlist = log(GNP)Ylist=le

a = 14.3b = 14.5

Life Expectancy = 14.3 + 14.5 log(GNP)

Life Expectancy = 14.3 + 14.5 × log(8000)

= 70.9

Predicted Life Expectancy = 70.9 years

Documents

REGRESSION ANALYSIS In the OUTCOME that you will commence second week back, you might be given data and asked to perform a REGRESSION ANALYSIS YOU NEED