Section C Group 9

Embed Size (px)

Citation preview

  • 7/31/2019 Section C Group 9

    1/17

    Analysis of presence of

    Multicollinearity QAM-II

    Submitted to Prof. Abhijit Bhattacharya

    SUBMITTED BY GROUP 5

    December 13, 2011

    AmeyRambole ABM08009 NitinRawat PGP26104

    AbhijitTalukdar PGP27134 Akash Joshi PGP27136

    Manish Pushkar PGP27161 Suraj Somashekhar PGP27187

    SumeetChoudhary PGP27185

  • 7/31/2019 Section C Group 9

    2/17

    1

    Problem StatementThe owner of Pizza Corner, Bangalore would like to build a regression model consisting of six well defined

    explanatory variables to predict the sales of pizzas. The six variables are:

    X1 : Number of delivery boys

    X2 : Cost (in Rupees) of advertisements (000s)X3 : Number of outlets

    X4 : Varieties of pizzas

    X5 : Competitors activities indexX6 : Number of existing customers (000s)

    Sales data of past fifteen months and the above listed variables is given below:

    SALES DATA FOR PIZZA

    Month Sales X1 X2 X3 X4 X5 X6

    1 81 15 20 35 17 4 70

    2 23 10 12 10 13 4 43

    3 18 7 11 14 14 3 31

    4 8 2 6 9 13 3 10

    5 16 4 10 11 12 4 17

    6 4 1 5 6 12 5 8

    7 29 4 14 15 15 2 39

    8 22 7 12 16 16 3 40

    9 15 5 10 18 15 4 30

    10 6 3 5 8 13 2 16

    11 45 13 17 20 14 2 30

    12 11 2 9 10 12 3 2013 20 5 12 15 12 3 25

    14 60 12 18 30 15 4 50

    15 5 1 5 6 12 5 20

  • 7/31/2019 Section C Group 9

    3/17

    2

    Initial Regression Model

    Y(Sales) was regressed over X1,X2,X3,X4,X5 and X6 in SPSS

    The following model was developed:-

    ^Ysales = 6.372 + .919X1 + .699X2 + 1.620X3 - 1.978X4 + .067X5 + .242X6

    SPSS Regression Output for initial model

    Descriptive Statistics

    Mean Std. Deviation N

    Sales 24.20 21.913 15

    X1 6.07 4.511 15

    X2 11.07 4.758 15

    X3 14.87 8.340 15

    X4 13.67 1.633 15

    X5 3.40 .986 15

    X6 29.93 16.529 15

    This table depicts mean, standard deviation and size of sample provided across 15 months for variousdependent and independent variables.

    Variables Entered/Removedb

    Model

    Variables

    Entered

    Variables

    Removed Method

    1 X6, X5, X4, X1,

    X3, X2a. Enter

    a. All requested variables entered.

    b. Dependent Variable: Sales

    This table tells you the method that SPSS used to run the regression. "Enter" means that eachindependent variable was entered in usual fashion. It says all variables were entered and no variable

    was removed

  • 7/31/2019 Section C Group 9

    4/17

    3

    Correlations

    Sales X1 X2 X3 X4 X5 X6

    Pearson Correlation Sales 1.000 .902 .934 .953 .725 -.040 .880

    X1 .902 1.000 .905 .845 .672 -.103 .841

    X2 .934 .905 1.000 .904 .702 -.189 .867

    X3 .953 .845 .904 1.000 .794 -.036 .856

    X4 .725 .672 .702 .794 1.000 -.178 .819

    X5 -.040 -.103 -.189 -.036 -.178 1.000 .006

    X6 .880 .841 .867 .856 .819 .006 1.000

    This table provides individual correlation coefficients between dependent and independent variablesModel Summary

    b

    Model R R Square

    Adjusted R

    Square

    Std. Error of the

    Estimate Durbin-Watson

    1 .976a

    .953 .918 6.260 1.745

    a. Predictors: (Constant), X6, X5, X4, X1, X3, X2

    b. Dependent Variable: Sales

    R is the correlation between the observed and predicted values of dependent variable.

  • 7/31/2019 Section C Group 9

    5/17

    4

    R-Square95.3 % of variation is explained by model. This is the proportion of variance in thedependent variable (Sales) which can be explained by the independent variables (X6, X5, X4, X1, X3,

    and X2). This is an overall measure of the strength of association and does not reflect the extent to

    which any particular independent variable is associated with the dependent variable.

    Adjusted R-square91.8 % of variation in sales is explained by model- adjusted for number ofindependent variables and sample size.

    Std. Error of the Estimate - This is also referred to as the root mean squared error. It is the standarddeviation of the error term and the square root of the Mean Square for the Residuals in the ANOVA

    table

    ANOVAb

    Model Sum of Squares Df Mean Square F Sig.

    1 Regression 6408.864 6 1068.144 27.254 .000a

    Residual 313.536 8 39.192

    Total 6722.400 14

    a. Predictors: (Constant), X6, X5, X4, X1, X3, X2

    b. Dependent Variable: Sales

    Sum of Squares - These are the Sum of Squares associated with the three sources of variance, Total,Regression and Residual. The Total variance is partitioned into the variance which can be explained by

    the independent variables (Regression) and the variance which is not explained by the independent

    variables (Residual). df- These are the degrees of freedom associated with the sources of variance. The total variance has 14

    (N-1) degrees of freedom. The Regression degrees of freedom correspond to the number of coefficients

    estimated minus 1. Including the intercept, there are 7 coefficients, so the model has 7-1=6 degrees of

    freedom. The Error degrees of freedom are the DF total minus the DF model, 14 - 6 =8.

    Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF.

  • 7/31/2019 Section C Group 9

    6/17

    5

    F and Sig. - This is the F-statistic the p-value associated with it. The F-statistic is the Mean Square(Regression) divided by the Mean Square (Residual): 1068.144/39.192 = 27.254. The p-value is

    compared to some alpha level in testing the null hypothesis that all of the model coefficients are 0.

    H0: 1 = 2= 3=4= 5= 6=0

    H1: Not all j are 0

    In this model we reject H0 and can conclude that not all j are 0 as p-value is 0

    Sales depends significantly upon some predictors.

    Coefficientsa

    Model

    UnstandardizedCoefficients

    StandardizedCoefficients

    t Sig.

    95%

    ConfidenceInterval for B Collinearity Statistics

    B

    Std.

    Error Beta

    Lower

    Bound

    Upper

    Bound Tolerance VIF

    1 (Const

    ant)6.372 32.586 .196 .850

    -

    68.77381.516

    X1 .919 .910 .189 1.010 .342 -1.179 3.017 .166 6.018

    X2 .699 1.303 .152 .537 .606 -2.306 3.704 .073 13.733

    X3 1.620 .618 .617 2.621 .031 .195 3.046 .105 9.500

    X4 -1.978 2.310 -.147 -.856 .417 -7.305 3.349 .197 5.083

    X5 .067 2.211 .003 .030 .977 -5.032 5.165 .589 1.696

    X6 .242 .299 .182 .808 .442 -.448 .931 .115 8.719

    a. Dependent

    Variable: Sales

    B - These are the values for the regression equation for predicting the dependent variable from theindependent variable. Interpretations made are

  • 7/31/2019 Section C Group 9

    7/17

    6

    o Controlling other variables constant, if Number of delivery boys is increased by 1 then Saleswill increase by 0.919

    o Controlling other variables constant, if cost (in rupees) of ads (000s) is increased by 1 thenSales will increase by 0.699

    o Controlling other variables constant, if Number of outlets is increased by 1 then Sales willincrease by 1.620

    o Controlling other variables constant, if varieties of pizza is increased by 1 then Sales willdecrease by 1.978

    o Controlling other variables constant, if Competitors activities index is increased by 1 thenSales will increase by 0.067

    o Controlling other variables constant, if Number of existing customers (000s) is increased by 1then Sales will increase by 0.242

    Std. Error: These are the standard errors associated with the coefficients. Beta - These are the standardized coefficients. By standardizing the variables before running the

    regression, you have put all of the variables on the same scale, and you can compare the magnitude of

    the coefficients to see which one has more of an effect. We can interpret these coefficients in same way

    as we did B values.

    Here B for constant is 0.Look for the regression coefficient having the highest magnitude.

    Corresponding regressor contributes the most.

    T and Sig. - These are the t-statistics and their associated 2-tailed p-values used in testing whether agiven coefficient is significantly different from zero. Using an alpha of 0.05:

    o The coefficient for X1 (0.919) is not significantly related to dependent variable because its p-value is 0.342, which is greater than 0.05.

    o The coefficient for X2 (0.699) is not significantly related to dependent variable because its p-value is 0.606, which is greater than 0.05.

    o The coefficient for X3 (1.620) is significantly related to dependent variable because its p-valueis 0.031, which is less than 0.05.

    1 2

    1 1 2 2

    1 2

    Standardized ,

    Standardized , Standardized

    i

    Y

    i i

    i i

    X X

    Y YY

    s

    X X X XX X

    s s

  • 7/31/2019 Section C Group 9

    8/17

    7

    o The coefficient for X4 (-1.978) is not significantly related to dependent variable because its p-value is 0.417, which is greater than 0.05.

    o The coefficient for X5 (0.067) is not significantly related to dependent variable because its p-value is 0.977, which is greater than 0.05.

    o The coefficient for X6 (0.242) is not significantly related to dependent variable because its p-value is 0.442, which is greater than 0.05.

    Tolerance - The tolerance of a variable is defined as 1 minus the squared multiple correlation of thisvariable with all other independent variables in the regression equation. Therefore, the smaller the

    tolerance of a variable, the more redundant is its contribution to the regression (i.e., it is redundant with

    the contribution of other independent variables). Small value of Tolerance indicates multicollinearity.

    VIFit is equal to 1/Tolerance. High value (more than 10) of VIF indicates multicollinearity.Collinearity Diagnostics

    a

    Mode

    l

    Dimens

    ion Eigenvalue

    Condition

    Index

    Variance Proportions

    (Constant) X1 X2 X3 X4 X5 X6

    1 1 6.470 1.000 .00 .00 .00 .00 .00 .00 .00

    2 .388 4.082 .00 .04 .00 .01 .00 .03 .00

    3 .052 11.145 .01 .04 .02 .00 .01 .42 .03

    4 .044 12.101 .00 .64 .00 .12 .00 .00 .13

    5 .032 14.166 .00 .00 .00 .34 .00 .03 .38

    6 .012 23.582 .00 .27 .63 .14 .04 .09 .00

    7 .001 74.026 .99 .00 .35 .39 .95 .42 .46

    a. Dependent Variable: Sales

    Smaller value of Eigen value () indicates presence of multicollinearity Condition Index = (Maximum value of)/ (Minimum value of). High value of CI indicates

    multicollinearity.

  • 7/31/2019 Section C Group 9

    9/17

    8

    1. Check for MulticollinearityMulticollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression

    model are highly correlated. In this situation the coefficient estimates may change erratically in response to

    small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of

    the model as a whole, at least within the sample data themselves; it only affects calculations regarding

    individual predictors. That is, a multiple regression model with correlated predictors can indicate how well theentire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual

    predictor, or about which predictors are redundant with respect to others.

    The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the

    coefficients become unstable and the standard errors for the coefficients can get wildly inflated.

    VIF Test: Since VIF value of Variable X2 is greater than 10 (13.733) in Coefficients table there ispresence of collinearity

    Eigen Value Test: Very small Eigen value (.001) of 7th dimension and very high value of itscorresponding Condition index (74.026) in Collinearity Diagnostics table indicates presence of

    collinearity.

    Pearson Correlation:Correlations

    Sales X1 X2 X3 X4 X5 X6

    Pearson

    Correlation

    Sales 1.000

    X1 .902 1.000

    X2 .934 .905 1.000

    X3 .953 .845 .904 1.000

    X4 .725 .672 .702 .794 1.000

    X5 -.040 -.103 -.189 -.036 -.178 1.000

    X6 .880 .841 .867 .856 .819 .006 1.000

  • 7/31/2019 Section C Group 9

    10/17

    9

    If the absolute value of Pearson correlation is greater than 0.9, collinearity is very likely to exist. Values

    in Bold indicate multicollinearity.

    Statistical TestFrom the above correlations table we can see that Pearson correlation between (X1, Sales) is .902, (X2,

    Sales) is .934

    Corresponding P-value of X1 and X2 are .342 and .606 resp.

    Considering 10 % level of significance these high p values does not suggest rejection of H0: j =0. So

    via T-test we conclude that X1 and X2 are not significantly related to dependent variable.

    So there is presence of collinearity as Pearson correlations and T-test present contradictory information.

    Variable X1 X2 X3 X4 X5 X6

    T statistic 1.010 .537 2.621 -.856 .030 .808

    Sig. (p value) .342 .606 .031 .417 .977 .442

    2. SPSS Stepwise Regression equation^Ysales = -11.817 + 1.640X1 + 1.753X3

  • 7/31/2019 Section C Group 9

    11/17

    10

    3. SPSS Stepwise Regression Output

    Descriptive Statistics

    Mean Std. Deviation N

    Sales 24.2000 21.91281 15

    X1 6.0667 4.51136 15

    X2 11.0667 4.75795 15

    X3 14.8667 8.33981 15

    X4 13.6667 1.63299 15

    X5 3.4000 .98561 15

    X6 29.9333 16.52905 15

    This table depicts mean, standard deviation and size of sample provided across 15 months for variousdependent and independent variables.

    Variables Entered/Removeda

    Model

    Variables

    Entered

    Variables

    Removed Method

    1X3 .

    Stepwise (Criteria: Probability-of-F-to-enter = .100).

    2X1 .

    Stepwise (Criteria: Probability-of-F-to-enter = .100).

    a. Dependent Variable: Sales

    Here we can see in first iteration X3 is entered because it has highest r value correlation with Sales(Check from correlation table)

  • 7/31/2019 Section C Group 9

    12/17

    11

    Variables are only added or removed if the Sig F Change value is significant.For this SPSS performs regression between Y and (X3,X1), Y and (X3,X2) and so on till Y and

    (X3,X6)

    It calculates then

    F (1,n-3)= (SSR(X3,XI)-SSR(X3))/MSSE(X3,XI) for all (X3,XI) combinations

    It then includes XI which has maximum F ratio

    o This value can be obtained from the table ofExcluded Variables. It shows the PartialCorrelation between each candidate for entry and the dependent variable.

    o Partial correlation is a measure of the relationship of the dependent variable to an independentvariable, where the variance explained by previously entered independent variables has been

    removed from both.

    o From theExcluded Variables table we can see X1 has maximum partial correlation .594 andminimum p-value .025. So X1 is next variable to be entered subjected to whether resulting

    model with Sales as dependent variable and (X1, X3) as independent variable improve R

    The process of adding more variables stops when all of the available variables have been included orwhen it is not possible to make a statistically significant improvement in R using any of the variables

    not yet included.

  • 7/31/2019 Section C Group 9

    13/17

    12

    Correlations

    Sales X1 X2 X3 X4 X5 X6

    Pearson Correlation Sales 1.000 .902 .934 .953 .725 -.040 .880

    X1 .902 1.000 .905 .845 .672 -.103 .841

    X2 .934 .905 1.000 .904 .702 -.189 .867

    X3 .953 .845 .904 1.000 .794 -.036 .856

    X4 .725 .672 .702 .794 1.000 -.178 .819

    X5 -.040 -.103 -.189 -.036 -.178 1.000 .006

    X6 .880 .841 .867 .856 .819 .006 1.000

    This table provides individual correlation coefficients between dependent and independent variables In step wise regression this table is used to find out first variable to be entered which has maximum

    correlation coefficient with Sales. In this case it is X3.

    Model Summaryc

    Model R R Square

    Adjusted R

    Square

    Std. Error of the

    Estimate Durbin-Watson

    1 .953a .908 .900 6.91277

    2 .970b

    .940 .930 5.78925 1.477

    a. Predictors: (Constant), X3

    b. Predictors: (Constant), X3, X1

    c. Dependent Variable: Sales

  • 7/31/2019 Section C Group 9

    14/17

    13

    As we can see with inclusion of X1 in model in iteration 2 has increased R square value so model isable to explain 94 % of variability in Sales

    Other parameters are explained earlier in section

    ANOVAc

    Model Sum of Squares Df Mean Square F Sig.

    1 Regression 6101.176 1 6101.176 127.676 .000a

    Residual 621.224 13 47.786

    Total 6722.400 14

    2 Regression 6320.215 2 3160.108 94.288 .000b

    Residual 402.185 12 33.515

    Total 6722.400 14

    a. Predictors: (Constant), X3

    b. Predictors: (Constant), X3, X1 c. Dependent Variable: Sales

    F and Sig. - This is the F-statistic the p-value associated with it. The p-value is compared to some alphalevel in testing the null hypothesis that all of the model coefficients are 0.

    H0: 1 = 2= 3=j=0 H1: Not all j are 0

    In both models we reject H0 and can conclude that not all j are 0 as p-value is 0 in both cases

    Sales depend significantly upon X3 in model 1. Sales depend significantly upon X3 or X1 in model 2.

    To find which one look in coefficients table output

    Other parameters are explained in section 3

  • 7/31/2019 Section C Group 9

    15/17

    14

    Coefficientsa

    Model

    Unstandardized

    Coefficients

    Standardized

    Coefficients

    t Sig.

    95% Confidence

    Interval for B

    Collinearity

    Statistics

    B

    Std.

    Error Beta

    Lower

    Bound

    Upper

    Bound Tolerance VIF

    1 (Constant) -13.013 3.746 -3.474 .004 -21.106 -4.921

    X3 2.503 .222 .953 11.299 .000 2.025 2.982 1.000 1.000

    2 (Constant) -11.817 3.172 -3.726 .003 -18.728 -4.906

    X3 1.753 .347 .667 5.053 .000 .997 2.510 .286 3.498

    X1 1.640 .641 .338 2.556 .025 .242 3.038 .286 3.498

    a. Dependent Variable: Sales

    T stat and Sig for Model 2o The coefficient for X1 (1.640) is significantly related to dependent variable because its p-value is

    0.025, which is less than 0.05.

    o The coefficient for X3 (1.753) is significantly related to dependent variable because its p-value is0.000, which is less than 0.05.

    B Values for Model 2o Controlling other variables constant, if Number of delivery boys is increased by 1 then Sales will

    increase by 0.1.640

    o Controlling other variables constant, if Number of outlets is increased by 1 then Sales will increaseby 1.753

    Since VIF value of Variable X1 and X3 is less than 10 (3.498) in Coefficients table we can safely saythere is no multicollinearity.

    Other parameters are explained in section 3

  • 7/31/2019 Section C Group 9

    16/17

    15

    Excluded Variablesc

    Model Beta In t Sig.

    Partial

    Correlation

    Collinearity Statistics

    Tolerance VIF

    Minimum

    Tolerance

    1 X1 .338a

    2.556 .025 .594 .286 3.498 .286

    X2 .400a 2.361 .036 .563 .183 5.465 .183

    X4 -.085a -.600 .560 -.171 .370 2.703 .370

    X5 -.006a -.064 .950 -.018 .999 1.001 .999

    X6 .241a

    1.560 .145 .411 .267 3.740 .267

    2 X2 .226b

    1.085 .301 .311 .113 8.820 .113

    X4 -.087b

    -.732 .480 -.215 .370 2.703 .193

    X5 .019b

    .257 .802 .077 .981 1.020 .281

    X6 .113b .736 .477 .217 .219 4.569 .214

    a. Predictors in the Model: (Constant), X3

    b. Predictors in the Model: (Constant), X3, X1

    T test and Sigo Model 1

    X1 and X2 are eligible to enter after iteration 1 as both have p-value less than .05. ButX1 has higher partial correlation so it enters in second iteration.

    X4, X5, and X6 have high p-value > .05 so H0: j =0 is not rejected. So Sales is notsignificantly dependent on X4, X5, and X6.

    o Model 2

  • 7/31/2019 Section C Group 9

    17/17

    16

    X2, X4, X5, and X6 have high p-value > .05 so H0: j =0 is not rejected. So Sales is notsignificantly dependent on X2, X4, X5, and X6.

    So final model has only significant dependence on X3 and X1

    Collinearity Diagnosticsa

    Model

    Dimens

    ion Eigenvalue Condition Index

    Variance Proportions

    (Constant) X3 X1

    1 1 1.879 1.000 .06 .06

    2 .121 3.944 .94 .94

    2 1 2.761 1.000 .03 .01 .01

    2 .198 3.732 .71 .01 .16

    3 .040 8.273 .26 .98 .83

    a. Dependent Variable: Sales

    In model 2 Eigenvalues are not close to 0 and Condition index is less than 10 so there is less evidence of

    multicollinearity.