MLR Project (Onion)

INDEX

TOPIC PAGE NUMBER

1. Proposal 2

2. Preliminary Multiple Linear Regression Model Analysis 4

3. Exploration of Interaction Terms 13

4. Model Search 18

5. Model Selection 22

6. Final Multiple Linear Regression Model 28

7. Final Discussion 30

(i) Appendix 1 32

(ii) Appendix 2 33

(iii) Appendix 3 34

(iv) Appendix 4 35

(v) Appendix 5 36

Page | 1

1. ProposalDescription of the problem and the variables:

Our project is focused on the increase in the crime rate which is a problem in any kind of society and we wanted to know the possible relationship between the crime rate in 47 states of the USA (response variable, Y) and four factors (predictor variables) we believe are the reason for causing response variable. These variables include:

X1 – unemployment rate of urban males per 1000 X2 – the number of families per 1000 earnings below one-half of the median income X3 – state population size in hundred thousands X4 – police expenditure per person by state and local government

Present and discuss the matrix scatter plot of the variables

Figure 1: Matrix Scatter plot

Page | 2

a

c

b

ih

gfe

d

j

j

a

b

c

d

e

f

g

h

i

Response vs Predictors Scatter Plot

Crime rate(Y) vs unemployment rate(X1) in figure 1-a: Looking at the figure we can see there is no linear relationship between Y and X1. And it has a constant trend. There are no outliers.

Crime rate(Y) vsnumber of families earning below one-half income (X2) in figure 1-b : Looking at the figure we can see there is a linear relationship between Y and X2. And it has a negative correlation pattern. There are no outliers. It will be useful if X2 is added to the model because there is some linear trend that is not explained by the model. So adding X2 helps to explain more variation.

Crime rate(Y) vsstate population size (X3) in figure 1-c: Looking at the figure we can see there is an upward trend between Y and X3. So, there is a linear relationship. Also, there are no outliers. It will be useful if X3 is added to the model because there is some linear trend that is not explained by the model. So adding X3 helps to explain more variation.

Crime rate(Y) vspolice expenditure (X4) in figure 1-d: Looking at the figure we can see there is a positive correlation pattern between Y and X4. So, there is a linear relationship. Also, there are no outliers. It will be useful if X4 is added to the model because there is some linear trend that is not explained by the model. So adding X4 helps to explain more variation.

Predictor vs Predictor Scatter Plot

Unemployment rate(X1) vsnumber of families earning below one-half income (X2) in figure 1-e:Looking at the figure there is a random cloud and no linear trend. The model is reasonable. There are no outliers.

Unemployment rate(X1) vsstate population size (X3) in figure 1-f:Looking at the figure we can see there is an upward trend between X1 and X3. So, there is a linear relationship. Also, there are no outliers.

Unemployment rate(X1) vspolice expenditure (X4) in figure 1-g: Looking at the figure there is a curvature in the pattern. The model is not reasonable. There are no outliers.

Number of families earning below one-half income (X2)vsstate population size (X3) in figure 1-h: Looking at the figure there is a curvature in the pattern. The model is not reasonable. There are no outliers.

Number of families earning below one-half income (X2) vspolice expenditure (X4) in figure 1-i:Looking at the figure we can see there is a negative downward trend between X2 and X4. So, there is a linear relationship. Also, there are no outliers.

State population size (X3) vspolice expenditure (X4) in figure 1-j:Looking at the figure we can see there is an upward trend between X3 and X4. So, there is a linear relationship. Also, there is one possible x-outlier.

Check response-predictor and predictor-predictor pairwise correlation:

It is important to check the correlations between the response-predictor; and also, to check the correlations between the predictor-predictor before we model the regression of our data, so that we can be aware of the possible problem we might face while modelling the data regression, the table in figure 1 below shows the correlations between the response- predictor and predictor-predictor.

Page | 3

Figure 2: Pearson Correlation Coefficients.

From the figure 2, it is observed that Pearson correlation coefficients between Y and X1 equals to 0.03112 which is less than our standard r (sample correlation, 0.7) that implies they do not have strong correlation. Similarly, Pearson correlation coefficients between Y and X2 (-0.17876) and Y and X3 (0.37930) are again less than r that implies they do not have strong correlation. But we can see that Pearson correlation coefficients between Y and X4 (0.70735) is greater than r that implies Y and X4 have strong correlation. Because Y and X4 have the strong correlation between them, it is desirable. Moreover, Pearson correlation coefficients between X1 and X2 (-0.03038), X1 and X3 (-0.07302), X1 and X4 (-0.00975), X2 and X3 (-0.19457), X2 and X4 (-0.63801), X3 and X4 (0.57151) are again less than r that implies they do not have strong correlation. Because the correlations between predictor-predictor are all less than r, therefore we may not have serious multicollinearity problem.

Discuss Potential Complications

Since, the correlations between predictor-predictor are all less than r (sample correlation), therefore we may not have serious multicollinearity problem in our model. But, looking at our scatter plot we can see unemployment rate(X1) vs police expenditure (X4) and number of families earning below one-half income (X2) vs state population size (X3) has a curve linear pattern which may have potential curvilinearity problem. Also, state population size (X3) vs police expenditure (X4) shows there is one possible x-outlier. So, we can use residual plot to precede the analysis by checking the outlier and curve trend.

2. Preliminary Multiple Linear Regression Model Analysis

From the scatter plot correlation matrices it will be necessary to show the relationship between the crime rate in 47 states and the four predictor variables. The multiple linear regression form is given as:

Yi= β0+β1Xi1+…βkXik+ϵi, where β0,β1,βk and ϵ are random variables representing the regression coefficients and vertical variations between the observed and fitted values of Y.

A preliminary model was fit from our collected data as follows: crime rate= β0+β1unemployment_rate+ β2median_income+β3state_population+β4police_expenditure, where β0, β1, β2, β3, β4 are the model parameters.

Page | 4

FIGURE 3: RESIDUALS VS PREDICTORS PLOT

In order to check the fitness of the model form, a residual plot is necessary to check for curvature. From figure 3, it can be concluded that the model form is acceptable because the residual plot of each predictor variable does not show a curvilinear trend.

FIGURE 4: RESIDUAL VS PREDICTED VALUE PLOT

The residual plot of each predictor varaible is also used to check the constant varriance assumption and from the figure above, it can be concluded that the constant variance assumption is satisfied since there is no funnel shape or curvature in our plot.

The Modified Levene test or Brown Forsythe test is conducted to confirm the constant variance in the plot. SAS software is used to perform this test. To perform this test the first thing we do is divide the data into two groups using the median of the independent variable. The median of our data was 79.5. So we

Page | 5

divide our data into 2 groups one is lower than 79.5 and other being greater than 79.5. In order to perform the modified levene test, the equality of the variances using an F-test should be performed first to check if the test has equal variances or unequal followed by the t-test to check if the variances are constant or not.

FIGURE 5: MODIFIED LEVENE TEST

Using 95% confidence (α = 0.05) and the p-value(α=0.05), we test using the following hypothesis:

Null hypothesis: H0 = variances are equalAlternative hypothesis: H1 = variances are not equalDecision rule: Reject H0 if p-value < α

Looking at the figure 5, from Equality of variances, we can see that p-value (0.0375) <α (0.05).Decision: We reject H0.Conclusion: Unequal variance is used for t-test.

Thus, we should conduct the two-sample t-test using unequal variances that is the pooled method from the figure 5 above. Using 95% confidence (α = 0.05) and the p value, we test using the following hypothesis:

Null hypothesis: H0 = Means are equalAlternative hypothesis: H1 = Means are not equalDecision rule: Reject H0 if p value< α

Looking at the figure 5 above, we can see that p value (0.2169) >α (0.05), Decision: We fail to reject H0.

Conclusion: Means are equal.Thus, we are 95 % confident that the means of the absolute deviations are unequal that is the model has constant variance. The result is consistent with that of the residual plot from figure 4 above.

Page | 6

FIGURE 6: NORMALITY PLOT

The above figure gives us the following information:

The plot is not straight with shorter right tail and shorter left tail. Normality is not ok.

Thus from this information we conclude that model assumption for normality is not satisfied.

Furthermore, The CORR Procedure using SAS software is used to test the normality. Here, we compare the observed value of the distribution of the coefficient of correlation between residuals (e) and their expected values under normality (enrm) with its critical value. The observed value of the coefficient of correlation (ῥ obtained from figure 7) while the critical value (c) is obtained from statistical tables.

FIGURE 7: CORRELATION TEST FOR NORMALITY

Using 90% confidence (α = 0.1) we perform test using the following hypothesis:Null hypothesis: H0 = Normality is OK vs Alternative hypothesis: H1 = Normality is violated Decision rule: Reject H0 if ῥ < c (α, n)

To calculate c (0.1, 34) from statistical tables, we do interpolation and the value is 0.9734.From figure 7, ῥ = 0.98859 and we can see ῥ (0.98859) > c (0.9734)

Decision: We fail to reject H0

Conclusion: Normality is not violated.

This result is not consistent with the result from NPP Plot. If the NPP plot and the normality test using SAS software give two different results, the result form the plot is more reliable. Thus, our overall conclusion is that the normality is not satisfied or violated.

Page | 7

Our data was taken in 47 states of USA and time constraint is not taken into considerations. So, we consider our residuals are uncorrelated. Hence time sequence plot to check uncorrelated errors assumption is not relevant for our model.

FIGURE 8: SAS OUTPUT OF T VALUE AND F VALUE

1. Checking Outliers

Leverage, influence and Bonferroni outlier test are diagnostic measures use to check for outliers. a) Bonferroni outlier test is used to check for y-outliers and this can be done by comparing the absolute values of the deleted studentized residuals. The Bonferroni outlier hypothesis testing is conducted as follows: H0: The data point is y-outlying vs. H1: The data point is not y-outlying. The decision rule is reject H0 if /ti/>t(1-α/2n; n-p-1), where α= 0.05, t (0.9993, 28) is 3.54557 from figure 8, since the absolute values of the deleted studentised residuals (From Appendix 1) are less than the Bonferroni cut-off, we reject H0 and conclude that the there are no y-outlying points in our data sets. b) X-outliers can be checked by identifying the data point in our data sets that has its leverage value (h ii) which is given as HatDiagH > 0.29(2 p /n ) which is the cut-off value, where p is the number of parameters, which is 5 and n is the number of observations in our data sets which is 34. From Appendix 1, observation 4 and 29 have leverage values greater than 0.29, therefore they are considered as x-outliers.

2. Checking Influence

DFFITS, DFBETAS and Cook’s distance measures are used to check the influence of the outliers found in our datasets on the fitted values, they are used to investigate if the outlying data point will affect the fitted regression function. Since our datasets is larger than 30, we use flag /DFFITS >2√ p /n (0.7699) and from Appendix 1, we can see that observation 4 has an absolute DFFITS value of 1.4805, observation 11 has absolute DFFITS value of 0.8163 and observation 29 has an absolute DFFITS values of 2.5746, which is all greater than 0.7699(the cut-off value). Also since our data is larger than 30, we flag /DFBETAS/>2/√n (0.34299) and from Appendix 1, we can see that observation 2 has absolute value of DFBETAS for X3 has the only value greater than the cut-off value while other values are less than the cut-off values therefore it is slightly influential, thus observation 2 can be ignored as an influential point. Furthermore, observation 4(X3), 18(X4), 22(X2), 24(X1), 26(X3, X4), 29(X3, X4), 31(X1) all have the absolute values of DFBETAS greater than cut-off values while other values are less than the cut-off value, thus all the observations are slightly influential and they can all be ignored as an influential point. Using the cook’s distance, the data point is influential when D i> f (0.5; p, n-p), which is 0.89089 from figure 8. And from our table below the observation 4 cook’s distance value is less than the cut-off value of 0.89089, thus x-outlying observation 4 is not influential; but x-outlying observation 29 is influential because the cook’s distance value is greater than the cut-off value of 0.89089.

Thus, looking at the overall result we do not have adequate reason to remove any outliers. There are no data points that are unusual so it is fine to include all the points in our data. Thus our data still has 34 observations.

Page | 8

FIGURE 9: VARIANCE INFLATION DIAGNOSTICS

Variance Inflation factors are used to check multicollinearity between predictors by regressing predictors on each other. And the following are the guidelines of VIFs:

∑k=1

p−1(VIF )k

¿ p−1>> 1;the mean values of the VIFs, which is 1.75869 is slightly greater than 1

Max(VIFk) > 10, the maximum VIF is 2.59828 which is far lower than 10 From figure 9 above none of the VIFk > 5

Therefore, we can conclude that we do not have serious multicollinearity problems and this was also mentioned earlier in our scatter plot correlation matrices observations.

Transformation

In order to see the difference or to improve the model we performed log transformation for the previous model. So, the transformed multiple linear regression form is given as:

Log crime rate= β0+β1unemployment_rate+ β2median_income+β3state_population+β4police_expenditure, where β0, β1, β2, β3, β4 are the model parameters.

Now we need to perform verification of model assumptions in order to present our transformed model to be preliminary model.

Page | 9

FIGURE 10: TRANSFORMED MODEL OF RESIDUALS AGAINST PREDICTORS

FIGURE 11: TRANSFORMED MODEL OF RESIDUAL AGAINST PREDICTED VALUE

Since there is no curvature in our plot of residuals against predictors and residual against predicted value as seen in figure 10 and 11, thus it can be inferred that the transformed model is appropriate. Because of the scattered patterns, it can be concluded that the constant variance error assumption is satisfied.

Method Variances DF t Value Pr > |t|Pooled Equal 32 0.07 0.9441

Satterthwaite Unequal 28.709 0.07 0.9449

Equality of VariancesMethod Num DF Den DF F Value Pr > FFolded F

15 17 1.56 0.3731

FIGURE 12: TRANSFORMED MODIFIED LEVENE TEST

The constant variance error assumption is checked by conducting a modified levene test, the median value 1.935281 is used to divide the data into two groups. In conducting the modified levene test, the equality of the variances is tested first by using the F-test. Hypothesis is conducted at 95% confidence level as follows:

Null hypothesis (H0): equal group variances vs

Alternative hypothesis (H1): unequal group variances.

Decision rule: Reject H0 if p<α.

Looking at figure 12, it can be seen that the p-value of 0.3731> 0.05, thus we fail to reject H0 and conclude that the group variances are equal. From this result the two-sample t-test hypothesis is conducted by testing at 95% confidence level as follows:

Null hypothesis (H0): constant variances vs

Alternative hypothesis (H1): constant variances violated.

Decision rule: Reject H0 if p<α.

Page | 10

It can also be seen from figure 12, that p-value of 0.9441 >0.05, thus we fail to reject H0 and conclude that the variances are constant which is consistent with the constant variance assumption noticed from the residual plot.

FIGURE 13: TRANSFORMED MODEL OF NORMALITY PLOT

From the above figure, we can see the plot is straight and thus normality is ok.

Pearson Correlation Coefficients, N = 34

e enrm

e

Residual

1.00000 0.98290

enrm

Normal Scores

0.98290 1.00000

FIGURE 14: TRANSFORMED MODEL OF CORELLATION TEST FOE NORMALITY

Using 90% confidence (α = 0.1) we perform test using the following hypothesis:

Null hypothesis: H0 = Normality is OK vs Alternative hypothesis: H1 = Normality is violated Decision rule: Reject H0 if ῥ < c (α, n)

To calculate c (0.1, 34) from statistical tables we do interpolation and the value calculated is 0.9734.From figure 7, ῥ = 0.98290 and we can see ῥ (0.98290) > c (0.9734)


Conclusion: Normality is ok.This result is consistent with the result from the NPP Plot.

Also, using Bonferroni outlier test for y-outliers in this case, we got our leverage cut off value to be 3.5712 and since the absolute values of the deleted studentised residuals (From Appendix 5) are less than the Bonferroni cut-off, we conclude that the there are no y-outlying points in our data sets. Also, leverage

Page | 11

cut-off value is 0.29 and from Appendix 5, there are no observations greater than the cut-off value, therefore they are no x-outliers in this case. Thus there is no need to check their influence.

FIGURE 15: SAS OUPUT OF PRELIMINARY MODEL

Preliminary modelThe transformed model satisfied the entire necessary model assumptions therefore it will be considered as our preliminary model, thus we fit the model with this equation we got from SAS output from figure 15:

Log crime rate = 0.83627+0.00044715unemployment_rate +0.00260median_income -0.00067161state_population +0.00659police_expenditure

Using the ANOVA table in figure 15, necessary information can be obtained to explain and check the model form. The F-value and the Pr>F are used as the cut-off values needed to conduct F-test, which are used to check if the regression is significance, the f-value (12.04) and the Pr>F which is used to F-test statistics is given as 0.0001, in order to check if the model is significant, F-test is conducted at significance level of 0.0001 which is our αNull hypothesis (H0): All βk = 0 VS Alternative (H1): At least one βk≠0, where k =1, 2….4Decision rule: Reject H0 if p-value<α.

Page | 12

Looking at figure 15 below, it is noticed that 0.0001< 0.01, thus we reject H0 and conclude regression is significant.

Looking at Figure 15, our regression model provides estimates of the crime rate in 47 states of the USA which is dependent on the four predictors. Because we are providing estimates of the actual data, variability is major concern. The ANOVA table (in Figure 15) provides variability information in the columns Sum of squares and Mean Square. The value 0.75249, found in the cell that is the intersection of model row and sum of squares column, is called the SSR (regression model sum of squares). It represents the amount of variability that is explained by our model. The row and mean square column is called the MSR (regression model mean of square); which value is 0.18812.The value 0.452324 represents the SSE (error/residual sum of squares), this is the amount of variability that is unexplained by our model. Below it is the value 1.20573; this is the value of the total variability of the observations of crime rate in 47 states of the USA due to variation in the four predictor variables ; which is the SSTO – total sum of squares. At the intersection of the error row and mean square column is the MSE (error/residual mean square), given as 0.01563, this value represents an unbiased estimator of the variance of the crime rate in 47states in USA variable, as it provides information on the spread of the data points from their mean and from each other. The value of the MSE, 0.01563 is low; thus, we can say that our data points are not very spread out from each other; as noticeable in Figure 15. The Root MSE is the square root of the MSE value found in the ANOVA table; it is the unbiased estimator of the standard deviation of crime rate in 47 states of the USA. R-square, is also found in this table. Given as 0.6241, this value (known as the coefficient of determination) represents the proportion of variability that is explained by our model; thus we can say that our regression model only accounts for 62.41% of the variability in the observations of the crime rate in 47 states of the USA, also more predictor variables can be added to get more explained proportion of variability. The adjust R-square is very useful in adjusting the R-square in order to determine the number of predictors that are in the model, the adjusted R-square doesn’t determine the fit of the model; it inform us of how suitable our model is in comparison with the number of predictors in the model thereby measuring the marginal contribution of each predictors in the model. Since the adjust R-square (0.5722) is not very close to the R-square(0.6241) this shows that the model may include too many predictors

Based on the observation mentioned earlier on in figure 9 which is also the same as figure 15, the variance inflation factor guidelines, it was observed that that none of all the VIFs are greater than 5and the average of the VIFs(1.758685) is not much bigger than 1, then we conclude that serious multicollinearity is not a problem.

From figure 15, t-test at significance level of 0.05 is used to conduct test to check if the predictors are significance to our model. The Pr>/t/ represents the p-value.

Null hypothesis (H0): All βk = 0 VS Alternative (H1): At least one βk≠0, where k =1, 2….4Decision rule: Reject H0 if p-value<α, Looking at figure 15 above, it is noticed that X2 (0.0015) and X4

(<0.0001) have their p-value less than α(0.05), thus we reject H0and conclude that only the two predictors are significant to this model; since β2 andβ4 are not equal to zero.

3. Exploration of Interaction Terms

It is necessary to check for potential interaction effect because the effects of predictors on the response may not be additive but they may depend on the levels of the other predictors. Interaction terms are obtained by multiplying predictors with each other, which is then used to check if they may have any effect on the response variable.

Page | 13

FIGURE 16: PARTIAL REGRESSION PLOT X1X2

Here in figure 16, the points have no trend (random cloud). So we conclude not to add X1X2 because it doesn’t have any marginal contribution to our model since it doesn’t have any particular trend.


Similarly in figure 17, the points have no trend (random cloud).This interaction term will not be useful in our model, which means X1X3cannot be added into the model because it doesn’t have any marginal contribution to the model since it has no trend.

Page | 14


The points in figure 18 shows a linear trend, therefore; it will be useful to add the interaction terms X1X4

to the model because there will be marginal contribution of the interaction terms to the model.


The points in figure 19 shows a linear trend, therefore; it will be useful to add the interaction terms X2X3

to the model because there will be marginal contribution of the interaction terms to the model.

Page | 15


Since there is a trend observed in figure 20, it will be useful to add the interaction terms to the model since it will have marginal contribution to the model.


Since there is a trend observed in figure 21, it will be useful to add the interaction terms to the model since it will have marginal contribution to the model.

Therefore looking at all the six interaction terms from figure 16 to figure 21, interaction terms between X1X4, X2X3, X2X4 and X3X4 are the possible useful interaction terms that can be added to our model since they have linear trend and have significant effects on our response variable (crime rate).

Page | 16

Interaction terms are usually highly correlated with their original predictors when rxx(correlation coefficients) >0.7 as seen in figure 22 below.

FIGURE 22: CORRELATIONS BETWEEN PREDICTORS AND INTERACTION TERMS

It is seen from figure 22, that all of the six interaction terms are highly correlated with one of their original predictors. As a result of this high correlations, there might be serious multicollinearity problems when interaction terms are added to the model, to reduce the effect of this high correlation; the interaction terms will be standardized by centering the means to 0 and scaling the variances to 1 to reduce the effect of the high correlation.

FIGURE 23: CORRELATIONS BETWEEN PREDICTORS AND STANDARDIZED INTERACTION TERMS

Looking at figure 23 above, the high correlation effects in all the six interaction terms has been reduced by standardizing the interaction terms, thus serious multicollinearity problem has been reduced and now we can add the standardized interaction terms into the model.

Page | 17

4. Model Search

The model search technique used in obtaining potential best models is backwards deletion, best subsets and stepwise regression. Here we will find two potentially best models. These best models predictors are significant at α=0.1 and multicollinearity is not a serious problem.

Best Subsets Selection

This technique contain crime rate on every available subset of our four predictors using SAS software. We will find the potentially two best models based on the following criteria:

High R-square and low SSE High adjusted R-square and low MSE Low values of Mallow’s CP such that (Cp ≈ p), where “p” is the number of parameters. Low values of AIC (Akaike’s Information Criterion) and SBC (Schwarz’ Bayesian Criterion)

FIGURE 24: BEST SUBSETS SELECTION – 1 VARIABLE MODELS


FIGURE 26: BEST SUBSETS SELECTION – 3 VARIABLE MODELS (2ND BEST MODEL)

FIGURE 27: BEST SUBSETS SELECTION – 4 VARIABLE MODELS (1ST BEST MODEL)

Page | 18



The two potentially best models can be selected using the criteria mentioned above. Here, looking at the Adjusted R-square in figure 28 and 29 it started to decrease instead of increasing trend as in previous figure. This means we do not have to look for 7 variable models because it will not help in explaining variability, thus it does not make sense to consider models with more than 6 variables.

Now, using the criteria mentioned above from figure 24 – 29, we observed the R-square, Adjusted R- square and the CP values to know the 1st and 2nd best model. We came to conclusion that Figure 27 is the 1st best model because of the R-square (.7750) and Adjusted R-Square (.7440) which are higher values then the previous model. Also CP value is 5.3977 which are comparatively low than other models and p value is 5 which implies (Cp ≈ p) is satisfied. While Figure 26 is the 2nd best model due to R-square (.7411) and Adjusted R-square (.7152) noticeably high and the CP (7.8307) which is relatively low and p is 4 is this case and thus satisfies (Cp ≈ p).

Step-wise Regression

This technique consists of Backward and Forwards selection for addition and deletion of a predictor variable from the model. We start with no predictor by identifying the predictors that are significant at α= 0.1. This is done by comparing the p-value for t-test to α= 0.1 for adding and deleting of predictor variable. The predictor with the largest p-value is identified and deleted when p-value is greater than α. The potentially good model is identified when no predictor variable can be added or deleted from the model.

Page | 19

FIGURE 30: STEP-WISE REGRESSION RESULTS

Using step-wise regression, the potentially best models are 3 and 4 variables models from figure 30. These two best models was selected also using subset technique.

Backward Deletion

This technique is a bit similar to step-wise regression. The different in this technique is that, we begin with full set of predictors by regressing crime rate on full set of its predictors. The potentially good model is identify until the remaining predictor variable are significant [p-value is less than α].

FIGURE 31: BACKWARDS DELETION RESULTS

Page | 20

We observed from figure 31 that 7 predictors were removed from the full set of our 9 predictors using backward deletion technique. Thus the potentially best models are 3 and 4 variables models same as the two selected by step-wise regression and subset technique. We also concluded that multicollinearity is not a serious problem.

Thus, our two potentials best models with significant predictors and no serious milticollinearity problem are presented below in figures 32 and 33 below.

R-Square = 0.7750

FIGURE 32: BEST MODEL 1

Best Model 1: Preliminary model:

Log Yi= 0.49799 + 0.00397X2 + 0.00837X4 + 0.08940stdx2x4 – 0.02166stdx3x4

Where Y is Crime rate, X2 is median income, X3 is state population and X4 is police expenditure.

Page | 21

R-Square = 0.7411

FIGURE 33: BEST MODEL 2

Best Model 2: Preliminary model:

Log Yi= 0.54640 + 0.00390X2 + 0.0078X4 + 0.09498stdx2x4

Where Y is Crime rate, X2 is median income, and X4 is police expenditure.

5. Model Selection

In selecting the final best model, the model assumptions must be verified and diagnostics must be checked for each model.

Best Model 1:

Figure 34: RESIDUAL PLOT - Best Model 1

Page | 22

The best model 1 consist of 3 – predictor models, so we need to check for the constant variance by plotting residual against predictor variable in figure 34. From the regression plot, we can see there is no curvature, no funnel shape and there is a constant variance in the residual plot of x2, x4 and stdx2x4.

Figure 35: MODIFIED LEVENE TEST- BEST MODEL 1

Using modified levene test to confirm this observation with dividing point = 1.935281 and testing at 95% confidence using p-value.


Looking at the figure 35, from Equality of variances, we can see that p-value (0.0375) <α (0.05).Decision: We reject H0.Conclusion: Unequal variance is used for t-test.

Thus, we should conduct the two-sample t-test using unequal variances from the figure 35 above. Using 95% confidence (α = 0.05) and the p value, we test using the following hypothesis:


Looking at the figure 5 above, we can see that p value (0.2169) >α (0.05), Decision: We fail to reject H0.

Conclusion: Means are equal.Thus, we are 95 % confident that the means of the absolute deviations are unequal that is the model has constant variance. The result is consistent with that of the residual plot from figure 34 above.

Page | 23

Figure 36: NORMAL PROBABILITY PLOT- Best Model 1

From the normal probability plot figure 36, we can see shorter-right tail and shorter left tail and we conclude normality is not okay.

Figure 37: CORRELATION TEST FOR NORMALITY- Best Model 1

Using the correlation test to verify normality at 90% confidence (α = 0.1) we perform test using the following hypothesis:





This result is not consistent with the result from NPP Plot. Thus we conclude that the normality assumption is violated after based on more reliable NPP Plot.

Page | 24

Normal

We verify for outlier in our model by using the leverages values for x-outliers and Bonferroni outlier for y-outliers. The leverages cutoff value is 0.29 while the Bonferroni value is 3.54557 at∝=0.05. Our model has no outliers since all the values are smaller than leverages cutoff value and the Bonferroni value (Appendix 4). Thus there is no need to check for influence.

Best Model 2:

Figure 38: RESIDUAL PLOT - Best Model 2

From the regression plot figure 38, we can see there is no curvature, no funnel shape and thus we can conclude there is a constant variance.

Figure 39: MODIFIED LEVENE- Best Model 2

Using modified levene test to confirm this observation with dividing point = 1.935281 and testing at 95% confidence using p-value.


Looking at the figure 39, from Equality of variances, we can see that p-value (0.3731) >α (0.05).Decision: We fail to reject H0.Conclusion: Equal variance is used for t-test.

Page | 25

Thus, we should conduct the two-sample t-test using unequal variances that is the pooled method from the figure 39 above. Using 95% confidence (α = 0.05) and the p value, we test using the following hypothesis:


Looking at the figure 5 above, we can see that p value (0. 9449) >α (0.05), Decision: We fail to reject H0.

Conclusion: Means are equal.Thus, we are 95 % confident that the means of the absolute deviations are equal and the model has constant variance. The result is consistent with that of the residual plot from figure 38 above.

Figure 40: NORMAL PROBABILITY PLOT - Best Model 2

From the normal probability plot figure 40, we can see a shorter-right tail and shorter left tail and we conclude normality is not okay.

Figure 41: CORRELATION TEST FOR NORMALITY- Best Model 2

Using the correlation test to verify normality at 90% confidence (α = 0.1) we perform test using the following hypothesis:



Page | 26



This result is not consistent with the result from NPP Plot. Thus we conclude that the normality assumption is violated based on more reliable NPP Plot.

We verify for outlier in our model by using the leverages values for x-outliers and Bonferroni outlier for y-outliers. The leverages cutoff value is 0.29 while the Bonferroni value is 3. 3.56123 at∝=0.05. Our model has no outliers since all the values are smaller than leverages cutoff value and the Bonferroni value (Appendix 5). Thus there is no need to check for influence.

Comparison of the two best models

In selecting the overall best regression model depends on the results of each model and the diagnostic from the residuals. The comparison table of the two best models using the assumptions in selecting the overall best model is shown below:

Best Model 1 Best Model 2R-square and SSE: The rule is higher the R-square and lower the SSE the better the model is. Here, R-square value is 0.7750 while the SSE value is 0.27645. Comparatively this model is better than the other one.

R-square and SSE: Here, the R-square value is 0.7411 while the SSE value is 0.31219.

Adjusted R-square and MSE: The rule is higher the Adjusted R-square and lower the SSE the better the model is. Here Adjusted R-square value is 0.7440 while the MSE value is 0.00953. Comparatively this model is better than the other one.

Adjusted R-square and MSE: Here, Adjusted R-square value is 0.7152 while the MSE value is 0.01041.

Low value of Mallow’s CP: Here, the CP value is 5.3977 which is the lower than the other model. Moreover, p value here is 5 and CP value is close to p value in this case which satisfies (Cp ≈ p). Comparatively this model is better than the other one.

Low value of Mallow’s CP: Here, the CP value is 7.8307 which is the higher than the other model. Moreover, p value here is 4 and CP value is not close to p value in this case compared to the other model.

AIC and SBC: The rule is the lower the AIC and SBC value, the better the model is. Here, AIC value is -154.2522 and SBC value is -146.62044 which is lower than the other model. Thus this model is better than the other model.

AIC and SBC: Here, AIC value is -151.4769 and SBC value is -145.3714 which is lower than the other model.

Hence, we can conclude from the above table that “Best Model 1” is the best overall regression model for relationship between the crime rates in 47 states of the USA.

Page | 27

6. Final Multiple Linear Regression Model

As mentioned above, the general multiple linear regression form Y i= β0+β1Xi1+…βkXik+ϵi,and our selected model from the SAS output is:

Log crime rate = 0.49799 + 0.00397median income + 0.00837police expenditure + 0.08940std median income police expenditure – 0.02166stdstate populations police expenditure.

We noticed from the analysis that the four initial factors (unemployment rate, median income, state population and police expenditure in relationship with the crime rate in 47 states), but only median income and police expenditure is used in predicting the relationship between the crime rate in 47 states. This model focus on the relationship between the crime rates in 47 states based on the low earning and police expenditure. From the scatter plot matrix, we observed that there is a linear relationship between the crime rate and number of families with median income with a negative pattern and also a linear relationship between the crime rate and police expenditure with a positive correlation pattern and a linear trend. Both variable explained more variation when added.

R-Square = 0.7750

Figure 42: FINAL MODEL – SAS OUTPUT

We conducted 95% confidence F-test to know the significance of our final model and we analyze the fit of the model by testing null hypothesis, H0: β 1=β 2=0versus alternative hypothesis, H1:β 1∨β 2 ≠ 0. Decision rule: reject H0 when p ¿αwhere p is the p-value, Pr¿ F .The ANOVA table in figure 42, we noticed the p-value ¿0.0001which is less than 0.05, we reject null hypothesis and conclude that our selected final model is significant.

Page | 28

The ANOVA table (in Figure 42) provides variability information in the columns Sum of squares and Mean Square. The value 0.92927, found in the cell that is the intersection of model row and sum of squares column, is called the SSR (regression model sum of squares). It represents the amount of variability that is explained by our model. The row and mean square column is called the MSR (regression model mean of square); which value is 0.23232.The value 0.27645 represents the SSE (error/residual sum of squares), this is the amount of variability that is unexplained by our model. Below it is the value 1.20573; this is the value of the total variability of the observations of crime rate in 47 states of the USA due to variation in the predictor variables; which is the SSTO – total sum of squares. At the intersection of the error row and mean square column is the MSE (error/residual mean square), given as 0.00953, this value represents an unbiased estimator of the variance of the crime rate in 47states in USA variable, as it provides information on the spread of the data points from their mean and from each other. The value of the MSE, 0.00953 is low; thus, we can say that our data points are not very spread out from each other. Given the R-square value is 0.7750 (known as the coefficient of determination) represents the proportion of variability that is explained by our model; thus we can say that our regression model only accounts for 77.50% of the variability in the observations of the crime rate in 47 states of the USA. The adjust R-square is very useful in adjusting the R-square in order to determine the number of predictors that are in the model, the adjusted R-square doesn’t determine the fit of the model; it inform us of how suitable our model is in comparison with the number of predictors in the model thereby measuring the marginal contribution of each predictors in the model. Since adjusted R-square (0.7707) which is close to R-square mean the number of predictors in this model is adequate. Thus, this model is valid for making prediction for the relationship between the median income and police expenditure with our response.

Interpret inferences

Lower CI Limit Upper CI LimitMedian income (0.00397) 0.00213 0.00571Police expenditure (0.00837) 0.00603 0.01071Std median income police expenditure (0.08940)

0.0266 0.1522

Std state population police expenditure (0.02166)

-0.05156 0.00824

Figure 43: BONFERRONI SIMULTANEOUS 95% CONFIDENCE INTERVALS

From the values seen in figure 43, we are 95% confident that median income lies in between 0.00213 and 0.00571 and police expenditure lies in between 0.00603 and 0.01071 based on the quadratic relation that includes adjustment variable [0.0266, 0.1522] and [-0.05156, 0.00824] simultaneously.

We are particularly interested in studying the case of xh when the median income is 139 and the police expenditure per person is 69 because during this observation it has the smallest crime rate per 1000 in 47 states which is 34.2 (Refer to Appendix 2).

The SAS System

Xnew 1 139 69

Predicted

Lower CI

Upper CI

Lower CB

Upper CB

Lower PI

Upper PI

2.12936 1.23423 3.35637 1.03128 4.01023 -2.2342 7.00123Figure 44: INTERVAL ESTIMATION AND PREDICTION

Page | 29

Confidence Interval

Looking at figure 44, we are 95% confident that the average number of crime rate in 47 states when median income is 139 and the police expenditure per person is 69 lies in between 1.23423 and 3.35637.

Confidence Band Boundaries

Similarly, we can say with 95% confidence that the boundary values of the confidence band for the average number of crime rate in 47 states when median income is 139 and the police expenditure per person is 69 will lie in between 1.03128 and 4.01023.

Prediction Interval

Also, we can conclude with 95% confidence that an actual average number of crime rate in 47 states when median income is 139 and the police expenditure per person is 69 will lie in between -2.2342 and 7.00123.

7. Final Discussion

Our project was focused on the possible relationship between the crime rate in 47 states of the USA and four factors unemployment rate, number of families per 1000, state population size and police expenditure. We collected total of 34 data and using multiple regression we found out that police expenditure per person is the most useful predictor variable because it showed that police expenditure per person has the most likely relationship with the crime rate.

A preliminary model was fit from our collected data as follows: crime rate= β0+β1unemployment_rate+ β2median_income+β3state_population+β4police_expenditure, where β0, β1, β2, β3, β4 are the model parameters. Then we checked the constant variance and normality. Variances were constant but the model did not satisfy the normality. Moreover we checked for the outliers, leverage, influence and variance inflation. We did not have adequate reason to remove any outliers because there were no data points that are unusual. Also we found that there was no serious multicollinearity problem because our max VIF value was 2.59828. Then we did log transformation and re-checked model assumptions.

We then explored the interactions using partial regression. Looking at all the six interaction terms from figure 16 to figure 21, interaction terms between X1X4, X2X3, X2X4 and X3X4 were the possible useful interaction terms that can be added to our model since they have linear trend and have significant effects on our response variable (crime rate).

We then obtained two potentially “best” models by using best subsets, backwards deletion and stepwise regression model search techniques. We found that the potentially best models are 3 variables model and 4 variables model.

After that for each model we verified the model assumptions and checked the diagnostics as well. The first best model (model with 3 variables) was chosen as the best overall model comparing it has higher R- square values and adjusted R-square values, lower SSE, MSE, AIC and SBC values. Also, CP value is close to p value in this case.

Then we present our final model as: Log crime rate = 0.49799 + 0.00397median income + 0.00837police expenditure + 0.08940std median income police expenditure – 0.02166stdstate population police expenditure. We found that the R-square value of our overall best model was 0.7750 which represents the proportion of variability that is explained by our model; thus we can say that our regression model accounts for 77.50% of the variability in the observations of the crime rate in 47 states of the USA.

Page | 30

For Further analysis, we can add some more predictor variables like mean number of years of schooling or labor force participation rate to see if the addition of this variables can help explain more variations in our data or not. We can also check their respective relationship with our predictor variable (the crime rate in 47 states of the USA).

Page | 31

APPENDIX 1

Page | 32

APPENDIX 2

Page | 33

APPENDIX 3

Page | 34

APPENDIX 4

Page | 35

APPENDIX 5

Page | 36

Documents

MLR Project (Onion)