19
QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN Quant 1 - Homework #2 (Dataset “Earth.sav”) Please provide all responses in this document and name the file using convention, “Last name, First name – homework #2.docx ). Please highlight the text of your answers. There are six (6) pages in this document. Ensure you scroll down. I put page breaks in between each of the numbered problems. For this assignment, we will be using the file, “Earth.sav,” and the following variables. For all applicable analyses, exclude cases listwise . - Average male life expectancy (lifeexpm) – dependent variable (DV) - People living in cities (urban) – independent variable (IV) - People who read (literacy) – independent variable (IV) - Infant mortality (babymort) – independent variable (IV) - Gross domestic product (gdp_cap) – independent variable (IV) - AIDS cases (aids) – independent variable (IV) - Daily calorie intake (calories) – independent variable (IV) 1. Using the steps from class and in SPSS, perform a correlation analysis between Average male life expectancy (lifeexpm) and Daily calorie intake (calories). Use an α = .05 significance level . a. Paste a copy of the pertinent parts of your output from the correlation analysis below (only the pertinent parts and please make it neat and reasonably sized). I recommend you paste as JPEG images. Hint: I’m looking for two things. (4 points)

Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

Quant 1 - Homework #2 (Dataset “Earth.sav”)

Please provide all responses in this document and name the file using convention, “Last name, First name – homework #2.docx). Please highlight the text of your answers. There are six (6) pages in this document. Ensure you scroll down. I put page breaks in between each of the numbered problems.

For this assignment, we will be using the file, “Earth.sav,” and the following variables. For all applicable analyses, exclude cases listwise.

- Average male life expectancy (lifeexpm) – dependent variable (DV)- People living in cities (urban) – independent variable (IV)- People who read (literacy) – independent variable (IV)- Infant mortality (babymort) – independent variable (IV)- Gross domestic product (gdp_cap) – independent variable (IV)- AIDS cases (aids) – independent variable (IV)- Daily calorie intake (calories) – independent variable (IV)

1. Using the steps from class and in SPSS, perform a correlation analysis between Average male life expectancy (lifeexpm) and Daily calorie intake (calories). Use an α = .05 significance level.

a. Paste a copy of the pertinent parts of your output from the correlation analysis below (only the pertinent parts and please make it neat and reasonably sized). I recommend you paste as JPEG images. Hint: I’m looking for two things. (4 points)

Page 2: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

b. Are the two variables significantly correlated? If so, positively or negatively? Support your answer. (4 points) The scatterplot supports positive correlation because it is a diagonal line from lower left to upper right (steep slope). It looks like there is a strong relationship. We have a correlation coefficient of .765 which is within the -1 to + 1 window. The value of .765 is close to 1.00, this being said we can conclude that the association is strong. They are significantly correlated because P value is less than .001. Here we have a positive correlation between life expectancy and daily calorie intake.

2. Using the steps from class and in SPSS, perform a correlation analysis between Average male life expectancy (lifeexpm) and AIDS cases (aids). Use an α = .05 significance level.

a. Paste a copy of the pertinent parts of your output from the correlation analysis below. (4 points)

Page 3: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

b. Are the two variables significantly correlated? If so, positively or negatively? Support your answer. (4 points)

Looking at the scatter plot it does not appear to be significantly correlated because the trend line is basically flat; the direction is still positive (although insignificant), but the strength (the steepness of the slope) is almost nothing so is very week. Taking a closer look we considered the Pearson Coefficient which is .011 and very low which means there is a low correlation. The correlation is essentially zero. The P value is .909 and is not less .05 and is not significantly correlated.

Page 4: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact of Daily calorie intake (calories) – independent variable on Average male life expectancy (lifeexpm) – dependent variable. Use an α = .05 significance level.

a. Paste a copy of the pertinent parts of your output from the simple linear regression analysis below (you may reference output from the previous questions instead of pasting it in, as appropriate). Hint: I’m looking for four things to provide you enough information to answer the questions below (see Session 9 slides). (4 points)

P value is less than .001- the null

must go

Page 5: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

b. Is the overall simple linear regression model significant? Support your answer. (4 points)

Yes the overall model is significant (has predictive value) because according to the ANOVA table the P value is less than .001, and we reject the null hypothesis (“there is nothing going on here”) that the model has no predictive value. P value is low so the null must go.

c. What is the value of the coefficient of determination and what does it mean as it relates to this problem? (4 points)The coefficient of determination (r2) is .585 found in the model summary. We can say there is a 58.5% of variation of life expectancy was explained by the variation in daily calorie intake.

d. Write out the regression equation model using the information from your analysis. (4 points)

General Form of the Regression Model

Y=26.37054+.013621(X)

Wow, I love statistics, thank you for giving us this opportunity. After completing some Kentucky Windage I identified the following:X Bar= 2753.88 which is the mean of the independent variableSx= 567.828 which is the standard deviation of the independent variable.Y Bar=63.88 which is the mean of the dependent variable.Sy=10.110 which is the standard deviation of the dependent variable.

b1= slope of regression line: r*sy/sx, .765*10.110/567.828=.013621b0=Intercept:Y-b1*X= 63.88-.013621(2753.83)=26.37054X=75

e. What is the expected Average male life expectancy when the Daily caloric intake is 2,000 calories? (4 points)Average male life expectancy= b0+b1*daily calorie intake

whereYi is the estimated value of the Y variable for a selected X value.

b0

is the Y-intercept. It is the estimated value of Y when X = 0.

b1

is the slope of the line, or the average change in Y for each change of one unit (either increase or decrease) in the

independent variable X.X is any value of the independent variable that is selected.εis the error

Y i=b0+b1 X i+εi

Page 6: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

26.37054+(.013621*2000)=26.37054+27.24.20=53.6125

The average male life expectancy with a daily caloric intake is 53.6125 Years.

f. In the terms of our problem, interpret the unstandardized Beta coefficient for b1. (4 points)As calories increase by 1 unit, life expectancy increases by .014 years. The unstandardized coefficient indicates the average change in the dependent variable associated with a one unit change in the dependent variable statistically controlling for the independent variable. We want to think about a change in the outcome associated with a unit change in the predictor.

g. In the terms of our problem, interpret the standardized Beta coefficient for b1. (4 points)

The standardized beta coefficient of .765 indicates that as calorie intake increases by one standard deviation, the average male life expectancy increases by .765 standard deviations. (SD of life expectancy x coefficient calories consumed). For every 567.8 calories consumed, male life expectancy increases by 7.7(SD of life expectancy multiplied by the standardized beta coefficient of daily calorie intake).

The standardized coefficients or beta coefficients are estimates resulting from an analysis performed on variables that have been standardized so that they have variances of 1. This is useful to answer the question of which independent variable have a greater effect on the dependent variable in a multiple regression analysis when the variable are measured in different units. The standardized coefficient represents the changes in terms of standard deviations in the dependent variable that result from a change of one standard deviation in an independent variable. That should be at least 5 extra EC points.

4. Use the steps from class and in SPSS. We want to analyze the effect of several variables on Average male life expectancy (dependent variable). Use an α = .05 significance level. Use the following variables and perform the appropriate steps for a multiple regression analysis (see Session 10 slides):

- Average male life expectancy (lifeexpm) – dependent variable- People living in cities (urban) – independent variable- People who read (literacy) – independent variable- Infant mortality (babymort) – independent variable- Gross domestic product (gdp_cap) – independent variable- AIDS cases (aids) – independent variable- Daily calorie intake (calories)

a. Paste a copy of the pertinent parts of your output from the simple linear regression analysis below. (4 points) Hint: I’m looking for the following things:

Page 7: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

Matrix Scatterplot

Scatterplot of the Standardized Predicted Value against the Standardized Residual

Page 8: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

Correlation Matrix

Regression model output

Page 9: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

Histogram of the Standardized Residuals

Page 10: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

Normal P-P plot of Standardized Residual

b. Does each of the IVs appear to be linearly related to the DV? Which part(s) of the SPSS output provides that information? If there are any IVs that do not appear to be linearly related to the DV, which are they? (4 points)

No, not all independent variables are related to the dependent variable. I am looking particularly at the matrix scatter plot to answer this question. I would say that the independent variable-AIDS cases are not closely related to the dependent variable-average male life expectancy. We can also look at the Correlation Matrix, which gives information about significance level (p-value) and correlation (persons r) and we can see that all variables are significantly related to avg male life expectancy except for aids cases, which does not reject the null hypothesis with a p-value above .05.

c. Does the model meet the assumption of independent errors (residuals) and homoscedasticity? Which part(s) of the SPSS output provides that informant.

Homoscedasticity is basically having data values that are evenly scattered out or spread out to the same extent amongst the different variables. To determine this I am going to look at the scatterplot

Page 11: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

(see above in part a of this question). The points are equally spread out, much like a shotgun blast, indicating that the model meets the assumption of independent errors and is in fact homoscedastic. Also confirming that the data is homoscedastic is that the fit line is flat; fits nicely, and no uneven shapes. The nice looking fit line tells me that error variance is consistent in varying values in the predictor variable. This graph looks model, and accept that the model meets the assumptions.

d. Is there any multicollinearity? Support your answer. Which part(s) of the SPSS output provides the best information to answer the multicollinearity question? (4 points)

Yes there appears to be a level of multicollinearity in the data set by looking at the correlation table and the VIF found on the Coeffciant model. The VIF value or the variance inflation factor needs to be less than 10 to be acceptably independent of other variables. Of note we have at least two areas (people who read & infant mortality) with scores of 6.628 and 9.729 that indicate that they are closely correlated. These two variables’ correlation coefficients are also closely related (-.744). Although other variables show small correlation in this model, none show VIF numbers above 10 and need to be removed from the data set, but people who read & infant mortality are more closely correlated and nearing the cutoff point of 10 than the rest. These two variables also have a Tolerence level lower than .2, which is another indicator of multicollinearity.

3) difficulty to assess the importance of a predictor.

Page 12: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

e. Are the errors normally distributed? Support your answer. Which part(s) of the SPSS output provides that information? (4 points)

We look at p/p plot, scatter plot and histogram, and they all appear to be normal (see part A of question). Looking at your independent errors-diagnostics slide, I have to say the errors are normally distributed (the points on the p/p plots are closely plotted to the fit line and the histogram has a classic bell shape). Also, looking at the scatter plot, I don’t see indications of autocorrelation which is a bad thing. I am referencing in particular Fred’s Super Scatterplot, did I mention it is almost 2100 and I’m still on # 4??? That should be an extra 5 EC.

f. Is the overall multiple linear regression model significant? Support your answer. (4 points)

Yes, the overall model is significant because according to the ANOVA table the P value is less than .001, which means we can reject the null hypothesis that the model has no predictive value. P value is low so the null must go. It tests if our model is significant.

Most of our Sigs or P values are less than .05, which shows the predictor is making a significant contribution to the model. The smaller the value of Sig. (and the larger the value of t), the greater the contribution of that predictor.

Page 13: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

g. Which of the independent variables significantly contribute to the predictive/explanatory value to the model? Which part(s) of the SPSS output provides that information? (4 points)

Standardized beta values tell us the importance of each predictor, (the bigger the absolute value=more important). In this case the independent variable that most significantly contributes to predictive/explanatory value is Infant Mortality with a Standardized Coefficient of 1.024. This is saying that although other variables (who all have beta Standardized Coefficients that are less than .2) contribute to the predictability of the model, infant mortality carries more weight to thedictive/explanatory value to the model.

Page 14: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

h. Of the independent variables that DO NOT significantly contribute to the predictive/explanatory value to the model, which has the highest p value (i.e., furthest from being significant) and what is the p value of that IV? (4 points)Gross Domestic Product provides the highest p value at .575

5. Use the steps from class and in SPSS. Remove the non-significant predictor variables from the multiple linear regression model in number 4. above. Use an α = .05 significance level. Use the remaining predictor variables and perform the appropriate steps for a multiple regression analysis (assume that you have met all your assumptions – just run the multiple regression analysis).

We re-ran the multiple linear regression model with avg male life expectancy as the constant and only kept infant mortality and people who read as variables with sig. levels less than or equal to .05.

a. Is the overall multiple linear regression model significant? Support your answer. (4 points)Yes, the overall model is significant because according to the ANOVA table the P value is less than .001, which means we can reject the null hypothesis that the model has no predictive value. P value is low so the null must go. It tests if our model is significant.

b. What is the value of the coefficient of multiple determination and what does it mean as it relates to this problem? (4 points)

Page 15: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

The coefficient of determination (r2) is .884 found in the model summary. We can say there is a 88.4% of variation in the avg male life expectancy can be explained by the variation in the independent variables; infant mortality and people who read.

c. Write out the regression equation model using the information from your analysis. (4 points)

Y intercept (b0) is the constant found in the Coefficinet Model- 82.272B1 is people who read = -.075B2 is infant mortality = -.269

Y=82.272+ (-.075*X1) + (-.269*X2)

d. In the terms of our problem, interpret the unstandardized Beta coefficient for the predictor variable with the Beta coefficient with the smallest standard error. (4 points)

Y i=b0+b1 X i+εi

Page 16: Weeblymariehoffmanmha.weebly.com/.../medler,_ty_hoffman,_… · Web view3. Using the steps from class and in SPSS, perform a simple linear regression analysis to determine the impact

QUANT HOMEWORK TY MEDLER and MARIE HOFFMAN

Looking at the Coefficient Model, under the Unstandardized Coefficient and Std. Error portion of the table, the predictor variable with the smallest standard error is infant mortality. With every increase of infant mortality by one unit, than avg male life expectancy will decrease by .269 (the unstandardized beta coefficient).

e. In the terms of our problem, interpret the standardized Beta coefficient for the same predictor variable as in d. immediately above. (4 points)

By increasing the infant mortality by 1 standard deviation (38.2972 infant deaths per thousand live births) life expectancy will decrease by 10.314 years.

Standard deviation of life expectancy (9.351) multiplied by the standard beta coefficient for infant mortality (-1.103) = 10.314 years

The standardized coefficients or beta coefficients are estimates resulting from an analysis performed on variables that have been standardized so that they have variances of 1. This is useful to answer the question of which independent variable have a greater effect on the dependent variable in a multiple regression analysis when the variable are measured in different units. The standardized coefficient represents the changes in terms of standard deviations in the dependent variable that result from a change of one standard deviation in an independent variable.

f. Use 10 in place of the X values in the new model equation to predict the expected Average male life expectancy (note: this is like 3.e. above, but I can’t provide more information without giving you the answer to earlier questions)? (4 points)

Y=82.272+ (-.075*10) + (-.269*10) = 78.832 years