49
Lecture 5

Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Embed Size (px)

Citation preview

Page 1: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Lecture 5

Page 2: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

We start with Lab 5 Follow-up Using SAT.txt data to perform multiple regression with variable Y= College GPA but this time use only two input variables X: high school GPA and SAT. (Do not use the variable Letters and note numbers are truncated or rounded). Given the regression table answer the questions:

Q1. If an incoming student has a High School GPA=2 and SAT=1500 what would be his predicted College GPA? A1:- 0.08813+2*0.407113+1500*0.001217 = 2.551 Q2. What is the approximate error of this prediction? A2: Approximation Error= +/-0.587 Q3. If this student increases his high school GPA by 1 point how much higher would be his predicted College GPA? A3: This is a tricky one (students often make a mistake here). 1*0.407=0.407

.

Page 3: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

We start with Lab 5 Follow-up Using SAT.txt data to perform multiple regression with variable Y= College GPA but this time use only two input variables X: high school GPA and SAT. (Do not use the variable Letters). Given the regression table answer the questions:

Q4. If two students have identical high school GPA but one has a SAT score 100 points higher than the other, how much higher will his College GPA be? A: 100*0.0012=0.12 Q5. What is the interpretation of coefficient 0.407? A: (Another tricky one). For each additional point increase in High School GPA we expect 0.407 points increase in College GPA. Q6. How good is the fit and why? A: Decent for prediction as R-square =0.398. (refer to table)

Page 4: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

More on Lab 5 Follow-up Using Tornadoes.txt data the task is to use months of May and June to predict the tornadoes’ activity in August.

Q1. If May had 100 tornadoes and June had 200 tornadoes what would be your prediction for the number of tornadoes in August? A: 34.885-0.033*100+0.177*200=67 Q2. What is the approximate Error of this prediction? (give an interval prediction as well) A: +/- 26; 67+/-26 that is interval [41, 93] Q3. The Intercept here has a value of 34.88. What is the interpretation of this number? A: When May and June have ZERO # of tornadoes the prediction for number of tornadoes in August is 34.88.

Page 5: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

More on Lab 5 Follow-up Using Tornadoes.txt data the task is to use months of May and June to predict the tornado activity in August.

Q4. What is the interpretation of the negative coefficient -0.03395? A: (This is a bit tricky) The coefficient is negative and it means that for each observed tornado in May one expect a reduction of tornadoes in August, more precisely, for 100 observed tornado in May one expects a decrease of about 4 tornadoes in August.

Page 6: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

The following table was obtained with the Y-axis being House price in $100,000 and the X-axis as presented. Answer the questions (and it is OK to round off the values) Note: The table below is from a modified house price data.

Q1. If a house is 20 years old, has 1 Bathroom, 1 fireplace and 5 rooms what is the predicted price? A: -0.865+3.91+0.849+5*0.715+20*(-0.052) =$642,900, Q2. Is it more expensive to have an extra bathroom or two extra rooms? A: (A bit tricky) An extra bathroom costs about $391,000 while extra 2 rooms cost 2*0.715=$143,000. So, one extra bathroom is predicted to cost more Q3 Approximately, how much older should a house be in order to decrease the value by about $52,800? A: (Even more tricky) When X (age of house) increases by 10, the house price will go down by 10*(-0.0528)=-0.528 and since the units are in $100000, we get that -0.528 corresponds to -$52800. Thus X=10 years.

Page 7: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

The following table was obtained with the Y-axis being House price in $100,000 and the X-axis as presented. Answer the questions (and it is OK to round off the values)

Q4. If two houses are identical but one has 1 fire place while the other has no fire place, what would be the difference in price? A: 1*0.849=$84,900 Q5. If one of the houses is 1000 square feet larger than the other what is the predicted increase in price? A: A reasonable question but not applicable in our case. We do not have the data to answer this.

Page 8: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Summary Statistics

• Summary statistics are used to summarize a set of observations/data, in order to communicate the largest amount of information as simply as possible.

• Task 1

• Upload SAT data and open the data in Excel. Construct the Summary Statistic table for College GPA column.

• Instructions: Click on DataData AnalysisDescriptive Statistics tabs. A window will prompt where you have to select Input Range (As usual, click there and then highlight the College GPA column). Finally, do not forget to click on Summary Statistics Tab. The resulting table is displayed on the right.

• Answer the questions:

• Q1. Which numbers describe the Center of the Data?

• A1. Mean and median :1.9805 and 1.985

• Q2. How many subjects participated in this study?

• A2 Count=100, thus 100 subjects.

Page 9: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Summary Statistics • Task 1

• Upload SAT data and open the data in Excel. Construct the Summary Statistic table for College GPA column.

• Q3. What is the approximate variation of the College GPA data?

• A3. For this we use Standard Deviation=0.749.

• Q4. According to the rule of thumb, 65% of the data lies within what interval?

• A4. 65% of the data lies within ONE standard deviation away from the mean:[1.9805 – 0.7492, 1.9805 + 0.7492] = [1.2313, 2.7297]

• Q5. According to the rule of thumb, 95% of the data lies within what interval?

• A5. 95% of the data lies within TWO standard deviation away from the mean:[1.9805 – 2*0.749, 1.9805 + 2*0.749] = [0.4821, 3.4789]

Page 10: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Summary Statistics • Task 1

• Upload SAT data and open the data in Excel. Construct the Summary Statistic table for College GPA column.

• Q6. What was the lowest College GPA recorded?

• A6. This would be the Minimum of the data file, 0.05.

• Q7. Which of the following quantities is/are NOT sensitive toward outliers: Standard Deviation, Median, Mean, Minimum, Maximum?

• A7. Median

• Q8. Which of the following quantities is/are sensitive toward outliers: Standard Deviation, Median, Mean, Minimum, Maximum?

• A8. Mean & Standard Deviation

Page 11: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q1. If Variable 1 increases by 10 units and Variable 5 decreases by 1 unit, based on the above tables, what would be your prediction for variable Total Tornadoes numbers?

• A1. The number of tornadoes will increase by approximately 2.7 - 2.06 = 0.64.

• Q2. How many years were used in this study?

• A2. Count = 45.

SUMMARY OUTPUT Ttotal Tornados

Regression Statistics Mean 760.0444

Multiple R 0.9421 Standard Error 36.49065

R Square 0.887552 Median 765

Adjusted R Square 0.873135 Mode #N/A

Standard Error 87.18832 Standard Deviation 244.7867

Observations 45 Sample Variance 59920.54

Kurtosis 0.067799

ANOVA Skewness -0.17219

df SS MS F Significance F Range 1096

Regression 5 2340034 468006.7 61.56523 1.87E-17 Minimum 201

Residual 39 296470.3 7601.802 Maximum 1297

Total 44 2636504 Sum 34202

Count 45

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 136.2497 42.01536 3.242855 0.002427 51.26563 221.2338 51.26563355 221.2338

X Variable 1 0.273618 0.913936 0.299384 0.766237 -1.57499 2.122228 -1.574992398 2.122228

X Variable 2 0.675613 0.430507 1.569341 0.124647 -0.19517 1.546396 -0.195170574 1.546396

X Variable 3 0.93366 0.288627 3.234836 0.002482 0.349857 1.517462 0.349857276 1.517462

X Variable 4 0.974585 0.23006 4.236215 0.000134 0.509244 1.439926 0.5092439 1.439926

X Variable 5 2.067821 0.212358 9.737412 5.41E-12 1.638286 2.497357 1.638285744 2.497357

Task 2 Below are two Tables. First is Regression table where the Y-variable was the Total Tornadoes numbers and the second table is descriptive statistic related to the column: Total Tornadoes Numbers.

Page 12: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q3. Which two numbers address the center of the Total Tornado data?

• A3. Mean and Median 760 & 765

• Q4. Once a prediction is made, we often provide a range where we believe the true value will lie. What number is use to construct this range?

• A4. Standard Error = 87.1.

SUMMARY OUTPUT Ttotal Tornados

Regression Statistics Mean 760.0444

Multiple R 0.9421 Standard Error 36.49065

R Square 0.887552 Median 765

Adjusted R Square 0.873135 Mode #N/A

Standard Error 87.18832 Standard Deviation 244.7867

Observations 45 Sample Variance 59920.54

Kurtosis 0.067799

ANOVA Skewness -0.17219

df SS MS F Significance F Range 1096

Regression 5 2340034 468006.7 61.56523 1.87E-17 Minimum 201

Residual 39 296470.3 7601.802 Maximum 1297

Total 44 2636504 Sum 34202

Count 45

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 136.2497 42.01536 3.242855 0.002427 51.26563 221.2338 51.26563355 221.2338

X Variable 1 0.273618 0.913936 0.299384 0.766237 -1.57499 2.122228 -1.574992398 2.122228

X Variable 2 0.675613 0.430507 1.569341 0.124647 -0.19517 1.546396 -0.195170574 1.546396

X Variable 3 0.93366 0.288627 3.234836 0.002482 0.349857 1.517462 0.349857276 1.517462

X Variable 4 0.974585 0.23006 4.236215 0.000134 0.509244 1.439926 0.5092439 1.439926

X Variable 5 2.067821 0.212358 9.737412 5.41E-12 1.638286 2.497357 1.638285744 2.497357

Task 2 Below are two Tables. First is Regression table where the Y-variable was the Total Tornado numbers and the second table is descriptive statistic related to the column: Total Tornado Numbers.

Page 13: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Lecture: Enter the P-Value

• We have done fairly extensive analysis with SAT data below:

• 1-dim regression

Page 14: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Clearly, each of the three variables (high school GPA, SAT score and letters) had a positive influence on the college GPA. In other words, if one would like to predict the performance of an incoming freshman, each of these three predictors would be relevant.

Page 15: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Lecture: Enter the P-Value • We have seen SAT data before and we have done fairly extensive

analysis: 1-dim regression and then multiple regression using all three X variables at once.

• We investigated the role of coefficients, Standard error as well as R-square on the table above.

• Now it is time to introduce the “real thing”.

Regression Statistics

Multiple R 0.63225

R Square 0.39974

Adjusted R Square 0.38098

Standard Error 0.58948

Observations 100

ANOVA

df SS MS F Significance F

Regression 3 22.2144 7.40479 21.3098 1.2E-10

Residual 96 33.3583 0.34748

Total 99 55.5727

CoefficientsStandard Errort Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept -0.15326 0.32294 -0.47459 0.63616 -0.79429 0.48776 -0.79429 0.48776

GPA High 0.37635 0.11426 3.29377 0.00139 0.14954 0.60316 0.14954 0.60316

SAT 0.00123 0.0003 4.04636 0.00011 0.00063 0.00183 0.00063 0.00183

Letters 0.02268 0.05098 0.44495 0.65736 -0.07851 0.12388 -0.07851 0.12388

Page 16: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Using SAT.txt data, we perform multiple regression with variable Y= College GPA , while X contains all the remaining three variables. The highlighted values are the main topic of this lecture.

The highlighted numbers are the famous P-values. The P-value is the measure of whether the outcome of a test is due to an actual effect or mere random chance. The concept is rather difficult to grasp and many professionals struggle to understand it, even after some extensive training. Thus, in the true spirit of this course we will not theorize here but rather give you some easy to follow, rule of thumbs type of, instructions.

Regression Statistics

Multiple R 0.6322

R Square 0.3997

Adjusted R Square 0.381

Standard Error 0.5895

Observations 100

ANOVA

df SS MS F Significance F

Regression 3 22.214 7.4048 21.31 1E-10

Residual 96 33.358 0.3475

Total 99 55.573

CoefficientsStandard Errort Stat P-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept -0.153 0.3229 -0.475 0.6362 -0.794 0.4878 -0.794 0.4878

High GPA 0.3764 0.1143 3.2938 0.0014 0.1495 0.6032 0.1495 0.6032

SAT 0.0012 0.0003 4.0464 0.0001 0.0006 0.0018 0.0006 0.0018

Letters 0.0227 0.051 0.4449 0.6574 -0.079 0.1239 -0.079 0.1239

P-Value Rule of thumb 1: The smaller the P-value the better. How small? Typically less than 5% or 1% is acceptable. P value less than 5% means that there is a less than 5% chance that the outcome of a test is due to randomness or the observed trend is not real.

Page 17: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Recall: Another Caution!

• Large R-square does not guarantee that the X and Y are indeed related. For this particular issue, Statistics offers a different number. One that will tell us how likely or what is the probability, that X and Y are related.

• That number is P - value. • P-Value Rule of thumb 1: The smaller the P-value the

better. How small? Typically less than 5% or 1% is acceptable.

• P value less than 5% means that there is a less than 5% chance that the outcome is caused by randomness or the observed trend is not real.

• In other words there is more than 95% chance the outcome/trend is from an actual cause.

Page 18: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Coefficients Standard

Error t Stat P-value

Intercept 0.822036 0.190689 4.310882 3.88E-05

HighSchl GPA 0.565491 0.087834 6.438194 4.49E-09

Coefficients Standard

Error t Stat P-value

Intercept 1.070205 0.255958 4.181178 6.31E-05

Letters 0.175394 0.047408 3.699697 0.000356

Coefficients Standard

Error t Stat P-value

Intercept 0.151892 0.307993 0.493169 0.622997

SAT 0.001802 0.000297 6.070509 2.42E-08

Page 19: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q: What does each dot represent?

• A: A Patient

• Q: What is the interpretation (if any) of the number 45.332?

• A: At near zero fat diet the age of first cancer detection is approximately 45.332

• Q: What is the interpretation of the number -0.0106?

• A: For an additional 100g of daily fat consumption the age of the first cancer detection drops by 1.06

• Q. Is this potential good news or bad news for the fight against cancer?

• A: Potential Good news

• Q: The P-value associated to the slope is 0.291. Is this good news or bad news?

• A: Bad news. P-value is too high, the test is insignificant and the observed negative slope is likely a fluke.

Applications

Page 20: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Applications

• Q: Does our intuition and our hope agree with the data?

• No, we hope that the US economy will grow stronger and for a long time.

• Q: What did the P-value do here?

• It is fairly low (0.026) which means it is statistically significant and that the observed negative trend is likely not a fluke. People might not like the bad news but numbers do not lie. The negative trend is very likely correctly detected.

Page 21: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Applications

• Q: Does our intuition agree with the data?

• Yes, it is expected that the car with higher mileage will have more wear and tear and that its efficiency will decrease.

• Q: What did the P-value do here?

• It is very low (0.00003) so it confirmed our intuition. Downside: It is an expected result.

Page 22: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Using SAT.txt data, we perform multiple regression with variable Y= College GPA , while X contains all the remaining three variables.

Regression Statistics

Multiple R 0.6322

R Square 0.3997

Adjusted R Square 0.381

Standard Error 0.5895

Observations 100

ANOVA

df SS MS F Significance F

Regression 3 22.214 7.4048 21.31 1E-10

Residual 96 33.358 0.3475

Total 99 55.573

CoefficientsStandard Errort Stat P-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept -0.153 0.3229 -0.475 0.6362 -0.794 0.4878 -0.794 0.4878

High GPA 0.3764 0.1143 3.2938 0.0014 0.1495 0.6032 0.1495 0.6032

SAT 0.0012 0.0003 4.0464 0.0001 0.0006 0.0018 0.0006 0.0018

Letters 0.0227 0.051 0.4449 0.6574 -0.079 0.1239 -0.079 0.1239

Application: The P-Values related to High-school GPA and SAT are very small (0.0014 and 0.0001 respectively) but the P-Value associated to Letters is 0.6574 which is too large. Remember, the acceptable cutoff points are either 5% (0.05) or 1% (0.01), both much smaller than the observed 0.65. Conclusion: The P-Value associated to letters is too large and the Regression analysis indicates that we need not keep the letters as a predicting variable. Since in statistics one often hears “the smaller model is a better model” we should redo the regression, but this time using only the High School GPA and SAT scores.

Page 23: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Using Tornadoes .txt data the task is to use the months of May and June to predict the tornado activity in August. Remember the question related to

May and the negative coefficient: Q. What is the interpretation of the negative coefficient -0.03395? The coefficient is negative and it means that for each observed tornado in May one expect a reduction of tornadoes in August, more precisely, for 100 observed tornado in May one expects a decrease of about 3 tornadoes in August.

Observe the P-Value related to May! It is 0.59, which is much larger than the acceptable cut-off point of 0.05. In comparison, the P-value associated to June is very good: 0.004. Conclusion: May should be discarded! P-value is too large indicating that its predicted value is not significant.

Page 24: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Housing and the P-Values. Remember Y variable has House price in $100,000 and the X-axis as presented.

One can see that out of the 4 variables, the #FirePlace has P-Value that is too large (i.e. 0.16>0.05). Thus we can conclude that in order to predict the house price it would be advisable to drop the variable related to fire places and use the remaining there. Their P-values are acceptable Bathroom P-value= 0.00059<0.05 Rooms P-value= 0.038<0.05 Age P-value= 0.009<0.05 Q. The slope next to age is negative but P-Value is very good. Is this something to be concerned? What does it mean? A: No, the negative coefficient is expected (the older the house the lower the price,….makes sense).

Page 25: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Discussion • Large P-value in the table does not mean that

this particular variable is not of influence. In fact, if one revisits the chart: Letters vs GPA one can see that the letters are definitely related to the GPA.

• So what did happen?

• Well, most often, it is that the other variables capture the dependency better than the one with bad P-value.

• In this case we do not need Variable “Letters” since its influence on GPA has been explained by High GPA and SAT.

• Stat Jargon: Often in multiple regression we say that the influence of variables with high P-values is explained by other variables.

• Examples:

• 1) Letters (on their own) influence GPA, but put together with High School GPA and SAT, their influence diminished

Multiple Regression serves as a selection model. One uses p-values to pick the variables amongst many that influence the output the most.

3) House prices: We do not need Variable “fireplaces” since the other variables explained its possible influence.

2) Tornadoes: the influence of May has been explained by June (so we do not need May since we have June).

Page 26: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Important Numbers

P-Value R-square Coefficients Standard

Error

Tells us if a

particular variable

is significantly

related to Y-variable. It does not

tell us if this

relation is strong or

not.

Tells us how strong the

relation between input

and output variables is,

(input could be multi-dimensional).

It does not tell us if this

relation is significant or

not.

Tell us how much

will the predicted

variables Y change

if a particular input variable increases

by one unit.

Tells us what

the

approximate

error of our prediction is.

Page 27: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q1. If Variable 3 increases by 10 units and Variable 4 decreases by 10 units, based on the above tables, what would be your prediction for the variable, Total Tornado numbers?

• A1. The Total number of tornadoes will change approximately by 9.3 – 9.7 = -0.4 or decrease by 0.4.

• Q2. Which variable(s) should be excluded from the study and deemed insignificant? A2. Variables 1 & 2 should be excluded since they have P-values too large.

SUMMARY OUTPUT Ttotal Tornados

Regression Statistics Mean 760.0444

Multiple R 0.9421 Standard Error 36.49065

R Square 0.887552 Median 765

Adjusted R Square 0.873135 Mode #N/A

Standard Error 87.18832 Standard Deviation 244.7867

Observations 45 Sample Variance 59920.54

Kurtosis 0.067799

ANOVA Skewness -0.17219

df SS MS F Significance F Range 1096

Regression 5 2340034 468006.7 61.56523 1.87E-17 Minimum 201

Residual 39 296470.3 7601.802 Maximum 1297

Total 44 2636504 Sum 34202

Count 45

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 136.2497 42.01536 3.242855 0.002427 51.26563 221.2338 51.26563355 221.2338

X Variable 1 0.273618 0.913936 0.299384 0.766237 -1.57499 2.122228 -1.574992398 2.122228

X Variable 2 0.675613 0.430507 1.569341 0.124647 -0.19517 1.546396 -0.195170574 1.546396

X Variable 3 0.93366 0.288627 3.234836 0.002482 0.349857 1.517462 0.349857276 1.517462

X Variable 4 0.974585 0.23006 4.236215 0.000134 0.509244 1.439926 0.5092439 1.439926

X Variable 5 2.067821 0.212358 9.737412 5.41E-12 1.638286 2.497357 1.638285744 2.497357

Task 4. Here are two Tables. First is the Regression table where the Y-variable is the Total Tornado Numbers and the second table is descriptive statistic related to the column: Total Tornado Numbers.

Page 28: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q3. Which number describes the variation of the variable, Total Tornadoes, the best?

• A3. Standard Deviation = 244.7.

• Q4. Once a prediction is made, one number tells us if the overall connection between the predicting Variables 1,2,3,4,5 and Total Tornadoes is strong. What would be this number?

• A4. R-square = 0.88.

• Q5. Based on these tables , what was the lowest recorded Total Tornadoes number?

• A5. Minimum = 201.

SUMMARY OUTPUT Ttotal Tornados

Regression Statistics Mean 760.0444

Multiple R 0.9421 Standard Error 36.49065

R Square 0.887552 Median 765

Adjusted R Square 0.873135 Mode #N/A

Standard Error 87.18832 Standard Deviation 244.7867

Observations 45 Sample Variance 59920.54

Kurtosis 0.067799

ANOVA Skewness -0.17219

df SS MS F Significance F Range 1096

Regression 5 2340034 468006.7 61.56523 1.87E-17 Minimum 201

Residual 39 296470.3 7601.802 Maximum 1297

Total 44 2636504 Sum 34202

Count 45

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 136.2497 42.01536 3.242855 0.002427 51.26563 221.2338 51.26563355 221.2338

X Variable 1 0.273618 0.913936 0.299384 0.766237 -1.57499 2.122228 -1.574992398 2.122228

X Variable 2 0.675613 0.430507 1.569341 0.124647 -0.19517 1.546396 -0.195170574 1.546396

X Variable 3 0.93366 0.288627 3.234836 0.002482 0.349857 1.517462 0.349857276 1.517462

X Variable 4 0.974585 0.23006 4.236215 0.000134 0.509244 1.439926 0.5092439 1.439926

X Variable 5 2.067821 0.212358 9.737412 5.41E-12 1.638286 2.497357 1.638285744 2.497357

Task 4. Here are two Tables. First is Regression table where the Y-variable was the Tornado Numbers and the second table is descriptive statistic related to the column: Total Tornado Numbers.

Page 29: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q6. If you need to pick two variables and increase their value by one unit, which two would you pick in such a way that the prediction changes by the least amount?

• Q7. If you need to pick two variables and increase their value by one unit, which two would you pick in such a way that the prediction changes by the largest amount?

SUMMARY OUTPUT Ttotal Tornados

Regression Statistics Mean 760.0444

Multiple R 0.9421 Standard Error 36.49065

R Square 0.887552 Median 765

Adjusted R Square 0.873135 Mode #N/A

Standard Error 87.18832 Standard Deviation 244.7867

Observations 45 Sample Variance 59920.54

Kurtosis 0.067799

ANOVA Skewness -0.17219

df SS MS F Significance F Range 1096

Regression 5 2340034 468006.7 61.56523 1.87E-17 Minimum 201

Residual 39 296470.3 7601.802 Maximum 1297

Total 44 2636504 Sum 34202

Count 45

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 136.2497 42.01536 3.242855 0.002427 51.26563 221.2338 51.26563355 221.2338

X Variable 1 0.273618 0.913936 0.299384 0.766237 -1.57499 2.122228 -1.574992398 2.122228

X Variable 2 0.675613 0.430507 1.569341 0.124647 -0.19517 1.546396 -0.195170574 1.546396

X Variable 3 0.93366 0.288627 3.234836 0.002482 0.349857 1.517462 0.349857276 1.517462

X Variable 4 0.974585 0.23006 4.236215 0.000134 0.509244 1.439926 0.5092439 1.439926

X Variable 5 2.067821 0.212358 9.737412 5.41E-12 1.638286 2.497357 1.638285744 2.497357

Task 4. Here are two Tables. First is Regression table where the Y-variable was the Tornado Numbers and the second table is descriptive statistic related to the column: Total Tornado Numbers.

Page 30: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q6. If you need to pick two variables and increase their value by one unit. Which two would you pick in such a way that the prediction changes by the least amount?

• A6. Variables 1 & 2 since their increase would yield the least, 0.27+0.67 = 0.94 increase in total tornadoes.

• Q7. If you need to pick two variables and increase their value by one unit. Which two would you pick in such a way that the prediction changes by the largest amount?

• A7. Variables 4 & 5 since their increase would yield the most, 0.97 + 2.06 = 3.03 increase in total tornadoes.

SUMMARY OUTPUT Ttotal Tornados

Regression Statistics Mean 760.0444

Multiple R 0.9421 Standard Error 36.49065

R Square 0.887552 Median 765

Adjusted R Square 0.873135 Mode #N/A

Standard Error 87.18832 Standard Deviation 244.7867

Observations 45 Sample Variance 59920.54

Kurtosis 0.067799

ANOVA Skewness -0.17219

df SS MS F Significance F Range 1096

Regression 5 2340034 468006.7 61.56523 1.87E-17 Minimum 201

Residual 39 296470.3 7601.802 Maximum 1297

Total 44 2636504 Sum 34202

Count 45

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 136.2497 42.01536 3.242855 0.002427 51.26563 221.2338 51.26563355 221.2338

X Variable 1 0.273618 0.913936 0.299384 0.766237 -1.57499 2.122228 -1.574992398 2.122228

X Variable 2 0.675613 0.430507 1.569341 0.124647 -0.19517 1.546396 -0.195170574 1.546396

X Variable 3 0.93366 0.288627 3.234836 0.002482 0.349857 1.517462 0.349857276 1.517462

X Variable 4 0.974585 0.23006 4.236215 0.000134 0.509244 1.439926 0.5092439 1.439926

X Variable 5 2.067821 0.212358 9.737412 5.41E-12 1.638286 2.497357 1.638285744 2.497357

Task 4. Here are two Tables. First is Regression table where the Y-variable was the Tornado Numbers and the second table is descriptive statistic related to the column: Total Tornado Numbers.

Page 31: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Lab 6 Follow-up The task was: Use US crime data and extract 4 columns of data: Murder, Robbery, Larceny and Burglary for Florida. Use multiple regression and predict the Murder count based on the other three variables. Warning:While cutting & pasting the data be sure that the three X-input columns Robbery, Larceny and Burglary are next to each other. This is needed in order to perform regression in Excel. Q1. What is the standard error and what

would be a short (1-sentence) interpretation of this number? A: 79.88 , interpretation: After predicting the number of murders we expect an error of +/- 79.88 Q2. What is the coefficient next to Burglary and what would be a short (1-sentence) interpretation of this number? A: 0.0063 Interpretation: For each Burglary there is an increase of about 0.006 murders. Q3. Suppose in the year 2001 we had 4000 Robberies, 100000 Larcenies and 100000 Burglaries. What would be your prediction for the murder count in 2001? A:342.5618+4000*(-0.00157) +100000*0.006396+100000*(-0.00104) = 861.97

Page 32: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Lab 6 Follow-up Use the Water Data and via Multiple regression select the two variables that predict the Water usage level the best. Make another table with these two variables and answer the questions. Hint: The first Regression table will have one of the three variables with too large P-value (i.e. the temperature has P-value=0.176 which is larger than 0.05). Thus we need to redo the regression but using only the remaining two variables.

Q1. One of the coefficients should be negative, what is its value and what is the interpretation (1-sentence answer)? A: persons, Interpretation: For each additional person the water usage drops by 21.56 gallons, Q2. Which of the two variables has a better P-value and what is this P-Value? A: Production, 0.0026, Q3. If the production is 10000, the number of people is 100 and the temperature is 70 what is the predicted Water usage in gallons? A: (A bit tricky, since variable temperature is not used) 4600+10000*0.2-100*21.5=4450, Q4. Based on the table would you characterize the Regression fit and the prediction as Poor, Good, Very Good or Excellent? A: Good according to the breakdown of R square table.

from 0 to 0.2 (poor)

from 0.2 to 0.4 (decent)

from 0.4 to 0.6 (good)

from 0.6 to 0.85 (very good)

from 0.85 to 1 (excellent)

Page 33: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

Lab 6 Follow-up Upload CO2 Data and via Multiple regression select the two variables that predict the CO2 level with the best P-value. Make another table with these two variables and answer the questions:

Q1. What are the two selected variables? A: Traffic and wind Q2.Which of these two variables has better P-value and what is the value? A: Traffic P-val=6.85E-12 (A bit Tricky, some students are confused with “E-12” notation) Q3. Based on the table would you characterize the Regression fit and the prediction as Poor, Good, Very Good or Excellent? A: Excellent (R-square is almost 1) Q4. What is the interpretation of the number 0.1747? A: For one unit of wind increase the CO2 will increase by 0.1747 units. (Sometimes we do not know what the “units” are!) Q5. What is the interpretation of the number 1.27? A: When there are no (zero) units of Traffic and Wind , the CO2 level is predicted to be 1.27 units.

from 0 to 0.2 (poor)

from 0.2 to 0.4 (decent)

from 0.4 to 0.6 (good)

from 0.6 to 0.85 (very good)

from 0.85 to 1 (excellent)

Page 34: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q1. If Variable 3 increases by 10 units, according to the table what would be your prediction?

• A1. The DEALER’S COST will increase by 1731.4 units.

• Q2. If all 4 variables increase by 1 unit, according to the table what would be your prediction?

• A2.The DEALER’S COST would increase by approximately 597.9 units.

• Q3. Which two numbers are designed to describe the center of the variable, DEALER’S COST (numbers are rounded),

• A3. 27386 & 23792

The two tables are related to the Y variable, DEALER’S COST. The first is a regression table with 4-variables predicting the DEALER’S COST while the second table contains the Summary statistic related to the DEALER’S COST data. Answer the questions (choose the most appropriate answer):

SUMMARY OUTPUT DealerCost

Regression Statistics Mean 27385.77

Multiple R 0.858877 Standard Error 953.062

R Square 0.73767 Median 23792

Adjusted R Square 0.733088 Mode 14207

Standard Error 7532.052 Standard Deviation 14579.05

Observations 234 Sample Variance 2.13E+08

Kurtosis 8.119432

ANOVA Skewness 2.148112

df SS MS F Significance F Range 109725

Regression 4 3.65E+10 9.13E+09 160.9866 2.45E-65 Minimum 9875

Residual 229 1.3E+10 56731804 Maximum 119600

Total 233 4.95E+10 Sum 6408271

Count 234

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 208.9385 9717.719 0.021501 0.982865 -18938.6 19356.51 -18938.6334 19356.51

X Variable 1 -2838.32 1714.461 -1.65552 0.099189 -6216.46 539.8147 -6216.456565 539.8147

X Variable 2 3358.434 966.9237 3.473318 0.000615 1453.229 5263.638 1453.22915 5263.638

X Variable 3 173.1412 14.67861 11.79548 2.11E-25 144.2188 202.0636 144.2188475 202.0636

X Variable 4 -95.2884 60.25432 -1.58144 0.115158 -214.012 23.4353 -214.0121702 23.4353

Page 35: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

• Q4. Which number should be used in order to answer the question: Is Variable 2 related to the variable, DEALER’S COST?

• A4. 0.0006 • Q5. Once a prediction is made, we often provide a range where we believe the true

value will lie. What number is use to construct this range? • A5. 7532 • Q6 Which number describes the variation of the variable, DEALER’S COST, the best?

(numbers are rounded) • A6. 14579

1. The above two tables are related to the Y variable, DEALER’S COST. The first is a regression table with 4-variables predicting the DEALER’S COST while the second table contains the Summary statistic related to the DEALER’S COST data. Answer the questions (choose the most appropriate answer):

SUMMARY OUTPUT DealerCost

Regression Statistics Mean 27385.77

Multiple R 0.858877 Standard Error 953.062

R Square 0.73767 Median 23792

Adjusted R Square 0.733088 Mode 14207

Standard Error 7532.052 Standard Deviation 14579.05

Observations 234 Sample Variance 2.13E+08

Kurtosis 8.119432

ANOVA Skewness 2.148112

df SS MS F Significance F Range 109725

Regression 4 3.65E+10 9.13E+09 160.9866 2.45E-65 Minimum 9875

Residual 229 1.3E+10 56731804 Maximum 119600

Total 233 4.95E+10 Sum 6408271

Count 234

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%

Intercept 208.9385 9717.719 0.021501 0.982865 -18938.6 19356.51 -18938.6334 19356.51

X Variable 1 -2838.32 1714.461 -1.65552 0.099189 -6216.46 539.8147 -6216.456565 539.8147

X Variable 2 3358.434 966.9237 3.473318 0.000615 1453.229 5263.638 1453.22915 5263.638

X Variable 3 173.1412 14.67861 11.79548 2.11E-25 144.2188 202.0636 144.2188475 202.0636

X Variable 4 -95.2884 60.25432 -1.58144 0.115158 -214.012 23.4353 -214.0121702 23.4353

Page 36: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One more example:

• Upload Tornado data and focus on the tornado number files (not tornado death related files). Use regression and the months: January, February, March, April and May to predict the number of tornadoes in October. Next you are to pick the two variables with the best P-values and create a new table, this time using these two variables in order to predict the number of tornadoes in October.

• Q: What are the two variables you picked before the new table?

• Q: After getting the new table, choose the month with the larger coefficient and describe its interpretation.

• Q: How good is the fit and why from the new table?

• Q: Which of the two variables have a better P-value in the new table?

Page 37: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.559511 R Square 0.313053 Adjusted R Square 0.224982

Standard Error 12.93656 Observations 45

ANOVA

df SS MS F Significanc

e F Regression 5 2974.4 594.88 3.5546 0.0096 Residual 39 6526.8 167.35 Total 44 9501.2

Coefficients Standard

Error t Stat P-value Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept 8.217164 6.1647 1.3329 0.1903 -4.252 20.686 -4.252 20.686

Jan -0.119195 0.1899 -0.628 0.5339 -0.503 0.2649 -0.503 0.2649 Feb 0.26924 0.1285 2.0952 0.0427 0.0093 0.5292 0.0093 0.5292 Mar -0.07443 0.0641 -1.161 0.2528 -0.204 0.0553 -0.204 0.0553 Apr 0.170439 0.0421 4.0482 0.0002 0.0853 0.2556 0.0853 0.2556 May -0.01414 0.0326 -0.434 0.6667 -0.08 0.0518 -0.08 0.0518

Q:Which two variables have the best P-values that you would pick?

Page 38: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.5146

R Square 0.2648 Adjusted R Square 0.2298 Standard Error 12.897 Observations 45

ANOVA

df SS MS F Significanc

e F Regression 2 2515.8 1257.9 7.563 0.0016

Residual 42 6985.4 166.32 Total 44 9501.2

Coefficie

nts Standard

Error t Stat P-value Lower 95% Upper 95% Lower 95.0%

Upper 95.0%

Intercept 4.9464 5.3274 0.9285 0.3585 -5.805 15.698 -5.805 15.698 Feb 0.2095 0.1224 1.7116 0.0943 -0.038 0.4566 -0.038 0.4566 Apr 0.1393 0.0368 3.7862 0.0005 0.0651 0.2136 0.0651 0.2136

Q:After getting the new table, choose the month with the larger coefficient and describe its interpretation. Q:How good is the fit and why from the new table? Q: Which of the two variables have a better P-value in the new table?

Page 39: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.559511 R Square 0.313053 Adjusted R Square 0.224982

Standard Error 12.93656 Observations 45

ANOVA

df SS MS F Significanc

e F Regression 5 2974.4 594.88 3.5546 0.0096 Residual 39 6526.8 167.35 Total 44 9501.2

Coefficients Standard

Error t Stat P-value Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept 8.217164 6.1647 1.3329 0.1903 -4.252 20.686 -4.252 20.686

Jan -0.119195 0.1899 -0.628 0.5339 -0.503 0.2649 -0.503 0.2649 Feb 0.26924 0.1285 2.0952 0.0427 0.0093 0.5292 0.0093 0.5292 Mar -0.07443 0.0641 -1.161 0.2528 -0.204 0.0553 -0.204 0.0553 Apr 0.170439 0.0421 4.0482 0.0002 0.0853 0.2556 0.0853 0.2556 May -0.01414 0.0326 -0.434 0.6667 -0.08 0.0518 -0.08 0.0518

Q1:Which two variables have the best P-values that you would pick? A1:February and April

Page 40: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.5146

R Square 0.2648 Adjusted R Square 0.2298 Standard Error 12.897 Observations 45

ANOVA

df SS MS F Significanc

e F Regression 2 2515.8 1257.9 7.563 0.0016

Residual 42 6985.4 166.32 Total 44 9501.2

Coefficie

nts Standard

Error t Stat P-value Lower 95% Upper 95% Lower 95.0%

Upper 95.0%

Intercept 4.9464 5.3274 0.9285 0.3585 -5.805 15.698 -5.805 15.698 Feb 0.2095 0.1224 1.7116 0.0943 -0.038 0.4566 -0.038 0.4566 Apr 0.1393 0.0368 3.7862 0.0005 0.0651 0.2136 0.0651 0.2136

Q2:After getting the new table, choose the month with the larger coefficient and describe its interpretation. A2:Feb = 0.2095. For 10 additional sightings of tornadoes in Feb there is expected and increase of 2 tornadoes in October. Q3:How good is the fit and why from the new table? A3: R^2 = 0.2648 and is decent according to the R^2 table. Q4: Which of the two variables have a better P-value in the new table? A4: April

Page 41: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One dimensional regression table Take the House price data and consider the two variables X=number of rooms and Y=number of bedrooms. Make a chart and create the appropriate fitted line.

DISCUSSION: The chart looks “good”; the slope is positive and the R-square is very good. Q1: What is the interpretation of the number 0.5697? A: For each additional room in a house one gets about 0.56 bedrooms. Q2: What is the statistical interpretation of the number -0.5193 A: No reasonable interpretation. Q3: How good is the fit? A3: excellent, since R-square is very high Q4: How certain are you that the number of rooms and bedrooms are related? A: You cannot answer this question without the P-value!!!!

Page 42: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One dimensional regression table Take the House price data and consider the two variables X=number of rooms and Y=number of bedrooms. Make a chart and create the appropriate fitted line.

Q4: How certain are you that the number of rooms and bedrooms are related? A: You cannot answer this question without the P-value!!!! Perform the Regression analysis and observe that the P-Value is very low; The P-value is less than 1%. Thus we can say with at least 99.99999% certainty that there is a connection between the number of rooms and number of bedrooms. Warning: The P-value does not tell us how strong is this connection (R square does that) It only tells us how certain are we that there is a connection.

Page 43: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One dimensional regression table Let us try the similar analysis but now use X=Age and Y=number of bedrooms. Make a chart and create the appropriate fitted line.

DISCUSSION: The chart does not look “good,” the slope is positive , but almost zero and R-square is not very good. Q1: What is the interpretation of the number 0.0055? A: for each additional year a typical house gets about 0.005 extra bedrooms. Or another way to put it: Older houses seemed to have more rooms and as a house age by a year it has extra 0.005 bedroom. Q2: What is the statistical interpretation of the number 3.085? A: A brand new house (0 years old) would have roughly 3 bedrooms (on average). Q3: How good is the fit? A: very bad, since R-square is very low Q4: How certain are you that the age and bedrooms are related? A: You cannot answer this question without the P-value!!!!

Page 44: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One dimensional regression table Let us try the similar analysis but now use X=Age and Y=number of bedrooms. Make a chart and create the appropriate fitted line.

Q4: How certain are you that the age and bedrooms are related? A: You cannot answer this question without the P-value!!!! Perform the Regression analysis and observe that the P-Value is higher then 5%. It is 0.588 which is bigger then 0.05. Hence we cannot statistically determined # of bedrooms and age of house is related! The statistical test is inconclusive.

Page 45: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One dimensional regression table Let us try the similar analysis but now use X=Age and Y=selling price (in $100000 ‘s). Make a chart and create the appropriate fitted line.

DISCUSSION: The chart looks fairly “good”= just about ok. Q: The slope of -0.073 describes this in more detail. A: For each additional year, the house gets cheaper by 0.073*$100000=$7300 Q: Is this a reasonable conclusion? A: yes, it is reasonable. The older the house, the lower the selling price The question is, can one statistically substantiate such a claim: “The older the house, the lower the selling price”? We need to perform the Regression Analysis.

Page 46: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

One dimensional regression table Let us try the similar analysis but now use X=Age and Y=selling price (in $100000 ‘s). Make a chart and create the appropriate fitted line.

The question is, can one statistically substantiate such a claim: “The older the house, the lower the selling price”? We need to perform the Regression Analysis. Interpretation: The P-Value is too high, (it is 0.074 which is larger than the acceptable 0.05). Consequently we cannot statistically determine if the age of the house influences the selling price. Warning: In these types of situations students often (mistakenly) offer a stronger statement like: we can determine that age does not influence the price. No! You cannot determine this! All you can say is that we do not have enough evidence to detect the connection between age and price. In other words, it is still possible that the age influences the price, but given this data we cannot confirm such a statement., hence the test is inconclusive.

Page 47: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

To summarize (Regression): • Hitherto, we have analyzed situations where one (or more)

input variables X influenced the output variable, Y. Example: The length of the sail boat (X) clearly influences the weight of the boat (Y). Regression analysis is designed to statistically describe this relation, to help us relate variables X and Y. So far we have learned a few important quantities

• P-value (related to the slope) tells us if there is a significant relation or not. For example the P-value between the length of the sailboat and its weight is 6.41E-09 indicating there is at least 99.99999% certainty that there is a connection between the length of sailboat and its weight.

• In multiple regression P values are used as a selection tool. Earlier example: We only needed High School GPA and SAT in order to predict the college GPA. In multiple regression Letters had high P-value and we discarded it.

• R (and R-square) tells us how strong is this relation between X and Y. Letters vs GPA have R square of only 0.122 and we can conclude that although we are almost 100% sure (due to small P-value) that letters are related to GPA, this relation is very weak. On the other hand, the sail boat example produces a very good R square =0.85 and we can conclude that X (i.e. length) is very much related to Y (i.e. weight).

Important to remember: P-Value does not tell us much about the strength of the relation between X and Y. A small P-value only indicates that we are very certain that there is a relation between X and Y and not how strong is this relationship! Case point: Letters vs GPA have an excellent P-value of 0.00035, but letters are not very good predictors of College GPA (i.e. the relationship between X and Y is weak)

Page 48: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

To summarize (Regression):

• Coefficients are used to compute the actual prediction. For a sailboat that is X feet long, we can predict its weight to be 1.012X-18.06 thousand pounds. Moreover, the coefficients could be used in order to determine individual contributions. In an earlier example we got that for one additional observed Robbery we expect to see 0.027 additional murders.

• Standard error is used in order to assist the prediction. Namely, a boat that is 40 feet long has predicted weight of about 22.452 thousand pounds or 22452 pounds but it would be absurd to claim that every 40 foot long sail boat will always have this exact weight. Instead, we use Standard error and produce the interval approximation. In this case the error is 4.4 and the interval prediction: [22.452 – 4.4, 22.452 + 4.4] = about [18, 27] thousand pounds.

Page 49: Lecture 5 - Florida Atlantic Universitymath.fau.edu/bkhadka/IntroStats_files/Lecture_5.pdf · We start with Lab 5 Follow-up ... 3 6 7 6 2 7 2 6 2 4 5 6 5 4 4 6 9 6 5 1 8 2 2 6 7 4

from 0 to 0.2 (poor)

from 0.2 to 0.4 (decent)

from 0.4 to 0.6 (good)

from 0.6 to 0.85 (very good)

from 0.85 to 1 (excellent)

Breakdown of R^2