View
221
Download
5
Tags:
Embed Size (px)
Citation preview
1
Chapter 17Chapter 17
• Linear regression is a procedure that Linear regression is a procedure that identifies relationship between independent identifies relationship between independent variables and a dependent variable. variables and a dependent variable.
• This relationship helps reduce the This relationship helps reduce the unexplained variation of the dependent unexplained variation of the dependent variable behavior, thus provide better variable behavior, thus provide better predictions of its future values.predictions of its future values.
2
The Simple linear regression model
• The model is: xy 10
3
The Simple linear regression model
• The model is:
• We try to estimate the deterministic part of it by developing the line with the best fit.
• Best fit is defined as the minimum sum of squared errors.
• An error is the difference between the line value and the actual value for a given x.
xy 10
4
The Simple linear regression model
• The analysis yields a prediction equation of the form , where…– b0 is an estimate of 0
– b1 is an estimate of 1
– is the predicted value of y for a given x.
xbby 10ˆ
y
5
• Are the costs of welding machines breakdowns related to their age?
• From the data answer the following:– Find the sample regression line– What is the coefficient of determination. Interpret.– Are machine-age and monthly repair costs linearly
related?– Is the fit good enough to use the model to predict the
monthly repair costs of a 120 months old machine?– Make the prediction.
Problem 1
6
• Find the sample regression line• To calculate b0 and b1 by hand we can use:
• Instead we’ll use Excel to find the covariance of x and y the variance of x, and the means of x and y, then use the following formulae.
xbyb
)x(x
)y)(yx(xb
i
ii
10
21
Problem 1
7
Problem 1
xbyb
s
yxb
x
10
21
),cov(
2ys
To calculate the covariance of x and y by Exceluse the function covar(x,y) and multiply the result by n/(n-1).
To calculate and use Data Analysis Plus Descriptive Statistics
2xs
8
Problem 1
• From Excel we get:– The Cov(age,cost)=936.82
Mean age (x) = 113.35. = 378.77Mean cost (y) = 395.21. = 4094.79.
– b1 = cov(x,y) = 936.82/378.77 = 2.4733b0 = y-b1x = 395.21-2.4733(113.35) = 114.86The regression line:
)months(4733.286.114Cost
2xs2ys
2xs
Data
9
Problem 1
• Coefficient of determination.
– In this case
– 56.59% of the variation in costs, are explained by the model (that is, by the age variation).
2y
2x
22
SS
y)][Cov(x,
SST
SSRr
5659.79.409477.378
]82.936[r
22
Data
10
Problem 1
• Is there a linear relationship between monthly costs and machine age?
• If 10 there is a linear relationship between x and y.
• Thus, we test the coefficient 1
H0: 1= 0
H1: 10
Data
11
Problem 1
In this case – t= [2.47-0]/.5106 = 4.837– The rejection region is
t > t/2 or t < -t/2 (a two tail test) with n-2 degrees of freedomFor = .05, t.025,18 =2.10
– Since 4.837>2.10, the null hypothesis is rejected at 5% significance level.There is linear relationship between Monthly cost and machine age.
1b
11
s
bt
We use a t-statistic:
5106.77.378)120(
3185.43
s)1n(
ss
2x
1b
Comment:Sb1 can be calculatedby
But in this course we’ll use the calculated value from the Excel Run. See the printout in the next slide.
Data
12
Problem 1SUMMARY OUTPUT
Regression StatisticsMultiple R 0.752234R Square 0.565856Adjusted R Square0.541737Standard Error43.31848Observations 20
ANOVAdf SS MS F Significance F
Regression 1 44024.24 44024.24 23.46094 0.00013Residual 18 33776.83 1876.491Total 19 77801.07
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 114.8525 58.68544 1.957086 0.06603 -8.44117 238.1461Age 2.47334 0.510636 4.843649 0.00013 1.400533 3.546146
The p value < alpha
Here we perform the t-test for 1
using the p-value (read from the printout)
Data
13
Problem 1
• We need to forecast the expected cost for a 120 months old machine.• The equation provides a point prediction:
Cost = 114.86+2.4733(120) = $411.65The prediction interval (use data analysis plus, select the “Prediction Interval” option, observe the results under “Prediction Interval”): Lower Confidence level = $318.12; Upper Confidence Level = 505.18
• What’s the prediction for the average monthly repair cost for all the machines 120 months old?To answer this question construct the confidence interval (notice, not the prediction interval!). Again, use Data analysis Plus select the “Prediction Interval” option, observe the results under “Interval Estimate of Expected value”.
Data
14
Chapter 18Chapter 18
• The multiple regression model allows more The multiple regression model allows more than one independent variable explain the than one independent variable explain the values of the dependent variable.values of the dependent variable.
• We assess the model as before usingWe assess the model as before using– t test for linear relationships between the t test for linear relationships between the
independent variables and the dependent variable independent variables and the dependent variable (tested one at a time)(tested one at a time)
– F test for the over usefulness of the modelF test for the over usefulness of the model
– Coefficient of determination for the fit.Coefficient of determination for the fit.
15
Problem 2
• When a company buys another company it is not unusual that some workers are terminated.
• A buyout contract between Laurier Comp and the Western Comp required that Laurier provides a severance package to Western workers fired, equivalent to packages offered to Laurier workers.
• It is suggested that severance is determined by three factors: Age, length of service, pay.
• Bill smith, a Western employee, is offered a 5 weeks severance pay when his employment is terminated.
• Based on the data provided by Laurier about severance offered to 50 of its employees in the past, answer the following questions:
16
• Determine the regression equation. Interpret the coefficients.
• Comment on how well the model fits the data.• Do all the independent variables belong in the
model?• Does Laurier meet its obligation to Bill Smith?
The results and analysis appear in the Excel data file “Laurier”
Problem 2 - continued
17
Problem 3
• A linear regression model for life longevity– Insurance companies are interested in
predicting life longevity of their customers.– Data for 100 deceased male customers was
collected, and a regression model run.– The model studied was:
Longevity = 0+1MotherAge+b2FatherAge+3GrandM+4GrandF+
18
Problem 3SUMMARY OUTPUT
Regression StatisticsMultiple R 0.860843R Square 0.74105Adjusted R Square0.730147Standard Error2.664075Observations 100
ANOVAdf SS MS F Significance F
Regression 4 1929.517 482.3792 67.96663 4.86E-27Residual 95 674.243 7.097295Total 99 2603.76
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 3.243821 5.423412 0.598114 0.551187 -7.52301 14.01065Mother 0.450858 0.054502 8.272401 8E-13 0.342659 0.559057Father 0.411183 0.049788 8.258636 8.56E-13 0.312341 0.510026Gmothers 0.016553 0.066107 0.250396 0.802822 -0.11469 0.147793Gfathers 0.086858 0.065657 1.322917 0.189039 -0.04349 0.217203
The equation
Coefficient of determination
19
Problem 3SUMMARY OUTPUT
Regression StatisticsMultiple R 0.860843R Square 0.74105Adjusted R Square0.730147Standard Error2.664075Observations 100
ANOVAdf SS MS F Significance F
Regression 4 1929.517 482.3792 67.96663 4.86E-27Residual 95 674.243 7.097295Total 99 2603.76
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 3.243821 5.423412 0.598114 0.551187 -7.52301 14.01065Mother 0.450858 0.054502 8.272401 8E-13 0.342659 0.559057Father 0.411183 0.049788 8.258636 8.56E-13 0.312341 0.510026Gmothers 0.016553 0.066107 0.250396 0.802822 -0.11469 0.147793Gfathers 0.086858 0.065657 1.322917 0.189039 -0.04349 0.217203
Overall usefulness: H0: all i = 0H1: At least one i = 0
F Significance = p value = 4.86(10-27) Reject H0. The model is useful.
20
Problem 3SUMMARY OUTPUT
Regression StatisticsMultiple R 0.860843R Square 0.74105Adjusted R Square0.730147Standard Error2.664075Observations 100
ANOVAdf SS MS F Significance F
Regression 4 1929.517 482.3792 67.96663 4.86E-27Residual 95 674.243 7.097295Total 99 2603.76
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 3.243821 5.423412 0.598114 0.551187 -7.52301 14.01065Mother 0.450858 0.054502 8.272401 8E-13 0.342659 0.559057Father 0.411183 0.049788 8.258636 8.56E-13 0.312341 0.510026Gmothers 0.016553 0.066107 0.250396 0.802822 -0.11469 0.147793Gfathers 0.086858 0.065657 1.322917 0.189039 -0.04349 0.217203
Mother’s age and father’s age at deathhave strong linear relationships to anIndividual’s age at death.Grandparents’ age at death are not good predictors of an individual’s age at death.
The t-test for i:H0: i = 0H1: i = 0
t= (bi – i)/sbi
Rejection region: t>t/2, n-k-1 ort<-t/2, n-k-1
21
Qualitative Variables
• Dummy variables help include qualitative data in a Dummy variables help include qualitative data in a regression model.regression model.
• If qualitative data can be categorized by n If qualitative data can be categorized by n categories, there are n-1 dummy variables needed to categories, there are n-1 dummy variables needed to express all the categories.express all the categories.
• Dummy variables take on the values 0 or 1.Dummy variables take on the values 0 or 1.– XXi i = 0 if the data point in question does not belong to = 0 if the data point in question does not belong to
category i category i
– XXii = 1 if the data point in question belongs to category i. = 1 if the data point in question belongs to category i.
22
• In problem 1 we studied the relationship between age of welding machines and breakdown costs.
• This study was expanded. It is now including also lathe machine and stamping machines. See Data file. Code for machine type: 1=Welding; 2=Lathe; 3=Stamping
• Answer the following:– Develop a regression model– Interpret the coefficient– Can we conclude that welding machines cost more to repair than
stamping machine.– Predict the monthly cost to repair an 85 month old lathe machine
Problem 4
23
• First we need to prepare the input dataRepairs Age Machine Repairs Weld Lathe
327.67 110 1 327.67 1 0376.68 113 1 376.68 1 0392.52 114 1 392.52 1 0443.14 134 1 443.14 1 0
. . . . . .
. . . . . .321.86 127 2 321.86 0 1303.73 164 2 303.73 0 1301.06 159 2 301.06 0 1279.24 155 2 279.24 0 1
. . . . . .
. . . . . .388.06 87 3 388.06 0 0401.24 106 3 401.24 0 0311.13 72 3 311.13 0 0252.33 67 3 252.33 0 0
Original data
Problem 4
24
• First we need to prepare the input dataRepairs Age Machine Repairs Weld Lathe
327.67 110 1 327.67 1 0376.68 113 1 376.68 1 0392.52 114 1 392.52 1 0443.14 134 1 443.14 1 0
. . . . . .
. . . . . .321.86 127 2 321.86 0 1303.73 164 2 303.73 0 1301.06 159 2 301.06 0 1279.24 155 2 279.24 0 1
. . . . . .
. . . . . .388.06 87 3 388.06 0 0401.24 106 3 401.24 0 0311.13 72 3 311.13 0 0252.33 67 3 252.33 0 0
Problem 4
25
• Run the multiple regressionSUMMARY OUTPUT
Regression StatisticsMultiple R 0.77057R Square 0.593778Adjusted R Square0.572016Standard Error48.59141Observations 60
ANOVAdf SS MS F Significance F
Regression 3 193271.4 64423.79 27.2852 5.24E-11Residual 56 132223 2361.126Total 59 325494.4
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 119.2521 35.00037 3.407167 0.001222 49.13801 189.3663Age 2.538233 0.402311 6.309124 4.75E-08 1.732307 3.344159Welding -11.7553 19.70184 -0.59666 0.553138 -51.2228 27.71215Lathe -199.374 30.71301 -6.49151 2.39E-08 -260.899 -137.848
Problem 4
26
• Run the multiple regressionSUMMARY OUTPUT
Regression StatisticsMultiple R 0.77057R Square 0.593778Adjusted R Square0.572016Standard Error48.59141Observations 60
ANOVAdf SS MS F Significance F
Regression 3 193271.4 64423.79 27.2852 5.24E-11Residual 56 132223 2361.126Total 59 325494.4
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 119.2521 35.00037 3.407167 0.001222 49.13801 189.3663Age 2.538233 0.402311 6.309124 4.75E-08 1.732307 3.344159Welding -11.7553 19.70184 -0.59666 0.553138 -51.2228 27.71215Lathe -199.374 30.71301 -6.49151 2.39E-08 -260.899 -137.848
Note the reference line (for the stamping machine):Cost=119.25+2.538Age
Cost=119.25+2.538Age-11.755W-199.37LRepair cost increase on the average by $2.53 a month.
The monthly repair cost for a welding machine is $11.75 lower than for a stamping machine of the same age. However, this result is not significant p value=.55). There is insufficient evidence in the sample to support the hypothesis that there is any difference between repair costs of welding machines and stamping machines.
The monthly repair cost for a lathe machine is $199.37 lower than for a stamping machine of the same age. This result is significant.
Problem 4
More Reviewquestions
27
Example 5• To predict the asking price of a used Chevrolet Camaro,
the following data were collected on the car’s age and mileage. Data is stored in CAMARO1.
Determine the regression equation and answer additional questions stated later.
• Solution– Run the regression tool from Excel > Data analysis. Click to see
the output next
28
The regression equation
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.900996R Square 0.811794Adjusted R Square 0.798351Standard Error 1966.976Observations 31
ANOVAdf SS MS F Significance F
Regression 2 4.67E+08 2.34E+08 60.38662 7E-11Residual 28 1.08E+08 3868994Total 30 5.76E+08
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 17499.1 940.3091 18.60994 2.66E-17 15572.96 19425.24Age -1131.64 335.3149 -3.37485 0.002179 -1818.5 -444.777Mileage -72.3086 26.34256 -2.74493 0.010448 -126.269 -18.3482
The regression equation: Price =17499.1-1131.64Age-72.31MileageBe careful about the interpretation of the intercept (17499).Do not argue that this is the price of a used car with no mileagewhen its age is “zero”. Although such cars may exist (a car purchased and returned within a week with almost no mileage)might need to be re-sold as a used car. Yet, such values of Age and Mileage were not covered by the sample range!!.
CAMARO1
29
The model usefulness
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.900996R Square 0.811794Adjusted R Square 0.798351Standard Error 1966.976Observations 31
ANOVAdf SS MS F Significance F
Regression 2 4.67E+08 2.34E+08 60.38662 7E-11Residual 28 1.08E+08 3868994Total 30 5.76E+08
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 17499.1 940.3091 18.60994 2.66E-17 15572.96 19425.24Age -1131.64 335.3149 -3.37485 0.002179 -1818.5 -444.777Mileage -72.3086 26.34256 -2.74493 0.010448 -126.269 -18.3482
• Does the overall model contribute significantly to predicting the asking price of a used Chevrolet Camaro? Use .01 for the significance levelAnswer: Observe the Significance F. This is the p value for the F Test of the hypothesesH0: 1= 2 = 0H1: At least one 0.
Since the p value is practically zero, it is smaller than alpha. The null hypothesis is rejected, and therefore at least one 0. The variable associated with this is linearly related to the price, and the model is useful, thus contributes to predicting the asking price.
CAMARO1
30
Model’s fit
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.900996R Square 0.811794Adjusted R Square 0.798351Standard Error 1966.976Observations 31
ANOVAdf SS MS F Significance F
Regression 2 4.67E+08 2.34E+08 60.38662 7E-11Residual 28 1.08E+08 3868994Total 30 5.76E+08
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 17499.1 940.3091 18.60994 2.66E-17 15572.96 19425.24Age -1131.64 335.3149 -3.37485 0.002179 -1818.5 -444.777Mileage -72.3086 26.34256 -2.74493 0.010448 -126.269 -18.3482
• How well does the model fit the data? Would you expect the predictions to be accurate with this model?
• Solution– Observing the coefficient of
determination (R2), 81% of the variation in car prices are explained by this model. This is quite high, and we can expect accurate predictions.
31
Predicting ‘y’
• Predict the value of the asking price for a 5-years old car, with 70,000 miles on the odometer, with 95% confidence.
• Solution– To obtain an interval estimate for the prediction of a single car
asking price when Age=5, and Mileage=70, we look for the prediction interval. From Data Analysis Plus we have {$2622.222, $10936.38}.
– The general form of the interval is: whereis determined from the data.Specifically: 17499.1-1131.64(5)-72.31(70)= 6779.303. So the interval is 6779.303 For the Data Analysis Plus procedure go to the worksheet “Prediction Interval” in “CAMARO1”.
),(ˆ 705y
),(ˆ 705y
32
Estimating the mean ‘y’
• Predict the value of the mean asking price for all 5-years old cars, with 70,000 miles on the odometer, with 95% confidence.
• Solution– To obtain an interval estimate for the mean asking price of all
cars for which Age=5 and Mileage=70, we look for the confidence interval. From Data Analysis Plus we have {$5756.028, $7802.577}For details go to the worksheet “Prediction Interval” in “CAMARO1”.
33
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.900996R Square 0.811794Adjusted R Square 0.798351Standard Error 1966.976Observations 31
ANOVAdf SS MS F Significance F
Regression 2 4.67E+08 2.34E+08 60.38662 7E-11Residual 28 1.08E+08 3868994Total 30 5.76E+08
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 17499.1 940.3091 18.60994 2.66E-17 15572.96 19425.24Age -1131.64 335.3149 -3.37485 0.002179 -1818.5 -444.777Mileage -72.3086 26.34256 -2.74493 0.010448 -126.269 -18.3482
Testing linear relationship
• Are both variables (Age and Mileage each one in the presence of the other one), serve as good predictors of Asking Price? Test at alpha=.025.
• Solution– Perform a t-test for the coefficient of each variable. The hypotheses
tested are: H0: Age=0 vs. H1: Age0 for which the p value is .002; H0: Mileage=0 vs. H1: Mileage0 for which the p value is .0104. In both cases the null hypothesis is rejected, therefore, both have linear relationship to the asking price at 2.5% significance level.
34
Problem 5-continued • The previous model for the prediction of the asking price
of used Chevrolet Camaro, is now extended by adding two new independent variables: car condition (Excellent, Average, Poor), and the type of the seller who sells the car (Dealer, Individual). The data for this case is stored in CAMARO2 (see next slide).
• Develop the linear regression model for this case and answer several questions formulated next.
• Solution– The two new variables describe the values of qualitative data (the
state of a car and the type of the seller). Thus, they are dummy variables, taking on the values ‘0’ and ‘1’.
35
Using dummy variables
• Solution – continued:– There are three possible car condition values, so we need two
dummy variables. Let us select the variables ‘Average’ and ‘Poor’.
– In describing the two values of the car condition, these variables are used as follows:
Average Poor
• An “Excellent condition” car 0 0
• An “Average condition” car 1 0
• A “Poor condition” car 0 1
– In a similar manner we use one dummy variable to describe who sold the car. Let us define Dealer = 1 if the car was sold by a dealer. Dealer = 0 if sold by an individual.
CAMARO2
36
The linear regression equationRegression StatisticsMultiple R 0.946816R Square 0.896461Adjusted R Square0.875753Standard Error1543.988Observations 31
ANOVAdf SS MS F Significance F
Regression 5 5.16E+08 1.03E+08 43.29087 1.61E-11Residual 25 59597467 2383899Total 30 5.76E+08
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 17357.38 1280.324 13.55702 5.03E-13 14720.51 19994.25Age -1131.93 274.8257 -4.11873 0.000365 -1697.95 -565.919Mileage -33.242 23.57196 -1.41023 0.170792 -81.7893 15.30533Average -2556.44 915.1316 -2.79352 0.009858 -4441.18 -671.689Poor -3275.3 1112.001 -2.94541 0.006882 -5565.51 -985.09Dealer 775.6425 913.1362 0.849427 0.403705 -1105 2656.28
The linear regression equation:Price= 17357.38-1131.93Age-33.242Mileage- -2556.44Avg-3275.3Poor+775.64Dealer
37
Interpreting the coefficients bi
• Interpret the coefficient estimates bi of each variable and test the strength of their predicting power.
• SolutionbAge= -1131.93. In this model, For each additional year the asking price
drops by $1132, keeping the rest of the variables unchanged.
bMile= -33.24. In this model, for each additional 1000 miles the asking price drops by $33.24, keeping the rest of the variables unchanged.
bAvg = -2556.44. In this model, the asking price for a car whose condition is average is $2556.44 lower than the asking price for a car whose condition is excellent, keeping the rest of the variables unchanged.
bPoor = -3275.3. In this model, the asking price for a car whose condition is poor is $3275.3 lower than the asking price for a car whose condition is excellent, keeping the rest of the variables unchanged.
bDeal = 775.64. In this model the asking price for a car sold by a dealer is $775.64 higher than this sold by an individual, keeping the rest of the variables unchanged.
38
The role of the dummy variable coefficients
• Let us compare the asking price equations of two cars, with the same age, mileage, and condition, one sold by a dealer, the other one by an individual:
Price(Dealer)=b0+b1Age+b2Mileage+b3Avg.+b4Poor +b5(Dealer=1)= b0+b1Age+b2 Mileage+b3Avg.+b4Poor +b5
Price(Individual)=b0+b1Age+b2Mileage+b3Avg.+b4Poor +b5(Dealer=0)= b0+b1Age+b2Mileage+b3Avg.+b4Poor
• Conclusion: When the only difference between cars is the type of sellers who sell them, the base line equation was selected to be the Price(Individual) equation, and then b5 is the average difference in asking price between them.
39
• Let us compare the asking price equations of three cars, that differ in their overall condition but have the same age, mileage, and are sold by the same type of a seller:
Price(Excellent)=b0+b1Age+b2 Mileage+b3(Avg.=0)+b4(Poor=0) +b5(Dealer)= b0+b1Age+b2 Mileage+b5(Dealer)
Price(Avg.)=b0+b1Age+b2Mileage+b3(Avg.=1)+b4(Poor=0) +b5(Dealer)= b0+b1Age+b2 Mileage+b5(Dealer) + b3
Price(Poor)=b0+b1Age+b2Mileage+b3(Avg.=0)+b4(Poor=1) +b5(Dealer)= b0+b1Age+b2 Mileage+b5(Dealer) + b4
• Conclusion: When the only difference between cars is the car condition, the base line equation was selected to be the Price(Excellent) equation, and then b3 and b4 are the average differences in asking price between an “excellent condition” car and the other two cars.
The role of the dummy variable coefficients
40
Prediction power of independent variable (are there linear relationships?)
• Testing the prediction power.– Formulate the t-test for each . Observing the p values we have:
• For Age the p value=.00036. Age is a strong predictor
• For Mileage the p value=.17. Mileage is not a good predictor, not having linear relationship with price.
• For Average the p value=.0098. There is sufficient evidence to infer at 1% significance level that the asking price of a car whose condition is average is different from the asking price of a car whose condition is
excellent. In fact, the argument is even stronger. Since the t-statistic is negative (-2.79), the rejection region is at the left hand tail of the distribution, so we have sufficient evidence to claim that avarage<0. This means the asking price of an “Avg. Condition” car is on the average $2556 lower than the asking price of an “Excellent condition” car.
41
• Testing the prediction power - continued.• For Poor the p value = .006. There is a very strong evidence to
believe that the asking price for a “Poor Condition” car is different than the asking price for an “Excellent condition” car. Specifically, a “Poor condition” car is sold for $3275.3 less than an “Excellent condition” car.
• For Dealer the p value = .40. There is insufficient evidence to infer at 2.5% significant level that on the average the asking price for a car sold by a dealer is different than the asking price for a car sold by an individual.
Prediction power of independent variable (are there linear relationships?)
42
• Predict the asking price of the following cars:– 4 years old, 45000 miles, Average condition, sold by an individual.
Price=17357 – 1131.9(4) – 33.242(45) – 2556.4(1) + 775.64(0)
The variable “Average” is equal to 1 when the car is in average conditions.The variable “Dealer” is equal to 0 when the car is sold by an individual.
Prediction power of independent variable (are there linear relationships?)
43
Chapter 16
• We test the hypotheses that a set of data belongs to certain distributions:– The multinomial distribution– The normal distribution
• We also study whether two variables are dependent or not.
• We apply a tool called a Chi-squared test
44
The multinomial experiment
• The multinomial experiment is an extension of the binomial experiment.
• Characteristics– There are n independent trials.– Each trial can result in one of k possible outcomes.– There is a probability of a type k success (pk) in each
trial.
• We test whether the sample gathered support the hypothesis that p1, p2,…,pk are equal to specified values. The test is called: The goodness of fit test.
45
Problem 6
• To determine whether a single die is balanced, or fair, the die was rolled 600 times.
• Is there sufficient evidence at 5% significance level to allow you to conclude that the die is not fair?
46
Problem 6
• The hypothesis:– H0: p1 = p2 =…p6 = 1/6
H1: At least one p is not 1/6.
– Build a rejection Region:
– In our case: 2>2
,5
.outcomesofnumbertheiskwhere
21k,
2
47
Problem 6
– We calculate 2 as follows:
– In our case:e1=e2=…=e6=600(1/6)=100
k
1i i
2ii2
e
)ef(
100
)100103(...
100
)10084(
100
)10092(
100
)100114( 22222
From the file we have:f1=114; f2=92; f3=84;f4=101; f5=107; f6=103
48
Contingency table
• Here we test the relationship between two variables. Are they dependent?
• We build a contingency table and a Chi-Square statistic
Variable/Category 1
Variable/ Category 2
r rows
c columns
21)1)(c(rα,
2
R
1i
C
1j ij
2ijij2
ij
χχ:regionRejection
e
)e(fχ
nj)f(Columni)f(Row
e
:ijcellinfrequencyExpected
49
Problem 7
• Type of music vs. geographic location– A group of 30-years-old people is interviewed to
determined whether the type of music is somehow related to the geographic location of their residence.
– From the data presented can we infer that music preference is affected by the geographic location? Use (=.10).
• H0: Type of music and geographic location are independent.
• H1: Type of music and geographic location are dependent.
50
Problem 7 – contd.
e11=(195)(428)/632=129.59; e12=(195)(100)/632=30.85e23=(235)(65)/632=24.16;2 = (140-129.59)2/129.59+…+(52-24.16)2/24.16+…=64.92
2.10,(3-1)(4-1) = 10.64; 64.92>10.64.
• Reject the null hypothesis. Type of music and geographic location are not independent.
Rock R & B Country Classical
Northeast 140 32 5 18
South 134 41 52 8
West 154 27 8 13
428 100 65 39
195
235
202
632
51
• Using data analysis Plus
Contingency TableRock R & B Country Classical TOTAL
140 32 5 18 195134 41 52 8 235154 27 8 13 202
TOTAL 428 100 65 39 632Test Statistic CHI-Squared = 64.9198P-Value = 0
Problem 7 – contd.
52
Goodness of Fit test for normality• Hypothesize on and and
• Divide the Z interval into equal size sub-intervals. [i.e. (–2, – 1); (-1,0); (0,1); (1,2)]
• Determine the corresponding probabilities covered by each subinterval. [i.e. p1=P(Z<-2); p2=P(-2<Z<-1); …]
• Translate the Z scores to the associated X values. [i,.e. x1=0+(-2)0; x2=0+(-1)0; …]
• Find the actual frequency for each subinterval [i.e. f1 - for the interval below x1; f2 - for the interval (x1,x2); …]
• Calculate the expected frequency for each interval:
• e1 = np1; e2 = np2; …
• Build a Chi squared statistic and perform the test
.onedhypothesizparametersof
numbertheisLand,ervalsintsubofnumbertheiskwhere2L1k,
2