Ch24 Exercises

4/20/2007

24 Exercises Mix and Match

Match the plot or test in the first column with the condition, test statistic or estimate in the second column.

1. Scatterplot of y on x1 a. Similar variances

2. Scatterplot of y on x2 b. F-statistic

3. Scatterplot of x2 on x1 c. Collinearity

4. Direct effect of x2 d. Nearly normal errors

5. Indirect effect of x2 e. Nonlinear effect

6. Scatterplot of y on

!

ˆ y f. Marginal slope of x1 on x2

7. Scatterplot of e on

!

ˆ y g. Partial slope for x2

8. Normal quantile plot h. Leveraged outlier

9. Test H0: β1 = β2 = 0 i. t-statistic

10. Test H0: β2 = 0 j. Calibration plot

True/False

11. The value of R2 always goes up when a nontrivial explanatory variable is added to a regression model.

12. The value of se always goes down when an explanatory variable is added to a regression model.

13. A slope in a simple regression is known as a partial slope because it ignores the effects other explanatory variables.

14. A partial slope estimates differences between average values of observations y that match on the other explanatory variables.

15. The partial slope for an explanatory variable has to be smaller in absolute value than its marginal slope.

16. If the confidence interval for the marginal slope of x1 includes zero, then the confidence interval for its partial slope includes zero as well.

17. The partial slope corresponds to the direct effect in a path diagram.

18. The indirect effect of an explanatory variable is the difference between the marginal and partial slopes.

4/20/2007 24 Exercises

E24-2

19. If we reject H0: β1 = β2 = 0 using the F-test, then we should conclude that both slopes are different from zero.

20. If we reject H0: β2 = 0, then we can conclude that the increase in R2 obtained by adding x2 to the regression is statistically significant.

21. The main use for a partial regression plot is finding nonlinear effects in multiple regression.

22. A calibration plot summarizes the overall fit of a regression model.

Think About It

23. An analyst became puzzled when analyzing the performance of franchises operated by a fast-food chain. The correlation between sales and the number of competitors within 3 miles was positive. When she regressed sales on the number of competitors and population density, however, she got a negative slope for the number of competitors. What’s going on?

24. In evaluating the performance of new hires, the human resources division found that candidates with higher scores on its qualifying exam performed better. In a multiple regression that also used the education of the new hire as an explanatory variable, the slope for test score was near zero. Explain this paradox for the manager of the HR division.

25. The human resources department at a firm developed a multiple regression to predict the success of candidates for available positions. Drawing records of new hires from five years ago, analysts regressed current annual salary on age at the time of hire and score on a personality test given to new hires. This path diagram summarizes the fitted model. Age is coded in years, the test is scored from 1 to 20, and annual salary is in thousands of dollars ($M).

!

a) Write down the equation for the multiple regression model. If there’s something that you’re missing, denote it with the usual symbol.

b) Which is larger: the direct or indirect effect of test score? c) Find the marginal slope of salary on test score. d) If you were a new applicant and could take a special course that in a week’s time

could raise your test score by 5 points, would the course be worth the $25,000 being charged? Which slope is relevant: marginal or partial?

Age

Salary

Test Score

5 points/year

2 years/point

5 $M/year

2 $M/point


E24-3

26. A marketing research analysis considered how two customer characteristics affect their customer’s stated desire for their product. Potential customers in a focus group were shown prototypes of new convenience product and asked to indicate how much they would like to buy such a product. Scores were obtained on a 0 to 100 rating scale. The marketing group measured the age (in years) and income (in thousands of dollars, $M). The following path diagram summarizes the estimates in the associated multiple regression.

!

a) Does the path diagram provide enough information in order to compute the fitted values from the multiple regression?

b) What is the sign of the marginal correlation between age and product rating? c) Describe the ideal customer for this product, at least as indicated by the

summarized model.

27. This table, often called a correlation matrix, shows the pairwise correlations among 3 variables: two explanatory variables x1 and x2 and the response denoted by y. For example, corr(y, x2) = 0.2359.

y x1 x2 y 1.0000 0.8667 0.2359 x1 0.8667 1.0000 0.0000 x2 0.2359 0.0000 1.0000

a) Why does it make sense to put 1’s down the diagonal of this table?

b) Find the slope of the simple regression of y on x1, if you can. If you cannot, indicate what’s missing.

c) Do you think that the marginal slope for y on x2 will be similar to the partial slope for x2 in the multiple regression of y on x1 and x2?

28. The following 3 × 3 table, known as a correlation matrix, shows the pairwise correlations among 3 variables. The variables are the “expert” ratings assigned to wines by well-known connoisseurs (from 0 to 100), the year of the vintage (year in which the grapes were harvested), and the listed price on a web site. For example, corr(year, price) = 0.3222.

Rating Year Price Rating 1.0000 0.0966 0.7408 Year 0.0966 1.0000 0.3222 Price 0.7408 0.3222 1.0000

Age

Product Rating

Income

2 $M/year

0.7 years/$M

-0.1 points/year

0.2 points/$M


E24-4

a) Would a multiple regression of the ratings on year and price explain more than half of the variation in the ratings?

b) If we regress the standardized value of the rating (that is, subtract the mean rating and divide by the SD of the rating) on the standardized price (subtract the mean price and divide by the SD of the price), what would be the slope?

c) Are the partial and marginal slopes for price identical? Explain.

29. An airline developed a regression model to predict revenue from flights that connect “feeder” cities to its hub airport. The response in the model is the revenue generated by flights operating to the feeder cities (in thousands of dollars per month), and the two explanatory variables are the air distance between the hub and feeder city (Distance, in miles) and the population of the feeder city (in thousands). The least squares regression equation based on data for 37 feeder locations last month is

Estimated Revenue = 87 + 0.3 Distance + 1.5 Population

with R2 = 0.74 and se = 32.7.

a) The airline plans to expand its operations to add an additional feeder city. One possible city has population 100,000 and is 250 miles from the hub. A second possible city has population 75,000 and is 200 miles from the hub. Which would you recommend if the airline wants to increase total revenue?

b) What is the interpretation of the intercept in this equation?

c) What is the interpretation of the partial slope for Distance?

d) What is the interpretation of the partial slope for Population?

30. A national motel chain has a model for the operating margin of its franchises. The operating margin is defined to be the ratio of net profit to total revenue (as a percentage). The company plans to use this model to help it identify profitable sites to locate new hotels. The response in the model is the operating margin, and the explanatory variables are the number of available hotel rooms currently within 3 miles of the site (Rooms) and the square feet of office space (Office, in thousands of square feet) near the site. The estimated regression based on sites of 100 motels operated by this chain is

Estimated Margin = 54 – 0.0073 Rooms + 0.0216 Office

with R2 = 45% and se = 8.4.

a) Two possible sites are similar, except that one is near an office complex with 400,000 square feet, whereas the other is near 50,000 square feet of offices. Within a mile of the location near the office complex, a competing hotel has 2,250 rooms whereas competitors near the other location offer 300 rooms. Which site would you expect to generate a higher operating margin? How much higher?

b) What the interpretation of the intercept in this equation?

c) What does it mean that the partial slope for Rooms is negative?


E24-5

d) Interpret the partial slope for Office.

31. This table gives further details of the multiple regression estimated in Question 29. Assume that the MRM satisfies the conditions for using this model for inference.

Estimate SE t-statistic p-value Intercept 87.3543 55.0459 Distance 0.3428 0.0925 Population 1.4789 0.2515

a) Fill in the column of t-statistics.

b) Estimate the column of p-values using the Empirical Rule. Only rough estimates are needed.

c) Does the addition of Distance to a simple regression using only population as an explanatory variable produce a statistically significant increase in R2?

d) If the population of a city were to grow by 10,000, on average what effect would this growth have on revenue? Give a range, rounded to presentation precision.

32. This table gives further details of the multiple regression estimated in Question 30. Assume that the MRM satisfies the conditions for using this model for inference.

Estimate SE t-statistic p-value Intercept 53.9826 5.1777 Rooms -0.0073 0.0013 Office 0.0216 0.0176

a) Fill in the column of t-statistics.

b) Estimate the column of p-values using the Empirical Rule. Only rough estimates are needed.

c) Does the addition of Office to a simple regression using only Rooms as an explanatory variable produce a statistically significant increase in R2?

d) Does the addition of Rooms to a simple regression using only Office as an explanatory variable produce a statistically significant increase in R2?

33. Refer to the context of the airline in Question 29. Assume that the estimated model meets the conditions for using the MRM for inference.

a) Does the estimated multiple regression equation explain statistically significant variation in revenue among these feeder cities?

b) If this model is used to predict revenue for a feeder city, how accurate would you expect those predictions to be?

34. Refer to the context of the airline in Question 30. Assume that the estimated model meets the conditions for using the MRM for inference.

a) Does the estimated multiple regression equation explain statistically significant variation in operating margins among these hotels?

b) If this model is used to predict the operating margin for a site, how accurate would you expect the prediction to be?


E24-6

You Do It

35. Gold chains These data give the prices (in dollars) for gold link chains at the web site of a discount jeweler. The data include the length of the chain (in inches) and its width (in millimeters). All of the chains are 14 karat gold in a similar link style. Use the price as the response.

a) Examine the scatterplots of the response versus the two explanatory variables as well as the scatterplot between the responses. Do you notice any unusual features in the data? Do the relevant plots appear straight enough for multiple regression?

b) Find the correlation between each pair of variables. Which correlation is largest? Explain why this correlation is so much larger than the others.

c) Fit the multiple regression of price on length and width. Show a summary of the fitted model. (Save the diagnostics for “d”.)

d) Even though the equation fit in “c” has a large R2 and both slopes are significantly different from zero, the estimated regression does not meet the conditions of the MRM. Explain why.

e*) You can “patch the model” by combining the two explanatory variables in a way that captures an important omitted variable. Do this, and see if the model improves. (Hint: Concentrate on identifying the obvious missing variable from this model. You can build a very good proxy for this variable using the given columns.)

f*) Summarize the fit of your improved model.

36. Convenience shopping (introduced in Chapter 19) These data describe the sales over time at a franchise outlet of a major US oil company. (The data file has values for two stations. For this exercise, use only the 283 cases for site 1.) Each row summarizes sales for one day. This particular station sells gas, and it also has a convenience store and a car wash. The response Sales gives the dollar sales of the convenience store. The explanatory variable Volume gives the number of gallons of gasoline sold and Washes gives the number of car washes sold at the station.

a) Examine scatterplots of the response versus the two explanatory variables as well as the scatterplot between the responses. Do you notice any unusual features in the data? Do the relevant plots appear straight enough for multiple regression?

b) Find the correlation between each pair of variables. Which correlation is largest? Explain why this correlation is so much larger than the others.

c) Fit the multiple regression of sales on volume and the number of car washes. Show a summary of the fitted model. (Save the diagnostics for “d”.)

d) Does the fitted model meet the conditions for using the MRM for inference?

e) Assume that the model meets the conditions for the MRM. Interpret carefully the estimated slope for the number of car washes. In your interpretation, include a range for the effect of this variable.


E24-7

37. Download (introduced in Chapter 19) Before plunging into videoconferencing, a company tested of its current internal computer network. The tests measured how rapidly data moved through its network given the current demand on the network. Eighty files ranging in size from 20 to 100 megabytes (MB) were transmitted over the network at various times of day, and the time to send the files recorded. The time is given as the number of hours past 8 am on the day of the test. Use the transfer time as the response, with the file size and time of day as explanatory variables.


b) Do you think, before fitting the multiple regression, that the partial slope for the file size will be the same as in the shown equation? Explain.

c) Fit the multiple regression of the transfer time on the file size and the time of day. Summarize the estimates obtained for the fitted model.

d) Does the fit of this model meet the conditions of the MRM?

e) Compare the sizes of the t-statistics of the fitted model to the overall F-statistic. Do these tests agree with each other?

f) Compare the confidence interval for the marginal slope for file size to the confidence interval for the partial slope for file size. How are these different?

g) Does the path diagram for the multiple regression offer a suggestion for the differences noticed in the previous questions?

38. Production costs (introduced in Chapter 19) A manufacturer produces custom metal blanks that are used by its customers for computer-aided machining. The customer sends a design via computer, and the manufacturer replies with an estimated price per unit. This cost estimate determines a price for the customer. The data for the analysis were sampled from the accounting records of 195 orders that were filled during the previous 3 months. Formulate the regression model with y as the average cost per unit and x1 as the material cost per unit and x2 as the labor cost per unit.


b) Fit the indicated multiple regression and show a summary of the estimated features of the model.

c) Does the estimated model appear to meet the conditions for the use of the MRM?

d) Has the addition of labor hours per unit resulted in a model that explains statistically significantly more variation in the average cost per unit?

e) Interpret the estimated slope for labor hours. Include in your answer the 95% confidence interval for the estimate.


E24-8

f) Does this model promise accurate predictions of the cost to fill orders based on the material and labor costs?

39. Home prices In order to help clients determine the price at which their house is likely to sell, a realtor gathered a sample of 150 purchase transactions in her area during a recent three month period. For the response in the model, use the price of the home (in thousands of dollars). As explanatory variables, use the number of square feet (also in thousands) and the number of bathrooms.




d) Does this estimated model explain statistically significant variation in the prices of homes?

e) Compare the marginal slope for the number of bathrooms to the partial slope. Explain why these are so different, and show confidence intervals for each.

f) A homeowner asked the realtor if she should spend $40,000 to convert a walk-in closet into a small bathroom in order to increase the sale price of her home. What does your analysis indicate?

40. Leases (introduced in Chapter 19) This data table gives annual costs of 223 commercial leases. All of these leases provide office space in a Midwestern city in the US. For the response, use the cost of the lease (in dollars per square foot). As explanatory variables, use the reciprocal of the number of square feet and the age of the property in which the office space is located (denoted as Age, in years).




d) Does this estimated model explain statistically significant variation in the costs per square foot of the leases?

e) Interpret the slope for the age of the building, including in your answer the confidence interval for this estimate.

f) Can you identify an embarrassing lurking variable? Might this lurking variable affect the coefficients of the explanatory variables?


E24-9

41. R&D expenses (Introduced in Chapter 19) This data table contains accounting and financial data that describe 504 companies operating in technology industries: software, systems design, and semiconductor manufacturing. The variables include the expenses on research and development (R&D), total assets of the company, and net sales. All columns are reported in millions of dollars; the variables are recorded in millions, so 1000 = 1 billion.) Use the natural logs of all variables rather than the originals.




d) Does the fit of this model explain statistically significantly more variation in the log of spending on R&D than a model that uses log assets alone?

The multiple regression in “b” has all variables on a natural log scale, similar to the model in Question 42. The only difference is that these variables have been put on the scale of natural logs rather than common logs. To interpret the equation, note that the sum of natural logs is the log of the product, loge x + loge y = loge (x y) and that b loge x = loge xb. Hence, the equation

loge y = b0 + b1 loge x1 + b2 loge x2

is equivalent to

y = eb0 x1b1 x2b2

The slopes in the log-log regression are exponents in a equation that describes y as the product of the explanatory variables raised to different powers. These powers are the partial elasticities of the response with respect to the predictors. (See Chapter 20 for a discussion of elasticities.)

e) Interpret the slope for net sales in the equation estimated by the fitted model in “b”. Include the confidence interval in your calculation.

f) The marginal elasticity of R&D spending with respect to net sales is about 0.79. Why is the partial elasticity in the multiple regression for net sales so much smaller? Is it really that much smaller?

42. Cars (introduced in Chapter 19) This data table gives various characteristics of 223 types of cars sold in the US during the 2003 and 2004 model years. Use the base price as the response and the horsepower of the engine (HP) and the weight of the car (given in thousands of pounds) as explanatory variables.

a) Examine the calibration plot and the plot of the residuals e on the fitted values

!

ˆ y for the multiple regression of base price on HP and weight. Do these plots reveal any problems in the fit of this model?


E24-10

b) Revise the variables in the model to all be on the scale defined by log10. Has this common transformation fixed the problems identified in “a”? To answer this question, refit the model on the log scale and consider the calibration and residual plots for the revised model.

The multiple regression in “b” has all variables on a log scale. To interpret this model, recall that the sum of logs is the log of the product, log10 x + log10 y = log10 (x y) and that b log10 x = log10 xb. Hence, an equation of the form

log10 y = b0 + b1 log10 x1 + b2 log10 x2

is equivalent to the product

y = 10b0 x1b1 x2b2

The slopes in the log-log regression are exponents in a model that estimates y as the product of the explanatory variables raised to different powers. These powers are the partial elasticities of the response with respect to the predictors. (See Chapter 20 for a discussion of elasticities.)

c) Is the partial elasticity for weight equal to zero? Estimate the partial elasticity from the multiple regression of log10 price on log10 HP and log10 weight.

d) Compare the partial elasticity for weight (the slope for log10 weight in the multiple regression) to the marginal elasticity for price with respect to weight (the slope for log10 weight in a simple regression of log10 price on log10 weight). Are these estimates very different? Use confidence intervals to measure the size of any differences.

e) Does the path diagram for this model offer an explanation for the differences in the confidence intervals found in “c”? Explain.

f) Based on your analysis, describe the effect of weight on price? Does it have an effect? Are heavier cars more expensive, on average?

43. OECD (introduced in Chapter 19) An analyst at the UN is developing a model that describes GDP (gross domestic product per capita, a measure of the overall production in an economy per citizen) among developed countries. She is using national data for 29 countries from the 2005 report of the Organization for Economic Co-operation and Development (OECD). She started with the equation (estimated by least squares) Estimated per capita GDP = $26,714 + 1,440 Trade Balance The trade balance is measured as a percentage of GDP. Exporting countries tend have large positive trade balances. Importers have negative balances. This equation explains only 37% of the variation in per capita GDP, so she added a second explanatory variable, the number of kilograms of municipal waste, per person.


b) Do you think, before fitting the multiple regression, that the partial slope for trade balance will be the same as in the shown equation? Explain.


E24-11

c) Fit the multiple regression that expands the one-predictor equation by adding the second explanatory variable to the model. Summarize the estimates obtained for the fitted model.


e) Draw the path diagram for this estimated model. Use it to explain why the estimated slope for the trade balance has become smaller than in the shown simple regression.

f) Give a confidence interval, to presentation precision, for the slope of the municipal waste variable. Does this interval imply that countries can increase their GDP by encouraging residents produce more municipal waste?

44. Hiring (introduced in Chapter 19) A firm that operates a large, direct-to-consumer sales force would like to build a system to monitor the progress of new agents. The goal is to identify “superstar agents” as rapidly as possible, offer them incentives, and keep them with the company. A key task for agents is to open new accounts; an account is a new customer to the business. The response of interest is the profit to the firm (in dollars) of contracts sold by agents over their first year. These data summarize the early performance of 464 agents. Among the possible explanations of this performance are the number of new accounts developed by the agent during the first 3 months of work and the commission earned on early sales activity. An analyst at the firm began the modeling by estimating the simple regression Estimated Log Profit = 8.98 + 0.28 Log Accounts This equation explains 18% of the variation in log profits. Formulate the MRM with y given by the natural log of Profit from Sales and the natural log of Number Accounts and the natural log of Early Commissions. (For discussion on models in logs, see questions 41 and 42.) Notice that some cases have value 0 for early commission. Rather than drop these (you cannot take the log of zero), fill in the zeros with a small positive constant. Here we’ll use $1 and continue on by taking the log with the value added on.

a) Examine scatterplots of the response versus the two explanatory variables as well as the scatterplot between the responses. Be sure to keep all of the variables on the scale of natural logs. Do you notice any unusual features in the data? Do the relevant plots appear straight enough for multiple regression?

b) Do you think, before fitting the multiple regression, that the partial elasticity for number of accounts will be the same as the marginal elasticity?

c) Fit the multiple regression that expands the one-predictor equation by adding the second explanatory variable to the model. Summarize the estimates obtained for the fitted model.


e) Does the confidence interval for the partial elasticity for the number of accounts indicate a large shift from the marginal elasticity?


E24-12

f) Use a path diagram to illustrate why the marginal elasticity and partial elasticity are either similar or different.

g) Which would more likely be more successful in raising the performance of new hires: a training program that increased the number of accounts by 5% but did not change early selling, or a program that raised both by 2.5%? Can you answer this question from the estimated model?

45. Promotion (introduced in Chapter 19) These data describe promotional spending by a pharmaceutical company for a cholesterol-lowering drug. The data covers 39 consecutive weeks and isolates the area around Boston. The variables in this collection are shares. Marketing research often describes the level of promotion in terms of voice. In place of the level spending, voice is the share of advertising devoted to a specific product. The column Market Share is the ratio of sales of this product divided by total sales for such drugs in the Boston area. The column Detail Voice is the ratio of detailing for this drug to the amount of detailing for all cholesterol-lowering drugs in Boston. Detailing counts the number of promotional visits made by representatives of a pharmaceutical company to doctors’ offices. Similarly, Sample Voice is the share of samples in this market that are from this manufacturer. Formulate the MRM with y given by the Market Share and x given by Detail Voice and Sample Voice.




d) Does this estimated model explain statistically significant variation the market share?

e) At a fixed level of sampling, do periods with increased detailing show significantly larger market share?

f) Does the fit of the multiple regression imply that the pharmaceutical company should no longer invest in detailing and only rely on sampling? Discuss briefly.

46. Apple (introduced in Chapter 19) This data table tracks monthly performance of stock in Apple Computer since its inception in 1980. The data includes 300 monthly returns on Apple Computer, as well as returns on the entire stock market, Treasury Bills (short term, 30-day loans to Uncle Sam), and inflation. (The column Market Return is the return on a value-weighted portfolio that purchases stock in proportion to the size of the company rather than one of each stock.) Formulate the model with Apple Return as the response and Market Return and IBM Return as explanatory variables.



E24-13



d) The regression of returns on Apple on market returns estimates β for this stock to be about 1.5. Does the multiple regression suggest a different slope for the market?

e) Give a confidence interval for the slope of IBM returns and carefully interpret this estimate.

f) Does the addition of IBM returns improve the fit of the model with just market returns by a statistically significant amount? Does this imply that we’ve found an improved trading scheme?


E24-14

4M Residual Car Values

When General Motors, Honda, Ford or Toyota lease a car, how do they decide what to charge? One answer, if you’ve got a lot of unpopular cars to move, is to charge whatever it takes to get the cars off the lot. A different answer considers the so-called residual value of the car at the end of the lease. The residual value of a leased car is the value of this car in the used-car market, a.k.a., the previously-owned car market.

How should we estimate the residual value of a car? The residual value certainly depends on how much the car was worth originally, such as the manufacturer’s list price. Lets take this off the table and limit our attention to a particular type of car. Let’s also assume that we’re looking at cars that have not been dented or banged up in an accident.

What else matters?

Certainly the age of the car affects to its residual value. Customers expect pay less for older cars, on average. Older cars have smaller residual value. The term of the lease, say 2 or 3 years, has to cover the cost of the age of the car. Another factor that affects the residual value is the type of use. An older car that’s in great condition (the one that the little old man only drove to church on Sunday) might be worth more than a newer car that’s been heavily driven. It seems as though the cost of a lease ought to take both duration and use into account.

Honda was not about to reveal to us the model that they use for estimating the residual value of leased cars. Not even Ford. So, we’ll use data on the resale prices of used BMW cars gathered off of the web. We introduced these data in an example in Chapter 1. These 218 cars are late-model BMW cars in their popular 3-series. Motivation

a) Why does a manufacturer need to estimate the residual value of a leased car in order to determine annual payments on the lease?

Method

b) Explain why we should use a multiple regression to estimate the effects of age and mileage simultaneously, rather than use two simple regression models.

c) The cars are all late-model cars of the same model. Explain why this similarity avoids some of the problems that we have when looking at a cross-section of different types of cars (such as nonlinear patterns or different variances).

d) Check scatterplots of the variables. Do the relationships appear straight-enough for using multiple regression with these variables?

Mechanics

e) Fit the appropriate multiple regression model.

f) Does this model meet the conditions for using the MRM as the basis for inference?


E24-15

g) Build confidence intervals for the partial effects of age and mileage.

Message

h) Summarize the results of your model. Recommend terms for leases that cover the costs of aging and mileage.

i) Do you have any caveats that should be pointed out with your recommend terms? For example, are there any evident lurking variables?

Documents

Ch24 Exercises