10

Applied regression analysis and generalized linear models · As we have seen, linear statistical models—particularly linear regression analysis—make strong assumptions about the

Embed Size (px)

Citation preview

1 1

U n u s u a l a n d

I n f l u e n t i a l D a t a

A s we have seen, linear statistical models—particularly linear regression analysis—make

strong assumptions about the structure of data, assumptions that often do not hold in appli­cations. The method of least squares, which is typically used to fit linear models to data, is very sensitive to the structure of the data and can be markedly influenced by one or a few unusual observations.

We could abandon linear models and least-squares estimation in favor of nonparametric regres­sion and robust estimation.1 A less drastic response is also possible, however: We can adapt and extend the methods for examining and transforming data described in Chapters 3 and 4 to diagnose problems with a linear model that has been fit to data and—often—to suggest solutions.

I will pursue this strategy in this and the next two chapters:

• The current chapter deals with unusual and influential data. • Chapter 12 takes up a variety of problems, including nonlinearity, nonconstant error vari­

ance, and non-normality. • Collinearity is the subject of Chapter 13.

Taken together, the diagnostic and corrective methods described in these chapters greatly extend the practical application of linear models. These methods are often the difference between a crude, mechanical data analysis and a careful, nuanced analysis that accurately describes the data and therefore supports meaningful interpretation of them.

Another point worth making at the outset is that many problems can be anticipated and dealt with through careful examination of the data prior to building a regression model. Consequently, if you use the methods for examining and transforming data discussed in Chapters 3 and 4, you will be less likely to encounter the difficulties detailed in the current part of the text on "post-fit" linear-model diagnostics.

Unusual data are problematic in linear models fit by least squares because they can unduly influence the results of the analysis and because their presence may be a signal that the model fails to capture important characteristics of the data. Some central distinctions are illustrated in Figure 11.1 for the simple-regression model Y = a + BX + e.

In simple regression, an outlier is an observation whose response-variable value is conditionally unusual given the value of the explanatory variable. In contrast, a univariate outlier is a value of Y or X that is unconditionally unusual; such a value may or may not be a regression outlier.

Regression outliers appear in Figure 11.1 (a) and (b). In Figure 11.1 (a), the outlying observation has an X value that is at the center of the X-distribution; as a consequence, deleting the outlier

'Methods for nonparametric regression were introduced informally in Chapter 2 and will be described in more detail in Chapter 18. Robust regression is the subject of Chapter 19.

1 1 . 1 O u t l i e r s , L e v e r a g e , a n d I n f l u e n c e

241

242 Chapter 11. Unusual and Influential Data

Figure 11.1 Leverage and influence in simple regression. In each graph, the solid line gives the least-squares regression for all the data, while the broken line gives the least-squares regression with the unusual data point (the black circle) omitted, (a) An outlier near the mean of X has low leverage and little influence on the regression coefficients. (b) An outlier far from the mean of X has high leverage and substantial influence on the regression coefficients, (c) A high-leverage observation in line with the rest of the data does not influence the regression coefficients. In panel (c), the two regression lines are separated slightly for visual effect but are, in fact, coincident.

has little impact on the least-squares fit, leaving the slope B unchanged and affecting the intercept A only slightly. In Figure 11.1(b), however, the outlier has an unusually large X value, and thus its deletion markedly affects both the slope and the intercept.2 Because of its unusual X value, the outlying last observation in Figure 11.1(b) exerts strong leverage on the regression coefficients, while the outlying middle observation in Figure 11.1(a) is at a low-leverage point. The combination of high leverage with a regression outlier therefore produces substantial influence on the regression coefficients. In Figure 11.1(c), the right-most observation has no influence on the regression coefficients even though it is a high-leverage point, because this observation is in line with the rest of the data—it is not a regression outlier.

The following heuristic formula helps to distinguish among the three concepts of influence, leverage, and discrepancy ("outlyingness"):

Influence on coefficients = Leverage x Discrepancy

2When, as here, an observation is far away from and out of line with the rest of data, it is difficult to know what to make of it: Perhaps the relationship between Y and X in Figure 11.1(b) is nonlinear.

11.1. Outliers, Leverage, and Influence 243

(a) (b)

Measured Weight (kg) Reported Weight (kg)

Figure 11.2 Regressions for Davis's data on reported and measured weight for women (F) and men (M). Panel (a) shows the least-squares linear regression line for each group (the solid line for men, the broken line for women) for the regression of reported on measured weight. The outlying observation has a large impact on the fitted line for women. Panel (b) shows the fitted regression lines for the regression of measured on reported weight; here, the outlying observation makes little difference to the fit, and the least-squares lines for men and women are nearly the same.

A simple and transparent example, with real data from Davis (1990), appears in Figure 11.2. These data record the measured and reported weight of 183 male and female subjects who engage in programs of regular physical exercise.3 Davis's data can be treated in two ways:

• We could regress reported weight (RW) on measured weight (MW), a dummy variable for sex (F, coded 1 for women and 0 for men), and an interaction regressor (formed as the product MW x F). This specification follows from the reasonable assumption that measured weight, and possibly sex, can affect reported weight. The results are as follows (with coefficient standard errors in parentheses):

'RW = 1.36 + 0.990MW + 40.0F - 0.725(MW x F) (3.28) (0.043) (3.9) (0.056)

R2 = 0.89 SE = 4.66

Were these results taken seriously, we would conclude that men are unbiased reporters of their weights (because A = 1.36 % OandB) = 0.990 % 1), while women tend to overreport their weights if they are relatively light and underreport i f they are relatively heavy (the intercept for women is 1.36 + 40.0 = 41.4 and the slope is 0.990 - 0.725 = 0.265). Figure 11.2(a), however, makes it clear that the differential results for women and men are due to one female subject whose reported weight is about average (for women) but whose measured

3Davis's data were introduced in Chapter 2.

244 Chapter 11. Unusual and Influential Data

weight is extremely large. Recall that this subject's measured weight in kilograms and height in centimeters were erroneously switched. Correcting the data produces the regression

~RW = 1.36 + 0.990MW + 1.98F - 0.0567(MW x F) (1.58) (0.021) (2.45) (0.0385)

R2 = 0.97 SE = 2.24

which suggests that both women and men are unbiased reporters of their weight. • We could (as in our previous analysis of Davis's data) treat measured weight as the response

variable, regressing it on reported weight, sex, and their interaction—reflecting a desire to use reported weight as a predictor of measured weight. For the uncorrected data,

MW = 1.79 + 0.969/?W + 2.07F - 0.00953(/?W x F) (5.92) (0.076) (9.30) (0.147)

R2 = 0.70 SE = 8.45

The outlier does not have much impact on the coefficients for this regression (both the dummy-variable coefficient and the interaction coefficient are small) precisely because the value of R W for the outlying observation is near R W for women [see Figure 11.2(b)]. There is, however, a marked effect on the multiple correlation and regression standard error: For the corrected data, R2 = 0.97 and SE = 2.25.

Unusual data are problematic in linear models fit by least squares because they can substan­tially influence the results of the analysis and because they may indicate that the model fails to capture important features of the data. It is useful to distinguish among high-leverage observations, regression outliers, and influential observations. Influence on the regression coefficients is the product of leverage and outlyingness.

1 1 . 2 A s s e s s i n g L e v e r a g e : H a t - V a l u e s

The so-called hat-value hi is a common measure of leverage in regression. These values are so named because it is possible to express the fitted values Yj ("T-hat") in terms of the observed values T, :

n Yj =h\jYx+h2jY2 + --- + hjj Yj + • • • + hnj T„ = hU Y>

;=i

Thus, the weight htj captures the contribution of observation T, to the fitted value Yj\ If hij is large, then the ith observation can have a considerable impact on the jth fitted value. It can be shown that ha = 2\2"=\ hjj, and so the hat-value «, = ha summarizes the potential influence (the leverage) of T/ on all the fitted values. The hat-values are bounded between 1 /n and 1 (i.e., 1/n < hj < 1), and the average hat-value is h = (k + l)/n (where le is the number of regressors in the model, excluding the constant).4

4For derivations of this and other properties of leverage, outlier, and influence diagnostics, see Section 11.8.

11.2. Assessing Leverage: Hat-Values 245

Figure 11.3 Elliptical contours of constant leverage (constant hat-values h\) for k= 2 explanatory variables. Two high-leverage points appear, both represented by black circles. One point has unusually large values for each of Xi and X 2, but the other is unusual only in combining a moderately large value of X2 with a moderately small value of Xi. The centroid (point of means) is marked by the black square. (The contours of constant leverage are proportional to the standard data ellipse, introduced in Chapter 9).

In simple-regression analysis, the hat-values measure distance from the mean of X:5

1 ( X j - X ) 2

' n + £ " = . ( * , - X " ) 2

In multiple regression, n, measures distance from the centroid (point of means) of the Xs, taking into account the correlational and variational structure of the Xs, as illustrated for k = 2 explanatory variables in Figure 11.3. Multivariate outliers in the X-space are thus high-leverage observations. The response-variable values are not at all involved in determining leverage.

For Davis's regression of reported weight on measured weight, the largest hat-value by far belongs to the 12th subject, whose measured weight was_wrongly recorded as 166 kg: hn = 0.714. This quantity is many times the average hat-value, h = (3 + 1)/183 = 0.0219.

Figure 11.4(a) shows an index plot of hat-values from Duncan's regression of the prestige of 45 occupations on their income and education levels (i.e., a scatterplot of hat-values vs. the observation indices).6 The horizontal lines in this graph are drawn at twice and three times the average hat-values, h = (2 + l)/45 = 0.06667.7 Figure 11.4(b) shows a scatterplot for the explanatory variables education and income: Railroad engineers and conductors have high leverage by virtue of their relatively high income for their moderately low level of education, while ministers have high leverage because their level of income is relatively low given their moderately high level of education.

5See Exercise 11.1. Note that the sum in the denominator is over the subscript j because the subscript i is already in use. 6Duncan's regression was introduced in Chapter 5. 7See Section 11.5 on numerical cutoffs for diagnostic statistics.

246 Chapter 11. Unusual and Influential Data

(a) (b)

0.25

£ 0.15

0.05

RR.engineer °

0 conductor o minister

o ° 0 ^ ° C P 0 0

QO COD̂fo

10 20 30 Observation Index

40

80 -

60 o E o « 40

20

o RR.engineer conductor

o o o

o

o °B o

o o o

o

o o **

8 ° o o

o o o

*

^minister

20 40 60 Education

80 100

Figure 11.4 (a) An index plot of hat-values for Duncan's occupational prestige regression, with horizontal lines at 2xh and 3xh. (b) A scatterplot of education by income, with contours of constant leverage at 2 xh and 3 xh given by the broken lines. (Note that the ellipses extend beyond the boundaries of the graph.) The centroid is marked by the black square.

Observations with unusual combinations of explanatory-variable values have high leverage in a least-squares regression. The hat-values hi provide a measure of leverage. The average hat-value is h = (k + l)/n.

1 1 . 3 D e t e c t i n g O u t l i e r s : S t u d e n t i z e d R e s i d u a l s

To identify an outlying observation, we need an index of the unusualness of Y given the Xs. Discrepant observations usually have large residuals, but it turns out that even if the errors e, have equal variances (as assumed in the general linear model), the residuals £, do not:

V(E,)=a2{\-hi)

High-leverage observations, therefore, tend to have small residuals—an intuitively sensible result because these observations can pull the regression surface toward them.

Although we can form a standardized residual by calculating

S E V 1 - hi this measure is slightly inconvenient because its numerator and denominator are not independent, preventing E\ from following a r-distribution: When | £ , | is large, the standard error of the

regression, SE = J j 2 E f / ( n - k - \ ) , which contains E f , tends to be large as well. Suppose, however, that we refit the model deleting the ith observation, obtaining an estimate

SE(-I) ofa£ that is based on the remaining n - 1 observations. Then the studentized residual

$£(-()Vl _ hj (11-1)

11.3. Detecting Outliers: Studentized Residuals 247

has an independent numerator and denominator and follows a f-distribution with n — k-2 degrees of freedom.

An alternative, but equivalent, procedure for defining the studentized residuals employs a "mean-shift" outlier model:

Yj = a + BlXjl +--- + BkXjk + yDj+£j (11.2)

where D is a dummy regressor set to 1 for observation i and 0 for all other observations:

1 for = i

0 otherwise

Thus,

E(Yi) =a + B1Xn + --- + BkXik + y E(Yj) = a + BlXjl+--- + BkXjk for ; £ i

It would be natural to specify the model in Equation 11.2 if, before examining the data, we suspected that observation i differed from the others. Then, to test HQ\ y = 0 (i.e., the null hypothesis that the /th observation is not an outlier), we can calculate ro = y/SE(y). This test statistic is distributed as tn-k-2 under Ho and (it turns out) is the studentized residual Ef of Equation 11.1.

Hoaglin and Welsch (1978) arrive at the studentized residuals by successively omitting each observation, calculating its residual based on the regression coefficients obtained for the remain­ing sample, and dividing the resulting residual by its standard error. Finally, Beckman and Trussell (1974) demonstrate the following simple relationship between studentized and stan­dardized residuals:

Ef = E[ / , (11.3) - k - l - E ? i

If n is large, then the factor under the square root in Equation 11.3 is close to 1, and the distinction between standardized and studentized residuals essentially disappears.8 Moreover, for large n, the hat-values are generally small, and thus it is usually the case that

Ef ^E[^ — ' SE

Equation 11.3 also implies that Ef is a monotone function of £. , and thus the rank order of the studentized and standardized residuals is the same.

11.3.1 T e s t i n g f o r O u t l i e r s i n L i n e a r M o d e l s

Because in most applications we do not suspect a particular observation in advance, but rather want to look for any outliers that may occur in the data, we can, in effect, refit the mean-shift model n times,9 once for each observation, producing studentized residuals £*, £* £*

Here, as elsewhere in statistics, terminology is not wholly standard: E* is sometimes called a deleted studentized residual, an externally studentized residual, or even a standardized residual; likewise, E'- is sometimes called an internally studentized residual, or simply a studentized residual. It is therefore helpful, especially in small samples, to determine exactly what is being calculated by a computer program.

9It is not necessary literally to perform n auxiliary regressions. Equation 11.3, for example, permits the computation of studentized residuals with little effort.

248 Chapter 11. Unusual and Influential Data

Usually, our interest then focuses on the largest absolute E f , denoted £ m a x . Because we have picked the biggest of n test statistics, however, it is not legitimate simply to use tn-k-2 to find a p-value for £ m a x : For example, even if our model is wholly adequate, and disregarding for the moment the dependence among the Efs, we would expect to obtain about 5% of £(*s beyond f.025 % ±2 , about 1% beyond r.005 % ±2.6, and so forth.

One solution to this problem of simultaneous inference is to perform a Bonferroni adjustment to the p-value for the largest absolute £* . 1 0 The Bonferroni test requires either a special r-table or, even more conveniently, a computer program that returns accurate p-values for values of t far into the tail of the / -distribution. In the latter event, suppose that p' = Pr( tn-k-2 > £ m a x) - T h e n the Bonferroni p-value for testing the statistical significance of £ m a x is p = 2np'. The factor 2 reflects the two-tail character of the test: We want to detect large negative as well as large positive outliers.

Beckman and Cook (1983) show that the Bonferroni adjustment is usually exact in testing the largest studentized residual. Note that a much larger E^ is required for a statistically significant result than would be the case for an ordinary individual r-test.

In Davis's regression of reported weight on measured weight, the largest studentized residual by far belongs to the incorrectly coded 12th observation, with E\2 = —24.3. Here, n — k — 2 = 183 - 3 - 2 = 178, and Pr(r i 7 8 > 24.3) % 10" 5 8 . The Bonferroni p-value for the outlier test is thus p % 2 x 183 x 10~ 5 8 % 4 x 10~ 5 6, an unambiguous result.

Put alternatively, the 5% critical value for £ m a x in this regression is the value of rps with probability .025/183 = 0.0001366 to the right. That is, £ m a x = tm, .0001366 = 3.714; this critical value contrasts with t\78, .025 = 1 -973, which would be appropriate for testing an individual studentized residual identified in advance of inspecting the data.

For Duncan's occupational prestige regression, the largest studentized residual belongs to ministers, with £ m i n i s t e r = 3.135. The associated Bonferroni p-value is 2 x 45 x Pr(r45_2-2 > 3.135) = .143, showing that it is not terribly unusual to observe a studentized residual this big in a sample of 45 observations.

11.3.2 A n s c o m b e ' s I n s u r a n c e A n a l o g y

Thus far, I have treated the identification (and, implicitly, the potential correction, removal, or accommodation) of outliers as a hypothesis-testing problem. Although this is by far the most common procedure in practice, a more reasonable (if subtle) general approach is to assess the potential costs and benefits for estimation of discarding an unusual observation.

Imagine, for the moment, that the observation with the largest Ef is simply an unusual data point but one generated by the assumed statistical model:

Yi = a + 6]Xn + • • • + BkXik + B i

with independent errors e, that are each distributed as yV(0, cr 2). To discard an observation under these circumstances would decrease the efficiency of estimation, because when the model— including the assumption of normality—is correct, the least-squares estimators are maximally efficient among all unbiased estimators of the regression coefficients.

If, however, the observation in question does not belong with the rest (e.g., because the mean-shift model applies), then to eliminate it may make estimation more efficient. Anscombe (1960) developed this insight by drawing an analogy to insurance: To obtain protection against "bad"

l()See Appendix D on probability and estimation for a discussion of Bonferroni inequalities and their role in simultaneous inference. A graphical alternative to testing for outliers is to construct a quantile-comparison plot for the studentized residuals, comparing the sample distribution of these quantities with the r-distribution for n - k - 2 degrees of freedom. See the discussion of non-normality in the next chapter.

1.3. Detecting Outliers: Studentized Residuals 249

data, one purchases a policy of outlier rejection, a policy paid for by a small premium in efficiency when the policy inadvertently rejects "good" data.11

Let q denote the desired premium, say 0.05—that is, a 5% increase in estimator mean-squared error i f the model holds for all of the data. Let z represent the unit-normal deviate corresponding to a tail probability of q(n - k - \)/n. Following the procedure derived by Anscombe and Tukey (1963), compute m = 1.4 + 0.85z and then find

/ m2-2 \ ln-k-l E « = m { l - 4 ( n - k - i ) ) i ^ r - ( 1 L 4 )

The largest absolute standardized residual can be compared with E'q to determine whether the corresponding observation should be rejected as an outlier. This cutoff can be translated to the studentized-residual scale using Equation 11.3:

^* / n — k — 2

In a real application, of course, we should inquire about discrepant observations rather than simply throwing them away.12

For example, for Davis's regression of reported on measured weight, n = 183 and k = 3; so, for the premium q = 0.05, we have

q(n-k-l) 0.05(183 -3 - 1) — = —— = 0.0489

n 183 From the quantile function of the standard-normal distribution, z = 1.66, from which m — 1.4 + 0.85 x 1.66 = 2.81. Then, using Equation 11.4, E'q = 2.76, and using Equation 11.5, E* = 2.81. Because £ m a x = \E*2\ = 24.3 is much larger than E*, the 12th observation is identified as an outlier.

In Duncan's occupational prestige regression, n = 45 and k = 2. Thus, with premium q = 0.05,

q(n-k - 1) 0.05(45 -2- 1) — = = 0.0467

n 45 The corresponding unit-normal deviate is z = 1.68, yielding m = 1.4 + 0.85 x 1.68 = 2.83, E'q = 2.63, and E* = 2.85 < | £ m j n j s t e r | = 3.135, suggesting that ministers be rejected as an outlier.

A regression outlier is an observation with an unusual response-variable value given its combination of explanatory-variable values. The studentized residuals Ef can be used to identify outliers, through graphical examination, a Bonferroni test for the largest abso­lute E f , or Anscombe's insurance analogy. If the model is correct (and there are no true outliers), then each studentized residual follows a r-distribution with n — k —2 degrees of freedom.

An alternative is to employ a robust estimator, which is somewhat less efficient than least squares when the model is correct but much more efficient when outliers are present. See Chapter 19. I2See the discussion in Section 11.7.