AO-AU83 b2l CLEMSON UNIV S C DEPT OF MATHEMATICAL … · * the regression coefficients becomes large. Hoerl (1959), (1962) and Hoerl and * Kennard (1970,a), (1970,b) have suggested

AO-AU83 b2l CLEMSON UNIV S C DEPT OF MATHEMATICAL SCIENCES F/6 12/1RIDGE ESTIMATION IN LINEAR REGRESS ION.(IU)OCT 76 J S HAWKEb N00014-75 C-0451

UNCLASSIFIED NODGZ

- DISTRIBUTION STATEMENT A_* Approved for public release;

* Distribution Unlimited

@~LEVEL~.IDGE ;STIMATION IN

) -LIEAR REGRESSION,

r .7

DTICSELECTERSAPR 2 8 1980

/REPORTA 80 / rl- -6.., _ \ /CTIW76 / 1/..

Y/ TECHNICAL REPFT, #232 ( ,Q/- ...... . . .2 _-

DISTRIBUTION STATEMENT A

Approved for public release;Distribution Unlimited

THE AUTHOR'S WORK WAS SUPPO ED BY THE OFFICE OF NAVAL RESEARCH

UNDER CONTRACT ' L75.C,,51 '-/ - -.... /,, J

L b

RIDGE ESTIMATION IN LINEAR REGRESSION

James S. HawkesClemson University

ABSTRACT

Consider the linear regression model Y = Xe+ E. Recently, a class of

* estimators, variously known as ridge estimators, has been proposed as an

alternative to the least squares estimators in the case of collinearity,

that is, when the design matrix X'X is nearly singular. The ridge estimator

is given by (X'X + KI) - l X'Y, where K is a constant to be determined.

An optimal choice of the value of K is not known. This paper examines the

risk (mean squared error) of the ridge estimator under the constraint 0'0<c

and determines optimal values of K for which the risk is smaller than the

risk of the least squares estimators where c is a constant.

Key words and phrases: Regression, Ridge Analysis, Multicollinearity, MeanSquared Error

AMS Classification: 62J05

CEssioN for

NTiS White SeCtio

DO Buff Sectioa E3UNANNOUNCED 13

jUS1iFICATION

... ................ .......

BISIRIu1INAYAILUTY CME

v3s 1(A!L an/or EC

The author's work was supported by The Office of Naval Research under

.4 Contract N00014-75-C-051

RIDGE ESTIMATION IN LINEAR REGRESSION

James S. HawkesDepartment of Mathematical Sciences

Clemson University

1. Introduction. In applications of multiple linear regression, theexplanatory variables under consideration are often interrelated. The relationis technically called multicollinearity or near multicollinearity. The ordinaryleast squares estimate of the regression coefficients tends to become "unstable"in the presence of multicollinearity. More precisely, the variance of some of

* the regression coefficients becomes large. Hoerl (1959), (1962) and Hoerl and* Kennard (1970,a), (1970,b) have suggested a class of estimators known as ridge

estimators as an alternative to the least squares estimation in the presence ofmulticollinearity. The new method of estimation is called ridge analysis..

Ridge analysis has drawn considerable interest in recent years. The techniquehas been developed and new results have been obtained by several authors, e.g.,Hawkins (1975), Hemmerle (1975), Sidik (1975). Newhouse and Oman (1975) have conducteda series of Monte Carlo experiments to compare the performance of ridge analysiswith the least squares.

The ridge analysis is an ad hoc procedure which gives a biased estimator.We compare the ridge estimator with the least squares estimator with respect to

the mean squared error. Hoerl and Kennard (1970,a) have claimed in their paperthat for a certain choice of a parameter (K) the ridge estimator is uniformlysuperior to the least squares estimator. This is not true. It appears that the

-limitation of any optimal property of the ridge estimator and its relation toother known estimators is not often clearly comprehended by many applied statisti-cians engaged in regression analysis. The object of this paper is to expose theessential features of ridge analysis.

In the following section we show a basis for the choice of the ridge estimator,its biased character and compare it with an unbiased estimator. Furthermore, weobtain conditions under which the ridge estimator has smaller mean squared errorthan the least squared estimator.

2. Ridge analysis. Consider the linear regression model

Y = X6 + E

where Y is n x1 vector of observations, X is n -p design matrix, a is p x 1 vector

of unknown parameters F is n 1 vector of the observational errors. It isassumed that the components of E are uncorrelated, and have a common variance equal

2to 3 , say. Also, E(:) = 0. Let prime denote the transpose of a vector or matrix.The least squares estimate of is obtained by minimizing (Y-Xe)'(Y-Xe) with respect

to @, and is given by

2

(2.) -

(2.1) = (X'X) X'y.

It is assumed that the columns of X are linearly independent and therefore therank of the design matrix XIX is equal to p.

WehaveE[ = e. That is, 8 is an unbiased estimator of e. Let Xi,...,

denote the characteristic roots of X'X. The mean squared error of 8 (MSE6) isgiven by

2 -1(2.2) E(a-)'(e-8) = a trace (X'X)

2 zp 1i=l i

If the explanatory variables are nearly multicollinear then the matrix X'X is, illconditioned, that is, pne (or more) of the characteristic roots of X'X is

smal;. In that case MSE 8 becomes large, as it is seen from (2.2). We can avoid*MSE 6 becoming large by inflating the characteristic roots. That is, substituting

r 8 for 8, given by

~-l1* (2.3) 8 = (X'X+KI) X'Y

where I is p xp identity matrix and K is a positive number. The estimator e isthe ridge estimator, proposed by Hoerl and Kennard as an alternative to the least

squares estimator.

We have

(2.4) E8 = (X'X+KI) (x'x)O

Therefore, 6 is a biased estimator of 3 unless K = 0, in which case 3 = 8. LetP be an orthogonal matrix diagonalizing X'X, that is

PX'XP' = D

where D is a diagonal matrix with the ith diagonal element equal to N.. Let

= ( i' "''' p)' = Pe. The mean squared error is given after simplification by

22p i 2 .

(2.5) E(e-e)'(e-e) = 2 2. + K 2i=l (A.+K) CN.+K) 2

1 2

From (2.2) and (2.5) we see that for any given K> 0

MSE ' > MSE @

for sufficiently large values of 9'9. Therefore, the ridge estimator can becompared to the least squares estimator only if e is constrained. Suppose it isknown apriori that e' < -c where c is a positive number. This condition would be

-I

I- . .

2

realized in many practical situations. Since 8 '8 =', we have a2 < C,

i = 1, ..., p. Hence from (2.5) we get1 2 _• K2 P

(2.6) MSE <t 2 + cK 2 2i=l (A i +K) i=l (A +K)11

Theorems 2.1, 2.2, and 2.3 below give certain results on the choice ofK in order that the ridge estimator has smaller mean squared error than theleast squares estimator.

20 2

Theorem 2.1. If 8'6< c then MSE 6< MSEe for 0< K<202-

Proof: From (2.6) we have for K< 2-- c

~ 2 P xi 2K. MSEe <o 2 +2-- i=l (A.+-K) (A ,+K)2

1 1

p i +2K

i=l (. +K)2

k2 1

i=l

= MSE 62 p 2

Theorem 2.2. If 6'8 < _ p 1 then MSE @ < MSE 6 for K> 0.

Proof: Let D(K) denote the quantity on the right hand side of (2.6).Differentiating D(K) with respect to K we have

p 2\. (cK-02

1(2. 7) 3D(K)/oK 2

2

The rjgh hand side of (2.7) is equal to zero for K = - and is <(>) 0 forcK <(>) .

-- Therefore, D(K) is first decreasing then increasing as K variesc

from 0 to . Now

D (0) 2 P 1i~l i

= MSE

/ ,D(1) = pc

Hence

D(c) < max(pc,MSE e)

2 p=MSE 8 for c< !- I

P =1

Since D(K) is an upper bound on the value of MSE e, the theorem follows. 0

From a Bayesian point of view suppose that the components of2 e are inde-pendently and identically distributed with means E and variance T

Theorem 2.3. If the components of e are independently and identically distri-buted with mean and variance T2 then the average mean squared error of the

* ridge estimator is minimized for k = a /( +T ).

Proof: Let E denote expectation with respect to the given prior distribution of

8. We have

(2.8) E(MSE 6) = 2 + K2 p 2+ 2i=l (X.+K) i=l (x.+ k

1 1

As in the proof of Thergem 2 2.2 we find that the right hand side of (2.8) isminimized for K = a /(E +T ) . [1

We have seen above that the ridge estimator is preferred to the least squaresestimator in certain situations when the parameter e is constrained. The followingtheorem gives a basis for the choice of the ridge estimator under the given con-straint.

Theorem 2.4. The value of e minimizing R(e) = (Y-Xe)' (Y-XO), given 8'9< c is equalto 6 where K is chosen such that 5'5 = c.

Proof: By direct computation we get

-2(2.9) 3'@ = (PX'Y)' (D+KI) (PXIY)

It is seen from (2.9) that '6 is decreasing in K. Therefore, the value of K,given by e'e = c is uniquely determined.

We have

(2.10) R(9) = (Y-Xe)' (Y-Xe)

= (Y-xe)' (Y-xe) + (X'Y)' [(X'X+KI)-I- (X'X) - I

X'X[(X'X+KI) -(X'X) -I ](X'Y)

= (Y-XO)' (Y-Xe) + (PX'Y)' D* (PX'Y)

where D* is a p x p diagonal matrix whose ith diagonal element is equal to

k 2

X2 (k+X.)2. .

From (2.10) we see that R(8) is increasing in k.

Consider the problem of minimizing R(M) with respect to 6 under the con-straint e'8 = c. By the Langrangian method the minimizing value of 6 is givenby

Xe - X' (Y-X8) = 0

* or

6 = (X'X+I) -x'y

where X is determined such that 8'6 = c. That is, the minimizing value of 6 isthe ridge eptimator e, where k is determined such that 6'6 = c.

We have shown above 8 '6 is decreasing in k and that R(e) is increasing ink. It follows that 6 whih minimizes R(6) under the constraint 'e = c,minimizes R(8) also under the constraint 6'8 < c. 0

Remark: We have a comparison between the least squares estimation and ridgeestimation. The ridge estimator is given by minimizing R(e) under a certain con-straint on the value of e'e, whereas the least squares estimator is given by mini-mizing R(e) without that constraint.

Throughout the foregoing discussion we have assumed that the quantity karising in the definition of the ridge estimator 6, is a scalar constant. Byletting k depend on the observation Y suitably, we might be able to obtain an esti-mator which has a smaller MSE than the least sidares estimator for all values of 9.

* Hoerl and Kennard (1970,a) have suggested an iterative method of choosing such avalue of K. However, they did not show that the final estimator had a smaller MSEthan the least squares estimator. On the other hand, (see Alam (1974)) any esti-mator of the form

((Y'X(X'X) Ix')/ 2 6

has smaller MSE than 0 where O(Z) is a function, such that, Z(I-M(Z)) is non-

decreasing in Z and 0<Z(1-(Z)) <_2p - 4 and

CE - ) min~l ..

P ial . P

See also, Sclove (1968) and Stein (1960).

* .

References

(1] Alam, K. (1974). Minimax estimators of the mean of a multivariate normaldistribution. Tech. Report, Mathematical Sciences, Clemson University.

[2] Hawkins, D. M. (1975). Relations Between Ridge Regression and Eigenanalysisof the Augmented Correlation Matrix. Technometrics, Vol. 17, pp. 477-80.

[3] Hemmerle, W. J. (1975). An Explicit Solution for Generalized Ridge Regression.Technometrics, Vol. 17, pp. 309-14.

[4] Hoerl, A. E. (1959). Optimum Solution of Many Variable Equations. ChemicalEngineering Progress 55, pp. 69-78.

[5] Hoerl, A. E. (1962). Applications of Ridge Analysis to regression problems.Chemical Engineering Progress 58, pp. 54-59.

[6] Hoerl, A. E. and Kennard, R. W. (1970). Biased Estimation for NonorthogonalProblems. Technometrics, Vol. 12, pp. 55-67.

[71 Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Application tonon-orthogonal problems. Technometrics, Vol. 12, pp. 69-82.

(8] Newhouse, J. P. and Oman, S. D. (1971). An evaluation of ridge estimators.Rand Technical Report R-716-PR.

[9) Sclove, S. L. (1968). Improved estimators for coefficients in linear regression.J. Amer. Statisti. Assoc. 63, 597-606.

[10] Sidik, S. M. (1975). Comparison of Some Biased Estimation Methods (IncludingOrdinary Subset Regression) in the Linear Model. Nasa Technical ReportTN D-7932.

[11] Stein, C. M. (1960). Multiple regression. Contributions to probability andstatisitics. Essays in honour of Harold Hotelling, Olkin, I (ed.), Stanford

University Press, 423-43.

k ,,o~

UNCLASSIFIEDSECURITY CLASSIFICATION OF THIS PAGE (When Doat Entered)

REPORT DOCUMENTATION PAGE READ WSTRUCTIONSBEFORE COMPLETING FORM

I REPORT NUM1ER .OVT ACCESSION NO. 1. RECIPIENT'S CATALOG NUMBER

N 8062

4. TITLE (and Subtitle) S. TYPE OF REPORT A Pt.RIOD COVERED

Ridge Estimation In Linear Regression TR #232

6. PERFORMING ORG. REPORT NUMBER

7. AuTHoR(s) S. CONTRACT OR GRANT NUM§ER(s)

James S. Hawkes N00014-75-C-0451

9. PERFORMING ORGANIZATION NAME AND AODRESS 10. PROGRAM ELEMENT. PROJECT, TASKAREA & WORK UNIT NUMBERS

Clemson University 0 -.2-0 2-Dept. of Mathematical Sciences NR G42- 7-Clemson, South Carolina 29631

I I. CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE

Office of Naval Research 10/76

Code 4-36- 4 .3 Is. NUMBER OF PAGES

Arlington, Va. 22217 814. MONITORING AGENCY NAME & ADDRESS(i dilfrent from Controlling Office) IS. SECURITY CLASS. (of this report)

UnclassifiedISo. DECLASSIFICATION/ DOWNGRADING

SCHEDULE

16. DISTRIBUTION STATEMENT (of thie Report)

Approved for public release; distribution unlimited.

17. DISTRIBUTION STATEMENT (of the aberract entered In Block 20. it dilferint from Report)

II. SUPPLEMENTARY NOTES

0

19 K EY WORDS rContinue on reverse id* if necese , and identify by block number)

Regression, Ridge Analysis, Multicollinearity, Mean Squared Error

20. ABSTRACT (Continue on reverse aide it necessar amd Identity by block nubeN Eonsider the linear regres-

sion model Y = X 9+ z. Recently, a class of estimators, variously known asridge estimators, has been proposed as an alternative to the least squares esti-mators in the case of collinearity, that is, when the design matrix X'X is

nearly singular. The ridge estimator is given by -/'X'X "I) - X'Y, where K

is a constant to be determined. An optimal choice of the value of K is notk.nown. This paper examines the risk (mean squared error) of the ridge estima-tor under the constraint 9'9 < c and determines optimal values of K for which _

DD, 2 1473 EDITION OP I NOV S5 IS OBSOLETE T. .:F..DS N 0101-014-8601

SECURITY CLASSIFICATION OF TmIS MAGE Whin Data Entereic

the risk is saml~ler than the risk of the least squares estimators where cis a constant.

Documents

AO-AU83 b2l CLEMSON UNIV S C DEPT OF MATHEMATICAL … · * the regression coefficients becomes large. Hoerl (1959), (1962) and Hoerl and * Kennard (1970,a), (1970,b) have suggested