· Sloan School of Management Massachu setts Institute of Technology Cambridge 39, Massachusetts December, 1964 Go p q j . Multicollinearity in Regression Analysis The Problem Revisi

Sloan School of Management

Mas sachu sett s Institute of Technology

Cambridge 39 , Mas sachusett s

December , 1964

Go p q j .

Multicollinearity in Regre s sion Analysi s

The Problem Revi sited

10 5 -64

E u Farrar and R . R . G1auber*

This paper i s a draft for private c irculation and comment . Itshould not be cited , quoted , or reproduced without permis sion ofthe authors . The re search has been supported by the Institute ofNaval Studie s , and by grants from the Ford Foundation to the SloanSchool of Management , and the Harvard Busine ss School .

*Sloan School of Management , and Graduate School ofBu sine s s Admini stration , Harvard University , re spectively .

CONTENTS

Multicollinearity Problem

Nature and Effect s

E stimation

Illustration

Spec ification

Hi storical Approache s

Econometric

Computer Programming

Problem Revi sited

Def inition

Diagnosis

General

Spec if ic

Illustration

Summary

MULTICOLLINEARITY IN REGRESSION ANALYSISTHE PROBLEM REVISITED

To most economists the s ingle equation least square s

regre s sion model , like an old friend , i s tried and true . Its

propertie s and limitations have been extensively studied , docu

mented and are , for the most part , well known . Any good text in

econometric s can lay out the as sumptions on which the model is

based and provide a reasonably coherent perhaps even a luc id

discus s ion of problems that arise as particular assumptions are

violated . A short bibliography of def initive papers on such

cla s sical problems as non -normality , heteroscedasticity , serial

correlation , feedback , etc . , complete s the j ob .

As with most old friend s , however , the longer one knows least

square s , the more one learns about it . An admiration for it s

robu stne s s under departure s from many as sumptions i s sure to

grow . The admiration must be tempered , however , by an ap p recia

tion of the model's sensitivity to certain other conditions .

The requirement that independent variable s be truly independent

of one another i s one of the se .

Proper treatment of the model's clas s ical problems ordinarily

involve s two separate stage s , detection and correction . The

Durbin -Watson te st for serial corre lation , combined with Cochrane

and Orcutt's sugge sted f irst differenc ing procedure , i s an

obviou s example .

*

*J . Durbin and G . 8 . Watson ,

"Testing for Serial Correlationin Lea st Square s Regre s s ion , Biometrika , 37

- 8 , 19 50 -1 . D .

Regre s sions to Relationships Containing Auto- Correlated Error

Bartlett's te st for variance heterogeneity followed by a data

transformation to re store homoscedastic ity is another .

* No

such "proper treatment"has been developed , however , f or problems

that arise as multicollinearity i s encountered in regre s sion

analysis .

Our attention here will focus on what we consider to be

the f irst step in a proper treatment of the problem its

detection , or diagnosi s . Economi sts generally agree that the

second step correction requires the generation of addition

al Just how this information i s to be obtained

depends largely on the taste s of an inve stigator and on the

specif ic s of a particular problem . It may involve additional

primary data collection , the use of extraneous parameter e stimate s

from secondary data sources,or the application of subj ective

inf ormation through constrained regre s s ion , or through Baye sian

e stimation procedure s . Whatever it s source , however , selectivity

and there by efficiency in generating the added information

require s a systematic procedure for detecting its need i . e . ,

for detecting the existence , measuring the extent , and pinpointing

the location and cause s of multicollinearity within a set of

independent variable s . Measure s are proposed here that , in our

opinion , fill thi s need .

The paper's bas ic organi zation can be outlined briefly a s

f ollows . In the next section the multicollinearity problem's

bas ic , formal nature is developed and illu strated . A di scu s sion

*F . David and J . Neyman ,"Extension of the Markoff Theorem on

Least Square s ,

"Stati stical Re search Memoirs , II . London , 1938 .

**J . Johnston,Econometric Method s ,

McGraw Hill , 1963 , p . 207 ;

J . Meyer and R Glauber , I nve stment Dec is ions , Economic Forecastin

and Public Policy , Division of Research Graduate School of Bu sine s sAdmini stration , Harvard University

,1964 , p . 181 ff .

of historical approache s to the problem follows . With this as

background , an attempt i s made to def ine multicollinearity in

terms of departure s from a hypothe sized statistical condition ,

and to fashion a serie s of hierarchical mea sure s at each of

three levels of detail for its pre sence , severity , and location

within a set cf data . Te sts are developed in terms of a genera

lized , multivariate normal , linear model , A pragmatic interpretation

of re sulting stati stic s as dimensionle s s measure s of corre spondence

between hypothe si zed and sample propertie s , rather than in terms

of clas s ical probability levels , is advocated . A numerical example

and a summary complete s the exposition .

THE MULTICOLLINEARITY PROBLEM

NATURE AND EFFECTS

The purpose of regre s s ion analysis is to e stimate the

parameters of a dependency , not of an interdependency,relation

ship . We def ine first

Y , X as observed value s , measured as standardized deviate s,

of the dependent and independent variable s,

8 as the true ( structural ) coeff icients ,

u as the true ( unobserved ) error term , with distributional

propertie s specif ied by the general linear model ,

*

and

02

as the underlying , population variance of u ;u

and pre sume that Y and X are re lated to one another through

the linear form

Y X B u

Least square s regre s s ion analysis leads to e stimates

fi = ( xtzo

'litY

with variance -covariance matrix

05 ( XtX )

- l

*See f or example , J . Johnston , op . cit . , Ch . 4 ; or F . Graybill ,

An Introduction to Linear Stati stical Models , McGraw Hill ,

that,in a variety of sense s , be st reproduce s the the hypothe sized

dependency relationship

Multicollinearity , on the other hand , i s veiwed here a s an

interdependency condition . It is defined in terms of a lack of

independence , or of the pre sence of interdependence s ignified

by high intercorre lations (xtX) within a set of variable s , and

under thi s view can exist quite apart from the nature , or even

the exi stence of a dependency relationship between X and a

dependent variable Y . Multic ollinearity i s not important to the

stati stician for it s own sake . Its signif icance , as contrasted

with it s definition , come s from the effect of interdependence in

X on the dependency relationship whose parameters are de sired .

Multicollinearity constitute s a threat and often a very seriou s

threat both to the proper specification and to the effective

e stimation of the type of structural relationships commonly

sought through the u se of regre s sion technique s .

The single equation , least square s re gre s s ion model is not

we ll equipped to cope with interdependent explanatory variable s .

In it s Original and most simple form the problem is not even

conceived . value s of X are pre sumed to be the pre - selected ,

c ontrolled elements of a clas s ical , laboratory experiment .

*

Least square s models are not limited , however , to simple , fixed

variate or fully controlled experimental situations .

Partially controlled or completely unc ontrolled experiments , in

which x as well as Y is subj ect to random variation and

therefore also to multicollinearity may provide the data on

which perfectly legitimate regre s s ion analyse s are

*Kendall put s his finger on the e s sence of the simple , f ixed variate ,

regre s sion model when he remarks that standard multiple regre s s ion i snot multivariate at all , but i s univariate . Only Y is pre sumed tobe generated by a proce s s that include s stochastic elements .

p . 68 -9 0

**See Model 3 , Graybill , op . c it . p . 104 .

Though not limited to fully c ontrolled , fixed variate

experiment s , the regre s sion model , like any other analytical

tool , is limited to gg gg experiments if gggg re sults are to

be insured . And a good experiment must provide as many dimensions

of independent variation in the data it generate s as there are

in the hypothe si s it handle s . An n dimensional hypothe sis

implied by a regre s sion equation containing n independent variable s

can be neither properly e stimated nor properly te sted , in it s entirety ,

by data that contains fewer than n signif icant dimensions .

In most case s an analyst may not be equally concerned about

the structural integrity of each parameter in an equation . Indeed,

it i s often sugge sted that for certain forecasting applications

structural integrity anywhere although always nice to have

may be of secondary importance . Thi s point will be considered

later . For the moment it must suff ice to emphasize that multi

c ollinearity is both a symptom and a facet of poor experimental

de sign . In a laboratory poor design occurs only through the

improper use of control . In life , where most economic e x p eri

ments take place , control is minimal at be st , and multicollinearity

i s an ever pre sent and seriou sly di sturbing fact of life .

Estimation

Difficultie s as soc iated with a multicollinear set of data

depend , of course , on the severity with which the problem i s

encountered . As interdependence among explanatory variable s

X grows , the corre lation matrix ( xtX ) approaches s ingu larity ,

and elements of the inverse matrix (t )- l

explode . In the limit ,

perfect linear dependence within an independent variable set lead s

to perfect singularity on the part of ( xtx ) and to a completely

indeterminate set of parameter In a formal sense

diagonal elements of the inverse matrix ( xtx)

"lthat corre spond

to linearly dependent members of X become infinite . variance s

for the affected variable s'regre s s ion coeff ic ients ,

03( ZtX )

-l

accordingly , also become inf inite .

The mathematic s , in its brute and tactle s s way , tells us

that explained variance can be allocated completely arbitrarily

between linearly dependent members of a completely singu lar set

of variables , and almost arbitrarily between members of an

almost singular set . Alternatively , the large variance s on

regres sion coeffic ient s produced by multicollinear independent

variable s indicate , quite properly , the low information content

of observed data and , accordingly , the low quality of re sulting

parameter e stimate s . It emphasize s one's inability to distinguish

the independent contribution to explained variance of an explanatory

variable that exhibits little or no truly independent variation .

In many ways a person whose independent variable set is

completely interdependent may be more f ortunate than one whose

data i s almost so ; f or the f ormer's inability to base his model

on data that cannot support its informat ion requirement s will be

discovered by a purely mechanical inability to invert the singu

1ar matrix ( ZFX) while the latter's problem , in most case s ,

will never be fully realized .

Diff icultie s encountered in the application of regre s sion

technique s to highly multicollinear independent variable s can be

discus sed at great length , and in many ways . One can state that

the parameter estimate s obta ined are highly sensitive to changes

in model specif ication , to change s in sample coverage , or

even to change s in the direction of minimization ; but lacking a

simple illustration , it is diff icult to endow such statements

with much meaning .

Illustrations , unfortunately , are plentiful in economic

re search . A simple Cobb-Douglas production function provide s

an almost clas s ical example of the instability of least square s

parameter e stimate s when derived from collinear data .

Illustration

The Cobb-Douglas production function* can be expre s sed

most simply as

P L81 0

52 ea+u

where

P i s production or output ,

L is labor input ,

C i s cap tial input ,

are parameters to be e stimated , and

u i s an error or re sidual term .

Should Bl+82

l , proportionate change s in inputs generate equal ,

proportionate change s in expected output the production

function is linear , homogeneous and several de sirable condi

tions of welfare economic s are satisfied . Cobb and Douglas set

out to te st the homogeneity hypothe sis empirically . Structural

e stimates , accordingly , are desired .

Twenty-four annual observations on aggregate employment,

capital stock and output for the manufacturing sector of the U . S .

economy,18 99 and 1922

,are collected . 8

2i s set equal to

1- 81

and the value of the labor coefficient 81

. 75 i s

e stimated by constrained least squares regre s sion analysi s . By

virtue of the constraint, 8

21 . 75 . 2 5 . Cobb and Douglas

are satisf ied that structural e stimates of labor and capital

coefficients for the manufacturing sector of the economy

have been obtained .

C W . Cobb and P . H . Douglas,"A Theory of Production

,"

American Economic Review,XVIII ,

Supplement,March 1928 . See

H . Mendushau sen ,"On the Signif icance of Prof es sor Dougla s'Pro

duction Function,"Econometrica

,6 April 1938 ; and D . Durand

,

"Some Thought on Marginal Productivity with Special Referenceto Prof e s sor Douglas Analysi s

,"Journal of Political Economy

,

45 ,193 7 .

- 8

Ten years later Mendershau sen reproduce s Cobb and Dougla s'

work and demonstrate s , quite vividly,both the collinearity of their

data and the sensitivity of re sult s to sample coverage and direc

tion of minimization . U sing the same data we have attempted to

re construct both Cobb and Douglas'original calculations and

Mendershausen's replication of same . A demon stration of least

squares'sensitivity to model spe cifi cation i . e . , to the

composition of an equation's independent variable set i s added .

By being unable to reproduce several of Mendershau sen's results ,

we inadvertently ( and facetiously ) add "sensitivity to computa

tion error"to the table of pitfalls commonly as sociated with

multicollinearity in regre s sion analysi s . Our own computations

are summarized below .

Parameter e stimate s for the Cobb—Dougla s model are linear

in logarithms of the variable s P,L,and C . Table 1 contains

simple correlations between the se variable s,in addition to an

arithmetic trend,t . Multi collinearity within the data i s indi

cated by the high intercorrelations between all variables ,

a common occurrence when aggregative time serie s data are used .

The sensitivity of parameter e stimate s to virtually any

change in the delicate balance obtained by a parti cular sample

of observations,and a particular model spe cif ication

,may be

illustrated forcefully by the wildly fluctuating array of labor

and capital coeff icient s,

and 82 ,

summarized in Table 2 .B1

Equations ( a ) and ( b ) replicate Cobb and Dougla s'original ,

constrained e stimate s of labor and capital coeff icient s with

and without a trend to pick up the impact on productivity of

technological change .

" Cobb and Dougla s comment on the in

The capital coeffi cient 82 i s constrained to equal 1- 81by regre s s ing the logarithm of P/C on the logarithm of L/C , withand without a term for trend

, 83t .

crease in correlation that results from adding a trend to their

production function ,* but neglect to report the change in produc

tion coeff icients that accompany the new spe cification ( equation

b , Table

Table 1

Simple Correlations,

Cobb-Douglas Production Function

Log L Log C

To four places , rC t . 99683

The sensitivity of coefficient e stimate s to change s in

model specif ication i s demonstrated even more strikingly,how

ever,by equations ( c ) and (d ) , Table 2 . Here e stimates of the

logarithm of p roduction. are based directly on logarithms of labor

and capital,with and without a term for trend . 8

1and 8

2 ,

accordingly,are not constrained to add to unity . The relation

ship obtained in equation ( c ) is very nearly linear,homogen

eous 81 8

2Cobb and Dougla s

,clearly

,would be

delighted by thi s specif ication . Unconstrained parameters

clo sely re semble their own constrained estimate s of labor and

Cobb,Douglas

,O p cit

,p . 154 .

capital coefficients , and the observed relationship between

dependent and independent variable s i s strong,both individually

( a s indicated by t-ratios ) , and j ointly ( as measured by R2,the

coefficient of determination ) .

By adding a trend to the independent variable set,however

the whole system rotate s wildly . The capital coefficient

82

. 23 that make s equation ( 0 ) so reas suring become s 82

in equation ( d ) . Labor and technological change,of course

,

pick up and divide between themselves much of capital's former

explanatory power . Neither , however,as sumes a value that

,on

a priori grounds,can be dismi s sed as outrageous ; and

,it must

be noted , each variable's individual contribution to explained

variance ( measured by t - ratios ) continue s to be strong . De s

pite the fact that both trend and capital and labor too,for

that matter carry large standard errors,no "danger light"

i s f lashed by conventional stati stical measure s of goodne s s of

f it . Indeed , R2and t-ratios for individual coefficients are

suf ficiently large in either equation to lull a great many

econometricians into a wholly unwarranted sense of se curity .

Evidence of the instability of multicollinear regres sion

e stimate s under changes in the dire ction of minimization i s il

lustrated by equations ( 0 ) ( f ) , Table 2 . Here capital and

labor,respectively

,serve as dependent variable s for estima

tion purpose s,after whi ch the de sired ( labor and capital ) co

efficients are derived algebraically . Labor coeff icient s,

81

. 0 5 and and capital coefficients, 82

. 60 and . 01 ,

are derived from the same data that produce s Cobb and Douglas'

81

. 75 , 62. 2 5 divi sion of product between labor and capital .

Following Mendershausen's lead equations (g ) ( i ) , Table

2,illustrate the model's sensitivity to the few non- collinear

observations in the sample . By omitting a single year ( 1908 ,

https://www.forgottenbooks.com/join

factory repre sentation of postwar United State s consumer

behavior . Multicollinearity,unfortunately

,contribute s to

difficulty in the specification as well as in the estimation

of economi c relationships .

Model spe cif ication ordinarily begins in the model

builder's mind . From a combination of theory,prior information ,

and j ust plain hunch,variable s are chosen to explain the be

havior of a given dependent variable . The j ob , however , doe s

not end with the first tentative specif ication . Before an equa

tion i s j udged acceptable it must be te sted on a body of em

p irical data . Should it be def icient in any of several re spects ,

the specif ication and thereby the model builder's "prior

hypothe sis" i s modified and tried again . The proces s may go

on for some time . Eventually di screpancies between prior and

samp le information are reduced to tolerable levels and an equa

tion acceptable in both re spe ct s i s produced .

In concept the proce s s i s sound . In practice , however ,

the econometrician's mind is more fertile than hi s data , and

the proce s s of modifying a hypothese s consists largely of paring

down rather than of building- up model complexity .Having

little conf idence in the validity of hi s prior information , the

economi st tends to yield too easily to a temptation to reduce

hi s model's s cope to that of hi s data .

Each sample,of course

,covers only a limited range of ex

p erience . A relatively small number of forces are likely to

be operative over,or during

,the subset of reality on which a

parti cular set of observations i s based . As the number of

variable s extracted from the sample increase s , each tends to

measure different nuance s of the same,few , basic factors that

are pre sent . The sample's basic information i s simply spread

more and more thinly over a larger and larger number of in

creasingly multicollinear independent variable s .

However real the dependency relationship between Y and each

member of a relatively large independent variable set X may be,

the growth of interdependence within X as its size increase s

rapidly decrease s the stability and therefore the sample

significance of each independent variable's contribution to

explained variance . As Liu point s out,data limitations rather

than theoretical limitations are largely re sponsible for a per

sistent tendency to underspecify or to oversimplify

econometric models .

* The increase in sample standard errors for

multicollinear regre s sion coef f icient s virtually a s sure s a

tendency for relevant variables to be di s carded incorrectly

from regre s sion equations .

The econometrician , then,is in a box . Whether hi s goal

i s to e stimate complex structural relationships in order to dis

tinguish between alternative hypotheses,or to develop reliable

forecasts , the number of variable s required i s likely to be

large , and past experience demonstrate s with depre s sing regu

larity that large numbers of e conomic variable s from a single

sample space are almost certain to be highly intercorrelated .

Regardle s s of the particular application,then

,the e s sence of

the multicollinearity problem i s the same . There exi st s a sub

stantial difference between the amount of information required

for the sati sfactory e stimation of a model and the information

contained in the data at hand .

If the model i s to be retained in all it s complexity ,

solution of the multicollinearity problem require s an augmenta

tion of existing data to include additional information .

Parameter e stimate s for an n dimensional model e . g . , a

T . C . Liu,"Underidentification , Structural Estimation and

Fore casting ,"Econometri ca

,28 ,

October 1960 ; p . 85 6 .

two dimensional production function cannot properly be based

on data that contains fewer signif icant dimens ions . Neither can

such data provide a basi s for dis criminating between alternative

formulations of the model . Even for forecasting purposes the

e conometrician whose data is multicollinear i s in an extreme ly

exposed po sition . Succes sful forecasts with multicollinear

variables require not only the perpetuation of a stable depend

ency relationship between Y and 1 ,but also the perpetuation of

stable interdependency relationships within X . The second

condition,unfortunately

,i s met only in a context in which

the forecasting problem is all but trivial .

The alternative of s caling down each model to fit the

dimensionality of a given set of data appears equally unpromi sing .

A set of substantially orthogonal independent variable s can in

general be specified only by di scarding much of the prior

theoretical information that a re searcher brings to hi s problem .

Time serie s analyse s containing more than one or two inde

pendent variables would virtually di sappear,and forecasting

models too simple to provide reliable forecast s would become the

order of the day . Consumption functions that include either

income or liquid as sets but not both provide an appropri

ate warning .

There i s,perhaps a middle ground . All the varia ble s in

a model are seldom of equal interest . Theoretical que stions

ordinarily focus on a relatively small portion of the independent

variable set . Cobb and Douglas,for example , are intere sted

only in the magnitude of labor and capital coefficients,not

in the impact on output of technological change . Di spute s con

cerning alternative consumption,inve stment , and cost of

capital models similarly focus on the relevance of , at most ,

one or two disputed variable s . Similarly, forecasting models

- 16

rely f or succe s s mainly on the structural integrity of tho se

variables whose behavior i s expected to change . In each case

certain variable s are strategically important to a particular

application while others are not .

Multicollinearity,then

,constitute s a problem only if it

undermine s that portion of the independent variable set that

i s crucial to the analysi s in que stion labor and capital

for Cobb and Douglas ; income and liquid as sets for po stwar

consumption forecast s . Should these variables be multicollinear ,

corrective action is ne ce s sary . New information must be oh

tained . Perhap s it can be extracted by stratifying or other

wi se reworking exi sting data . Perhaps entirely new data i s re

quired . Wherever inf ormation is to be sought , however , insight

into the pattern of interdependence that undermine s pre sent

data is nece ssary if the new information i s not to be simi

larly affected .

Current procedure s and summary statisti cs do not provide

effective indi cation s of multicollinearity's presence in a

set of data,let alone the insight into it s location , pattern ,

and severity that i s required if a remedy in the form of

selective additions to information i s to be obtained . The

current paper attempts to provide appropriate "diagnostics"

for thi s purpose .

Hi storical approache s to the problem will both facili

tate expo sition and complete the nece s sary background for the

pre sent approach to multicollinearity in regre s sion analysi s .

- 17

HISTORICAL APPROACHES

Hi storical approache s to multicollinearity may be organized

in any of a number of ways . A very convenient organi zation re

f le et s the taste s and backgrounds of two types of persons who have

worked actively in the area . Econometricians tend to view the

problem in a relatively ab stract manner . Computer programmers ,

on the other hand , see multicollinearity as j u st one of a rela

tively large number of contingencie s that mu st be anticipated

and treated . Theoretical statisticians,drawing their training ,

experience and data from the controlled world o f the laboratory

experiment,are noticeably unintere sted in the problem altogether .

Econometric

Econometrician s typically view multicollinearity in a very

matter-of- fact if slightly s chizophrenic fashion . They

point out on the one hand that least squares coeff icient esti

mate s ,

6 s (iti )‘

l itu

are "be st linear unbiased ," s ince the expectation of the last term

i s zero regardle s s of the degree of multicollinearity inherent

in x ,if the model i s properly spe cified and feedback i s absent .

Rigorously demonstrated , this propo sition is often a source of

great comfort to the embattled practitioner . At time s it may

j ustify comp acency .

On the other hand , we have seen that multi collinearity

imparts a substantial bia s toward incorrect model specification .*

Liu,idem .

It ha s also been shown that poor spe cif ication undermine s the

"be st linear unbiased"character of parameter e stimate s over multi

collinear , independent variable sets .

* Complacency , then , tend s

to be short lived,giving way alternatively to de spair a s the

econometrician recognizes that non- experimental data , in general ,

i s multicollinear and that in principal nothing can be done

about Or , to use Jack Johnston's words,one i s in

the stati stical pos ition of not being able to make bricks without

Data that doe s not po sse s s the information required

by an equation cannot be expected to yield it . Admonitions that

new data or additional a priori information are required to"break the multicollinearity are hardly reas suring ,

for the gap between information on hand and information required

i s so often immense .

Together the combination of complacency and de sp air . that

characterize s traditional views tends to virtually paralyze ef

forts to deal with multicollinearity as a legitimate and diffi

cult , yet tractable e conometric problem . There are,of course

,

exceptions . Two are di scus sed below .

Artif icial Orthogonalization" The f irst i s proposed by

and illustrated with data from a demand study by

Employed correctly,the method is an example of a

solution to the multicollinearity problem that proceeds by re

ducing a model's information requirements to the information

H . Theil ,

"Specif ication Errors and the Estimation ofEconomi c Relationships

,"Review of the International Stati sti cal

H . Theil,Economic Forecasts and Policy , North Holland ,

1962 ;p . 216 .

J . Johnston , 0 p . cit . ,p . 2 0 7 .

J . Johnston,idem . ,

and H . Theil , op . cit . , P o 217 °

M . G . Kendall , op . cit . , pp . 70- 75 .

R . N . Stone,"The Analysi s of Market Demand ,

"Journal ofthe Royal Stati stical Society

,CVIII ,

III ,1945 ; pp . 28 6- 382 .

- 19

content of existing data . On the other hand , perverse applica

tions lead to parameter e stimate s that are even les s sati sfactory

than those ba sed on the original set of data .

G iven a set of interdependent explanatory variable s X ,and

a hypothe sized dependency relationship

Y K 8 u

Kendall propo se s "[ to throw ] new light on certain old but uh

solved problems ; particularly ( a) how many variables do we take?

( b ) how do we dis card the unimportant ones? and ( c ) how do we get

rid of multicollinearitie s in

Hi s solution , brief ly,runs as follows" Def ining

a matrix of observations on n multicollinear

explanatory variables ,

a set of m .s n orthogonal component s or

common factors ,

a matrix of n derived re sidual ( or unique )

components , and

the constructed set of ( m x n ) normalized

factor loadings ,

Kendall decompose s X into a set of statistically signif icant

orthogonal common factors and re sidual components U such that

X = E A + U

exhausts the sample's observed variation .

'

2 EA , then , sum

Kendall,op . cit . , p . 70 .

- 20


return from factor to variable space by transforming e stimators

A and 8* into e stimate s

B** A

t8*

of the structural parameters

Y X B u

originally sought .

In the S pecial ( component analysi s ) case in which all m n

factors are obtained , and retained in the regre s s ion equation .

"[Nothing has been lost ] by the transformation except the time

spent on the arithmetical labor of f inding it .

"T By the same

token , however , nothing has been gained,for the Gaus-Markoff

theorem insure s that coeff icient estimates B** are identical

to the e stimate s 8 that would be obtained by the dire ct applica

tion of least square s to the original , highly un stable,set of

variable s . Moreover,all m n factors will be found signifi

cant and retained only in tho se instance s in which the independent

variable set,in fact , is not seriously multicollinear .

In general,therefore , Kendall's procedure derive s n

parameter estimates ,

B** A B*

from an m dimensional independent variable set ,

Y E e"

<x-u>ats* e*

f Kendall , op,cit . ,

p . 70 .

whose total information content is both lower and le s s well

defined than for the original set of variable s . The rank of

K-U , clearly , is never greater,and usually i s smaller

,than the

rank of X Multicollinearity , therefore , is intensif ied rather

than alleviated by the serie s of transformations . Indeed,by di s

carding the re sidual or perhaps the "unique" portion of an

independent variable's variation,one is seriously in danger

of throwing out the baby rather than the bath i . e . ,the inde

pendent rather than the redundant dimension s of information .

Kendall's approach i s not without attractions . Should

factors permit identif ication and use as variable s in their own

right,the transformation provide s a somewhat defensible solu

tion to the multicollinearity problem . The di screpancy between

apparent and significant dimensions ( in model and data,re sp ec

tively ) i s eliminated by a meaningful reduction in the number of

the model's parameters . Even where factors cannot be used

directly , their derivation provides insight into the pattern of

interdependence that undermine s the structural stability of

e stimate s based on the original set of variable s .

The shortcoming of thi s approach lie s in it s prescriptions

for handling tho se s ituations in which the data do not suggest

a reformulation that reduces the model's information require

ments i . e . , where components cannot be interpreted directly

as e conomic variable s . In such circumstances , solution of the

multicollinearity problem require s the application of additional

information,rather than the further reduction of existing in

formation . Methods that retain a model's full complexity while

reducing the information content of existing data aggrevate

rather than alleviate the multicollinearity problem .

Rule s of Thumb" A se cond and more pragmatic line of attack

re cognize s the need to live with poorly conditioned , non-ex p eri

- 2 3

mental data,and seeks to develop rule s of thumb by which

"acceptable"departure s from orthogonality may be distingui shed

from "harmful"degree s of multicollinearity .

The term "harmful multicollinearity"i s generally def ined only

symptomatically as the cause of wrong signs or other symptoms

of nonsense regres sions . Such a practice's inadequacy may be illus

trated ,perhap s , by the ease with which the same argument can be

used to explain right signs and sensible regre s sions from the same

basic set of data . An O perational definition of harmful multi

collinearity , however inadequate it may be,i s clearly preferable

to the methodological slight-of- hand that symptomatic definitions

make pos sible .

The mo st simple,operational definition of unacceptable

collinearity make s no pretense to theoretical validity . An ad

mittedly arbitrary rule of thumb i s establi shed to constrain

s imple correlations between explanatory variables to le s s than,

say,r . 8 or . 9 . The most obvious type of pairwi se sample in

terdep endence ,of course , can be avoided in this fashion .

More elaborate and apparently S ophi sticated rule s of thumb

also exi st . One,that has lingered in the background of econome

tries for many years,has recently gained suff icient stature to

be included in an elementary text . The rule holds,es sentially

,

that "intercorrelation or multi collinearity i s not ne ce s sarily a

problem unle s s it i s high relative to the overall degree of

multiple Or , more spe cifically,if

i s the ( simple ) correlation between two independent

variable s ,and

i s the multiple correlation between dependent and

independent variables ,

Klein,An Introduction to Econometrics , Prentice—Hall ,

101 .

_ 24_

multicollinearity i s said to be "harmful"if

rii

Ry

By this criterion the Cobb~Douglas production function i s

not seriously collinear,for multiple correlation R

y. 98 i s

comfortably greater than the simple correlation between

( logarithms of ) labor and capital,

r12

. 91 . Ironically thi s

i s the application chosen by the textbook to illustrate the rule's

validity .

"

Although it s origin i s unknown,the rule's intuitive appeal

appears to rest on the geometri cal concept of a triangle formed

by the end points of three vectors ( repre senting variables Y , _Kl ,

and 52 ,re spe ctively ) in N dimen sional observation space

( reduced to three dimensions in Figure Y éfil

is repre sented

by the perpendicular the least square s ) ref lection of Y

onto the plane . Multiple correlation Ry

i s def ined by the

Observation N

Observation 1

tion

Figure

Idem .

direction cosine between Y and Y, while simple correlation riz

is the direction cosine between X1

and X2

. Should multiple

correlation be greater than simple correlation,the triangle's

base X1X2

is greater than its height Y Y, and the dependency

relationship appear s to be "stable ."

De spite it s intuitive appeal the concept may be attacked on

at least two grounds . First,on extens ion to multiple dimensions

it breaks down entirely . Complete multicollinearity i . e . ,per

feet singularity within a set of explanatory variable s i s

quite consistent with very small simple correlations between

members of x . A set of dummy variables whose non- zero elements

accidentally exhaust the sample space i s an obviou s,and an aggre

vatingly common , example . Second , the Cobb-Douglas production

function provide s a convincing counter- example . If thi s set of

data is not "harmfully collinear,"the term has no meaning .

The rule's conceptual appeal may be rescued from absurdities

of the f irst type by extending the concept of simple correlation

between independent variable s to multiple correlation within the

independent variable set . A variable, Xi’then ,

would be said

to be "harmfully multicollinear"only if its multiple correlation

with other members of the independent variable set were greater

than the dependent variable's multiple correlation with the

entire set .

The Cobb-Douglas counter example remains,however

,to in

dicate that multicollinearity i s ba sically an interdependency,

not a dependency condition . Should (KtX ) be singular or

virtually so tight sample dependence between Y and X can

not as sure the structural integrity of least square s parameter

e stimates .

Computer Prpgrammipg

The development of large s cale , high speed digital com

puters has had a well-recognized,virtually revolutionary impact

- 2 6

on econometric applications . By bringing new persons into con

tact with the field the computer also i s having a perceptible,if

le s s dramatic , impact on econometric methodology . The phenomenon

i s not new . Technical spe ciali sts have called attention to

matters of theoretical intere st in the past Profes sor Viner's

famous draftsman , Mr . Wong,i s a notable example .

" More recently,

the computer programmer's approach to s ingularity in regre s sion

analysi s has begun to shape the econometrician's view of the

problem a s well .

Specif ically , the numerical e stimation of parameters for a

standard regre s sion equation require s the inversion of a matrix

of variance- covariance or correlation coeff icients for the inde

pendent variable set . Est imate s of both slope coeff icient s,

.

3 (iti )’

l itY

and variance s,

A 2 - lvarcs) o

u(ltX )

require the operation . Should the independent variable set X

be perfe ctly multicollinear , (Ktz) , of course

,is s ingular

,and a

determinate solution doe s not exi st .

The programmer,accordingly

,i s required to buili checks

for non- singularity into standard regres sion routine s . The te st

most commonly used re lies on the property that the determinant

of a singular matrix i s zero . Defining a small,positive test

value,

6 0 ,a solution i s attempted only if the determinant

14t

h

otherwi se,computations are halted and a premature exit is

called .

J . Viner ,"Cost Curves and Supply Curve s

,"Z eitschrift fur

Nationalokonomie ,III

,1931 . Reprinted in A . E . A . Readings in

Price Theory,Irwin

,1952 .

Che cks for singularity may be kept internal to a computer

program,well out of the user's sight . Re cently

,however

,the

determinant ha s tended to j oin 8 coefficients, t

- ratios,

F-te sts and other summary statistics as routine elements of

printed output . Remembering that the determinant, | §

tx i s

based on a normalized , correlation matrix,its position on the

scale

yields at lea st heuri stic insight into the degree of interde

p endence within the independent variable set . A s K approaches

singularity , of course , lét

élapproaches zero . Converse ly

close to one implie s a nearly orthogonal independent variable

set . Unf ortunately , the gradient between extreme s is not well

defined . As an ordinal measure of the relative orthogonality of

similar set s of independent variable s , however,the stati stic

has attracted a certain amount of well- deserved attention and

use .

A single,overall measure of the degree of interdependence

within an independent variable set,although useful in its own

right,provide s little information on which corrective action

can be based . Near singularity may result from strong,sample

pairwi se correlation between independent variable s,or from a

more subtle and complex linkage between several members of the

set . The problem's cure,of course

,depend s on the nature of

the interaction . The determinant per se , unfortunately , give s

no information about that interaction .

In at least one case an attempt has been made to localize

multicollinearity by building directly into a multiple regre s

sion program an index of each explanatory variable's dependence

on other members of the independent variable set .

* Recalling

A . E . Beaton and R . R . G lauber , Stati stical LaboratoryUltimate Regre s sion Package ,

Harvard Stati stical Laboratory, 1962 .

- 28


diagonal element r and thereby the var ("i ) 02

r1 1

ex

p lode s ,pinpointing not only the exi stence

,but alsgthe location

of singularity within an independent variable set .

As originally conceived,diagonal e lements of the inverse

correlation matrix were checked internally by computer programs

only to identify completely singular independent variable s . More

recently they have j oined other stati stic s a s standard elements1 1

< CD 1 8of regre s sion output .

" Even though the spe ctrum 1 r

little explored,diagonal element s

,by their size , give heuri stic

insight into the relative severity,as well as the location , of

redundancie s within an independent variable set .

Armed with such ba sic ( albeit crude ) diagnostics,the in

ve stigator may begin to deal with the multicollinearity problem .

Fir st,of cour se

,the determinant lX

t

X alert s him to its

exi stence . Next,diagonal element s r

1 1give suffi cient insight

into the problem's location and therefore into it s cause

to sugge st the selective additions of information that are re

quired for stable,least square s , parameter e stimates .

Idem .

THE PROBLEM REVISITED

Many persons , clearly , have examined one or more aspect s

of the multicollinearity problem . Each,however

,has focused

on one facet to the e xclusion of others . Few have attempted to

synthe size , or even to distingui sh between either multicollin

earity's nature and ef fects

,or it s diagnosi s and cure . Klein

,

for example , define s the problem in terms that include both

nature and effe cts ; while Kendall attempt s to produce a solution

without concern for the problem's nature,eff ects , or diagnosi s .

Those who do concern themselves with a definition of multi

collinearity tend to think of the problem in terms of a di screte

condition that either exi sts or doe s not exi st , rather than as

a continuous phenomenon who se severity may be measured .

A good deal of confusion and some inconsistency

emerge s from this picture . Cohe sion require s,f irst of all

,a

clear distinction between multicollinearity's nature and ef

feet s,and

,second

,a def inition in terms of the former on

which diagnosi s , and subsequent correction can be ba sed .

DEFINITION

Econometric problems are ordinarily def ined in terms of

stati stically signif icant dis crepancie s between the propertie s

of hypothe sized and sample variates . Non— normality,heter

oscedasticity and autocorrelation,for example

,are defined

in terms of diff erences between the behavior of hypothe sized

and observed res iduals . Such definitions lead directly to the

development of te st stati stics on which detection , and an

evaluation of the problem's nature and severity,can be based .

Once an inve stigator i s alerted to a problem's exi stence and

character,of course

,corrective action ordinarily consti

tutes a separate and often quite straight- forward step .

Such a definition would seem to be both pos sible and de sirable

for multicollinearity .

Let us define the multicollinearity problem,therefore

,

in terms of departure s from parental orthogonality in an inde

pendent variable set . Such a def inition has at least two ad

vantage s .

First,it distingui shes clearly between the problem's

e s sential nature which consist s of a lack of independence ,

or the presence of interdep endence , in an independent variable

set , X and the symptoms or effects on the dependency re

lationship between Y and X that it produce s .

Second,parental orthogonality lends itself easily to

formulation as a stati stical hypothesi s and,as such

,leads

directly to the development of te st statistics,adj usted for

numbers of variables and observati ons in X ,against "hich the

problem's severity can be calibrated . Developed in suf f icient

detail,such statistics may provide a great deal of insight

into the location and pattern , as well as the severity,of in

terdep endence that undermine s the experimental quality of a

given set of data .

DIAGNOSIS

Once a def inition i s in hand,multicollinearity cease s to

be so inscruitable . Add a set of distributional propertie s and

hypothese s of parental orthogonality can be developed and te sted

in a variety of ways , at several level s of detail . Stati stics

whose distributions are known ( and tabulated ) under appropriate

as sumptions,of course

,mu st be obtained . Their value s for a

parti cular sample provide proba bilistic measure s of thet ex tent

of correspondence or non- corre spondence between hypothe

- 32

s ized and sample characteri stics ; in thi s case,between hypo

the sized and sample orthogonality .

To derive test stati sti cs with known di stribution s,

specif ic as sumptions are required about the nature of the p op u

lation that generate s sample va lue s of X . Becau se exi sting di s

tribution theory i s based almost entirely on as sumptions that

X i s multivariate normal , it is convenient to retain the assump

tion here as well . Common versions of least square s regre s sion

models , and te sts of signif icance ba sed thereon,also are ba sed

on multivariate normality . Questions of dependence and inter

dependence ih regres sion analysi s,therefore

,may be examined

within the same stati stical framework .

Should the as sumption prove unne cessarity s evere,it s

probabili stic implications can be relaxed informally . For

formal purpose s , however,multivariate normality's strength and

convenience i s e s sential,and underlie s everything that f ollows .

General

The heuri stic relationship between orthogonality and the

determinant of a matrix of sample f ir st order correlation co

ef f icients

0 < | xtx < 1

has been discus sed under computer programming approache s to

singularity,above . Should it be pos sible to attach distribu

tional propertie s under an a ssumption of parental orthogonality

to the determinant I Xt

X I , or to a convenient transformation of

|Xt

X | , the re sulting stati stic could provide a useful f irst

measure of the presence and severity of multi collinearity within

an independent variable set .

- 33

Presuming X be multivariate normal,such propertie s are

close at hand . As shown by Wishart,sample variances and co

variances are j ointly di stributed according to the frequency

function that now bears hi s name .

" Working from the Wi shart

distribution , Wilke s , in an analytical tour de force,i s able

to derive the moment s and di stribution ( in open form ) of the

determinant of sample covariance Employing the addi

tional as sumption of parental orthogonality,he then obtains

the moments and di stribution of determinants for sample correla

tion matrice s IXt

X I as well . Specifically the kth moment of

lXt

X I i s shown to be

N- l n-l n[P ~fl P k )

kn - i

[r —

2 152? —

2

where as befor e , N i s sample size and n,the number of

In theory,one ought to be able to derive the frequency

function for |Xt

X | from and in open form it i s indeed

po s sible . For n 2, however

,explicit solutions for the di s

tribution of IXt

X | have not been obtained .

Bartlett,however , by comparing the lower moments of ( 2 )

with those of the Chi Square di stribution,obtains a transfor

mation of

t( 3 )

étX.(v)" - [N-l ( 2n log X X |

Wi shart,J .

"The Generali ze Produce Moment DistributionSample s from a Multivariate Normal Population

,"Biometrika ,

1928 .

Wilke s,S "Certain Generalizati ons in the Analysi s of

Variance ,"Biometrika

,24

,1932 ; p . 477 .

8 . O p . cit . , p . 492 .

- 34

that i s di stributed approximately as Chi Square with vln (n- l )

degree s of freedom .

In thi s light the determinant of intercorrelati ons within

an independent variable set takes on new meaning . No longer i s

interpretation limited to extreme s of the range

0 5 | x x l< i .

By transforming [Xt

Xlinto an approximate Chi Square stati stic,

a meaningful s cale is provided against which departure s from

hypothe si zed orthogonality , and hence the gradient between

s ingularity and orthogonoality ,can be calibrated . Should one

accept the multivariate normality as sumption,of course

,

probability levels provide a cardinal measure of the extent to

which X i s interdependent . Even without such a scale,trans

formation to a variable whose di stribution i s known,even ap p rox i

mately by standardizing for sample size and number of vari

able s offers a generalized,ordinal measure of the extent to

which quite diff erent sets of independent variable s are under

mined by multicollinearity .

Specific

Determining that a set of explanatory variable s departs

substantially from internal orthogonality i s the f irst,but

only the logical first step in an analysis of multicollinearity

as defined here . If information i s to be applied eff iciently

to alleviate the problem , localization mea sure s are required

to accurately specify the variable s mo st severely undermined

by interdependence .

To find the ba si s for one such measure we return to

notions developed both by computer programmers and by econome

- 35 _

tricians . As indicated earlier,both use diagonal elements of

the inverse correlation matrix ,rl l

,in some form

,a s measure s

of the extent to which particular explanatory variable s are af

fected by multicollinearity .

Intuition sugge sts that our def inition of hypothesized

parental orthogonality be tested through this statistic .

Elements of the nece ssary stati stical theory are developed

by Wilkes,who obtains the di stribution of numerous determinental

ratios of variable s from a multivariate normal di stribution .

"

Specif ically , for the matrix (XtX ) , def ining h principle minors

tX ] . for i l h

,

no two contain the same diagonal element ,rii’but

r enters one principle minor,Wilke s considers the

For any arbitrary set of h principle minors he then

obtains both the moments and di stribution of Z . For the spe cial

case of intere st here, ( employing the notation of p . 2 9 above ) ,

let us form principal minors such that

h 2, IX

t

Xll I r l,and lé

t

élg then

I xtl | 1

z,ct r

l l

l

S . S . Wilke s , op . cit” e sp . pp . 480- 2 ,491- 2 .


which can be recogni zed as the F-di stribution with v1and v

2

degre e s of freedom .

"

The transformation

( 9 ) w = cr”

1 )

then,can be seen to be distributed a s F with N-n and n- l degree s

of freedom . Def ining R? as the squared multiple correlation

between Xi and the otherlmembers of X , thi s re sult can be most

easily understood by recalling that

1

Therefore , ( r -1 ) equals and w ( as defined in ( 6 ) and

( 9 ) above ) , except for a term involving degrees of freedom,i s

the ratio of explained to unexplained variance ; it is not sur

pri sing then,to see w di stributed as F .

As regards the distribution of the same considerations

dis cus sed in the preceding se ction are relevant . If X i s

j ointly normal, ( 9 ) i s distributed exactly as F , and it s magni

tude therefore provides a cardinal measure of the extent to

which individual variables in an independent variable set are

affe cted by multicollinearity . If normality cannot be as sumed

( 9 ) still provide s an ordinal measure,adju sted for degree s of

freedom,of X .

's dependence on other variables in X .

— a

Having e stabli shed which variables in X are substantially

F . Graybill,op . cit . ,

p . 31 .

- 38

multicollinear , it generally pro ve s useful to determine in

greater detail the pattern of interdependence between affected

members of the independent variable set . An example,perhaps

,

will illustrate the informati on's importance . Suppose ( 9 ) i s

large only for Xl ,

.X2 , X 3 , and X4 ,indicating the se variable s

to be signif icantly multicollinear ,but only with each other

,

the remaining variables in X being e s sentially uncorrelated both

with each other and with X1 , X4 . Suppose further that all four

variable s , i , X4 , are substantially intercorrelated with

each of the others . If well- determined e stimate s are de sired

for this subset of variable s,additional information must be ob

tained on at least three of the four .

Alternatively,suppose that X

1and X

2are highly correlated

,

X and X4relati ons among the four

,and with other members of X ,

are small .

also are highly correlated,but all other intercor

In thi s case,additional information must be obtained only for

two variable s Xlor X

2 , p p X4 . Clearly,then the

efficient solution of multicollinearity problems require s de

tailed information about the pattern as well as the existence,

severity, and location of intercorrelations within a subset of

interdependent variable s .

To gain insight into the pattern of interdependence in X ,

a straight forward transformation of off —diagonal elements of

the inverse correlation matrix (XtXY

li s both eff ective and

convenient . Its development may be summarized brief ly,a s

follows .

Con sider a partition of the independent variable set

such that variable s Xiand X

jconstitute X and the remaining

K( 2 )

correlation coefficients,then

,i s partitioned such that

n- 2 variable s The corre sponding matrix of zero order

511 512

R21 52 2

where R11 ,

containing variables X . and X . , i s of dimension 2 x 2

and R2 2

i s (n- 2 ) x (n Element s of the inverse correlation

matrix 21 3 corre sponding to X(

l ), then , can be expre s sed without

lo s s of generality as"

ij -1 - lr (511 512 52 2 52 1 )

Before inversion the single off -diagonal element of

R R“1

(R =12 =22 521 )— 11

may be recognized as the partial covariance of‘Xi and Xj ,

constant X(2 )

,the other members of the independent variable set .

holding

On normalizing i . e . ,dividing by square roots of corresponding

diagonal elements in the usual fashion,partial correlation

coef ficients between‘Xi and X

jcan be

For the spe cial case considered here,where X(

l ) contains

only 2 variable s and Ell’accordingly is 2 x 2,it can also be

shown that corre sponding normalized off -diagonal element s ofl - l l

(511 512 g 2 521 ) and it s inverse (511 512 522 g differ2 21 )

G . Hadley,Linear Algebra ,

Addi son We sley,1961 ; pp . 10 7 , 108 .

Analysi s,Wiley

,1958 .

from one another only by sign . It follows , therefore,that

,by

a change of sign,normalized of f - diagonal elements of the in

’1 yield artial correlations among

members of the independent variable set . That i s , defining r .

verse correlation matrix (XtX )

as the coeff icient of partial correlation between Xiand Xj ,

other members of X held constant,and r

i j as elements of (X XX )

above,it follows that

Di stributional propertie s under a hypothesis of parental

orthogonality,of course

,are required to tie - up the bundle .

Carrying forward the as sumption of multivariate normality,such

propertie s are close at hand . In a manner exactly analogous to

the simple ( zero order ) correlation coeff icient,the stati stic

r . ./N- n

2

ij

may be shown to be distributed as Student's t with v N-n

degrees of freedom .

"

An exact,cardinal interpretation

,of interdependence be

tween Xi and Xjas members of X ,

of course,require s exact sat

isfaction of multivariate normal di stributional propertie s . As

with the determinant and diagonal elements of (XJ

CXY

'lthat

precede it,however

,off - diagonal elements transformed to

r . . or't provide useful ordinal mea sure s of collinearity

1 3 13'

even in the absence of such rigid as sumptions .

Graybill,op . cit . ,

pp . 2 15,208 .

Illustration

A three stage hierachy of increas ingly detailed te sts

for the pre sence , location , and pattern of interdependence within

an independent variable set X has been proposed . In order , the

stage s are

1 . Te st for the presence and severity of multicollinearity

anywhere in X , based on the approximate distribution ( 3 )

of determinants of sample correlation matrice s ,

I XtXI , from an orthogonal parent populat ion .

2 . Te st for the dependence of particular variable s on

other members of X based on the exact distribution ,

under parental orthogonality , of diagonal elements of

the inverse correlation matrix ( XtX )

'l.

3 . Examine thap attern of interdependence among X through

the distribution , under parental independence , of off

diagonal elements of the inverse c orrelation matrix ,

( Xtx>

'1.

In many ways such an analysi s , based entirely on statistic s

that are routinely generated during standard regres sion computations ,

may serve as a substitute for the formal , thorough ( and time -con

suming ) factor analysis of an independence variable set . It

provide s the insight required to detect , and if present to identify ,

multicollinearity in X . Accordingly , it may serve as a starting

point from which the additional information required for stable ,

least square s e stimation can be sought . An illustration , perhaps ,

will help to clarify the procedure's mechanic s and purpose ; both

of which are quite straight-f orward .

In a series of stati stical cost analyse s for the U . S . Navy ,

an attempt has been made to measure the effect on maintenance cost of

such factor s as ship age , size , intensity of usage ( measured by

fuel consumption ) , time between succe s sive overhauls , and such

discrete , qualitative characteristic s a s propulsion mode ( steam ,

die sel , nuclear ) , complexity ( radar picket , gu ided mis s ile,

and conversion under a recent ( Fleet Rehabilitation and Moderniza

tion , FRAM) program . Equations have been spec if ied and e stimated

on various sample s from the Atlantic Fleet de stroyer force that

relate logarithms of repair costs to logarithms of age , displace

ment , overhaul cycle and fuel consumption , and to discrete ( 0 , l )

dummy variable s for die sel p rO p ulsion , radar picket , and FRAM c on

version .

"

Stability under change s in spec ification , direction of

mimization , and sample coverage have been examined heuristically

by comparing regre s sion coef fic ient s , determinants of correlation

matrice s lXtX | , and diagonal elements of ( X

tX)

- l, from different

equations . The sensitivity of certain parameters under such change s

fuel consumption ) and the stability of others age ,

overhaul c ycle ) have been noted in the past .

By perf orming an explicit analysi s for interdependence in X ,

such information could have been obtained more quickly , directly ,

and in greater detail . Consider , for example . the seven variable

equation summarized in Table 3 . Multiple correlation and

as sociated FL statistics,with t-ratios for the relationship between

dependent and independent variable s , shows dependence between Y and

X to be substantial .

D . E . Farrar and R . E . Apple ,

"Some Factors that Af fect the

Overhaul Cost of Ships ,

"Naval Research Logi stic s Quarterly , 10 ,

4 , 1963 ; and "Economic Considerations in Establi shing an

Overhaul Cycle for Ships ,

"Naval Engineers Journal , 77 , 6 ,1964 .


parameter e stimate s from secondary data source s , or the direct

application of subj ective information may be nece s sary .

' I n any case , efficient c orrective action require s selectivity ,

and selectivity require s information about the nature of the

problem to be handled . The procedure outlined here provide s such

information . It produce s detailed diagnostic s that can support

the selective acqui sition of information required f or effective

treatment of the multicollinearity problem .

-46

SUMMARY

A point of view as well as a colle ction of techniques i s

advocated here . The technique s in thi s case a serie s of

diagnostics can be formulated and illustrated explicitly . The

spirit in whi ch they are developed , however , i s more diffi cult to

convey . G iven a point of view , technique s that support it may

be replaced quite ea sily ; the inverse i s seldom true . An effort

will be made,therefore , to summarize our approach to multi

collinearity and to contrast it with alternative views of the

problem .

Multicollinearity as def ined here i s a stati stical,

rather than a mathematical condition . As such one thinks,and

speaks,in terms of the problem's severity rather than of its

exi stence or non- exi stence .

As viewed here,multicollinearity i s a property of the

independent variable set alone . No account whatever i s taken of

the extent,or even the exi stence

,of dependence between Y and X .

It i s true,of course

,that the effect on estimation and sp e cifi

cation of interdependence in X reflected by variance s of

e stimated regre s sion coef ficients also depends partly on the

strength of dependence between Y and X . In order to treat the

problem,however

,it is important to distinguish between nature

and ef fects,and to develop diagno stics based on the former .

In our view an independent variable set X is not le s s multi

collinear if related to one dependent variable than if related

to another ; even though its effect s may be more serious in one

case than the other .

Of multicollinearity's ef fect s on the structural inte

grity of e stimated econometric models estimation instability ,

and structural mi s specification the latter , in our view,is

the more serious . Sensitivity of parameter e stimate s to change s

in spec ification , sample coverage , etc . , is reflected at least

partially in standard deviations of e stimated regres s ion co

efficients . No indication at all exist s , however , of the bias

imparted to coeff ic ient e stimate s by incorrectly omitting a

relevant , yet multicollinear , variable from an independent vari

able set .

Historical approache s to multicollinearity are almost

unanimous in presuming the problem's solution to lie in deciding

which variable s to keep and which to drop from an independent

variable set . Thought that the gap between a model's information

requirements and data's information content can be reduced by

increas ing available information,as well as by reducing mode l

complexity , i s seldom cons idered .

"

A maj or aim of the pre sent approach , on the other hand , is

to provide suf f ic iently detailed insight into the location and

pattern of interdependence among a set of independent variable s

that strategic additions of information become not only a theore

tically possibility but also a practically feasible solution f or

the multicollinearity problem .

Selectivity , however , is emphasized . This is not a

counsel of perfection . The purpose of regre s sion analysi s i s to

e stimate the structure of a dependent variable Y's dependence

on a pre - selected set of independent variable s X , not to select

an orthogonal independent variable

H . The il , op cit , p . 217 ; and J . Johnston , op c it , p . 207

are notable exceptions .

Indeed , should a completely orthogonal set of economic variable sappear in the literature one would suspect it to be e ither toosmall to explain properly a moderately complex dependent variable ,

or to have been chosen with internal orthogonality rather thanrelevance to the dependent variable in mind .

- 48

Structural integrity over an entire set , admittedly , requ ire s

both complete spec ification and internal orthogonality . One can

not obtain reliable e stimate s for an entire n dimens ional structure ,

or di stingu ish between c ompeting n dimensional hypothe se s , with

fewer than n significant dimens ions of independent variation . Yet

all variable s are seldom equally important . Only one or at

most two or three strategically important variable s are ordinarily

pre sent in a regre s s ion equation . With complete specification and

detailed ins ight into the location and pattern of interdependence

in X , structural instability within the critical subset can be

evaluated and , if nece s sary , corrected . Multicollinearity amongfi

non-critical variable s can be tolerated . Should critical variable s

also be affected additional information to provide coeff ic ient

e stimate s either for strategic variable s directly , or for those

members of the set on which they are princ ipally dependent i s

required . Detailed diagnostic s for the pattern of interdepend

ence that undermine s the experimental quality of X permits such

information to be developed and applied both frugally and effectively .

Insight into the pattern of interdependence that affects

an independent variable set can be provided in many ways . The

entire f ield of factor analysi s , for example , i s de signed to handle

such problems . Advantage s of the measure s proposed here are two

fold . The f irst is pragmatic ; while factor analys is involve s ex

tensive separate computations , the pre sent set of measures relie s

entirely on transformations of statistic s , such as the determinant

I XtX I and elements of the inverse correlation matrix , ( X

tX)

- l,

that are generated routinely during standard regre s s ion computations .

The second i s symmetry ; que sti ons of dependence and interdependence

in regre s s ion analysi s are handled in the same conceptual and

stati stical framework . variable s that are internal to a set X

for one purpose are viewed as external to it for another . In

thi s ve in , te sts of interdependence are approached as succe s sive

te sts of each independent variable's dependence on other members

of the set . The conceptual and computational apparatus of

regre s sion analysis , accordingly , i s u sed to provide a quick and

simple , yet serviceable, substitute for the factor analysis of

an independent variable set .

It would be ple sant to conclude on a note of triumph that

the problem ha s been solved and that no further "revisits"are

nece s sary . Such a feeling , clearly , would be misleading . Diagno

s is , although a nece s sary first step , doe s not insure cure . No

miraculou s "instant orthogonalization"can be offered .

We do , however , close on a note of optimism . The diagnostics

de scribed here offer the econometrician a place to begin . In

c ombination with a spirit of selectivity in obtaining and apply

ing additional informat ion , multicollinearity may return from

the realm of impos sible to that of difficult , but tractable ,

econometric problems .

Documents

· Sloan School of Management Massachu setts Institute of Technology Cambridge 39, Massachusetts December, 1964 Go p q j . Multicollinearity in Regression Analysis The Problem Revisi