Applications of Multiple Regression in Psychological Research

Masthead LogoFordham University

DigitalResearch@Fordham

Psychology Faculty Publications Psychology

4-2009

Applications of Multiple Regression inPsychological ResearchRazia Azen

David Budescu

Follow this and additional works at: https://fordham.bepress.com/psych_facultypubs

Part of the Psychology Commons

This Article is brought to you for free and open access by the Psychology at DigitalResearch@Fordham. It has been accepted for inclusion inPsychology Faculty Publications by an authorized administrator of DigitalResearch@Fordham. For more information, please [email protected].

Recommended CitationAzen, Razia and Budescu, David, "Applications of Multiple Regression in Psychological Research" (2009). Psychology FacultyPublications. 86.https://fordham.bepress.com/psych_facultypubs/86

https://fordham.bepress.com?utm_source=fordham.bepress.com%2Fpsych_facultypubs%2F86&utm_medium=PDF&utm_campaign=PDFCoverPages

https://fordham.bepress.com/psych_facultypubs?utm_source=fordham.bepress.com%2Fpsych_facultypubs%2F86&utm_medium=PDF&utm_campaign=PDFCoverPages

https://fordham.bepress.com/psychology?utm_source=fordham.bepress.com%2Fpsych_facultypubs%2F86&utm_medium=PDF&utm_campaign=PDFCoverPages

https://fordham.bepress.com/psych_facultypubs?utm_source=fordham.bepress.com%2Fpsych_facultypubs%2F86&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/404?utm_source=fordham.bepress.com%2Fpsych_facultypubs%2F86&utm_medium=PDF&utm_campaign=PDFCoverPages

https://fordham.bepress.com/psych_facultypubs/86?utm_source=fordham.bepress.com%2Fpsych_facultypubs%2F86&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

[12:36 18/4/2009 5283-Millsap-Ch13.tex] Job No: 5283 Millsap: The SAGE Handbook of Quantitative Methods in Psychology Page: 283 283–310

PART IV

Data Analysis



13Applications of Multiple

Regression in PsychologicalResearch

Razia Azen and David Budescu

THE REGRESSION MODEL

History and introduction

The regression model was conceptualized inthe late nineteenth century by Sir FrancisGalton, who was studying how characteristicsare inherited from one generation to thenext (e.g., Stanton, 2001; Stigler, 1997).Galton’s goal was to model and predictthe characteristics of offspring based on thecharacteristics of their parents. The term‘regression’ came from the observation thatextreme values (or outliers) in one gener-ation produced offspring that were closerto the mean in the next generation; hence,‘regression to the mean’occurred (the originalterminology used was regression to ‘medi-ocrity’). Galton also recognized that previousgenerations (older than the parents) couldinfluence the characteristics of the offspringas well, and this led him to conceptualizethe multiple-regression model. His colleague,Karl Pearson, formalized the mathematics ofregression models (e.g., Stanton, 2001).

The multiple-regression (MR) modelinvolves one criterion (also referred to asresponse, predicted, outcome or dependent)variable, Y, and p predictor (also referred toas independent)1 variables, X1, X2, . . ., Xp.The MR model expresses Yi, the observedvalue of the criterion for the ith case, asa linear composite of the predictors and aresidual term:

Yi = β0 + β1X1i + β2X2i + . . . + βpXpi + εi

(1)

Here, X1i, X2i, . . . , Xpi are the valuesobserved on the p predictors for the ithcase, and the various βs (β0, β1, β2, . . . , βp)are the (unknown) regression coefficientsassociated with the various predictors. Thefirst coefficient, β0, is an intercept term (ora coefficient associated with a predictorthat takes on the value X0 = 1 for allobservations).

If all the variables (response and predictors)are standardized to have zero mean and unit


286 DATA ANALYSIS

variance the model can be re-expressed as:

Zy =p∑

i=1

β∗i Zxi + e (2)

where Z refers to a standardized variable. Forobvious reasons, the β∗

i are referred to as stan-dardized coefficients (by definition, β∗

0 = 0).It is easy to show that the standardizedcoefficients can be obtained by multiplyingtheir raw counterparts by the ratio of thestandard deviations of the respective predictorand the response:

β∗i = βi

Sxi

Sy(3)

This definition is not universally accepted.Some statisticians (e.g., Neter et al., 1996)prefer to define standardized coefficients asthe values obtained by fitting the modelsafter applying the, so-called, correlationtransformation.2 Although the values of thecoefficients are the same (the numeratorand denominator are divided by the sameconstant), their standard errors are not!Furthermore, Bring (1994) challenged these(closely related) definitions and suggestedthat a more appropriate way of calculatingthe standardized coefficients should use thepartial standard deviations of the predictors.

The predicted value of Yi, which representsthe best guess (expected value) of the criteriongiven the observed combination of the ppredictor values for the ith case, is written as:

Yi = β0 + β1Xi1 + β2Xi2 + . . . + βpXip

(4)

The residual, εi, is the difference between theobserved and predicted values of Y associatedwith the ith case:

εi = Yi − Yi. (5)

The ideal situation of perfect deterministicprediction implies εi = 0 for all cases.Otherwise, the residuals are assumed to berandom variables with 0 mean and unknownvariance (which is estimated in the process

of model fitting). The residuals provide ameasure of the goodness or accuracy ofthe model’s predictions, as smaller (larger)residuals indicate more accurate (inaccurate)predictions. The residuals, εi, are sometimeslabeled ‘errors’, but we find this terminologypotentially misleading since it connotes that(some of) the variables are subject to mea-surement error. In fact, the statistical modelassumes implicitly that all measurements areperfectly reliable, and the residuals are due tosampling variance, reflecting the fact that therelationships between Y and the various Xsare probabilistic in nature (e.g., in Galton’sstudies, not all boys born to fathers who are180 cm tall, and mothers who are 164 cmtall, have the same height). The more complexstructural equation models (SEM) combinethe statistical MR model with measurementmodels that incorporate the imperfectionof the measurement procedures for all thevariables involved. These models are beyondthe scope of this chapter, but are covered inPart V (e.g., Chapter 21) of this book.

This chapter covers the wide varietyof MR applications in behavioral research.Specifically, we will discuss the measurement,sampling and statistical assumptions of themodel, estimation of its parameters, inter-relation of its various results, and evaluation ofits fit. We illustrate the key results with severalnumerical examples.

Applications of themultiple-regression model

The regression model can be used in oneof two general ways, referred to by some(e.g., Pedhazur, 1997) as explanation andprediction. The distinction between theseapproaches is akin to the distinction betweenconfirmatory and exploratory analyses.

The explanatory/confirmatory use of themodel seeks to confirm (or refute) certain the-oretical expectations and predictions derivedfrom a particular model (typically developedindependently of the data at hand). Theultimate goal is to understand the specificprocess by which the criterion of interest isproduced by the (theoretically determined)


APPLICATIONS OF MULTIPLE REGRESSION IN PSYCHOLOGICAL RESEARCH 287

predictors. The explanatory regression modelis used to confirm the predictions of the theoryin the context of a properly formulated model,by using standard statistical tools. It can verifythat those predictors that are specified bytheory are, indeed, significant and others,which are considered irrelevant by the theory,are not. Similarly, it could test whether the pre-dictor that is postulated by the theory to be themost important in predicting Y reproduces byitself the highest amount of variance, producesthe most accurate predictions, and so forth.

Consider, for example, specific theoriesrelated to the process and variables thatdetermine an individual’s IQ. Some theoriesmay stress the hereditary nature of IQand others may highlight the environmentalcomponents. Thus, they would have differentpredictions about the significance and/orrelative importance of various predictors inthe model. An explanatory application of themodel would estimate the model’s parameters(e.g., regression coefficients), proceed to testthe significance of the relevant predictors,and/or compare the quality and accuracy ofthe predictions of the various theories.

The predictive/exploratory analysis can bealso guided, at least in part, by theory butit is more open ended and flexible and, inparticular, relies on the data to direct theanalysis. Such an analysis seeks to identifythe set of predictors that predicts best theoutcome, regardless of whether the model isthe ‘correct’explanatory mechanism by whichthe outcome is produced. Of course, it isreassuring if the model makes sense theoret-ically, but this is not a must. For example, topredict IQ one would start with a large setof predictors that are theoretically viable andinclude the predictors that optimally predictthe observed values of IQ in the regressionmodel. Thus, the decision in this case isdata driven and more exploratory in naturethan the explanatory approach. While themodel selected using a prediction approachshould yield highly accurate predictions, thecomponents of the model are not consideredto be any more ‘correct’ than those for otherpotential predictors that would yield the sameprediction accuracy.

To borrow an example from Pedhazur(1997), if one is trying to predict the weather, apurely predictive approach is concerned withthe accuracy of the prediction regardless ofwhether the predictors are the true scientific‘causes’ of the observed-weather conditions.The explanatory approach, on the other hand,would require a model that can also provide ascientific explanation of how the predictorsproduce the observed-weather conditions.Therefore, while the predictive approach mayprovide accurate predictions of the outcome,the explanatory approach also provides trueknowledge about the processes that underlieand actually produce the outcome.

THE VARIOUS FORMS OF THE MODEL

Measurement level

Typically, all the variables (the response andpredictors) are assumed to be quantitative innature; that is, measured on scales that havewell defined and meaningful units (interval orhigher). This assumption justifies the typicalinterpretation of the regression coefficients asconditional slopes – the expected change inthe value of the response variable per unitchange of the target predictor, conditionalon holding the values of the other (p − 1)predictors fixed.

This assumption is also critical for one ofthe key properties of the MR model. MRis a compensatory model in the sense thatthe same response value can be predictedby multiple combinations of the predictors.In particular, high values on some predictorscan compensate for low values on others.The implied trade-off between predictorsis captured by their regression coefficients.Imagine, for example, that we can predictfreshmen Grade Point Average (GPA) in acertain college from their two ScholasticAssessment Test (SAT) scores (quantitativeand verbal), according to the followingequation (in standardized scores): PredictedGPA = 1.0*SAT-Q + 0.5*SAT-V. One canmake up for low SAT-Q (or SAT-V) scoresby appropriately higher SAT-V (or SAT-Q)


288 DATA ANALYSIS

Table 13.1 Two possible ways to code a categorical predictor with C = 4categories

Type of school ‘Dummy’ coding ‘Effect’ coding

D11 D12 D13 D21 D22 D23

Public school 1 0 0 −1 −1 −1Private school 0 1 0 −1 −1 1Parochial school 0 0 1 −1 1 −1Other schools 0 0 0 1 −1 −1

scores. More precisely, one can compensatefor a unit disadvantage in SAT-V by a ½-point increase in SAT-Q, and a one unitdisadvantage in SAT-Q can be offset by a2-point increase in SAT-V.

Despite the measurement level restrictionof the model, it is possible to include lower-level (e.g., categorical) predictors in a MRequation through a series of appropriatetransformations. Imagine that we wanted toconsider ‘type of high school attended’ as apotential predictor of freshmen GPA, whereC = 4 mutually exclusive and exhaustivecategories make up the school types as shownin Table 13.1. We can define (C − 1) = 3binary3 variables to fully represent the schooltype classification. As the variables take ononly two values, they are consistent withthe measurement-level constraint of the MR(recall that an interval scale has two freeparameters – its origin and its unit). The choiceof values for these variables is arbitrary (allpossible assignments are linearly related), andwhile it is convenient to use the values 0,1 orthe values −1,1, the only technical constraintis that the (C − 1) variables be linearlyindependent [see, for example, Chapter 8 inCohen et al. (2003) for a good discussion ofcoding schemes for categorical variables].

The columns in Table 13.1 represent twoalternative coding schemes (the first labeled‘Dummy coding’ and the second labeled‘Effect coding’). Note that both schemesdistinguish between the various types ofschools (each school has a unique patternof (C − 1) values). Although the two sets ofvariables are different (the dummy coding setuses the ‘other’ schools as the baseline, andthe effect coding set compares all schoolsto the ‘public’ benchmark), and would yield

different regression coefficients, their jointeffect is identical! In fact, the test of thehypothesis that the ‘type of school’ is a usefulpredictor of freshmen GPAwould be invariantacross all possible definitions of the (C − 1)linearly independent variables.

Derivative predictors

Typically, the p predictors are measured inde-pendently from each other by distinct instru-ments (various scales, different tests, multipleraters, etc.). In some cases, researchers supple-ment these predictors by additional measuresthat are derived by specific transformationsand/or combinations of the measured vari-ables. For example, polynomial-regressionmodels include successive powers (quadratic,cubic, quartic, etc.) of some of the originalpredictors, and are designed to capture non-linear4 trends in the relationship betweenthe response and the relevant predictors.Interactive-regression models include vari-ables that are derived by multiplying two,or more, of the measured variables (or somesimple transformations of these variables) inan attempt to capture the joint (i.e., aboveand beyond the additive) effects of the targetvariables. Computationally, these models donot require any special treatment as one cantreat the product X1X2 or the quadratic termX2

1 as additional variables, but we mentionseveral subtle interpretational issues.

The first issue is, simply, that the standardinterpretation of regression coefficients asconditional slopes breaks down in thesemodels. Clearly, in a polynomial model onecannot increase X by one unit while keepingX2, X3, etc., fixed, nor vice-versa. Similarly,in an interactive model that includes product



terms, say involving variables Xi and Xj, theirrespective coefficients will no longer reflectthe ‘net’ effects of these variables, as theseeffects depend on, and cannot be separatedfrom, the values of the other variables. Forexample, in the simplest interaction model,Y = β0 + β1X1 + β2X2 + β12X1X2 + ε,when X1 is increased by one unit, Y increasesby (β1 + β12X2)! Of course, this applies tothe coefficient of the product as well (i.e.,one cannot change the product XiXj by a unitwhile keeping Xi and Xj fixed!).

The second issue is more technical. Suc-cessive powers of a variable (X1, X2

1, X31)

and products involving this variable (X1X2,X1X3), are highly correlated (e.g., Bradleyand Srivastava, 1979; Budescu, 1980). Assuch, they suffer from many of the problemsassociated with collinearity in that, concep-tually, highly correlated predictors indicateredundancy in the model and, statistically,highly correlated predictors affect estimationaccuracy. Fortunately, it is possible to reducesome of these problems by properly re-scalingall the variables. Simply centering the vari-ables, by subtracting their respective means,reduces correlations between these predictorsconsiderably and facilitates interpretation.The intuition is quite simple – if the values areall non-negative (say X1 = income and X2 =years of education), all correlations betweenthe original variables and their powers ortheir products are going to be extremelyhigh. By subtracting the means, about halfof the values of X1 and X2 (but not of theirsquares) become negative, so the correlationsdrop substantially. Centering does not affectthe overall fit of the model, but facilitatesinterpretation. Thus, we strongly recommendthat one always center (or, without any lossof generality, standardize) all variables inpolynomial and/or interactive models.

The last point relates to the distinctionwe made earlier between confirmatory andexploratory designs. In exploratory workall predictors are treated as ‘exchangeable’(or ‘symmetric’) in the sense that thereis no prior ordering among them, whilein confirmatory work the theory induces aparticular hierarchical structure. Polynomial

and interactive models are, essentially, alwayshierarchical. Parsimony dictates that thebasic form of the predictors be part of themodel before considering higher order and/orinteractive effects. In fact, when the predictorsare continuous, interpretation of such higherorder and/or interactive terms when theirlower-order (or component) variables areexcluded from the model is meaningless and,potentially, misleading. When a qualitative(or discrete) variable is used as a predictor(e.g., gender), its interaction with a continuouspredictor is used to model the slopes (of thecontinuous predictor) separately for each levelof the qualitative variable. In such models,the interaction term may be interpretablewithout necessarily including the qualitativepredictor (that captures differences in theintercepts) by itself, though we would contendthat in most research problems it is moremeaningful to retain lower-level terms inthe model when their interaction is included.Finally, there is no consensus in the literatureon the question of precedence for single-variable polynomial terms and interactiveterms (for details, see Aiken and West, 1991;Cortina, 1993; Ganzach, 1997; Lubinski andHumphreys, 1990), but we tend to favor theview espoused in the discussion by Ganzach(1997) that suggests that quadratic termsshould precede interactive terms.

Sampling designs

The classical regression model assumes thatthe values of the predictors are ‘fixed’;that is, chosen by various design considera-tions (including, possibly, plain convenience)rather than sampled randomly. Random sam-ples (of equal, or unequal, size) of the responsevariable are then obtained for each relevantcombination of the p predictors. Thus, thedata consist of a (possibly large) collectionof distributions of Y, conditional on particularpredetermined combinations of X1,. . ., Xp.The X’s (predictors) are not random variables,and their distributions in the sample arenot necessarily expected to match theirunderlying distributions in the population.Thus, no distributional assumptions are made


290 DATA ANALYSIS

about the predictors. The unknown parametersto be estimated in the model are the pregression coefficients and the variance of theresiduals.

Alternatively, in the ‘random’ design theresearcher randomly samples cases from thepopulation of interest, where a ‘case’ consistsof a vector of all p predictors and theresponse variable. Thus, we observe a jointdistribution of the (p + 1) random variables,and it makes sense to make assumptionsabout its nature (typically, that it is normal).The unknown parameters to be estimated arethe (p + 1)(p + 2)/2 entries in the variance-covariance matrix of the variables (whichdetermine the regression coefficients), and thevariance of the residuals.5

Consider again the hypothetical freshmenGPA prediction problem described earlier.The researcher could approach this problemby randomly selecting a fixed number (sayn = 30) of men and of women for each of16 predetermined combinations of SAT-V andSAT-Q; say, all combinations of SAT-V and ofSAT-Q from 450 to 750 in increments of 100points [(450,450); (450,550); . . .; (750,750)]and record their freshmen GPA, or simply takeone random sample of about 1000 studentsand record their gender, SAT scores, and GPA.The former is a fixed design and the latteris a random one. In both cases one wouldhave the same variables and, subject to minorsubtle differences, be able to address the samequestions (e.g., Sampson, 1974). In general,the results of a fixed design can be generalizedonly to the values of X included in the studywhile the results of a random design canbe generalized to the entire population of Xvalues that is represented (by a random sampleof values) in the study.

In both fixed and random designs it iscustomary to assume that all observationsin the sample(s) are mutually independent.There are two noticeable exceptions to thisassumption. In multistage-sampling designsobservations at the lower levels that arenested within the same higher order clustersare, typically, positively correlated witheach other reflecting geographical, social-economic proximity and other sources of

commonality. Consider a national sample of13-year-old students, such as in the NationalAssessment of Educational Progress (NAEP),that is obtained by randomly sampling:(1) school districts in various geographicalregions; (2) schools within districts; (3) class-rooms within schools; and finally (4) studentswithin classrooms. The results of the studentsselected for testing cannot be treated asstatistically independent since some of thesestudents share many characteristics (definitelymore than they share with other studentsin classrooms in other schools and districtsin the national sample). Hierarchical-linearmodels (HLM) (e.g., Bryk and Raudenbush,1992), which are the topic of Chapter 15, areextensions of the standard MR model that canhandle these dependencies efficiently.

Another extension of MR, common inbusiness, economics and many natural sci-ences, involves applications in which keyobservations constitute time series (e.g., thedaily closing value of a stock over one year,or the amount of annual precipitation at aparticular location over the last 200 years), andthe residuals of the various observations areserially correlated (or autocorrelated). Timeseries are relatively rare, but not unheardof, in behavioral sciences (e.g., sequences ofinteractions within dyads of participants suchas spouses, or players involved in a series ofPrisoner’s Dilemma games; Budescu 1985).Analysis of time series is beyond the scope ofour chapter, but it is discussed in Chapter 26of this book, and the classical book by Boxand Jenkins (1976) is a good primary sourcefor this topic, with special emphasis on thevariety of time-dependent processes.

STATISTICAL ASSUMPTIONS ANDESTIMATION

The parameters of the MR model can beestimated either by least-squares (LS) ormaximum-likelihood (ML) procedures. Inthis section we briefly review the key results(details can be found in such standard text-books as Draper and Smith, 1998; Graybill,1976; Neter et al., 1996).



Consider the standard (fixed) MR modelfirst. Let y be a vector (n rows by 1column) including the values of the responsevariable for all n observations in the sample,and let X be a matrix (n rows by p + 1columns) including the values of the fixedp predictors (including a constant predictor,X0). LS estimation requires a minimal set ofassumptions:

Assumption 1

The response is a linear function of thepredictors. Thus, the model can be re-writtenin matrix notation as:

y = Xβ + ε (6)

where β is a vector (p + 1 rows by 1 column)consisting of the (unknown) regression coef-ficients and ε is a vector (n rows by 1 column)including the residuals of all n observations inthe sample.

Assumption 2

The residuals are independent and identicallydistributed random variables.

Assumption 3

The residuals have 0 means and equal vari-ances: E(ε) = 0, �(ε) = σ 2I. The assumptionthat all residuals have equal variance isreferred to as homoskedasticity, and it impliesall conditional distributions of the response(i.e., for each possible combinations of thepredictors) have equal variances.

The LS method seeks estimates of theregression coefficients, b, such that the sumof the squared residuals (ε‘ε) is minimized.Under assumptions 1–3, it is possible to showthat b = (X’X)−1X’y, and the famous Gauss–Markov theorem shows that these coefficientsare BLUE – Best (meaning with the smallestvariance) Unbiased Linear (meaning linearfunction of the y’s) Estimators. The otherparameter,σ 2, is estimated by the mean square

residual or the mean square error (MSE):

MSE = s2 =∑

i

(Yi − Yi)2/(n − p − 1)

(7)

To derive ML estimates (i.e., find thoseparameter values that, conditional upon thedistributional assumptions, are the most likelyto have generated the observed data), we needone additional assumption:

Assumption 4

The residuals are normally distributed: ε ∼N(0, σ 2I). The ML estimates of the regressioncoefficients are identical to the LS estimates,but σ 2 is estimated by the regular (biased)variance of the residuals in the sample, s2 =∑

i(Yi − Yi)2/n. Typically, the LS (unbiased)

estimate is used.Under the random model, Assumptions 1,

3, and 4 are replaced by Assumption 5.

Assumption 5

The response and the p predictors have a joint(p + 1)-variate Normal distribution with apositive semi-definite covariance matrix, �.

The ML estimate of �, based on the samplecovariance matrix, S, is easily obtained (e.g.,Johnson and Wichern 2002; Timm, 2002).Taking advantage of the standard results forconditional multivariate normal distributions,we obtain the desired vector of estimates,b, from E(Y|X1, . . . , Xp). Computationally,the results are identical to the LS and MLestimates for the fixed case. Note, however,that we did not assume a linear model a-priori(Assumption 1). The linearity follows directlyfrom the properties of the multivariate-normaldistributions. This provides another interpre-tation of the MR model as the collection of allconditional expectations in the space definedby the p predictors.

The assumptions that residuals are nor-mally distributed and homoskedastic areclearly unrealistic when the response vari-able is dichotomous (e.g., success/failure,


292 DATA ANALYSIS

or below/above a certain threshold) or cat-egorical (e.g., color of a product). Specialregression models have been developed forthese cases. They preserve the basic formof the MR model but employ differentdistributional assumptions that are moreappropriate for these cases, and are consistentwith their constraints. Their key feature isassuming a probabilistic model (typicallyNormal or Logistic) relating the predictors tothe binary/categorical response. These modelsare dicussed in Chapter 14 of this book, andAgresti (1996) and Neter et al. (1996) are goodsources for further details on these probit- andlogistic-regression models.

Beyond parameter estimation

Once the regression coefficients are estimated,it is relatively simple to estimate:

• Their variances and co-covariances:Var(b) = s2(X’X)−1.

• The predicted values (that have the samemean as the actual response values):Y = Xb.

• The residuals (that have a mean of 0and are uncorrelated with the predictedvalues), ε, using: e = Y − Xb.

Finally, it is possible to show that the totalvariation of the response variable, Y, asmeasured by the sum of squares of theobserved values (SST) around their mean,can be decomposed into two orthogonalcomponents associated with: (1) the fittedregression model, measured by the sum ofsquares of the predicted values (SSR) aroundthe mean response (Y ); and (2) the residuals(SSE), measured by the sum of squares of theresiduals. The same decomposition holds for

the degrees of freedom, and it is customaryto represent these components in an analysisof variance (ANOVA) table as shown inTable 13.2.

Note (from Table 13.2) that SST andSSR are similar in nature and calculatedaround the same mean. We take advantage ofthis similarity to calculate R2, which is thestandard measure of goodness of fit of themodel:

R2 = SSR/SST = 1 − SSE/SST (8)

R2 measures what proportion of the totalvariance of Y is reproduced by the model. It isbounded from below by 0 (when the predictorscannot predict Y), and from above by 1 (whenprediction is perfect). It is possible to showthat R2 is also the squared correlation betweenthe observed and predicted values of Y. It is0 when Y and Y are uncorrelated, and it is 1when Y and Y are perfectly correlated. In thecontext of the fixed model, R2 is referred toas ‘the coefficient of determination’, and inthe random model it is called ‘the squared-multiple correlation’. Interestingly, the sam-ple R2 is a biased estimate of its correspondingpopulation value, ρ2. Olkin and Pratt (1958)present an approximate unbiased estimate ofρ2 for the multivariate-normal case with nobservations and p predictors (see alsoAlf andGraf, 2002):

ρ2 = R2 − p − 2

n − p − 1(1 − R2)

− 2(n − 3)

(n − p − 1)(n − p + 1)(1 − R2)2

(9)

Alternatively, computer intensive proceduressuch as the bootstrap can also be usedto estimate the population value of ρ2

Table 13.2 ANOVA table

Source Sums of squares Degrees of freedom Mean square

Model predictions SSR = ∑i

(Yi − Y )2 p SSR/p

Residuals SSE = ∑i

(Yi − Yi )2 n − p − 1 SSE/(n − p − 1)

Total SST = ∑i

(Yi − Y )2 n − 1 SST/(n − 1)



for non-normal populations. The bootstraprequires randomly drawing n observationsfrom the original sample, with replacement.The bootstrap sample contains n observationsbut it is not identical to the original sampledue to the random sampling process (i.e., inthe bootstrap sample some of the originalobservations appear more than once and somenot at all). The value of R2 can then beestimated by fitting the regression model tothe bootstrap sample. This process is repeateda large number of times, resulting in alarge number of estimates of R2. These R2

values are then averaged to obtain a bootstrapestimate of the population value. Detailedinformation on this method is available fromsources such as Diaconis and Efron (1983),Mooney and Duval (1993), and Stine (1989)and it is also discussed in Chapter 16 ofthis book.

SPECIFIC APPLICATIONS

Throughout the remainder this chapter, wewill use a real data set to demonstrate variousconcepts and methods, so we introduce ithere. The data set is discussed in detail bySuh et al. (1998), who obtained normativebeliefs and emotional experience as wellas satisfaction with life judgments fromthousands of college students in over 40countries. Variables were mostly self-reportedand included some demographic measures(e.g., sex); emotional experience, measuredusing positive and negative affects as well asan affect balance score; subjective global lifesatisfaction; domain-specific life satisfaction;and values (or norms) for life satisfaction.

Multiple regression as aconfirmatory model: comparingcompeting (nested) models

The most common way of using MR as aconfirmatory tool is to test the significanceof one, several, or all of the regressioncoefficients that are predicted to be important(or, at least, relevant) by the theory beingtested. Any test of parameters amounts to

a comparison of two models, one thatincludes the parameters in question (and isreferred to as the ‘full’ model), and one thatexcludes them (referred to as the ‘reduced’ or‘restricted’ model). The standard tests requirethat the models be ‘nested’; that is, that thereduced model contains a strict subset of thevariables in the full model.

For example, the full model might containfive predictors, X1 − X5. Suppose we wantto test the prediction (presumably derivedfrom our theory) that X1, X2, and X5 are thecritical predictors and that, in their presence,X3 and X4 are not significant, or do notcontribute to the prediction of Y. Thus, wewish to test H0: β3 = β4 = 0|X1, X2, X5(we use the ‘conditioning’ notation to remindus of the other variables in the model). If thishypothesis holds, the reduced model containsonly the predictors X1, X2, and X5 (because itrestricts the coefficients of X3 and X4, whichwere part of the full model, to be zero). Themodel-comparison procedure tests whetherthe predictors X3 and X4, jointly, contribute(or do not contribute) to the explanatoryand predictive power of the model. On theone hand, if the full model fits the datasignificantly better than the reduced model,this provides evidence for the contribution ofX3 and X4 and indicates that their inclusionis advantageous. On the other hand, if the fullmodel does not fit the data significantly betterthan the reduced one, this provides evidencethat X3 and X4 are not necessary, as predictedby the null hypothesis in this example.

Imagine that an alternative theory postu-lates that only X1, X3, and X4 are of interest,with X1 being a key variable. These threevariables make up the ‘full’ model. Supposewe want to test, again, that X3 and X4 arenot significant in the presence of X1. Thus,we wish to test H0: β3 = β4 = 0|X1. If thishypothesis holds, the reduced model containsonly X1 (because it restricts the coefficients ofX3 and X4, which were part of the full model,to be zero). The model-comparison proceduretests whether X3 and X4, jointly, contribute(or do not contribute) to the explanatoryand predictive power of the model. The keypoint is that although in both cases we test


294 DATA ANALYSIS

parameters associated with the same variables(X3 and X4), we are not comparing the samemodels. The hypothesis being tested is notthe same in a substantive sense, and theresults of the statistical tests are not identical.Probably the most important lesson for theresearcher is that a statistical test provides themachinery for comparing competing modelsand its results are meaningful only when theyare interpreted in the context of these models.Conclusions of the type ‘X3 is significant(because β3 �= 0)’ or ‘X4 is not significant(because β4 = 0)’ without specifying thenature of the models being compared aremeaningless, and potentially misleading.

The general hypotheses tested by the modelcomparison procedure are:

• H0: In the full model the βs of thepredictors in the subset being tested areall zero. The restricted model is better.

• H1: In the full model the βs of thepredictors in the subset being tested arenot all zero. The full model is better.

More formally, if the full model contains atotal of p predictors, and the predictors arearranged such that the subset to be testedconsists of the predictors q+1 through p, thenthe full model is:

Y = β0 + β1X1 + . . . + βqXq + βq+1Xq+1

+ . . . βpXp + ε (10)

and the hypotheses are:

H0 : βq+1 = . . . = βp = 0

H1 : βq+1, . . . , βp are not all zero.

The statistical test compares the fit of thetwo competing models based on their SSEvalues (as defined in Table 13.2) using the teststatistic:

F = (SSER − SSEF)/(dfR − dfF)

SSEF/dfF

∼ F(dfR−dfF,dfF) (11)

where SSER and SSEF are the residual (orerror) sums of squares for the reduced and

full models, respectively, and dfR and dfF aretheir respective degrees of freedom. If thenull hypothesis is true, the F ratio follows thespecified F distribution.

This test statistic compares the differencein lack of fit between the two models relativeto the lack of fit in the full model, whiletaking the degrees of freedom into account.On the one hand, if the null hypothesisis true, the restriction on the relevant βsshould not significantly impact the SSE, sothe numerator, and the test statistic, will berelatively small. On the other hand, if thenull hypothesis is false, the full model shouldperform significantly better (i.e., obtain asubstantially smaller SSE) than the reducedmodel, so the difference between SSER andSSEF will be relatively large leading to arejection of the null hypothesis.

The test statistic can be written in terms ofR2 values (rather than SSEs) as:

F = (R2F − R2

R)/(dfF − dfR)

(1 − R2F)/dfF

= dfF(R2F − R2

R)

(dfF − dfR)(1 − R2F)

∼ F(dfR−dfF,dfF)

(12)

where it is perhaps clearer that the test statisticcompares the difference in the fit of the twomodels relative to the lack of fit of the fullmodel (again taking degrees of freedom intoaccount).6

The tests for comparing the fit of the models(and, practically, all other tests in MR) areinvariant under linear transformations of thepredictors and/or the response variable. If, forexample, the price of Bordeaux wines (Y) ispredicted from the temperatures and amountsof precipitation in the fall and the spring ofthe year the grapes were picked (Ashenfelteret al., 1995), the tests would not be affectedif the wines are priced in US$ or in Euros, ifthe temperatures are in Celsius or Fahrenheit,if precipitations are measured in inches orcentimeters, and so on (note, however, thatthe regression coefficients would vary as afunction of the units used).



Most tests associated with MR are specialcases of this model comparison approach. Forexample, to test the significance of any singlepredictor in the presence of the other p − 1predictors, the F-test above is equivalent to thet-test of β for that predictor (i.e., t2 = F) thatis printed by all statistical packages. Further,the omnibus F-test of the full model’s R2

(or overall fit), which is also printed by allstatistical packages, is equivalent to testingthe full model (including all p predictors)against a reduced model that includes only theintercept.7

One obvious and common use of this testoccurs when a set of (C − 1) predictorsrepresents a qualitative predictor, such as the‘type of school’ variable described in thesection ‘Measurement level’ and Table 13.1,above. All (C − 1) predictors (e.g., D11,D12,D13) are considered jointly, and to testwhether school type is a significant predictorone would need to compare a model thatincludes all (C − 1) binary variables to amodel that restricts their associated (C − 1)coefficients to be zero. The result would beinvariant across the choice of binary variables(e.g., the D1i or the D2i set in Table 13.1).The individual t-tests of the coefficients in thecontext of the full model are more difficult tointerpret and depend on the coding scheme.For example, the test of the single coefficientassociated with D11 (see Table 13.1) inthe ‘Measurement level’ section tests thecontribution of the distinction between publicand ‘other’ schools in the presence of allthe other predictors, including the distinctionsbetween private and ‘other’schools (D12), andbetween parochial and ‘other’ schools (D13).

For continuous-predictor variables, com-mon uses of this procedure involve situationsin which a set of variables are includedin the model in a hierarchical fashion andthe order of inclusion reflects theoretical(or statistical ‘control’) considerations. Forexample, if certain demographic variablesare known to affect the outcome, one couldcompare a reduced model that contains this setof variables only, to a full model that containsan additional set of variables (in addition to thecontrol set). If the null hypothesis is rejected

in this case, then the variables in the additionalset significantly contribute to prediction of thecriterion over and above the variance alreadyaccounted for by the control set.

To conduct this test using statistical soft-ware, one simply needs to fit both the fulland reduced models, obtain their SSE orR2 values, and use either of the formulaeabove to compare the models. The majorstatistical software programs (e.g., SAS,SPSS) also have the capability to provide thetest statistic and p-value of this F-test in theoutput. Examples 1 and 2 (below) illustrateapplications of this test.

Example 1: a confirmatoryapplication

Using the Suh et al. (1998) data, we demon-strate the prediction of global life satisfactionusing the American sample only (n = 420).We model global life satisfaction (the crite-rion) using domain-specific life-satisfactionvariables (namely, satisfaction with one’shealth, finances, family, nation, housing, self,food) as predictors, and then test whetherthe addition of variables measuring valuesfor life-satisfaction domains (namely, valueson overall life satisfaction, money, humility,love, happiness) or variables measuring affect(namely, the frequency and intensity ofexperiencing positive and negative affects)contributes significantly to the model.

The results of the analysis are presentedin Table 13.3. While the R2 values for allmodels are shown, we focus only on thosemodels that include the domain-satisfaction(DS) variables. The model containing theseven DS variables as predictors results inan R2 value of .510. When the five valuesmeasures (V) are added as predictors, the R2

increases to .513, indicating an R2 change(�R2) of .003 from the base model. TheF-test of this change indicates that it isnot significant (F5,407 = 0.47, p < .05) and,therefore, the addition of the values measures(V) does not contribute to the predictionof global life satisfaction over and abovethe initial contribution of the DS variables.On the other hand, adding the four affect


296 DATA ANALYSIS

measures (A) to the model that includes theDS variables increases the R2 to .572. Thisincrease (�R2 = .062) is indeed significant(F4,408 = 14.87, p < .05), indicating that theaffect measures contribute significantly to pre-dicting global life satisfaction over and abovethe initial contribution of the DS variables.Not surprisingly, the model that contains theDS variables and both the affect and valuesmeasures also performs significantly betterthan the model that contains only the DSvariables, but this is clearly due to the effectof the affect measures on predicting globallife satisfaction. In fact the difference in fitbetween models 5 and 7 (�R2 = .002) isnot significant. In conclusion, once we use theDS variables to predict global life satisfaction,the addition of affect measures significantly

improves or contributes to the fit of the modelwhereas the addition of values measures doesnot, so we favor model 5.

Example 2: another confirmatoryapplication

In this example we predict global lifesatisfaction from the frequency of negativeand positive affect (NA and PA, for short) andexamine the potential moderating effects ofrespondent’s sex. The results of the analysesfor data from the United States are presentedin Table 13.4. The two affect frequencymeasures (model 1) together account for about42% of the total variability in global lifesatisfaction (R2 = .416). Adding gender asa predictor (model 2, which would allow the

Table 13.3 Example 1. Predicting global life satisfaction for the USA sample (n = 420)

Variables in the model dfM, dfaE R2 �R2 F for �R2 p-value for �R2

1. Domain satisfaction (DS) only 7, 412 .51 –2. Values (V) only 5, 414 .03 –3. Affect (A) only 4, 415 .43 –4. DS + V 12, 407 .51 .003 0.47 .7985. DS + A 11, 408 .57 .062 14.87 < .00016. A + V 9, 410 .43 –7. DS + V + A 16, 403 .57 .064 6.74 < .0001adfM, model (regression) degrees of freedom = p; dfE, error (residual) degrees of freedom = n − p − 1.

Table 13.4 Example 2. Predicting global life satisfaction for the USA sample (n = 438)

Model Variables B dfM, dfaE R2 �R2 F for �R2 p-value for �R2

1. Common intercept and PA .49common slope NA −.28

2, 435 .42 –

2. Gender specific PA .51intercepts and common NA −.26slope Gender −.07

3, 434 .42 .005 3.56 .060

3. Common intercept and PA .62gender specific slopes NA −.15

PA × gender −.14NA × gender −.13

4, 433 .42 .002 0.83 .437

4. Gender specific PA .63intercepts and slopes NA −.15

Gender −.07PA × gender −.13NA × gender −.12

5, 437 .42 .007 1.65 .178adfM, model (regression) degrees of freedom = p; dfE, error (residual) degrees of freedom = n − p − 1.



regression intercepts to vary depending onone’s sex) increases R2 to .421, but this isnot a significant improvement over the fitof the initial model (F1,434 = 3.56, p > .05).The same is true for models containingthe interaction of gender with the affectfrequency measures (which would allow theregression slopes to vary depending on one’ssex). Therefore, the prediction of global lifesatisfaction from affect frequency does notdepend on the sex of the individual.

MR as an exploratory tool: modelselection and measures of model(mis)fit

In this section, we cover some of the basicissues involved in using MR as an exploratorytool designed to identify the ‘best’ set ofpredictors in the absence of a specific theory.Typically, we have a very large number ofpotential predictors that are inter-correlated,and we believe that we can find a much smallersubset that would be useful in predicting Y. Tofix ideas, consider the standard methodologythat was used to develop some of the mostwidely used vocational interest inventories(e.g., Anastasi, 1982): A large number ofitems is administered to a large number ofrespondents who work in a particular field(say, physicians) and who report variouslevels of satisfaction (and possibly success) intheir chosen profession. A scale of interest inmedicine is constructed by identifying thoseitems that predict best the level of satisfactionof the various physicians. The key point isthat the items are chosen based solely on theirpredictive efficacy, and not on any theoreticalconsiderations.

The challenge of the techniques reviewedin this section is to balance the two effectsof adding more variables to the model: betterfit and higher complexity. Although there isgeneral agreement that the most parsimonioussolution is one that achieves the best fit relativeto the model’s complexity, there are manyways of quantifying fit and complexity aswell as accounting for their trade-off. Oncea measure of model fit is selected it can be

used as a selection tool that allows one tocompare many competing models (involvingdifferent subsets of predictors) and choose thebest ones.

The most common measure of model fit isR2. We saw in the last section that R2 was a keycomponent in the model comparison tests butit can also be used as a descriptive measure toselect models. However, both the sample sizeand number of predictors in the model affectits value. As a simplistic example, considera simple regression (p = 1) and a sampleof n = 2. In this case, the scatter-plot of Xand Y contains just two points, which canbe joined by a straight line and produce aseemingly perfect relationship where R2 = 1,regardless of the true relation between X andY in the population. The same pattern holdsfor p = 2 and n = 3, p = 3 and n = 4,and so forth. In fact, when the value of ρ2 inthe population is 0 the expected value of thesample R2 is p/(n − 1), which is greater thanzero (Pedhazur, 1997). As we showed earlier,the sample R2 value is biased, and alwaysoverestimates its population value, ρ2. Thebias is especially high when the sample sizeis small relative to the number of predictors.Thus, there is a danger of serious over-fit incases with many predictors and small samples(see Birnbaum’s satire, Sue Doe Nihm, 1976).The sample size needs to be substantiallylarger than the number of predictors for thesample R2 to provide a good estimate of itspopulation value. To correct for the model’scomplexity (number of predictors) relative tothe sample size, the adjusted R2 measure, R2

adj,is used:

R2adj = 1 − (n − 1)

(n − p − 1)(1 − R2)

= 1 − (n − 1)

(n − p − 1)

SSE

SST(13)

Note that the adjustment uses the Sums ofSquares (SS) divided by their degrees offreedom (i.e., the Mean Square, MS). Forany fixed sample size, n, the adjusted R2 willtypically be smaller than the value of R2 bya factor that is directly related to the numberof predictors in the model. Therefore, if two


298 DATA ANALYSIS

models fit to the same data set produce thesame R2 values, but one has more predictorsthan the other, the model with fewer predictorshas a larger R2

adj value and is consideredto be superior since it achieves the samefit (and accuracy of predictions) with fewerpredictors.

Another descriptive measure of model fitthat accounts for the number of predictorsis Mallows’ Cp criterion (Mallows, 1973).If a total of p predictors are available, thenfor a subset model containing k predictorsMallows’ criterion is:

Ck = SSEk

MSE− [n − 2(k + 1)] (14)

where SSEk is the error sum of squaresfor the subset model and MSE is the meansquare error for the full model containing allp predictors. Mallows’ criterion is concernedwith identifying an unbiased model. If thereis no bias in the predicted values of the model(i.e., E(Yi) = μi), the expected value ofCk is approximately k + 1 (Mallows, 1973;Neter et al., 1996). Thus, for the full (p-predictors) model, Cp = p+1 and the fit of any(k-predictors) subset model is evaluated bycomparing its Ck value to k +1, where a smalldifference indicates good fit (i.e., no bias).Biased models result in Ck values greater thank, so we typically seek models with Ck valuesthat are both small and close to k.

A measure of fit that is based on the ML (orinformation) function of a regression modelwith p predictors is Akaike’s informationcriterion (AIC) developed by Akaike (1970;1973):

AIC = n ln(SSE/n) + 2(p + 1) (15)

where SSE is the error sum of squares forthe model in question and smaller values ofAIC indicate better fit (AIC is a measure ofloss of information in fitting the model, whichwe wish to minimize). All other things beingequal, as SSE (and its logarithm) decreases,indicating better fit, AIC decreases. As withother measures of fit, AIC increases as themodel’s complexity (p) increases.

Schwarz’s Bayesian Criterion (Schwarz,1978), also known as the Bayesian informa-tion criterion (BIC), is a slight variation onAIC with a more severe penalty for the numberof predictors:

BIC = n ln(SSE/n) + (p + 1) ln(n) (16)

In general, BIC penalizes the fit more severelythan AIC for the number of predictors in amodel. Therefore, when several competingmodels are fit to the same data the BICmeasure is likely to select a model with fewerpredictors than the AIC measure.

The measures discussed above do notprovide an exhaustive list (see, for example,Miller, 1990; Burnham and Anderson, 2002)but are arguably the most commonly encoun-tered measures of fit in the social sciences.The various measures can be calculated foreach of the feasible 2p distinct subset modelsthat can be generated from p predictors, andthey all account in one way or another for thenumber of predictors in each subset model.This approach is referred to as ‘all subsetsregression’, and can be used to identify the‘best’ model (by whatever measure). Thereare two approaches for identifying the ‘best’model: it can be done by conditioning on levelof complexity (i.e., considering the single bestpredictor, the best pair, the best triple, etc.)and choosing the best subset model within agiven level of complexity; alternatively, allmodels can be simply rank ordered regardlessof complexity to choose the best ones.8

The final selection is typically based onsimple numerical and/or visual comparisonsand, typically, does not involve significancetests.

An alternative approach relies on a familyof automated computer algorithms – forwardselection, backward elimination and stepwiseregression – that were developed before thecomputations involved in the ‘all subsets’approach were feasible. These techniquesinvolve convenient shortcuts and rely heavilyon significance tests as a decision tool toinclude new variables in the model, excludepredictors from the model, and stop the search(for algorithmic details see, for example,



Draper and Smith, 1998; Neter et al., 1996).These techniques share several drawbacks,and are inferior to the most modern approachof all subsets regression. The first problem isthat not all possible models are considered,so there is no guarantee that the final modelselected is necessarily the ‘best’ according toany criterion. The second major problem isthat they use a very large number of testswithout any adjustment for test multiplicity.9

We recommend use of the stepwise procedure(the most flexible of the three) only in caseswhere the number of predictors is extremelylarge (in the hundreds) as a preliminary stepto identify a manageable subset of variables tobe examined later by the ‘all subsets’ method.Example 3 illustrates an application of modelselection procedures.

Example 3: an exploratoryapplication

In the life-satisfaction data set, we have atotal of 17 potential predictors of global lifesatisfaction (seven domain-specific satisfac-tion variables, five values measures, fouraffect measures and sex). In the absence oftheory, one may wish to fit all 217 = 131,072

distinct subset models possible and explorewhich model(s) might provide the best fitin predicting global life satisfaction. As withmost exploratory analyses, this can shedsome light on potential theories for predictinglife satisfaction that can subsequently beconfirmed with additional data. To illustratethis procedure, Tables 13.5 and 13.6 show thetop models (based on fit) that can be formedfrom the 17 predictors available and variousmodel-fit measures for these models.

Table 13.5 shows the model that fits bestfor each level of complexity. For example,the single best predictor is satisfaction withself, the best pair of predictors contains thesatisfaction with self and frequency of positiveaffects, and the best triple of predictors alsoadds satisfaction with family. The table alsolists the various fit measures for these selectedmodels. Note that R2

adj favors a model withnine predictors (highest value), whileAIC andBIC reach their desired minimal values forthe models with eight and seven predictors,respectively.

Table 13.6 shows the top five models usingtwo model-selection criteria (adjusted R2 andCp) with values rounded to 2 decimal places.The model that produces the highest adjusted

Table 13.5 Example 3. Best-fitting models for predicting global life satisfaction for data fromUS (n = 420), for various levels of complexity

Size of Adj. R2 R2 Cp AIC BIC Variablesmodel (p) in model

1 0.39 0.39 159.66 1397.54 1398.31 X72 0.48 0.47 81.68 1336.71 1337.75 X7 X133 0.53 0.52 36.22 1296.44 1297.93 X3 X7 X134 0.54 0.54 25.89 1286.73 1288.36 X3 X4 X7 X135 0.55 0.55 15.27 1276.39 1278.29 X3 X4 X7 X13 146 0.56 0.56 8.88 1269.97 1272.14 X2 X3 X4 X7 X13 X147 0.57 0.56 2.87 1263.79 1266.30 X2 X3 X4 X5 X7 X13 X148 0.57 0.56 2.86 1263.70 1266.37 X2 X3 X4 X5 X7 X13 X14 X179 0.57 0.56 3.74 1264.54 1267.34 X2 X3 X4 X5 X7 X9 X13 X14 X1710 0.57 0.56 4.77 1265.53 1268.47 X2 X3 X4 X5 X7 X9 X13 X14 X15 X1711 0.58 0.56 6.19 1266.93 1269.99 X2 X3 X4 X5 X6 X7 X9 X13 X14 X15 X1712 0.58 0.56 8.08 1268.81 1271.97 X2 X3 X4 X5 X6 X7 X9 X12 X13 X14 X15 X1713 0.58 0.56 10.04 1270.77 1274.02 X2 X3 X4 X5 X6 X7 X9 X12 X13 X14 X15 X16 X1714 0.58 0.56 12.02 1272.76 1276.09 X2 X3 X4 X5 X6 X7 X8 X9 X12 X13 X14 X15 X16 X1715 0.58 0.56 14.01 1274.74 1278.17 X1 X2 X3 X4 X5 X6 X7 X8 X9 X12 X13 X14 X15 X16 X1716 0.58 0.56 16.00 1276.73 1280.25 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X12 X13 X14 X15 X16 X17

Satisfaction domains are: health (X1), finances (X2), family (X3), housing (X4), food (X5), country (X6), self (X7) valuedomains are, life satisfaction (X8), money (X9), humility (X10), love (X11), happiness (X12) affect measures are, positivefrequency (X13), negative frequency (X14), positive intensity (X15), negative intensity (X16), and sex (X17)


300 DATA ANALYSIS

Table 13.6 Example 3. The five best-fitting models for predicting global life satisfaction forUS sample (n = 420) based on two selection criteria

Size of model (p) Adj. R2 R2 Cp AIC BIC Variables in model

Using adjusted R2

9 0.56 0.57 3.73 1264.54 1267.34 X2 X3 X4 X5 X7 X9 X13 X14 X1710 0.56 0.57 4.78 1265.53 1268.47 X2 X3 X4 X5 X7 X9 X13 X14 X15 X178 0.56 0.57 2.86 1263.70 1266.37 X2 X3 X4 X5 X7 X13 X14 X179 0.56 0.57 3.92 1264.74 1267.53 X2 X3 X4 X5 X7 X13 X14 X15 X1710 0.56 0.57 5.13 1265.91 1268.82 X2 X3 X4 X5 X6 X7 X9 X13 X14 X17Using Cp

8 2.86 0.57 0.56 1263.70 1266.37 X2 X3 X4 X5 X7 X13 X14 X177 2.87 0.57 0.56 1263.79 1266.30 X2 X3 X4 X5 X7 X13 X148 3.66 0.57 0.56 1264.54 1267.17 X2 X3 X4 X5 X7 X13 X14 X159 3.73 0.57 0.56 1264.54 1267.34 X2 X3 X4 X5 X7 X9 X13 X14 X178 3.91 0.57 0.56 1264.80 1267.42 X2 X3 X4 X5 X7 X9 X13 X14

Satisfaction domains are: health (X1), finances (X2), family (X3), housing (X4), food (X5), country (X6), self (X7), valuedomains are life satisfaction (X8), money (X9), humility (X10), love (X11), happiness (X12), affect measures are positivefrequency (X13), negative frequency (X14), positive intensity (X15), negative intensity (X16), and sex (X17).

R2 contains 9 predictors, and includes somevariables from each set (i.e., domain-specificsatisfaction variables, some affect measures,a value measure and sex). The model thatproduces the best fit using the Cp criterioncontains the same variables except for thevalue measure, so it selects an eight-predictormodel as best fitting.

By perusing the output from suchexploratory procedures, one may beginto discern informative patterns. Certainvariables commonly appear as predictors inmost of the top models and certain variablesnever appear in such models. This may thenguide the development of some theoriesregarding which variables appear to beresponsible for satisfaction with life, andwhich do not. To find support for thesetheories, confirmatory analyses may beconducted using additional data.

Multiple regression as a predictivetool: interval estimation ofpredictions and cross-validation

Once a particular model is selected (by anyof the methods described above), it can beused as a predictive tool for specific valuesof the criterion. For each unique combinationof values of the p predictors (and interceptterm X0 = 1), Xh = {1, Xh1, . . . , Xhp}, the

expected value of Y is given by:

E(Yh) = β0 + β1Xh1 + β2Xh2 + . . .

+ βpXhp = Xh’β (17)

where Xh is a (p + 1) × 1 vector containingthe predictor values and β is a (p + 1) × 1vector containing the regression coefficients(including the intercept). Therefore, E(Yh)indicates the expected value of the criterion(i.e., its value in the population) given thisxh predictor vector. In a particular sample theregression parameters are replaced by theirsample estimates and the predicted value ofY is given by:

Yh = b0 + b1Xh1 + b2Xh2 + . . .

+ bpXhp = xh’b (18)

where b is a (p + 1) × 1 vector containing theregression coefficient estimates. Therefore, Yh

is the expected value of allY values associatedwith the predictor vector xh and is an unbiasedestimate of E(Yh).10 The variance of Yh isestimated by pre- and post-multiplying thevariance of b by the xh vector, so:

s2{Yh}=xh’ s2{b}xh =MSE(xh’ (X’X)−1xh)(19)

where s2{b} = (MSE)(X’X)−1. MSE is themean square error of the model, and X



is the n × (p + 1) data matrix containingall n observations on all p predictors andan intercept term (e.g., Neter et al., 1996).Therefore, an interval estimate for E(Yh) isgiven by:

Yh ± t(1 − α/2; n − p − 1)s{Yh} (20)

where (1 − α) is the desired level ofconfidence of the interval (e.g., for a 95%confidence interval 1 − α = .95 so α = .05).The width of the confidence interval is aquadratic function of the distance between themean values of the predictors in the sample(contained in the vector x) and the target xh.In other words, predictions are most accuratewhen xh = x, and their accuracy decreases(quadratically) as one moves away from thispoint, highlighting the dangers of extremeextrapolations.

In addition to predictions for the meanY value given xh, one may be interestedin predicting a single new observation.This amounts to randomly selecting oneobservation with predictor values xh. Thepoint prediction is identical, but this selectioninduces a higher level of uncertainty. Thevariance of the predicted value based ona single new observation is s2{Yh(new)} =MSE+xh’s2{b}xh = MSE(1+xh’(X’X)−1xh),resulting in the confidence interval Yh ± t(1 − α/2; n − p − 1)s{Yh(new)}.

Beyond the concern with predictions ofspecific values, one may seek ways to quantifythe quality of the model’s predictive validityas a whole. Intuitively, one may think that R2,the coefficient of determination, provides sucha measure. Because the parameter estimatesare based on data from a particular sample,the R2 value is affected by the sample’sidiosyncrasies. It is maximal for the sampleat hand, but not for others. Therefore, ifanother random sample is obtained from thesame population, and the model from theoriginal sample is used to predict valuesin this new sample, the new R2 valuewould be lower than in the original sample,a phenomenon sometimes referred to as‘shrinkage’. The degree of shrinkage dependson p, n, and R2. In general, the smaller

the shrinkage the higher the confidencethat the model generalizes well to othersamples.

The procedure used to evaluate how wellthe regression model – developed using onesample – generalizes to other samples iscalled cross-validation. The simplest cross-validation calls for a division of the availablesample into two smaller random samples (e.g.,halves). One sample (or half), sometimesreferred to as the screening or training sample,is used to estimate the model parametersand obtain the (optimal) fit of the model.The second sample, sometimes referred toas the validation or prediction sample, isused to obtain predicted values based onthe parameter estimates computed previously(using the training sample). The R2 obtainedwhen applying the parameters estimated inthe training sample to the observations inthe validation sample is the cross-validated(and typically ‘shrunk’) R2. Ideally, the valuesof statistics such as the MSE, b, R2, andso on from the validation sample should berelatively close to their counterparts in thetraining sample.

Variations on this cross-validation pro-cedure involve splitting the data set intomore than two parts, each time leaving outone part and using the remaining data as atraining sample and the left-out set as thevalidation sample. In the extreme, the left-out data includes a single observation, suchthat training data set consists of all but oneobservation, and the accuracy of the estimatedmodel is evaluated by obtaining the predictionfor a single observation at a time. This isalso known as the ‘jack-knifing’ procedure(Mosteller and Tukey, 1977).

Alternatively, and especially if the sampleis too small to allow for data splitting, onecan predict statistically the degree of shrink-age. Pedhazur (1997) discusses formulae[attributed both to Stein (1960) and Herzberg(1969)] that have been shown, in simulationstudies, to accurately estimate the cross-validation coefficient (the R2 in the validationsample) without actually carrying out thecross-validation process (Cotter and Raju,1982). For a model with fixed predictors, the


302 DATA ANALYSIS

estimate of the cross-validation coefficient,R2

CV, is:

R2CV = 1 −

(n − 1

n

) (n + p + 1

n − p − 1

)(1 − R2)

(21)

where R2 is the squared multiple correlationcoefficient of the full sample and n is thesample size of the full sample. For a modelwith random predictors the estimate is:

R2CV = 1 −

(n − 1

n − p − 1

) (n − 2

n − p − 2

)

(n + 1

n

)(1 − R2) (22)

ADDITIONAL TOPICS IN MULTIPLEREGRESSION

In this section we discuss two additionaltopics that affect the interpretation ofMR multiple regression results: collinearityamong predictors and relative importance ofpredictors.

Collinearity

Collinearity occurs when some of the pre-dictors in a dataset are linear combinationsof other predictors. If one predictor can beperfectly predicted from a linear combinationof other predictors, the X’X matrix usedin estimating model parameters is singularand, therefore, there is no unique set ofunbiased estimates of the parameters. Froma practical perspective collinearity impliesredundancy in the information provided bythe set of predictors and indicates that not allpredictors are needed (i.e., the model is mis-specified). Collinearity analysis is typicallyrelated to a confirmatory approach, as itcan be used to identify mis-specification ofthe model. However, some researchers mayalso use it in an exploratory manner toscreen and exclude some predictors from theanalysis.

Typically, collinearity is not perfect.However, when one predictor is almost

perfectly predictable from the others (say withan R2 greater than 0.9), a situation describedas near collinearity, most researchers wouldagree that this is an unacceptable level ofredundancy. Statistically, (near) collinearitycan make the regression coefficients takearbitrarily high values and switch sign inan unpredictable fashion, as well as inflatetheir standard errors. While the correlationmatrix of the predictors can be inspectedfor unacceptably high bivariate correlations,patterns of collinearity due to more complexlinear combinations of the variables may notbe detected by simple inspection. Therefore,several measures have been proposed fordetecting collinearity.

The ‘global’ approach to measuring colli-nearity involves examining global measuresof the system (i.e., the p predictors). Underperfect colinearity the X’X matrix is ofdeficient rank (< p) and at least one of itseigenvalues is 0. Thus, small (i.e., near-zero)eigenvalues indicate near-singularity and arediagnostic of near-collinearity. A popularmeasure is the condition number, defined asthe square root of the ratio of the largest eigen-value of X’X (standardized to unit variances)to the smallest eigenvalue.11 Some guidelines(not necessarily agreed upon) suggest that forthis measure values in the range of 30–100indicate moderate collinearity and valuesover 100 indicate strong collinearity (Belsley,1991). Examination of the coefficients ofthe eigenvector associated with very smalleigenvalues can help identify the linearcombinations that induce the dependencyamong the predictors.

The ‘local’ approach to measurement ofcollinearity involves identification of highlypredictable predictors. Let R2

i be the squaredmultiple correlation obtained in performing aMR in which Xi is predicted from the other(p − 1) predictor variables. Two measuresbased on R2

i are often reported. One is thevariance inflation factor (VIF) that is definedas: VIFi = 1/(1 − R2

i ).As the name indicates,it represents the inflation in the variance ofbi due to correlations among the predictors,where the base line is the case of uncorrelatedvariables when R2

i = 0 and VIFi = 1).



High values of VIFi indicate that Xi may bea linear combination of the other predictors.Because VIF values are unbounded, theyare sometimes rescaled into measures of‘tolerance’ defined as 1/VIFi = (1 − R2

i ).Tolerance, therefore, varies from a minimumof 0 to a maximum of 1. There is no clearagreement on what values of VIF or toleranceare considered to be indicative of severecollinearity problems. To some extent, this isa subjective decision that involves decidingwhat value of R2 (among the predictors!)would be considered unacceptably high. Themeasures can also be used in a relative fashionto identify the variables with the lowesttolerance (highest VIF).

Relative importance

A question that often comes up in interpreta-tion of MR models relates to the determinationof the relative importance of predictors.Researchers are typically interested in: (1)ranking the predictors from the most to theleast important; (2) scaling them, by assigningvalues on an interval scale that reflects theirimportance, and possibly (3) relating thesemeasures to the model’s overall goodness offit (Budescu, 1993). Despite objections (seePratt, 1987), many researchers have proposedways to measure the relative importance ofpredictors [see Budescu (1993) and Kruskaland Majors (1989) for partial reviews]. Asurprising conclusion of the reviews of thepredictor importance literature is that thereis no universally accepted definition or agenerally accepted measure of importance.In fact, some of the measures proposedare not explicitly related to any specificdefinition.

Many researchers, incorrectly, equate apredictor’s relative importance with themagnitude of its standardized regressioncoefficient. Criticisms of this misleadinginterpretation are published periodically in theprofessional literature in various disciplines.For some examples the reader should consult:Greenland et al. (1986) in epidemiology;King (1986) in political sciences; Budescu(1993) and Darlington (1968) in psychology;

Bring (1994), Kruskal and Majors (1989), andMosteller and Tukey (1977) in statistics. Inthe next section we review briefly the majorshortcomings of the standardized coefficientsas measures of importance. We concludethat, unless the predictors are uncorrelated,standardized coefficients cannot be saidto isolate and measure a net, direct, orunique effect of the corresponding targetpredictors.

In addition to importance measures thatare based on standardized regression coef-ficients, other measures proposed in therelative importance literature are typicallybased on correlations (e.g., Hedges andOlkin, 1981; Kruskal, 1987; Lindeman et al.,1980; Mayeske, et al., 1969; Mood, 1969,1971; Newton and Spurrell, 1967a, 1967b;Pedhazur, 1975), a combination of thecoefficients and correlations (Courville andThompson, 2001; Darlington, 1968; Dunlapand Landis, 1998; Pratt, 1987; Thomas andZumbo, 1996; Thomas et al., 1998) or oninformation (Soofi and Retzer, 2000; Soofiet al., 2000; Theil, 1987; Theil and Chung,1988). We discuss two alternative and, inour view, superior methods of meaningfullyevaluating relative importance: dominanceanalysis (DA; Azen and Budescu, 2003;Budescu, 1993) in the confirmatory context,and criticality analysis (CA;Azen et al., 2001)in the exploratory context.

Why standardized coefficients arenot measures of relative importance

The usual interpretation of the coefficientsis the rate of change in Y per unit changein Xi (or slope) when all other variablesare fixed (held constant). We have alreadydiscussed the fact that this interpretation isinadequate for polynomial and/or interactivemodels where the predictors are function-ally related. We recommend centering (or,alternatively, standardizing) the predictors inpolynomial and/or interactive models (seethe section ‘Derivative predictors’, above), toreduce collinearity. This, however, does notaffect their interpretation. In this section, weexplain why standardized coefficients should


304 DATA ANALYSIS

not be interpreted as measures of relativeimportance.

Consider now the case of distinct predic-tors. The definition does not specify at whatvalues to fix the other predictors and, it isnatural to infer that this rate of change isinvariant across all choices of the values ofthe other predictors, Xj (j �= i). This isindeed true if all p predictors are mutuallyuncorrelated and/or the p predictors andthe response have a (p + 1)-variate normaldistribution, but not necessarily in othercases [see Lawrance (1976) for a proof anddiscussion]. The intuition is quite simple: thehigher the correlations between the predictors,the closer this case comes to the situationwhere the predictors are functionally related inthe sense that changing one variable implieschanges in the others. And conditioning onvarious levels of any one predictor focuseson different subsets of the target population.Thus, the interpretation of the standardized(or raw) coefficient as a fixed rate ofchange for the case of distinct predictors iscontingent on strict assumptions about thedistribution of, and intercorrelations between,the p predictors.

The next issue is how to interpret the signof the coefficient: can one seriously talk aboutnegative importance? Or, should one interpretimportance as we interpret correlations, thatis, by distinguishing between the absolutemagnitude of the coefficient (a measureof overall importance) and its sign (anindication of the direction of the effect)?It turns out that neither approach capturesfaithfully the behavior of the coefficients.This determination follows directly from theelegant analysis of suppressor variables byTzelgov and his colleagues (e.g., Tzelgovand Henik, 1991; Tzelgov and Stern, 1978).They show examples of cases where allthe predictors are positively inter-correlatedand their correlations with the responsehave identical signs, but in each case oneof the regression coefficients changes sign!Similarly, in the case of (almost) perfectco-linearity (rX1 X2 = 0.99) where the twopredictors correlate with the response almostidentically (ryX1 = 0.61 and ryX2 = 0.60),

one of the coefficients is positive (b∗1 = .804),

and the second is negative (b∗2 = −.196).

Clearly, the sign of the regression coefficientis unrelated to any sensible definition of thepredictors’ importance.

Bring (1994) shows that, contrary to theimplication of the interpretation of the regres-sion coefficient as a measure of importance,the magnitude of the standardized coefficientsdoes not reflect the effect of the correspondingpredictors on the model’s goodness of fit.Finally, it is well known that the rankingof the predictors based on their standardizedcoefficients is model dependent and it is notnecessarily preserved in all subset models.

Relative importance and dominanceanalysis

In the previous section we have arguedthat interpreting standardized coefficients asmeasures of importance can lead to para-doxical situations that defy common senseand natural intuitions about importance. Thiswas done without actually proposing a cleardefinition of this elusive concept. In thissection we propose a definition and describea methodology dominance analysis (DA)that is more suitable for the determinationof relative importance in linear models.Budescu (1993) suggested that the importanceof any predictor should: (1) be defined interms of the variable’s effect on the model’sfit; (2) be based on direct and meaningfulcomparisons of the target predictor with allthe other predictors; (3) reflect the variable’scontribution to the fit of the full model underconsideration, as well as to all its possiblesubsets; and (4) recognize indeterminatesituations in which it is impossible to rank(some of the) predictors in terms of theirimportance. He also proposed that relativeimportance be derived and inferred from therelationship between all p(p − 1)/2 distinctpairs of predictors. Predictor Xi dominates(completely) predictor Xj (for short, XiDXj)if Xi contributes more to the model’s fit in allsub-models that include neither Xi nor Xj. Inother words, XiDXj if in all the instances (withp predictors, there are 2p−2 such cases) where



one has the option of including only one ofthe two variables in a model, considerationsof goodness of fit would never favor Xj (i.e.,Xi always contributes at least as much asXj to the model’s fit). For example, in amodel with p = 4 predictors, X1 would beconsidered to dominate (completely) X2 if itscontribution to the fit of the model would behigher in the following 2(4−2) = 4 cases:(i) as a single predictor, (ii) in addition toX3 alone, (iii) in addition to X4 alone, and(iv) in addition to X3 and X4 as a set. Thisis an explicit, precise, and quite stringentdefinition of importance that is consistent withmost researchers’ intuitions and expectationsabout it. DA can be applied meaningfullyin any substantive domain and for any typeof model (distinct predictors, polynomial,interactive, etc.), and it is free from thevarious interpretational problems that plaguethe (standardized) coefficients.

Note that if p ≥ 3, complete dominanceinvolves more than one comparison amongeach pair of variables, so it is possiblethat neither predictor dominates the other.

Consequently, in some cases it may beimpossible to rank-order all p predictors,although in most cases it is possible to estab-lish a partial order. To address this situationAzen and Budescu (2003) have developed twoweaker versions of dominance – conditionaland general. Conditional dominance relieson the mean contribution of Xi (specifically,its squared semi-partial correlations) in allmodels of a given size. Finally, generaldominance relies on Cxi , the average ofthese mean size-specific contributions of Xi

across all model sizes (see also Lindemanet al., 1980). An interesting property of thesemeasures is that they add up to the (full)model’s fit:

R2 =p∑

i−1

Cxi (23)

In other words, the Cxi (i = 1, . . . , p)decompose or distribute the model’s global fitacross all p predictors. Table 13.7 illustratesthis approach with a model where globalsatisfaction with life is predicted by p = 4

Table 13.7 Dominance analysis example (with p = 4 predictors)

Subset model R2 Additional contribution of

X1 X2 X3 X4

Null and k = 0 average .000 .054 .128 .229 .100

X1 .054 .103 .206 .081X2 .128 .030 .162 .058X3 .229 .032 .061 .039X4 .100 .035 .086 .168

k = 1 average .032 .083 .179 .059

X1X2 .157 .153 .049X1X3 .260 .050 .031X1X4 .135 .072 .157X2X3 .290 .020 .025X2X4 .186 .021 .129X3X4 .268 .024 .047

k = 2 average .022 .056 .146 .035

X1X2X3 .310 .020X1X2X4 .207 .124X1X3X4 .291 .039X2X3X4 .314 .016

k = 3 average .016 .039 .124 .020

X1X2X3X4 .330

Overall average .031 .076 .169 .054

X1, health; X2, finances; X3, family; X4, housing.

aparna

Rectangle

aparna

Text Box

Au: subscript i roman?


306 DATA ANALYSIS

domain specific variables.12 The resultsindicate that satisfaction with family (X3)completely dominates each of the other threepredictors (its contribution to the model’s fitis greater than any of the other predictorsin each of the rows of the table where theycan be compared), satisfaction with finances(X2) completely dominates the remaining twopredictors, and satisfaction with housing (X4)dominates satisfaction with health (X1) as apredictor of overall satisfaction with life. Theoverall model’s fit is R2 = 0.33 and can bedistributed among the four variables as .03 tohealth, .08 to finances, .17 to family, and .05to housing.

Relative importance and criticalityanalysis

Dominance analysis is particularly relevantand useful for cases that involve relativelyfew predictors (p < 10), where thereis interest in a complete ranking of theircontribution to the model’s overall fit and acomplete understanding of the relationshipsbetween the predictors (Azen and Budescu,2003, describe a slight variation on themain theme, that allows one to perform‘constrained DA’ that includes certain groupsof variables). Thus, DA is most appropriatefor the confirmatory applications of MR.Azen et al. (2001) developed an alternativeapproach, criticality anaylsis (CA), that wasmotivated by, and consistent with, the logicof the exploratory applications of MR. In anutshell, for CA one uses a large number, B,of bootstrapped re-samples of size n (takenwith replacement from the original sampleof n observations). In each re-sample onecan invoke his/her favorite selection method(e.g., adjusted-R2, AIC, BIC, etc.), to pick the‘best-fitting model (BFM)’. This produces adistribution of BFMs across the B bootstrapsamples. Next, one can use this distributionto calculate, for each variable, the fraction ofBFMs in which it was included. This measurevaries from 0 (the target variable was neverincluded in any of the BFMs) to 1 (the targetvariable was included in all of the BFMs). Thisindex can be interpreted as the probability

of BFM mis-specification when the targetvariable is omitted, so it measures how criticalthe target variable is to the identification ofthe ‘best’ model. Hence we refer to it as thepredictor’s criticality.

Table 13.8 illustrates this approach withthe life satisfaction data using nine variables[five domain satisfaction (DS) variables andfour values]. We ran B = 1000 re-samplesand used adjusted-R2 and AIC to identify theBFMs. The top panel presents the frequencydistribution of the best-fitting models,13 andthe bottom panel calculates the criticality ofthe nine variables.

The results indicate that satisfaction withfamily, finances and housing are essential(highly critical), as they are both includedin a high percentage of BFMs using bothselection criteria (i.e., AIC and adjusted R2).Satisfaction with family, for example, appearsin every BFM (criticality is 1.0) using bothselection criteria. However, predictors suchas value on humility and love have relativelylow criticality values, as these predictors wereincluded in well under half of the BFMs.In general, the domain-satisfaction predictorsresulted in much higher criticalities than thevalue-related predictors in predicting overallsatisfaction with life.

FINAL REMARKS

We reviewed briefly and in a relatively non-technical fashion the major MR results andemphasized various interpretational issueswhile highlighting how they relate to differentapplications of the model. Given its longhistory and wide-spread use, there are manyexcellent books that cover these applications,as well as many others that we did not touchon, in a more comprehensive fashion and atvarious levels of technical details. Some of ourfavorites, several of which we cited repeatedlyin this chapter, are Chatterjee and Hadi (2000),Cohen et al. (2003), Draper and Smith (1998),Graybill (1976), Johnson and Wichern (2002),Kleinbaum, et al. (2007), Mosteller and Tukey(1977), Neter et al. (1996), Pedhazur (1997),and Timm (2002).



Table 13.8 Distribution (percentage) of best-fitting models and predictorcriticality in 1000 re-samples(a) Best-fitting models

Best-fitting model AIC Adjusted R2

X1 X5 X6 X7 X8 X9 16.4 11.6X5 X6 X7 X8 X9 15.0X1 X4 X5 X6 X7 X8 X9 7.1 10.9X5 X6 X7 X9 6.5X3 X5 X6 X7 X8 X9 6.5 5.5X1 X3 X5 X6 X7 X8 X9 5.4 9.3X1 X3 X4 X5 X6 X7 X8 X9 6.5X1 X2 X5 X6 X7 X8 X9 6.4X5 X6 X7 X8 X9 6.2Other models 43.1 43.6

(b) Predictor criticality

Criticality

Predictor AIC Adjusted R2

X1: value on money .466 .621X2: value on humility .180 .348X3: value on love .259 .417X4: value on happiness .261 .421X5: satisfaction with health .929 .960X6: satisfaction with finances .995 1.000X7: satisfaction with family 1.000 1.000X8: satisfaction with country .823 .929X9: satisfaction with housing .966 .984

AIC, Akaike’s information criterion.

ACKNOWLEDGMENTS

We are grateful to Drs Hans FriedrichKoehn and Albert Maydeu Olivares for usefulcomments on an earlier version of the chapter.

NOTES

1 We prefer to reserve the terms indepen-dent/dependent variables to randomized experimen-tal designs, and will use the criterion/predictorsterminology throughout the chapter.

2 This amounts to dividing the standardizedvariables (Zy, Zxi) by

√(n − 1). The name is due to

the fact that under this transformation the parameterestimates can be obtained directly from the correlationmatrix among the response and the predictors.

3 Various sources refer to such variables as‘categorical’, ‘indicator’ or ‘dummy’ variables.

4 This set of transformations is part of the moregeneral family of power transformations X’ = Xλ.Polynomials are defined by natural (positive integers)exponents, but the general family includes all realvalues and includes other ‘standard’ transformationssuch as square root (λ = 0.5), reciprocal (λ = −1),

logarithmic (λ = 0, by definition), etc. It is often usedto optimize the fit of the model. In particular, thewell known Box–Cox procedure (Box and Cox, 1964)provides a convenient way to find the ‘best’ exponent.

5 A ‘mixed’ design is a combination of the fixedand random designs where some of the predictorsare fixed and the others are random variables.

6 This form also highlights nicely the decomposi-tion of the test statistic as a product of the ‘Size of theeffect’ and the ‘Size of the study’ (e.g., Maxwell andDelaney, 2004).

7 Note that this omnibus test is not the same as theF-test for lack of fit, which tests whether the modelsatisfies the linearity assumption. Details on the lackof fit test can be found, for example, in Neter et al.(1996).

8 The two approaches don’t necessarily lead toidentical solutions.

9 The tests in these procedures select the highest(or the lowest) test statistic from a large number oftests without adjusting for the potential capitalizationon chance inherent in such a process.

10 Strictly speaking, the predictions are unbiasedonly if we fit the correct model. Although it isimpossible to actually establish this fact, researchersroutinely make this assumption.

11 The conditioning number is closely related tothe internal correlation (Joe and Mendoza, 1989),


308 DATA ANALYSIS

the upper bound of all the simple, multiple andcanonical correlations that can be defined amongthe p predictors.

12 We only use four variables so that we canshow the results of the complete analysis in arelatively small table. A SAS macro that can analyzeup to p = 10 predictors can be downloaded fromhttp://www.uwm.edu/∼azen/damacro.html

13 In the interest of space we only present thosemodels that were selected as best at least 50 times(5%) by one of the two criteria.

REFERENCES

Agresti, A. (1996) An introduction to Categorical DataAnalysis. Wiley: New York.

Aiken, L.S. and West, S.G. (1991) Multiple Regression:Testing and interpreting interactions. Thousand Oaks,CA: Sage.

Akaike, H. (1970) ‘Statistical predictor identification’,Annals of the Institute of Statistical Mathematics, 22:203–217.

Akaike, H. (1973) ‘Information theory and an extensionof the maximum likelihood principle’, in Petrov, B.N.and Csaki, F. (eds.), 2nd International symposiumon information theory. Budapest: Akailseoniai-Kiudo.pp. 267–281.

Alf, E.F. and Graf, R.G. (2002) ‘A new maximum likeli-hood estimator of the squared multiple correlation’,Journal of Behavioral and Educational Statistics, 27:223–235.

Anastasi, A. (1982) Psychological testing (5th edn.).New York: MacMillan Publishing.

Ashenfelter, O., Ashmore, D., and Lalonde, R. (1995)‘Bordeaux wine vintage quality and the weather’,Chance, 8: 7–14.

Azen, R. and Budescu, D.V. (2003) ‘The dominanceanalysis approach for comparing predictors in multipleregression’, Psychological Methods, 8: 129–148.

Azen, R., Budescu, D.V., and Reiser, B. (2001) ‘Criticalityof predictors in multiple regression’, British Journalof Mathematical and Statistical Psychology, 54:201–225.

Belsley, D. A. (1991) Conditioning Diagnostics:Collinearity and Weak Data in Regression. New York:Wiley.

Box, G.E.P. and Cox, D.R. (1964) ‘An analysis oftransformations’, Journal of Royal Statistical Society,Series B, 26: 211–246.

Box, G.E.P. and Jenkins, G. (1976) Time Series Analysis:Forecasting and Control. San Francisco: Holden-Day.

Bradley, R.A. and Srivastava, S.S. (1979) ‘Correlationin polynomial regression’, The American Statistician,33: 11–14.

Bring, J. (1994) ‘How to standardize regressioncoefficients’, The American Statistician, 48: 209–213.

Bryk, A.S. and Raudenbush, S.W. (1992) HierarchicalLinear Models: Applications and Data AnalysisMethods. Newbury Park, CA: Sage.

Budescu, D.V. (1980) ‘A note on polynomialregression’, Multivariate Behavioral Research, 15:497–508.

Budescu, D.V. (1985) ‘Analysis of dichotomous variablesin the presence of serial dependence’, PsychologicalBulletin, 97: 547–561.

Budescu, D.V. (1993) ‘Dominance analysis: A newapproach to the problem of relative importanceof predictors in multiple regression’, PsychologicalBulletin, 114: 542–551.

Burnham, K.P and Anderson, D.R. (2002) ModelSelection in Multimodel Inference: A PracticalInformational-Theoretical Approach (2nd edn.). NewYork: Springer.

Chatterjee, S., Hadi, A., and Price, B. (2000) RegressionAnalysis by Example (3rd edn.). Wiley.

Cohen, J., Cohen, P., West, S., and Aiken, L. (2003)Applied Multiple Regression/Correlation Analysis forthe Behavioral Sciences (3rd edn.). Hillsdale, NJ:Lawrence Erlbaum Associates.

Cortina, J.M. (1993) ‘Interaction, nonlinearity, andmulticollinearity: implications for multiple regression’,Journal of Management, 19: 915–922.

Cotter, K.L. and Raju, N.S. (1982) ‘An evaluation offormula-based population squared cross-validityestimates and factor score estimates in prediction’,Educational and Psychological Measurement, 42:493–519.

Courville, T. and Thompson, B. (2001) ‘Use of structurecoefficients in published multiple regression articles:0 is not enough’, Educational and PsychologicalMeasurement, 61: 229–248.

Darlington, R,B. (1968) ‘Multiple regression inpsychological research and practice’, PsychologicalBulletin, 69: 161–182.

Diaconis, P. and Efron, B. (1983) ‘Computer-intensivemethods in statistics’, Scientific American, 248:116–130.

Draper, N.R. and Smith, H. (1998) Applied RegressionAnalysis (3rd edn.). New York: Wiley.

Dunlap, W.P. and Landis, R.S. (1998) ‘Interpretationsof multiple regression borrowed from factor analysisand canonical correlation’, Journal of GeneralPsychology, 125: 397–407.

Ganzach, Y. (1997) ‘Misleading interaction andcurvilinear terms’, Psychological Methods, 2:235–247.

Graybill, F.A. (1976) Theory and Application of theLinear Model. North Scituate, MA: Duxbury Press.



Greenland, S., Schelessman, J.J., and Criqui, M.H. (1986)‘The fallacy of employing standardized regressioncoefficients and correlations as measures of effect’,American Journal of Epidemiology, 123: 203–208.

Hedges, L.V. and Olkin, I. (1981) ‘The asymptoticdistribution of commonality components’,Psychometrika, 46: 331–336.

Herzberg, P.A. (1969) ‘The parameters ofcross-validation’, Psychometrika, 34: 1–68.

Joe, G.W. and Mendoza, J.L. (1989) ‘The internalcorrelation: its applications in statistics andpsychometrics’, Journal of Educational Statistics, 14:211–226.

Johnson, R.A. and Wichern, D.W. (2002) AppliedMultivariate Statistical Analysis (5th edn.). UpperSaddle River, NJ: Prentice Hall.

King, G. (1986) ‘How not to lie with statistics: avoidingcommon mistakes in quantitative political sciences’,American Journal of Political Sciences, 30: 666–687.

Kleinbaum, D.G., Kupper, L.L., Mueler, K.E., andNizam, A. (2007) Applied regression analysis andother multivariable techniques (3rd edn.). NorthScituate, MA: Duxbury Press.

Kruskal, W. (1987) ‘Relative importance by averagingover orderings’, The American Statistician, 41: 6–10.

Kruskal, W. and Majors, R. (1989) ‘Concepts ofrelative importance in recent scientific literature’,The American Statistician, 43: 2–6.

Lawrance, A.J. (1976) ‘On conditional and partialcorrelation’, The American Statistician, 30: 146–149.

Lindeman, R.H., Merenda, P.F., and Gold, R.Z. (1980)Introduction to Bivariate and Multivariate Analysis.Glenview, IL: Scott, Foresman.

Lubinski, D. and Humphreys, L.G. (1990) ‘Assessingspurious moderator effects: illustrated substantivelywith the hypothesized (synergistic) relation betweenspatial visualization and mathematical ability’,Psychological Bulletin, 107: 385–393.

Mallows, C.L. (1973) ‘Some comments on Cp’,Technometrics, 15: 661–675.

Maxwell, S.E. and Delaney, H.D. (2004) DesigningExperiments and Analyzing Data: A ModelComparison Perspective (2nd edn.). Mahwah, NJ:Lawrence Earlbaum Associates.

Mayeske, G.W., Wisler, C.E., Beaton, A.E., Weinfeld,F.D., Cohen, W. M., Okada, T. et al. (1969) AStudy of Our Nation’s Schools. Washington, DC:US Department of Health, Education, and Welfare,Office of Education.

Miller, A. (1990) Subset Selection in Regression.London: Chapman and Hall.

Mood, A.M. (1969) ‘Macro-analysis of the Americaneducational system’, Operations Research, 17:770–784.

Mood, A.M. (1971) ‘Partitioning variance in multipleregression analyses as a tool for developing learningmodels’, American Educational Research Journal, 8:191–202.

Mooney, C.Z. and Duval, R.D. (1993) Bootstrapping:A Nonparametric Approach to Statistical Inference.Newbury Park, CA: Sage Publications.

Mosteller, F. and Tukey, J.W. (1977) Data Analysis andRegression: A Second Course in Statistics. Reading,MA: Addison-Wesley.

Neter, J., Kutner, M.H., Nachtshien, C.J. andWasserman, W. (1996) Applied Linear StatisticalModels (4th edn). Chicago: Irwin.

Newton, R.G. and Spurrell, D.J. (1967a) ‘A developmentof multiple regression for the analysis of routinedata’, Applied Statistics, 16: 51–64.

Newton, R.G. and Spurrell, D.J. (1967b) ‘Examples ofthe use of elements for clarifying regression analysis’,Applied Statistics, 16: 165–172.

Olkin, I. and Pratt, J.W. (1958) ‘Unbiased estimationof certain correlation coefficients’, The Annals ofMathematical Statistics, 29: 201–211.

Pedhazur, E.J. (1975) ‘Analytic methods in studies ofeducational effects’, in Kerlinger, F.N. (ed.), Reviewof Research in Education 3. Itasca, IL: Peacock.

Pedhazur, E.J. (1997) Multiple Regression in BehavioralResearch: Explanation and Prediction (3rd edn.).Orlando, FL: Harcourt Brace.

Pratt, J.W. (1987) Dividing the indivisible: Using simplesymmetry to partition variance explained in T. Pukillaand S. Duntaneu (eds.), Proceedings of the SecondTampere Conference in Statistics. University ofTampere, Finland. pp. 245–260.

Sampson, A.R. (1974) ‘A tale of two regressions’, Journalof American Statistical Association, 69: 682–689.

Schwarz, G. (1978) ‘Estimating the dimension of amodel’, Annals of Statistics, 6: 461–464.

Soofi, E.S. and Retzer, J.J. (2002) ‘Informationindices: unification and applications’, Journal ofEconometrics, 107: 17–40.

Soofi, E.S., Retzer, J.J., and Yasai-Ardekani, M. (2000) ‘Aframework for measuring the importance of variableswith applications to management research anddecision models’, Decision Sciences, 31: 595–625.

Stanton, J.M. (2001) ‘Galton, Pearson, and the peas: abrief history of linear regression for statistics instruc-tors’, Journal of Statistics Education, 9. Online. Avail-able: http://www.amstat.org/publications/jse/v9n3/stanton.html

Stigler, S.M. (1997) ‘Regression towards the mean,historically considered’, Statistical Methods inMedical Research, 6: 103–114.

Stein, C. (1960) ‘Multiple regression’, in Olkin, I.,Ghurye, S.G., Hoeffding, W., Madow, W.G. and


310 DATA ANALYSIS

Mann, H.B. (eds.), Contributions to Probabilityand Statistics, Essays in Honor of Harold Hotelling.Stanford, CA: Stanford University Press. pp. 424–443.

Stine, R. (1989) ‘An introduction to bootstrap methods’,Sociological Methods and Research, 18: 243–291.

Suh, E., Diener, E., Oishi, S., and Triandis, H.C. (1998)‘The shifting basis of life satisfaction judgmentsacross cultures: emotions versus norms’, Journal ofPersonality and Social Psychology, 74: 482–493.

Sue Doe Nihm (Pseudonym) (1976) ‘Polynomial law ofsensation’, American Psychologist, 31: 808–809 (asatire by Michael Birnbaum).

Theil, H. (1987) ‘How many bits of information does anindependent variable yield in a multiple regression?’,Statistics and Probability Letters, 6: 107–108.

Theil, H. and Chung: C-F. (1988) ‘Information-theoreticmeasures of fit for univariate and multivariatelinear regressions’, The American Statistician, 42:249–252.

Timm, N.H. (2002) Applied Multivariate Analysis. NewYork: Springer-Verlag.

Thomas, D.R., Hughes, E., and Zumbo, B.D. (1998)‘On variable importance in linear regression’,Social Indicators Research: An Internationaland Interdisciplinary Journal for Quality-of-lifeMeasurement, 45: 253–275.

Thomas, D.R. and Zumbo, B.D. (1996) ‘Using a measureof variable importance to investigate the stan-dardization of discriminant coefficients’, Journal ofEducational and Behavioral Statistics, 21: 110–130.

Tzelgov, J. and Henik, A. (1991) ‘Suppression situationsin psychological research: Definitions, implicationsand applications’, Psychological Bulletin, 109:524–536.

Tzelgov, J. and Stern. I. (1978) ‘Relationships betweenvariables in three variable linear regression andthe concept of suppressor’, Educational andPsychological Measurement, 38: 325–335.

Documents

Applications of Multiple Regression in Psychological Research