19
Beyond Alpha: An Empirical Examination of the Effects of Different Sources of Measurement Error on Reliability Estimates for Measures of Individual Differences Constructs Frank L. Schmidt and Huy Le University of Iowa Remus Ilies University of Florida On the basis of an empirical study of measures of constructs from the cognitive domain, the personality domain, and the domain of affective traits, the authors of this study examine the implications of transient measurement error for the mea- surement of frequently studied individual differences variables. The authors clarify relevant reliability concepts as they relate to transient error and present a procedure for estimating the coefficient of equivalence and stability (L. J. Cronbach, 1947), the only classical reliability coefficient that assesses all 3 major sources of mea- surement error (random response, transient, and specific factor errors). The authors conclude that transient error exists in all 3 trait domains and is especially large in the domain of affective traits. Their findings indicate that the nearly universal use of the coefficient of equivalence (Cronbach’s alpha; L. J. Cronbach, 1951), which fails to assess transient error, leads to overestimates of reliability and undercorrec- tions for biases due to measurement error. Psychological measurement and measurement theory are fundamental tools that make progress in psychological research possible. Estimating relation- ships between constructs and testing hypotheses are essential to the advancement of psychological theo- ries. However, research cannot estimate relationships between constructs directly. Constructs are measured, and it is relationships between scores on the construct measures that are directly estimated. The process of measurement always contains error that biases esti- mates of relationships between constructs (there is no such thing as a perfectly reliable measure). Because of measurement error, the relationships between specific measures (observed relationships) underestimate the relationships between the constructs (true score rela- tionships). To estimate true score relationships—the relationships of substantive interest—the observed re- lationships are often adjusted for the effect of mea- surement error; for these true score relationship esti- mates to be accurate the adjustments need to be precise. An important class of measurement errors that is often ignored when estimating reliability of measures is transient error (Becker, 2000; Thorndike, 1951). If transient errors indeed exist, ignoring them in the es- timation of the reliability of a specific measure leads to overestimates of reliability. Perhaps the most im- portant consequence of overestimating reliability is the underestimation of the relationships between psy- chological constructs due to undercorrecting for the bias due to measurement error. This may lead to un- desirable consequences potentially hindering scien- tific progress (Schmidt, 1999; Schmidt & Hunter, 1999). Transient errors are defined as longitudinal variations in responses to measures that are produced by random variations in respondents’ psychological states across time. Thus transient errors are randomly distributed across time (occasions). Consequently, the score variance produced by these random variations is not relevant to the construct that is measured and should be partialed out when estimating true score relationships. Measurement experts have long been aware of tran- Frank L. Schmidt and Huy Le, Department of Manage- ment and Organizations, Henry B. Tippie College of Busi- ness, University of Iowa; Remus Ilies, Department of Man- agement, Warrington College of Business Administration, University of Florida. Correspondence concerning this article should be ad- dressed to Frank L. Schmidt, Department of Management and Organizations, Henry B. Tippie College of Business, University of Iowa, Iowa City, Iowa 52242. E-mail: [email protected] Psychological Methods Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 8, No. 2, 206–224 1082-989X/03/$12.00 DOI: 10.1037/1082-989X.8.2.206 206

Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Embed Size (px)

Citation preview

Page 1: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Beyond Alpha: An Empirical Examination of the Effects ofDifferent Sources of Measurement Error on Reliability Estimates

for Measures of Individual Differences Constructs

Frank L. Schmidt and Huy LeUniversity of Iowa

Remus IliesUniversity of Florida

On the basis of an empirical study of measures of constructs from the cognitivedomain, the personality domain, and the domain of affective traits, the authors ofthis study examine the implications of transient measurement error for the mea-surement of frequently studied individual differences variables. The authors clarifyrelevant reliability concepts as they relate to transient error and present a procedurefor estimating the coefficient of equivalence and stability (L. J. Cronbach, 1947),the only classical reliability coefficient that assesses all 3 major sources of mea-surement error (random response, transient, and specific factor errors). The authorsconclude that transient error exists in all 3 trait domains and is especially large inthe domain of affective traits. Their findings indicate that the nearly universal useof the coefficient of equivalence (Cronbach’s alpha; L. J. Cronbach, 1951), whichfails to assess transient error, leads to overestimates of reliability and undercorrec-tions for biases due to measurement error.

Psychological measurement and measurementtheory are fundamental tools that make progress inpsychological research possible. Estimating relation-ships between constructs and testing hypotheses areessential to the advancement of psychological theo-ries. However, research cannot estimate relationshipsbetween constructs directly. Constructs are measured,and it is relationships between scores on the constructmeasures that are directly estimated. The process ofmeasurement always contains error that biases esti-mates of relationships between constructs (there is nosuch thing as a perfectly reliable measure). Because ofmeasurement error, the relationships between specificmeasures (observed relationships) underestimate therelationships between the constructs (true score rela-tionships). To estimate true score relationships—the

relationships of substantive interest—the observed re-lationships are often adjusted for the effect of mea-surement error; for these true score relationship esti-mates to be accurate the adjustments need to beprecise.

An important class of measurement errors that isoften ignored when estimating reliability of measuresis transient error (Becker, 2000; Thorndike, 1951). Iftransient errors indeed exist, ignoring them in the es-timation of the reliability of a specific measure leadsto overestimates of reliability. Perhaps the most im-portant consequence of overestimating reliability isthe underestimation of the relationships between psy-chological constructs due to undercorrecting for thebias due to measurement error. This may lead to un-desirable consequences potentially hindering scien-tific progress (Schmidt, 1999; Schmidt & Hunter,1999). Transient errors are defined as longitudinalvariations in responses to measures that are producedby random variations in respondents’ psychologicalstates across time. Thus transient errors are randomlydistributed across time (occasions). Consequently, thescore variance produced by these random variations isnot relevant to the construct that is measured andshould be partialed out when estimating true scorerelationships.

Measurement experts have long been aware of tran-

Frank L. Schmidt and Huy Le, Department of Manage-ment and Organizations, Henry B. Tippie College of Busi-ness, University of Iowa; Remus Ilies, Department of Man-agement, Warrington College of Business Administration,University of Florida.

Correspondence concerning this article should be ad-dressed to Frank L. Schmidt, Department of Managementand Organizations, Henry B. Tippie College of Business,University of Iowa, Iowa City, Iowa 52242. E-mail:[email protected]

Psychological Methods Copyright 2003 by the American Psychological Association, Inc.2003, Vol. 8, No. 2, 206–224 1082-989X/03/$12.00 DOI: 10.1037/1082-989X.8.2.206

206

Page 2: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

sient error. Thorndike (1951), for example, definedtransient measurement error over 50 years ago andcalled for empirical research to calibrate its magni-tude. However, to our knowledge, Becker (2000) pro-vided the first empirical examination of the effects oftransient error on the reliability of measures of psy-chological constructs. His study showed that the ef-fect of transient error, while being negligible in somemeasures (0.02% of total score variance for a measureof verbal aggression), can be substantial in others(10.01% of total score variance for a measure of hos-tility). Measurement error variance of such magnitudecan substantially attenuate observed relationships be-tween constructs and lead to erroneous conclusions insubstantive research unless it is appropriately cor-rected for. Becker’s findings indicate the desirabilityof more extensive examination of the effects of tran-sient error in measures of psychological constructs.

The research presented here investigates the impli-cations of transient error in the measurement of im-portant individual differences variables. In an attemptto cover a significant part of the individual differencesarea, we examined measures of important constructsfrom (a) the cognitive domain (cognitive ability), (b)the broad personality domain (the Big Five factors ofpersonality, generalized self-efficacy, and self-esteem), and (c) the domain of affective traits (posi-tive affectivity [PA]; negative affectivity [NA]).Becker’s (2000) study focused on the four subscalesof the Buss–Perry Aggression Questionnaire (Buss &Perry, 1992). In this study, we examine 10 constructsmore central to research in differential psychology.

To clarify the impact of transient error processes,we first review classical measurement methods usedto correct for biases introduced by measurement errorwhen estimating relationships between constructs anddevelop a method to improve the accuracy of thiscorrection process. We then apply this method tostudy the impact of transient errors on the measure-ment of individual differences constructs from the cat-egories noted above, which is the substantive purposeof this study.

In classical measurement theory, the fundamentalformula for the observed correlation in the populationbetween two measures (x and y) is

�xy � �xtyt(�xx �yy)

1/2, (1)

where �xy is the observed correlation, �xtytis the cor-

relation between the true scores of (constructs under-lying) measures x and y, and �xx and �yy are the reli-abilities of x and y, respectively. This is called the

attenuation formula because it shows how measure-ment error in the x and y measures reduces the ob-served correlation (�xy) below the true score correla-tion (�xtyt

). Solving this equation for �xtytyields the

disattenuation formula:

�xtyt� �xy/(�xx �yy)

1/2. (2)

With an infinite sample size (i.e., in the popula-tion), this formula is perfectly accurate (so long as theappropriate type of reliability estimates is used). Inthe smaller samples used in research, there are sam-pling errors in the estimates of �xy, �xx, and �yy, andtherefore there is also sampling error in the estimateof �xtyt

. Because of this, a circumflex is used in Equa-tion 3 below to indicate that �xtyt is estimated. For thesame reason, rxy, rxx, and ryy are used instead of �xy,�xx, and �yy, respectively. The disattenuation formulatherefore can be rewritten as follows:

�xtyt� rxy/(rxx ryy)

1/2. (3)

Thus, �xtyt is the estimated correlation between theconstruct underlying the measure x and the constructunderlying the measure y; this estimate approximatesthe population value (�xtyt

) when the sample size isreasonably large.

As Equation 3 shows, for the estimation of the truescore correlation between constructs measured by xand y (�xtyt

in Equation 3) to be unbiased, the estimatesof the reliabilities of measures x and y need to beaccurate. To accurately estimate reliabilities of con-struct measures, one must understand and account forthe processes that produce measurement error.

Three major error processes are relevant to all psy-chological measurement: random response error, tran-sient error, and specific factor error. (A fourth errorprocess, disagreement between raters, is relevant onlyto measurement that uses multiple raters of the con-structs that are measured.) Depending on whichsources of error variance are taken into account whencomputing reliability estimates, the magnitude of thereliability estimates can vary substantially, which canlead to erroneous interpretations of observed relation-ships between psychological constructs (Schmidt,1999; Schmidt & Hunter, 1999; Thorndike, 1951).

Schmidt and Hunter (1999) observed that measure-ment error is typically not defined substantively inpsychological research. These authors asserted thatunless the psychological processes that produce mea-surement error are described, “one is left with theimpression that measurement error springs from hid-den and unknown sources and its nature is mysteri-

BEYOND ALPHA 207

Page 3: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

ous” (Schmidt & Hunter, 1999, p. 192). Thus, follow-ing Schmidt and Hunter, we first describe thesubstantive nature of the three main types of errorvariance sources. Next we examine the bias (in esti-mating the reliability of measures) associated withselective consideration of these sources. Finally, wepropose a method of computing reliability estimatesthat takes into consideration all main sources of mea-surement error, and we present an empirical study ofthe effect of the different types of error variancesources on measures of important individual differ-ences constructs.

It has long been known that many sources of vari-ance influence the observed variance of measures inaddition to the constructs they are meant to measure(Cronbach, 1947; Cronbach, Gleser, Nanda, & Ra-jaratnam, 1972; Thorndike, 1951). Variance due tothese sources is measurement error variance, and itsbiasing effects must be removed in the process ofestimating the true relationships between constructs.Although the multifaceted nature of measurement er-ror has occasionally been examined in classical mea-surement theory (e.g., Cronbach, 1947), it is the cen-tral focus of generalizability theory (Cronbach et al.,1972). Although there is potentially a large number ofmeasurement error sources (termed facets in general-izability theory; Cronbach et al., 1972), only a fewsources significantly contribute to the observed vari-ance of self-report measures (Le & Schmidt, 2001).These sources are random response error, transienterror, and specific factor error.

Random Response Error

Random response error is caused by momentaryvariations in attention, mental efficiency, distractions,and so forth within a given occasion. It is specific toa moment when subjects respond to an item of a mea-sure. As such, a subject may, for example, providedifferent answers to the same item if it appears in twoplaces in a questionnaire. These variations can bethought of as noise in the human central nervous sys-tem (Schmidt & Hunter, 1999, Thorndike, 1951).Random response error is reduced by summing itemscores. Thus, other things equal, the larger the numberof items in a specific scale, the greater the extent towhich random response error is reduced.

Transient Error

Whereas random response error occurs across items(i.e., discrete time moments) within the same occa-

sion, transient error occurs across occasions. Tran-sient errors are produced by longitudinal variations inrespondents’ mood, feelings, or in the efficiency ofthe information processing mechanisms used to an-swer questionnaires (Becker, 2000; Cronbach, 1947,DeShon, 1998; Schmidt & Hunter, 1999). Thus, thoseoccasion-characteristic variations influence the re-sponses to all items that are answered on that specificoccasion. For example, respondents’ affective state(mood) will influence their responses to various sat-isfaction questionnaires, in that respondents in a morepleasurable affective state on that occasion will bemore likely to rate their satisfaction with various as-pects of their lives or jobs higher. If those responsesare used to examine cross-sectional relationships be-tween satisfaction constructs and other constructs, thevariations in the satisfaction scores across occasionsshould be treated as measurement error (i.e., transienterrors). Under the conceptualization of generalizabil-ity theory (Cronbach et al., 1972), transient error isthe interaction between persons (measurement ob-jects) and occasions.

Transient error can be defined only when the con-struct to be measured is assumed to be substantivelystable across the time period over which transient er-ror is assessed. Individual differences constructs suchas intelligence or various personality traits can safelybe assumed to be substantively stable across at leastshort periods of time, their score variation across oc-casions being caused by transient error. For constructsthat exhibit systematic variations even across shorttime periods, such as mood, changes in the magnitudeof the measurement cannot be considered the result oftransient errors because those changes have substan-tive (theoretically significant) causes that can andshould be studied and understood (Watson, 2000).

If transient error indeed has a nonnegligible impacton the measurement of individual differences (i.e., itreduces the reliability of such measures), this impactneeds to be estimated, and relationships among con-structs need to be corrected for the downwardly bias-ing effect of transient error on these relationships.That is, observed relationships should be corrected forunreliability using reliability estimates that take intoaccount transient error. To assess the impact of tran-sient error on the measurement of psychological con-structs, one estimates the amount of transient error inthe scores of a measure by correlating responsesacross occasions. As noted, much too often transienterror is ignored when estimating reliability of mea-sures (Becker, 2000).

SCHMIDT, LE, AND ILIES208

Page 4: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Specific Factor Error

This type of measurement error arises from the sub-jects’ idiosyncratic response to some aspect of themeasurement situation (Schmidt & Hunter, 1999). Atthe item level, specific factor errors are produced byrespondent-specific interpretation of the wording ofquestionnaire items. In that sense, specific factor er-rors correspond to the interaction between respon-dents and items. Specific factors are not part of theconstructs being measured because they do not cor-relate with scores on other items measuring that con-struct (i.e., they are item specific); thus they are mea-surement errors. Specific factor errors tend to canceleach other out across different items, but for a specificitem they replicate across occasions. At the scalelevel, different scales that were created to measure thesame construct may also contain specific factor errors.Scale-specific factor error can be controlled only byusing multiple scales to measure the same construct,thus defining the construct being measured by what ismeasured and shared by all scales (Le & Schmidt,2001; Schmidt & Hunter, 1999). That is, the constructis defined as the factor that the scales have in com-mon.

Calibrating Measurement Error Processes WithReliability Coefficients

By definition, reliability is the ratio of true scorevariance to observed score variance. The basic con-ceptual difference between methods of computing re-liability lies in the ways in which they define andestimate true score variance. Estimates of true scorevariance based on different methods may include vari-ance components due to measurement errors in addi-tion to those due to the construct of interest. To put itin another way, different methods implicitly assessthe extent to which different error processes (i.e., ran-dom response, transient, and specific factor) producevariation in test scores that is not relevant to the un-derlying trait to be measured (i.e., measurement er-ror). The magnitude of an error process is assessed bya reliability estimate only when that error process isallowed to reduce the size of the reliability estimate;the magnitude of the reduction is a measure of theimpact of the error process that produced it. It followsthat some methods of computing reliability are moreappropriate than others because they better model thevarious error processes. These issues have in fact longbeen known and discussed by measurement research-ers (e.g., Cronbach, 1947; Thorndike, 1951). Never-

theless, we reexamine the topic here to clearly definethe problem and then offer a solution. Specifically, wediscuss below how popular estimates of reliability as-sess different sources of measurement errors.

Internal Consistency Reliability

This method of computing reliability is probablythe one most frequently used in psychological re-search, and computations of internal consistency areincluded in all standard statistical analysis programs.Internal consistency reliability assesses specific factorerror and random response error. Specific factor er-rors and random response error are independently dis-tributed across items and tend to cancel each other outwhen item scores are summated.

The most popular index of internal consistency of ameasure is coefficient alpha (Cronbach, 1951). Cron-bach (1951) showed that coefficient alpha is the av-erage of all the split-half reliabilities. A split-half re-liability is the correlation between two halves of ascale, adjusted to reflect the reliability of the full-length scale. As such, coefficient alpha is conceptu-ally equivalent to the correlation between two parallelforms of a scale administered on the same occasion(Cronbach, 1951). Parallel forms of a scale are dif-ferent sets of items having identical loadings on theconstruct to be measured but having different specificfactors (further definition and discussion of parallelforms is provided in a subsequent section). When par-allel forms of a scale administered on the same occa-sion are correlated, measurement error variance due tospecific factor error and random response error re-duces the correlation between the parallel forms.Thus, in practice, internal consistency reliability canbe estimated three different ways: (a) by the correla-tion between parallel forms of a scale; (b) by splittinga scale into halves, computing the correlation betweenthe resulting subscales and correcting this estimate forthe reduced number of items with the Spearman–Brown prophecy formula; and (c) by using formulasspecific to the measurement situation such as those forCronbach’s alpha (when items are scored on a con-tinuum; Cronbach, 1951) or Kuder–Richardson–20(when items are scored dichotomously). Because themost direct method for computing internal consis-tency is to correlate parallel forms of the scale admin-istered on one occasion, this class of reliability coef-ficient estimates is referred to as the coefficient ofequivalence (CE; Anastasi, 1988; Cronbach, 1947).Feldt and Brennan (1989) provided a formula illus-trating the variance components included as true score

BEYOND ALPHA 209

Page 5: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

variance by the CE (Equation 102, p. 135). The CEassesses the magnitude of measurement error pro-duced by specific factor and random response errorbut not by transient error processes.

Test–Retest Reliability

Test–retest reliability, as its name implies, is com-puted by correlating the scores on the same form of ameasure across two different occasions. The test–retest reliability coefficient is alternatively referred toas the coefficient of stability (CS; Cronbach, 1947)because it estimates the extent to which scores on aspecific measure are stable across time (occasions).The CS assesses the magnitude of transient error andrandom response error, but it does not assess specificfactor error, as shown by Feldt and Brennan (1989;Equation 100, p. 135).

Coefficient of Equivalence and Stability

The coefficient of equivalence assesses specificfactor error and random response error but not tran-sient error, whereas the CS assesses transient and ran-dom response error but not specific factor error. Theonly type of reliability coefficient that estimates themagnitude of the measurement error produced by allthree error processes is the coefficient of equivalenceand stability (CES; Cronbach, 1947). The CES iscomputed by correlating two parallel forms of a mea-sure, each administered on a different occasion. Theuse of parallel forms administered on one occasionassesses only specific factor error and random re-sponse error, and the administration of the same mea-sure across two different occasions assesses only tran-sient error and random response error. Hence, theCES is the ideal reliability estimate because its mag-nitude is appropriately reduced by all three sources ofmeasurement error. Supporting this argument, Feldtand Brennan (1989) provided a formula for the CES,showing that its estimated true score variance includesonly variance due to the construct of interest (Equa-tion 101, p. 135). The other types of reliability coef-ficients selectively take into account only somesources of measurement error, and thus they overes-timate the reliability of the scale.

Computing the Coefficient of Equivalenceand Stability

As already noted, to compute the CES of a measurefor a specific sample, two parallel forms of the mea-sure, each administered on a different occasion, are

required. The correlation of the scores on these par-allel forms is reduced by all three main sources ofmeasurement error, and it equals the CES. However,in many cases, parallel forms of the same measuremay not be available. For such cases, we present herean alternative method for estimating the CES of ascale from the CES of its half scales that were ob-tained either from (a) administering the scale on twodistinct occasions and then splitting it post hoc toform parallel half scales or (b) splitting the scale intoparallel halves then administering each half on eachseparate occasion. The former approach (i.e., post hocdivision) provides us with multiple estimates of theCE and CES coefficients of the half scales, which wecan average to reduce the effect of sampling error,thereby obtaining more accurate estimates of the co-efficients of interest. However, administering thesame scale on two occasions with a short interval canresult in measurement reactivity (i.e., subjects’memory of item content could affect their responseson the second occasion; Stanley, 1971; Thorndike,1951). This problem is especially likely for measuresof cognitive constructs. In such a situation, it may beadvisable to adopt the latter approach: The scaleshould be split into halves first, with each half admin-istered on only one occasion.

After estimating the half-scale CES, one needs tocompute the CES for the full scale. Using the Spear-man–Brown prophecy formula to adjust for the in-creased (double) number of items for the full scaleappears to be the most direct approach. However,Feldt and Brennan (1989) have cautioned that theSpearman–Brown formula does not apply when thereare multiple sources of measurement error involved(Feldt & Brennan, 1989, p. 134), which is the casehere. We demonstrate below that use of the Spear-man–Brown formula is not accurate in this case.

Our derivations assume that all the statistics areobtained from a very large sample, so there is nosampling error; in other words, they are populationparameters. We further assume that there exist paral-lel forms for the scale in question, and these forms canbe further split into parallel subscales. Parallel formsare defined under classical measurement theory to bescales that (a) have the same true score and (b) havethe same measurement error variance (Traub, 1994).It can be readily inferred from this definition thatparallel forms have the same standard deviation andreliability coefficient (ratio of true score variance toobserved score variance). Parallel forms thus definedare strictly parallel, as contrasted to randomly parallel

SCHMIDT, LE, AND ILIES210

Page 6: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

forms, which are scales having the same number ofitems randomly sampled from a hypothetical infiniteitem domain of the construct in question (Nunnally &Bernstein, 1994; Stanley, 1971). Randomly parallelforms may have different true score and observedscore variances due to sampling error resulting fromthe sampling of items.

Let A and B be two parallel forms of a scale thatwere both administered at two times: Time 1 andTime 2. 1A symbolizes Form A administered at Time1; 2A symbolizes Form A administered at Time 2; 1Bsymbolizes Form B administered at Time 1; and 2Bsymbolizes Form B administered at Time 2. With thisnotation, the CES can be written as

CES � �(1A,2B) � �(2A,1B). (4)

If we split A to create two parallel half scales, a1 anda2, and split B to create two parallel half scales, b1 andb2, we then have four parallel half scales, a1, a2, b1,and b2, and

A = a1 + a2, (5)

B = b1 + b2. (6)

From Equations 4, 5, and 6, with prefixes 1 and 2denoting the time when a half scale is administered,we have

CES � �(1a1 + 1a2, 2b1 + 2b2)� �(2a1 + 2a2, 1b1 + 1b2). (7)

Because a1, a2, b1, and b2 are parallel scales, theirstandard deviations are the same. Further, the standarddeviation should be invariant across Times 1 and 2.Symbolizing this value, �, we have

�(1a1) = �(1a2) = �(1b1) = �(1b2) = �(2a1)= �(2a2) = �(2b1) = �(2b2) = �. (8)

Further, let the CE of the half scales be symbolized asce; this coefficient can be written as

ce � �(1a1,1a2) � �(2a1,2a2) � �(1b1,1b2)� �(2b1,2b2). (9)

Let the CES of the half scales be symbolized as ces;by definition this coefficient is the correlation be-tween any different half scales across times:

ces = �(1a1,2a2) = �(1a1,2b1) = �(1a1,2b2)= �(1a2,2a1) = �(1a2,2b1) = �(1a2,2b2)= �(1b1,2a1) = �(1b1,2a2) = �(1b1,2b2)= �(1b2,2a1) = �(1b2,2a2) = �(1b2,2b1). (10)

From Equation 7 above and the definition of the cor-relation coefficient, we have

CES = �(1a1 + 1a2, 2b1 + 2b2)

=Cov[1a1 + 1a2, 2b1 + 2b2]

[Var(1a1 + 1a2)]1�2 � [Var(2b1 + 2b2)]1�2.

(11)

Using Equations 8 and 10 above, we can rewrite thenumerator of the right side of Equation 11 as follows:

Cov(1a1,2b1) + Cov(1a1,2b2) + Cov(1a2,2b1)+ Cov(1a2,2b2)= �(1a1,2b1)�(1a1)�(2b1)+ �(1a1,2b2)�(1a1)�(2b2)+ �(1a2,2b1)�(1a2)�(2b1)+ �(1a2,2b2) �(1a2)�(2b2)= 4 �2 ces. (12)

Using Equations 8 and 9 above, we can rewrite thedenominator of the right side of Equation 11 as

[�2(1a1) + �2(1a2) + 2�(1a1,1a2)�(1a1)�(1a2)]1�2

[�2(2b1) + �2(2b2) + 2�(2b1,2b2)�(2b1)�(2b2)]1�2

= 2�2(1 + ce). (13)

Finally, from Equations 12 and 13, we have

CES � 2 ces/(1 + ce). (14)

Equation 14 gives a simple formula for computingthe CES of a full scale that was split into halves fromthe CES of the half scales. It can be inferred fromabove derivations that this equation yields the exactvalue of CES when (a) all statistics are obtained fromvery large sample size, and (b) strictly parallel halfscales can be formed by appropriately splitting the fullscale. To the extent that those assumptions do nothold in practice (i.e., there is a limited sample size anddepartures from strict parallelism of half scales), theestimate of CES is affected by sampling error in bothsubjects and items (Feldt, 1965; Lord, 1955). Never-theless, the formula presented here provides a conve-nient way to obtain an unbiased and asymptoticallyaccurate estimate of CES for any measure with realsamples (i.e., sample sizes are limited) and in theabsence of perfectly parallel forms. When half scalesare formed by post hoc splitting of a full scale asdescribed above, sampling error can be reduced byaveraging statistics of subscales to obtain more accu-rate estimates of population values. If the scales aresplit in random halves (instead of strictly parallelhalves), the resulting reliability coefficient estimatesthe random CES (Nunnally & Bernstein, 1994).

We note here that the use of the Spearman–Brown

BEYOND ALPHA 211

Page 7: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

prophecy formula is not appropriate to adjust the CESfor the length of the full scale (though it would beappropriate to adjust the CE). If the Spearman–Brownformula were erroneously applied here to estimateCES from ces, the erroneous estimate of the adjustedCES would be

CES � 2 ces/(1 + ces). (15)

Comparing Equations 14 and 15 reveals that adjustingCES with the Spearman–Brown formula results inoverestimation of the full-scale CES (because ces isalways smaller than ce).

Extending the Formula for Estimating the CES

Equation 14 provides a formula for computing theCES of a full-scale from the CE and CES of its halfscales. To derive the equation, we have assumed that(a) a scale can be split into parallel half scales, and (b)the psychometric properties of the half scales (stan-dard deviations and ce) remain unchanged across oc-casions. Arguably, these assumptions are not strongand are likely to be met in practice. Nevertheless, asan anonymous reviewer pointed out, the assumptionscan be further relaxed. When measurement reactivityis not a problem and thus the same scale can be ad-ministered at both times (i.e. the post hoc splittingapproach mentioned earlier is adopted), the followingequation can be used to estimate the CES of thescale:1

CES =2[Cov (1a1,2a2) + Cov(2a1,1a2)]

[Var(1A)]1�2 � [Var(2A)]1�2 , (16)

where Cov(1a1,2a2) is the covariance between thehalf-scale a1 administered at Time 1 (1a1) and thehalf-scale a2 administered at Time 2 (2a2); Cov(2a1,1a2) is the covariance between the half-scale a1 ad-ministered at Time 2 (2a1) and the half-scale a2 ad-ministered at Time 1 (1a2); Var(1A) is the variance ofthe Full-Scale A administered at Time 1; and Var(2A)is the variance of the Full-Scale A administered atTime 2.

Equation 16 assumes only that Scale A can be di-vided into two parallel forms, a1 and a2. In practice,however, there are scales with an odd number ofitems; hence it is not possible to divide them intoparallel half scales. The assumption underlying Equa-tion 16 can then be further relaxed to cover this situ-ation. A scale can now be simply divided into two

subscales, and then the following equation can beused to estimate the CES of the full scale from infor-mation on the subscales:

CES =Cov(1a1,2a2) + Cov(2a1,1a2)

2p1p2[Var(1A)]1�2 � [Var(2A)]1�2, (17a)

where Cov(1a1, 2a2) is the covariance between thesubscale a1 administered at Time 1 (1a1) and the sub-scale a2 administered at Time 2 (2a2); Cov(2a1, 1a2) isthe covariance between the subscale a1 administeredat Time 2 (2a1) and the subscale a2 administered atTime 1 (1a2); Var(1A) is the variance of the Full-Scale A administered at Time 1; Var(2A) is the vari-ance of the Full-Scale A administered at Time 2; p1 isthe ratio of number of items in the subscale a1 to thatin the full scale; p2 is the ratio of number of items inthe subscale a2 to that in the full scale.

Equations 16 and 17a can be applied only when thesame scale is used at both times (the post hoc splittingapproach). In situations where it is impossible to ap-ply this approach because of time constraint or con-cern about measurement reactivity, a modification canbe introduced into Equation 17a to enable estimatingthe CES for scales with an odd number of items:

CES =Cov(1a1,2a2)

p1p2[Vâr(1A)]1�2 � [Vâr(2A)]1�2, (17b)

where Cov(1a1,2a2) is the covariance between thesubscale a1 administered at Time 1 (1a1) and the sub-scale a2 administered at Time 2 (2a2); p1, p2 are theproportions of total items in the subscales a1 and a2;and Var(1A) and Var(2A) are the estimated variancesof the Full-Scale A.

As seen from above, we need to estimate the vari-ances of the Full-Scale A when it is administered atTimes 1 and 2 (which are not directly available be-cause only one subscale is administered at each timein this situation). These values can be obtained fromthe following equations (Appendix A provides deri-vations of these equations):

1 We are grateful to an anonymous reviewer for suggest-ing the way to relax the assumptions underlying Equation14. The reviewer also provided Equations 16 and 17 andtheir derivations. Derivation details are available fromFrank L. Schmidt upon request.

SCHMIDT, LE, AND ILIES212

Page 8: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Var(1A) =Var(1a1)[p1 − (p1 − 1)ce1]

p12 , (18a)

Var(2A) =Var(2a2)[p2 − (p2 − 1)ce2]

p22 , (18b)

where ce1 and ce2 are the coefficients of equivalenceof subscale a1 and subscale a2, respectively.

From Equation 16 to Equation 17b above, we havegradually relaxed the assumptions required for esti-mating the CES. This is achieved at the cost of sim-plicity: those equations are more complicated thanEquation 14. Using Equations 16 and 17a also re-quires an additional assumption, that is, measurementreactivity is not a problem. Equation 17b further re-laxes this assumption by allowing the CES to be es-timated by subscales administered at two differenttimes. All these equations, especially Equation 17b,can be used in most practical situations where theassumptions underlying Equation 14 are not likely tohold.2

The Impact of Transient Erroron the Measurement of

Individual Differences ConstructsAs shown above, the only type of reliability that

takes into consideration all the main sources of mea-surement error is the CES. The most widely usedreliability coefficient, Cronbach’s alpha, does not takeinto account transient error. Thus, Cronbach’s alpha(i.e., CE) is a potential overestimate of the reliabilityof a measure (if transient error processes indeed im-pact the measured scores). By computing the CESwith the method detailed in the preceding section wecan estimate the extent to which CE overestimatesreliability: subtracting CES from CE yields the pro-portion of observed variance produced by transienterror processes (Feldt & Brennan, 1989; Le &Schmidt, 2001). This approach was previouslyadopted by Becker (2000) to estimate the effect oftransient error.

For constructs that exhibit trait-like characteristics,the processes that produce temporal variations in themeasurements of a specific construct should betreated as measurement error processes, but for state-like constructs, temporal variations of the measure-ments cannot be treated as measurement error. Ex-tending the trait–state dichotomy along a continuum,it follows that, except for constructs that can be placedat the ends of this continuum (such as intelligence,which can be placed at the trait end, or emotions,which are at the state end), the trait versus state dis-

tinction depends—at least for the purpose of definingtransient error—on the specific (theoretical) definitionof the construct that is measured. For example thePositive and Negative Affect Schedule (PANAS;Watson, Clark, & Tellegen, 1988) scales can be usedto assess both temporary mood-states (affect) andstable-traits (affectivity) constructs,3 the differencebeing only in the instructions given to the respondents(momentary instructions vs. general or long-term in-structions). Furthermore, empirical evidence showsthat average levels of daily state affect are substan-tially correlated with affectivity trait measures (r �.64 and r � .53, for Positive Affect, [PA] and Nega-tive Affect [NA], respectively; Watson and Clark,1994), thus scores of momentary affect ratings (i.e.,mood states) can be considered indicators of trait-level affectivity if such measures are averaged acrossmultiple occasions. The larger the number of occa-sions that are averaged over, the more closely theaverage approaches a trait measure.

A particularly illustrative example of how the samemeasures can be used to construct scores whose po-sition on the trait–state continuum varies from state totrait was presented by Watson (2000). He presentedtest–retest correlations for scores that were con-structed by aggregating daily mood measures across anumber of days increasing from 1 to 14. The test–retest correlation for the composite scores increasedmonotonically with the number of days entering thecomposite, from .44 to .85 for Positive Affect andfrom .37 to .77 for Negative Affect (Watson, 2000; p.147, Table 5.1.). That is, the more closely the measureapproximated a trait measure, the more stable it be-came.

Transient measurement error processes were ex-pected to affect the measurement of all constructs in-cluded in this study, but the impact on the measure-ment of the different traits may not be equal. Conley(1984) estimated the “annual stabilities” (i.e., average1-year test–retest correlation between measures, cor-

2 From our available data (not from this study), all theseequations yield very similar results, even when the assump-tions for Equation 14 are violated (i.e., when standard de-viations of a scale are different across testing occasions). Inother words, Equation 14 appears to be quite robust to vio-lation of its assumptions.

3 In the literature, affective states (moods) and affectivetraits (personality) are referred to as affect and affectivity,respectively. We adopt this terminology in this article.

BEYOND ALPHA 213

Page 9: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

rected for measurement error using coefficient alpha)of measures of intelligence, personality, and self-opinion (i.e., self-esteem) to be .99, .98, and .94, re-spectively. Those findings may actually reflect differ-ences in (a) stabilities of the constructs (as argued byConley), (b) the susceptibilities of their measures totransient error, or (c) both. Schmidt and Hunter (1996,1999), on the basis of initial empirical evidence, hy-pothesized that transient error is lower in the cognitivedomain than in the noncognitive domain. In accor-dance with Schmidt and Hunter’s hypothesis (1996,1999) we expected transient error to be the smallest inthe cognitive domain. Within the personality domain,we expected transient error to influence affectivitytraits to a larger extent than broad personality traits.This expectation is suggested by the relatively lowertest–retest correlations often observed in measures ofaffectivity traits when compared with those of otherdispositional measures (Judge & Bretz, 1993). Evenwhen compared with the Big Five traits to which theyare frequently related (i.e., Extraversion and Neuroti-cism, respectively; e.g., Brief, 1998; Watson, 2000),positive affectivity and negative affectivity displayrelatively smaller stability coefficients over moderatetime periods (2.5–3 years; Vaidya, Gray, Haig, &Watson, 2001). Vaidya et al. (2001) found averagetest–retest correlation for positive affectivity andnegative affectivity to be .54, whereas the averagetest–retest correlation for Extraversion and Neuroti-cism was .63. However, as mentioned above, withsuch a relatively long interval, the lower test–retestcorrelations could be due to the lower stability of theconstructs, or larger transient error, or both. It is notclear from the current research evidence which source(instability of the constructs or transient error) is re-sponsible for the low test–retest correlations often ob-served. Our study attempts to disentangle effects ofthese sources by examining only the effect of transienterror. We accomplished this by using a short test–retest interval (one week), during which we can befairly certain that the constructs in question are stable.

From a theoretical perspective, people may not beable to entirely separate self-assessments of trait af-fectivity from current affective states (Schmidt,1999). That is, mood states may, to some extent, in-fluence self-ratings of trait affectivity. This influencewill translate into increased levels of transient error inthe measurement of trait-level affectivity. We expectthe impact of mood states on the self-reports of traitaffectivity to be the largest for the PANAS measuresbecause these scales use mood descriptors to assess

affectivity; consequently, the amount of transient er-ror should be largest for the PANAS measures amongthose examined in the study.

Method

Procedure

To investigate these hypotheses, we conducted astudy at a midwestern university. Two hundred thirty-five students enrolled in an introductory managementcourse participated in the study. Participation was vol-untary and was rewarded with course credits. A totalof 123 women (52%) and 112 men (48%) were in theinitial sample. The average age of the participants was21.5 years. All participants were requested to com-plete two laboratory sessions separated by an intervalof approximately 1 week. A total of 18 sessions, eachhaving about 20 participants, was arranged. The ses-sions spanned 3 weeks. Measures were organized intotwo distinct questionnaires. Each questionnaire in-cluded one of a pair of parallel forms of measures foreach construct studied (described below) in this proj-ect. In each session, participants were given one ques-tionnaire to answer. Each participant received differ-ent questionnaires in his or her two different sessions.Sixty-eight of the initial participants failed to attendthe second session, so the final sample consists of 167participants (83 men and 84 women; average age �21.3 years). There was no difference between the par-ticipants who dropped out and those in the finalsample in terms of age and gender. Because it is pos-sible that the dropout participants were different fromthose in the final sample in some other characteristicsthat could influence the results (e.g., those whodropped out were lower in conscientiousness, whichcould have resulted in some range restriction in thisvariable in the final sample), we carried out furtheranalyses to examine the potential bias due to sampleattrition. Mean scores of the participants who did notcomplete the study were compared with those in thefinal sample on all the available measures. No statis-tically significant differences were found.

Measures

As mentioned above, we measured constructs fromthe cognitive domain, the broad personality domain,and the affective traits domain. More specifically, wemeasured general mental ability (GMA), the Big Fivepersonality traits, self-esteem, self-efficacy, positiveaffectivity, and negative affectivity. Table 1 presents

SCHMIDT, LE, AND ILIES214

Page 10: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

the measures used in the study. As shown therein, weused the Wonderlic Personnel Test (Wonderlic Per-sonnel Test Inc., 1998) to measure GMA. This test ispopular among employers as a valid personnel selec-tion tool. It has also been used in research to opera-tionalize the GMA construct (e.g., Barrick, Stewart,Neubert, & Mount, 1998; Hansen, 1989; Mackaman,1982). Available empirical evidence showed that thetest is highly correlated with many other popular testsof GMA, such as the Wechsler Adult IntelligenceScale–Revised (Wechsler, 1981; .75–.96), Otis–Lennon (Otis & Lennon, 1979; .87–.99), and the Gen-eral Aptitude Test Battery (U.S. Department of Labor,1970; .74, Wonderlic Personnel Test, 1998).

For the Big Five personality constructs, we usedtwo inventories: The Personal Characteristics Inven-tory (PCI; Barrick & Mount, 1995) and the Interna-tional Personality Inventory Pool Big Five scale(IPIP; Goldberg, 1997). The respective manuals ofthese scales show that they are highly correlated withother measures of Big Five personality constructspopularly used in research and practice, that is, theNeuroticism–Extraversion–Openness Personality In-ventory—Revised (NEO-PI–R; Costa & McCrae,

1985), Hogan Personality Inventory (Hogan, 1986),and Goldberg’s markers (Goldberg, 1992). Therehave also been empirical studies using these scales inthe literature (e.g., Barrick, Mount, & Strauss, 1994;Barrick et al., 1998; Heaven & Bucci, 2001).

Two scales were used for the self-esteem construct:the Rosenberg Self-Esteem Scale (Rosenberg, 1965)and the Texas Social Behavior Inventory (Helmreich& Stapp, 1974). According to a recent review (Blas-covich & Tomaka, 1991), these scales are among themost popularly used measures of self-esteem in theliterature. Because generalized self-efficacy is a rela-tively new construct, there are not many scales avail-able. We chose to use the scale developed by Shereret al. (1982), which appears to be one of the mostwidely used measures of this construct.

The PANAS (Watson et al., 1988) is probably themost popular measure of affective traits (Price, 1997),so we used it in our study. To obtain additional esti-mates of transient error for the affective constructs,we included two other measures of trait affectivity:Diener and Emmons’s (1984) Affect-Adjective Scaleand the Multidimensional Personality Index (Watson& Tellegen, 1985).

As shown in Table 1, only two of the measuresused have parallel forms available: Wonderlic Person-nel Test (Wonderlic Personnel Test, 1998) for theconstruct of GMA, and Texas Social Behavior Inven-tory (Helmreich & Stapp, 1974) for the construct ofself-esteem. In the case of the other measures, we splitthem into halves to form parallel half scales. Effortswere made to create half scales as strictly parallel aspossible (cf. Becker, 2000; Clause, Mullins, Nee, Pu-lakos, & Schmitt, 1998; Cronbach, 1943): Items in themeasures were allocated into the halves on the basisof (a) item content, (b) their loadings on respectivefactors (from previous empirical studies providingthis information), and (c) other relevant statistics (i.e.,standard deviations, intercorrelations with otheritems) when available. For each measure of this type,a different half scale was included in each question-naire to be administered on that occasion. Because oftime constraints and concerns about measurement re-activity, we did not include full scales for those mea-sures each time and apply post hoc division of scales,as discussed in the earlier section.

Analyses

Two types of reliability coefficients were com-puted: the CE was computed with Cronbach’s (1951)formula, and the CES was computed using the method

Table 1Measures Used in the Study

Construct Measure

CognitiveGeneral mental ability Wonderlic

Broad personalityBig Five (Conscientiousness,

Extraversion, Agreeableness,Neuroticism, and Openness)

PCIa

IPIP Big Fivea

Self-esteem TSBI, Forms A and BRosenberga

Generalized self-efficacy GSE Scalea

Affective traitsPositive and negative

affectivityPANASa

DE scalea

MPIa

Note. Wonderlic � Wonderlic Personnel Test (1998); PCI �Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP� International Personality Inventory Pool (Goldberg, 1997);TSBI � Texas Social Behavior Inventory (Helmreich & Stapp,1974); Rosenberg � Rosenberg Self-Esteem Scale (1965); GSE �Sherer’s Generalized Self-Efficacy Scale (Sherer et al., 1982);PANAS � Positive and Negative Affect Schedule (Watson, Clark,& Tellegen, 1988); DE � Diener and Emmons’s (1984) Affect-Adjective Scale; MPI � Multidimensional Personality Index (Wat-son & Tellegen, 1985).a These measures have no parallel forms. We used the methoddescribed in the text to calculate coefficient of equivalence stabilityreliability values.

BEYOND ALPHA 215

Page 11: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

detailed earlier. Specifically, for the measures withparallel forms (Wonderlic Personnel Test and TexasSocial Behavior Inventory), the CE value was esti-mated by averaging the coefficient alphas of the par-allel forms; the CES values were the correlations be-tween the parallel forms administered on differentoccasions. For those measures without parallel forms,their half-scale coefficients of equivalence (ce) wereestimated by averaging coefficient alphas of the par-allel half scales of the same measures; their half-scalecoefficients of equivalence and stability (ces) were thecorrelations between the parallel half scales of thesame measures administered on two different occa-sions. The Spearman–Brown formula was then usedto estimate the CE of the full scales from those of thehalf-scales (ce). The CES of the full scales was esti-mated from those of the half-scales (ces) by use ofEquation 14 derived in the previous section.

Next we estimated the ratio of transient error vari-ance to observed score variance for measures ofeach construct by subtracting the CES from CE; thisratio or proportion is hereafter referred to as TEV.When available, TEV for measures of the sameconstruct were averaged to obtain the best estimateof the effect of transient error on measures of thatconstruct. Finally, we examined the effect of transienterror on measures of different constructs by com-paring proportions of TEV across measures andconstructs.

As discussed earlier, when sample size is limitedand half scales are likely to be randomly parallel in-stead of strictly parallel, the values of CES estimatedby Equation 14 could be affected by sampling error inboth subjects and items. It is therefore important tohave information about the distribution of the CESestimates, as well as those of TEV and CE. Unfortu-nately, the nonlinearity of Equation 14 and the com-plexity of the sampling distributions of its compo-nents (e.g., ce is estimated by averaging twocoefficient alphas of the half scales) renders analyticalderivation of the formula for estimating the standarderror of the CES difficult. Here the Monte Carlo simu-lation technique can provide a tool for empiricallyexamining the distributions of interest (cf. Mooney,1997). We simulated data based on the results (i.e.,estimates of CE, CES, and TEV) obtained from theprevious analysis. One thousand data sets were gen-erated for each measure included in the study underthe same conditions (see Appendix B for further de-tails of the simulation procedure). For each data set,the same analysis procedure described in the previous

section was carried out to estimate the CE, CES, andTEV. Standard deviations of the distributions of theseestimates provide the standard errors of interest.

Results

Table 2 shows the results of the study. The standarddeviations given are for full-length scales. Where theobserved data were for half-length scales, the standarddeviations were computed using the methods pre-sented in Appendix A. The next two columns presentthe CE and CES values, with the following columnbeing the estimate of the TEV. The last column indi-cates the percentage by which the CE (coefficient al-pha) overestimates the actual reliability. Overall, es-timates of the CES are smaller than the estimates ofthe CE (coefficient alpha), indicating that transienterror exists in scores obtained with the measures usedin this study. For GMA, the CE overestimated thereliability of the measure by 6.70%; for the Big Fivepersonality measures, the extent of the overestimationvaried between 0 and 14.90% across the ten measures(two measures of each of the Big Five constructs).From the measures of the Big Five traits, Neuroticismmeasures contained the largest amount of transienterror (TEV � .09 across the two measures), which isconsistent with the affective nature of this trait (Wat-son, 2000). The estimated value of TEV is negativefor measures of Openness to Experience and for thePCI measure of Extraversion. Because proportion ofvariance cannot be negative by definition, negativevalues must be attributed to sampling error, so weconsidered the TEV of these measures to be zero.Hunter and Schmidt (1990) provided a detailed dis-cussion of the problem of negative variance estimates,a common phenomenon in meta-analysis due to sec-ond-order sampling error. Negative variance esti-mates also occur in analysis of variance. When a vari-ance is estimated as the difference between two othervariance estimates, the obtained value can be negativebecause of sampling error in the two estimates. Forexample, if the population value of the variance inquestion is zero, the probability that it is estimated tobe negative will be about 50% (because about half ofits sampling distribution is less than zero). The prac-tice of considering negative variance estimates as dueto sampling error and treating them as zero is also therule in generalizability theory research (e.g., Cron-bach et al., 1972; Shavelson, Webb, & Rowley, 1989).

For the generalized self-efficacy construct, the CEof the Sherer et al. (1982) measure of generalized

SCHMIDT, LE, AND ILIES216

Page 12: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Table 2Estimates of Reliability Coefficients and Proportions of Transient Error Variances

Construct and measure SDa CE CES TEV % Overestimateb

GMAWonderlic 5.36 .79 .74 .05 6.7

ConscientiousnessPCI 10.09 .88 .78 .10 13.0IPIP 18.08 .93 .89 .04 4.5

Average .91 .84 .07 8.3Extraversion

PCI 8.43 .81 .83 .00 (−.02)c 0.0IPIP 16.90 .92 .88 .04 4.5

Average .87 .86 .02 2.2Agreeableness

PCI 6.93 .85 .80 .05 6.3IPIP 14.82 .88 .87 .01 1.1

Average .87 .84 .03 3.6Neuroticism

PCI 8.14 .85 .74 .11 14.9IPIP 19.69 .93 .86 .07 8.1

Average .89 .80 .09 11.3Openness

PCI 6.17 .81 .87 .00 (−.05)c 0.0IPIP 15.16 .88 .90 .00 (−.02)c 0.0

Average .85 .89 .00 (−.04)c 0.0GSE

Sherer’s GSE 28.83 .89 .83 .05 6.0Self-Esteem

TSBI 8.31 .85 .80 .05 6.3Rosenberg .84 .79 .05 6.3

Average .85 .80 .05 6.3Positive affectivity

PANAS 4.96 .82 .63 .19 30.2MPI 7.94 .91 .81 .09 11.1DE scale 3.72 .86 .74 .12 16.2

Average .86 .73 .13 17.8Negative affectivity

PANAS 5.78 .83 .78 .05 6.4MPI 9.33 .90 .80 .09 11.2DE scale 4.17 .90 .69 .21 30.4

Average .88 .76 .11 14.5

Note. N � 167. Blank cells indicate that data are not applicable. CE � coefficient equivalence (alpha);CES � coefficient of equivalence and stability; TEV � transient error variance over observed scorevariance; GMA � general mental ability; Wonderlic � Wonderlic Personnel Test (1998); PCI �Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP � International Personality InventoryPool (Goldberg, 1997); GSE � Sherer’s Generalized Self-Efficacy Scale (Sherer et al., 1982); TSBI �Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg � Rosenberg Self-EsteemScale (1965); PANAS � Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988); MPI� Multidimensional Personality Index (Watson & Tellegen, 1985); DE � Diener and Emmons’s (1984)Affect-Adjective Scale.a Full-scale standard deviation. bPercentage that CES is overestimated by CE. cThe estimated ratio oftransient error variance to observed score variance is negative for these measures. Because varianceproportion cannot be negative, zero is the appropriate estimate of its population value.

BEYOND ALPHA 217

Page 13: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

self-efficacy overestimated reliability by 6.00%. Theoverestimation was 6.30% for both measures of self-esteem.

Against our expectation, measures of GMA andbroad personality traits appear to have comparableTEV (.05 for GMA and an average of .04 across 7broad personality traits; Table 2). Hence, the compari-son of the amount of transient error in measures ofGMA with the amounts of transient error present inmeasures of broad personality constructs does notsupport Schmidt and Hunter’s (1996, 1999) hypoth-esis that transitory error phenomena affect measuresof ability to a lesser extent than measures of broadpersonality factors.4

The findings indicate that the proportion of tran-sient error component is quite large for measures ofaffectivity, averaging (across the three affectivity in-ventories) .13 for positive affectivity and .11 for nega-tive affectivity. These results show that the CE over-estimates the reliability of PA and NA, on average, byabout 18% and 15%, respectively. Consistent with ourprediction, the proportion of transient error in mea-sures of PA is largest for the PANAS measure (.19).However, among measures of NA, the PANAS con-tained the smallest TEV (.05), contrary to our expec-tation.

Table 3 shows the standard errors of the CE, CES,and TEV. These values were obtained from the MonteCarlo simulation, except for those of averaged esti-mates, that is, the averaged CE, CES, or TEV forconstructs that have more than one measure. Standarderrors of these averaged estimates were calculated bythe usual formula for averaged statistics.5 As can beseen therein, the standard errors for CE estimates(ranging from .008 for IPIP Conscientiousness to .023for PCI Openness) are consistently smaller than thoseof the CES and TEV. Standard errors of the CES(ranging from .022 to .065) in turn are slightly butconsistently larger than those of TEV (from .019 to.062). In general, scales with larger numbers of itemshave smaller standard errors. This finding is expected,as the estimates were affected by sampling errors inboth subjects and items. Of special interest are thestandard errors of the TEV estimates. These valuesare small, but because the TEV estimates are alsosmall, the standard errors are somewhat large relativeto the TEV estimates. Obviously, this is due to themodest sample size of the study. Nevertheless, thefact that the 90% confidence intervals of the majorityof the TEV estimates (12 cases out of 20 measuresincluded in the study) do not cover the zero point

indicates the existence of transient error in those mea-sures.

Discussion and Conclusion

This study replicates and expands the findings pre-sented by Becker (2000). The primary implication ofthese findings is that the nearly universal use of theCE as the reliability estimate for measures of impor-tant and widely used psychological constructs, such asthose studied in this project, leads to overestimates ofscale reliability. As shown in this article, the overes-timation of reliability occurs because the CE does notcapture transient error. As Becker noted, few empiri-cal investigations of the effect of transient errors gobeyond computer simulations. The present study con-tributes to the literature on measurement by calibrat-ing empirically the effect of transient error on themeasurement of important constructs from the indi-vidual differences area and by formulating a simplemethod of computing the CES—the reliability coef-ficient that takes into account random response, tran-sient, and specific factor error processes.

With the exception of the personality trait of Open

4 This general conclusion remained essentially un-changed even after adjusting for the fact that, as might beexpected, our student sample was less heterogeneous onWonderlic scores than the general population (SDs � 5.36and 7.50, respectively, where the 7.50 is from the WonderlicTest Manual). We adjusted both the CE and CES using thefollowing formula:

rxxA� 1 − ux

2 (1 − rxxB),

where ux � sample SD/population SD, rxxB� sample re-

liability (from Table 2), and rxxA� reliability in the general

(more heterogenous population; Nunnally & Bernstein,1994). This adjustment reduced the TEV estimate for theWonderlic from .05 to .03. This is still not appreciablydifferent from the average figure of .04 for the personalitymeasures. It is to be expected that college students will besomewhat range restricted on GMA. As is usually the case,no such differences in variability were found for the non-cognitive measures.

5 The following formula was used to calculate the stan-dard errors of averaged estimates (CE, CES, or TEV):

SEm =1�k��i

k

SEi2�1�2

,

where SEm is the standard error of the averaged estimates;SEi is the standard error of an estimate; and k is the numberof estimates.

SCHMIDT, LE, AND ILIES218

Page 14: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Table 3Standard Errors for the Reliability Coefficents and Transient Error Variance Estimates

Construct and measure CE SECEa CES SECES

b TEV SETEVc

GMAWonderlic .79 .020 .74 .035 .05 .029

ConscientiousnessPCI .88 .013 .78 .042 .10 .039IPIP .93 .008 .89 .022 .04 .019

Average .91 .008d .84 .024d .07 .022d

ExtraversionPCI .81 .020 .83 .039 .00 .038IPIP .92 .009 .88 .025 .04 .022

Average .86 .011d .86 .023d .02 .022d

AgreeablenessPCI .85 .017 .80 .041 .05 .039IPIP .88 .014 .87 .030 .01 .027

Average .87 .011d .84 .025d .03 .024d

NeuroticismPCI .85 .016 .74 .048 .11 .045IPIP .93 .008 .86 .027 .07 .025

Average .89 .009d .80 .028d .09 .026d

OpennessPCI .81 .023 .87 .045 .00 .045IPIP .88 .013 .90 .028 .00 .027

Average .85 .013d .89 .027d .00 .026d

GSESherer’s GSE .89 .013 .83 .035 .05 .032

Self-EsteemTSBI .85 .015 .80 .029 .05 .023Rosenberg .84 .019 .79 .044 .05 .043

Average .85 .012d .80 .026d .05 .024d

Positive affectivityPANAS .82 .022 .63 .065 .19 .062MPI .91 .011 .81 .033 .09 .031DE scale .86 .018 .74 .045 .12 .044

Average .86 .010d .73 .029d .13 .027d

Negative affectivityPANAS .83 .021 .78 .047 .05 .046MPI .90 .011 .80 .036 .09 .033DE scale .90 .012 .69 .049 .21 .046

Average .88 .009d .76 .026d .11 .024d

Note. All the standard errors were obtained from the Monte Carlo simulation. CE � coefficientequivalence (alpha); CES � coefficient of equivalence and stability; TEV � transient error varianceover observed score variance; GMA � general mental ability; Wonderlic � Wonderlic Personnel Test(1998); PCI � Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP � International Per-sonality Inventory Pool (Goldberg, 1997); GSE � Sherer’s Generalized Self-Efficacy Scale (Sherer etal., 1982); TSBI � Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg � Rosen-berg Self-Esteem Scale (1965); PANAS � Positive and Negative Affect Schedule (Watson, Clark, &Tellegen, 1988); MPI � Multidimensional Personality Index (Watson & Tellegen, 1985); DE � Dienerand Emmons’s (1984) Affect-Adjective Scale.a Standard error of CE estimates. bStandard error of the CES estimates. cStandard error of the TEVestimates. dStandard error of the averaged estimates.

BEYOND ALPHA 219

Page 15: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

ness to Experience, transient error affected the mea-surement of all constructs measured in this study.6

Results showed that the proportion of transient errorin measures of GMA is comparable to the transienterror proportion in measures of broad personalitytraits, which was inconsistent with our expectation, onthe basis of Schmidt and Hunter’s (1996) speculationthat transient error is smaller in magnitude in the cog-nitive domain (as compared with the noncognitivedomain). These findings, however, are consistent withthe arguments presented by Conley (1984). It may bethe case that, in the long term, personality indeedchanges more than intelligence does, thus causing de-creased levels of test–retest stability over long periodsfor personality constructs, whereas in the short term,transient error affects cognitive and noncognitivemeasures similarly. Future research might examinethis hypothesis.

As expected, in the affectivity domain the overes-timation of scale reliability was substantial and thelargest among the construct domains sampled in thisstudy. For positive affectivity, we found that the CEoverestimates the reliability of the measures examinedin the study by about 18%. For the PANAS, the mostpopular measure of affectivity, the overestimate ishigher, at about 30%. For negative affectivity theamount of TEV was also substantial, leading to anoverestimation of the reliability of NA scores byabout 15%. The extent of bias resulting from using theCE as the reliability estimate for measures of GMAand broad personality factors is relatively smaller, butit can be potentially consequential. Our estimate ofthe proportion of TEV in the measures of self-esteem(.05 across the two measures; Table 2) is very similarin magnitude to the .04 estimate obtained by Becker(2000).

As noted earlier, we tried to analyze widely usedmeasures for the constructs of cognitive ability, per-sonality, and affectivity in the study. Nevertheless, itis obviously impossible for us to include all the popu-lar measures of the constructs used by researchers inthe literature. To confidently generalize the findings,researchers need more studies that are similar to thisone but that use different measures of the constructsand those of different important constructs in psychol-ogy and social science.

Just as with other statistical estimates, the estimatesof TEV obtained here are susceptible to sampling er-ror. As discussed earlier, sampling fluctuation obvi-ously accounts for the negative estimates of transienterror for the measures of Openness to Experience.

Results of our simulation showed that standard errorsfor the estimates were relatively large, as expectedfrom the modest sample size of the current study.Consequently, specific figures obtained here shouldbe interpreted cautiously, pending future replications.The problem of sampling error can be adequately ad-dressed only by application of meta-analysis whensufficient studies become available (Hunter &Schmidt, 1990). Accordingly, we encourage furtherstudies examining the effects of transient error.

Recent research (e.g., Becker, 2000; Schmidt &Hunter, 1999) suggested transient error has a poten-tially large impact on the measurement of psychologi-cal constructs. This potential impact should be takeninto account when estimating the reliability of mea-sures and when correcting observed relationship forbiases introduced by measurement error (DeShon,1998; Schmidt & Hunter, 1999). Our study providessome initial empirical estimates of the magnitude ofthe problem for measures of widely used constructs.The finding that transient error indeed has nontrivialeffects calls for more comprehensive treatment ofmeasurement errors in empirical research. One obvi-ous step toward minimizing transient error in the mea-sures used in research is to repeatedly administer mea-sures to the same group of subjects with appropriatetime intervals and then average (or sum) the scoresacross all the testing occasions. TEV in the scoresthus obtained will be reduced in proportion to thenumber of testing occasions (Feldt & Brennan, 1989).However, a large number of testing occasions may beneeded to virtually eradicate the effect of transienterror, and the majority of research projects probablydo not have the resources to do this. We thereforeneed a more general approach to correction for thebiasing effects of measurement errors in general andthat of transient error in particular. Studies like thisone specifically designed to estimate the CES of mea-sures of important psychological constructs can pro-

6 Our estimates of TEV (and those of Becker, 2000) areactually slight underestimates. Strictly speaking, coefficientalpha estimates the random CE rather than the strict paral-lelism CE (Cronbach, 1951; Nunnally & Bernstein, 1994).Hence coefficient alpha can be expected to be slightlysmaller than the strictly parallel CE, and therefore the quan-tity alpha—CES slightly underestimates TEV. This differ-ence, however, is small and is reduced somewhat by the factthat the measures used in computing the CES may havefallen somewhat short of strict parallelism.

SCHMIDT, LE, AND ILIES220

Page 16: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

vide the needed reliability estimates to be subse-quently used to correct for measurement error inobserved correlations between measures.7 The cumu-lative development of such a “database” for the CESestimates of measures of widely used psychologicalconstructs can thus enable substantive researchers toobtain unbiased estimates of construct-level relation-ships among their variables of interest and to mini-mize the research resources needed. Rothstein (1990)presented a precedent for this general approach: Sheprovided large-sample meta-analytically derived fig-ures for interrater reliability of supervisory ratings ofoverall job performance that have subsequently beenused to make corrections for measurement error inratings in many published studies. A similar CES da-tabase could serve this same purpose for widely usedmeasures of individual differences constructs.

7 As is well known, reliability varies with sample hetero-geneity: Reliability is higher in samples (and populations) inwhich the scale standard deviation (SDx) is larger. However,given the reliability and the SDx in one sample, a simpleformula (see Footnote 4) is available for computing thecorresponding reliability in another sample with a differentSDx (Nunnally & Bernstein, 1994).

References

Anastasi, A. (1988). Psychological testing (6th ed.). NewYork: Macmillan.

Barrick, M. R., & Mount, M. K. (1995). The Personal char-acteristics inventory manual. Unpublished manuscript,University of Iowa, Iowa City.

Barrick, M. R., Mount, M. K., & Strauss, J. P. (1994). An-tecedents of involuntary turnover due to a reduction inforce. Personnel Psychology, 47, 515–535.

Barrick, M. R., Stewart, G. L., Neubert, M. J., & Mount,M. K. (1998). Relating member ability and personality towork-team processes and team effectiveness. Journal ofApplied Psychology, 83, 377–391.

Becker, G. (2000). How important is transient error in es-timating reliability? Going beyond simulation studies.Psychological Methods, 5, 370–379.

Blascovich, J., & Tomaka, J. (1991). Measures of self-esteem. In J. P. Robinson, P. R. Shaver, & L. S. Wrights-man (Eds.), Measures of personality and social psycho-logical attitudes (pp. 115–160). San Diego, CA:Academic Press.

Brief, A. P. (1998). Attitudes in and around organizations.Thousand Oaks, CA: Sage.

Buss, A. H., & Perry, M. (1992). The Aggression Question-

naire. Journal of Personality and Social Psychology, 63,452–459.

Clause, C. S., Mullins, M. E., Nee, M. T., Pulakos, E., &Schmitt, N. (1998). Parallel test form development: Aprocedure for alternate predictors and an example. Per-sonnel Psychology, 51, 193–208.

Conley, J. J. (1984). The hierarchy of consistency: A reviewand model of longitudinal findings on adult individualdifferences in intelligence, personality, and self-opinion.Personality and Individual Differences, 5, 11–25.

Costa, P. T., & McCrae, R. R. (1985). The NEO PersonalityInventory manual. Odessa, FL: Psychological Assess-ment Resources.

Cronbach, L. J. (1943). On estimates of test reliability.Journal of Educational Psychology, 34, 485–494.

Cronbach, L. J. (1947). Test reliability: Its meaning anddetermination. Psychometrika, 12, 1–16.

Cronbach, L. J. (1951). Coefficient alpha and the internalstructure of tests. Psychometrika, 16, 297–334.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N.(1972). The dependability of behavioral measurements:Theory of generalizability for scores and profiles. NewYork: Wiley.

DeShon, R. P. (1998). A cautionary note on measurementerror corrections in structural equation models. Psycho-logical Methods, 3, 412–423.

Diener, E., & Emmons, R. A. (1984). The independence ofpositive and negative affect. Journal of Personality andSocial Psychology, 47, 1105–1117.

Feldt, L. S. (1965). The approximate sampling distributionof Kuder–Richardson reliability coefficient twenty. Psy-chometrika, 30, 357–370.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L.Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.

Goldberg, L. R. (1992). The development of markers for theBig-Five factor structure Psychological Assessment, 4,26–42.

Goldberg, L. R. (1997). A broad-bandwidth, public-domain,personality inventory measuring the lower-level facets ofseveral five-factor models. In I. Mervielde, I. Deary, F.De Fruyt, & F. Ostendorf (Eds.), Personality psychologyin Europe (Vol. 7, pp. 7–28). Tilburg, the Netherlands:Tilburg University Press.

Hansen, C. P. (1989). A causal model of the relationshipamong accidents, biodata, personality, and cognitive fac-tors. Journal of Applied Psychology, 74, 81–90.

Heaven, P. C. L. & Bucci, S. (2001). Right-wing authori-tarianism, social dominance orientation and personality:An analysis using the IPIP measure. European Journal ofPersonality, 15, 49–56.

BEYOND ALPHA 221

Page 17: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Helmreich, R., & Stapp, J. (1974). Short forms of the TexasSocial Behavior Inventory (TSBI): An objective measureof self-esteem. Bulletin of the Psychonomic Society, 4,473–475.

Hogan, R. (1986). Hogan Personality Inventory manual.Minneapolis, MN: National Computer Systems.

Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings.Newbury Park, CA: Sage.

Judge, T. A., & Bretz, R. D. (1993). Report on an alterna-tive measure of affective disposition. Educational andPsychological Measurement, 53, 1095–1104.

Le, H., & Schmidt, F. L. (2001, August). The multi-facetednature of measurement error and its implications formeasurement error corrections. Paper presented at the109th Annual Convention of the American PsychologicalAssociation, San Francisco, CA.

Lord, F. M. (1955). Sampling fluctuations resulting fromthe sampling of test items. Psychometrika, 20, 1–22.

Mackaman, S. L. (1982). Performance evaluation tests forenvironmental research: Wonderlic Personnel Test. Psy-chological Reports, 51, 635–644.

Mooney, C. Z. (1997). Monte Carlo simulation. NewburyPark, CA: Sage.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometrictheory. (3rd ed.): McGraw-Hill.

Otis, A. S., & Lennon, R. T. (1979), Otis–Lennon SchoolAbility Test. New York: The Psychological Corporation.

Price, J. L. (1997). Handbook of organizational measure-ment. International Journal of Manpower, 18, 303–558.

Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press.

Rothstein, H. R. (1990). Interrater reliability of job perfor-mance ratings: Growth to asymptote level with increasingopportunity to observe. Journal of Applied Psychology,75, 322–327.

Schmidt, F. L. (1999, March 25). Measurement error andcumulative knowledge in psychological research. Invitedaddress presented at Purdue University, West Lafayette,IN.

Schmidt, F. L., & Hunter, J. E. (1996). Measurement errorin psychological research: Lessons from 26 research sce-narios, Psychological Methods, 1, 199–223.

Schmidt, F. L., & Hunter, J. E. (1999). Theory testing andmeasurement error. Intelligence, 27, 183–198.

Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989).Generalizability theory. American Psychologist, 44, 922–932.

Sherer, M., Maddux, J. E., Mercandante, B., Prentice-Dunn,S., Jacobs, B., & Rogers, R. W. (1982). The self-efficacyscale. Psychological Reports, 76, 707–710.

Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.),Educational measurement (pp. 356–442). Washington,DC: American Council on Education.

Thorndike, R. L. (1951). Reliability. In E. F. Lindquist(Ed.), Educational measurement (pp. 560–620). Wash-ington, DC: American Council on Education.

Traub, R. E. (1994). Reliability for the social sciences:Theory and applications. Thousand Oaks, CA: Sage.

U.S. Department of Labor. (1970). Manual for the USESGeneral Aptitude Test Battery, Section III: Development.Washington, DC: Manpower Administration, U.S. De-partment of Labor.

Vaidya, J. G., Gray, E. K., Haig, J., & Watson, D. (2001).On the differential stability of traits: Comparing traitaffect and Big Five scales. Manuscript submitted for pub-lication.

Watson, D. (2000). Mood and temperament. New York:Guilford Press.

Watson, D., & Clark, L. A. (1994). The PANAS-X: Manualfor the Positive and Negative Affect Schedule—expandedform. Iowa City: University of Iowa.

Watson, D., Clark, L. A., & Tellegen, A. (1988). Develop-ment and validation of brief measures of positive andnegative affect: The PANAS scales. Journal of Person-ality and Social Psychology, 54, 6, 1063–1070.

Watson, D., & Tellegen, A. (1985). Toward a consensualstructure of mood. Psychological Bulletin, 98, 219–235.

Wechsler, D. (1981). Manual for the Wechsler Adult Intel-ligence Scale—Revised. New York: The PsychologicalCorporation.

Wonderlic Personnel Test. (1998). Wonderlic PersonnelTest & Scholastic Level Exam, User’s manual. North-field, IL: Wonderlic and Associates.

SCHMIDT, LE, AND ILIES222

Page 18: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Appendix A

Estimating the Standard Deviation of a Full Scale From Its Subscale Standard Deviation

We need a general method which can enable us to esti-mate the standard deviation of a full scale from values ofsubscales with any number of items. We derive such amethod here.

Our derivations assume all the values come from an un-limited sample size, that is, they are population parameters.We further adopt the item domain model (Nunnally & Bern-stein, 1994) that assumes that all items of a scale are ran-domly sampled from a hypothetical pool of an infinite num-ber of items.

Consider Scale A1 with k1 items. Assuming that we havethe standard deviation of A1 (�A1) and its coefficient alpha(�1), we need to derive a formula to estimate the standarddeviation of Scale A2 that has k2 number of items from thesame item domain. Following the item domain model, itemsfrom Scales A1 and A2 are randomly sampled from the samehypothetical item domain.

The variance of A1 can be written as a function of theaverage variance and intercorrelation of a single item asfollows:

�2A1 � k1�2 + k1(k1 − 1)�2�, (A1)

where � is the average standard deviation of one item, and� is the average intercorrelation among all the items in-cluded in the scale.

Similarly, the variance of A2 can be obtained from theaverage variance and intercorrelation of a single item:

�2A2�k2�2 + k2(k2 − 1)�2�. (A2)

Factoring Equations A1 and A2 above for k1�2 and k2�2,respectively, we obtain

�2A1 � k1�2[1 + (k1 − 1)�] (A3)

and

�2A2 � k2�2[1 + (k2 − 1)�]. (A4)

Solving Equation A3 for �2, we have

�2 � �2A1/[k1 + k1(k1 − 1)�]. (A5)

The average intercorrelation among all items (�) can becomputed from the coefficient alpha of Scale A1 (�1) by thereversed Spearman–Brown formula:

� � �1/[k1 − (k1 − 1)�1]. (A6)

Substituting the value for � in Equation A6 into EquationA5 and simplifying the result, we have

�2 � �2A1 [k1 − (k1 − 1)�1]/k1

2. (A7)

Now replacing the values of � in Equation A6 into EquationA4 and simplifying, we obtain Equation A8:

�2A2 � k2�2 [k1 − �1(k1 − k2)]/[k1 − (k1 − 1)�1]. (A8)

Next replacing the value of �2 in Equation A7 into EquationA8 above and simplifying, we have the equation of interest:

�2A2 =

k2

k12

�2A1 �k1 − �k1 − k2��1�. (A9)

Taking the square root of both sides of Equation A9 yieldsthe formula used to compute the standard deviation of ascale having k2 items from the standard deviation and co-efficient alpha of its subscales having k1 items:

�A2 =�A1 �k2 �k1 − �k1 − k2��1��

1�2

k1. (A10)

As can be seen from the derivations above, this formulacan be used to compute the standard deviation of one scalefrom the standard deviation and coefficient alpha of anyother scale if the two scales sample items from the sameitem domain.

In practice, the values of �A1 and �1 are estimated andtherefore subject to sampling error. Consequently, the esti-mate of �A2 is also subject to sampling error. Using thecircumflex to denote all the values are estimated, we canrewrite Equation A10 as follows:

�A2 =�A1 �k2 �k1 − �k1 − k2��1��

1�2

k1. (A10�)

To reduce the impact of sampling error, we should aver-age the values of the subscales. Equation A10� was used inour study to estimate the standard deviations of the fullscales of PCI Extraversion, PCI Openness, Sherer’s GSE,and MPI Positive Affectivity and Negative Affectivity.

If we call p1 the ratio of number of items in the subscaleA1 to that of the full scale A2 (i.e., p1 � k1/k2), Equation A9above can be alternatively written as follows:

�2A2 � �2

A1 [p1 − (p1 − 1)�1]/p12. (A11)

Equation A11 above is the same as Equations 18a and 18bin the text.

(Appendixes continue)

BEYOND ALPHA 223

Page 19: Schmidt, F.L; Le, H & Ilies, R. (2003) Beyond alpha an empirical examination of the effects of different.pdf

Appendix B

Simulation Procedure to Estimate Standard Errors of CE, CES, and TEV

Data were simulated for each subject on each test item. Asubject’s score on an item of a test includes four compo-nents:

xpij � tpi + spi + opj + epij, (B1)

where xpij is the observed score of subject p on item i onoccasion j, tpi is the true score of subject p on item i, spi isthe specific factor error of item i (interaction between sub-ject p and item i), opj is the transient error on occasion j(interaction between subject p and occasion j), and epij is therandom response error of subject p on item i on occasion j.The components on the right side of Equation B1 can bedetermined on the basis information (reliability coefficients)of the scale that includes the item i. Specifically, let usconsider a subject’s score on a scale X that has k items. It isthe sum of his or her scores across k items. From EquationB1 above, we have

Xpj = �i=1

k

�tpi + spi + opj + epij�. (B2)

Variance (across subjects) of scale X is then

Var(X� = Var∑i=1

k

tpi + Var∑i=1

k

spi

+ Var∑i=1

k

opj + Var∑i=1

k

epij. (B3)

Because spi and epij are not correlated across items (i.e.,Cov(si, si�) � 0; and Cov(eij, ei�j) � 0, with all i and i�),Equation B3 can be expanded as

Var(X) � k2Var(t) + kVar(s) + k2Var(o) + kVar(e),(B4)

where Var(t) is the true score variance of an item, Var(s) isthe variance of specific factor error of an item, Var(o) is thevariance of transient error of an item, and Var(e) is thevariance of random response error of an item.

By definition, the components on the right side of Equa-tion B4 are

true score variance �TV� = k2Var�t�, (B5)specific factor error variance �SEV� = kVar�s�, (B6)

transient error variance (TEV) = k2Var(o), (B7)and

random response error variance (REV) = kVar(e). (B8)

Standardizing X (i.e., letting Var(X) � 1), we have

TV = CES, (B9)TEV = CE − CES, (B10)

and

SEV + REV = 1 − CE, (B11)where CES is the coefficient of equivalence and stability ofscale X and CE is the coefficient of equivalence of scale X.

From Equation B5 to Equation B11, the variance com-ponents for an item can be determined by the reliabilitycoefficients (CE and CES) of the scale it is included in:

Var(t) = CES�k2, (B12)

Var�o� = �CE − CES��k2, (B13)Var�s� = SEV�k, (B14)

and

Var�e� = (1 − CE − SEV)�k. (B15)The value of SEV in Equation B14 above can be any posi-tive number that is smaller than 1 − CE (we tested differentvalues of SEV within that range and obtained virtually thesame results). The reliability coefficients (CE and CES) andnumber of scale items (k) can be obtained for each measureincluded in the study (Table 2).

With the values of Var(t), Var(o), Var(s) and Var(e) es-timated from Equations B12 to B15 above, we can simulatea score for each subject on each item on each occasion.Specifically, the following equation was used:

xpij � z1[Var(t)]1/2 + z2[Var(s)]1/2 + z3[Var(o)]1/2

+ z4[Var(e)]1/2, (B16)

where z1, z2, z3, and z4 were randomly drawn from fourindependent standard normal distributions. In total, scoresfor p subjects (p � 167), on k items (k � number of itemsof the scale examined), on two occasions were simulated.For a subject, the same values of z1 and z3 were used acrossitems, and the same value of z2 (on each item) was usedacross occasions.

Two parallel half scales were then created based on thesimulated data. Relevant statistics (coefficient alphas andcorrelation between the half scales) were then computed.The program used these data to calculate CE, CES, andTEV for each scale included in the study following themethod described in the text. One thousand data sets weresimulated and standard deviations of the distributions ofestimated CE, CES, and TEV were recorded. These are thestandard errors of interest.

A computer program (in SAS 8.01) was written to imple-ment the simulation procedures described above. The pro-gram is available from Frank L. Schmidt upon request.

Received September 5, 2001Revision received October 7, 2002

Accepted January 13, 2003 �

SCHMIDT, LE, AND ILIES224