34
http://tcp.sagepub.com Psychologist The Counseling DOI: 10.1177/0011000006288127 2006; 34; 806 The Counseling Psychologist Roger L. Worthington and Tiffany A. Whittaker Recommendations for Best Practices Scale Development Research: A Content Analysis and http://tcp.sagepub.com/cgi/content/abstract/34/6/806 The online version of this article can be found at: Published by: http://www.sagepublications.com On behalf of: Division of Counseling Psychology of the American Psychological Association can be found at: The Counseling Psychologist Additional services and information for http://tcp.sagepub.com/cgi/alerts Email Alerts: http://tcp.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://tcp.sagepub.com/cgi/content/abstract/34/6/806#BIBL SAGE Journals Online and HighWire Press platforms): (this article cites 51 articles hosted on the Citations use or unauthorized distribution. © 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.com Downloaded from

Scale Development Research

Embed Size (px)

DESCRIPTION

Scale Development

Citation preview

Page 1: Scale Development Research

http://tcp.sagepub.comPsychologist

The Counseling

DOI: 10.1177/0011000006288127 2006; 34; 806 The Counseling Psychologist

Roger L. Worthington and Tiffany A. Whittaker Recommendations for Best Practices

Scale Development Research: A Content Analysis and

http://tcp.sagepub.com/cgi/content/abstract/34/6/806 The online version of this article can be found at:

Published by:

http://www.sagepublications.com

On behalf of:

Division of Counseling Psychology of the American Psychological Association

can be found at:The Counseling Psychologist Additional services and information for

http://tcp.sagepub.com/cgi/alerts Email Alerts:

http://tcp.sagepub.com/subscriptions Subscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.com/journalsPermissions.navPermissions:

http://tcp.sagepub.com/cgi/content/abstract/34/6/806#BIBLSAGE Journals Online and HighWire Press platforms):

(this article cites 51 articles hosted on the Citations

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 2: Scale Development Research

Scale Development ResearchA Content Analysis and Recommendations

for Best Practices

Roger L. WorthingtonUniversity of Missouri–Columbia

Tiffany A. WhittakerUniversity of Texas at Austin

The authors conducted a content analysis on new scale development articles appearingin the Journal of Counseling Psychology during 10 years (1995 to 2004). The authorsanalyze and discuss characteristics of the exploratory and confirmatory factor analysisprocedures in these scale development studies with respect to sample characteristics,factorability, extraction methods, rotation methods, item deletion or retention, factorretention, and model fit indexes. The authors uncovered a variety of specific practicesthat were at variance with the current literature on factor analysis or structural equa-tion modeling. They make recommendations for best practices in scale developmentresearch in counseling psychology using exploratory and confirmatory factor analysis.

Counseling psychology has a rich tradition producing psychometricallysound instruments for applications in research, training, and practice. Manyareas of scholarly inquiry in counseling psychology continue to be ripe forscale development research. In a special issue of the Journal of CounselingPsychology (JCP) on quantitative research methods, Dawis (1987) pro-vided an overview of scale development techniques, Tinsley and Tinsley(1987) discussed the use of factor analysis, and Fassinger (1987) presentedan overview of structural equation modeling (SEM). Although these articlescontinue to be cited in counseling psychology research, recent advancesrequire updated information and a comprehensive overview of all three top-ics. More recently, Quintana and Maxwell (1999) and Martens (2005) pro-vided comprehensive updates of SEM, but their focus was not specificallyon its use in scale development research (see also Martens & Hasse, 2006[this issue]; Weston & Gore, 2006 [TCP, special issue, part 1]).

The purpose of this article is threefold: (a) to provide an overview of thesteps taken in the scale development process using exploratory factor analy-sis (EFA) and confirmatory factor analysis (CFA), (b) to assess current prac-

The authors contributed equally to the writing of this article. We would like to thank JeffreyAndreas Tan for his assistance with the content analysis. Address correspondence to Roger L.Worthington, Department of Educational, School, and Counseling Psychology, University ofMissouri, Columbia, MO 65211; e-mail: [email protected]

THE COUNSELING PSYCHOLOGIST, Vol. 34 No. 6, November 2006 806-838DOI: 10.1177/0011000006288127© 2006 by the Society of Counseling Psychology

806

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 3: Scale Development Research

tices by reporting the results of a 10-year content analysis of scale develop-ment research in counseling psychology, and (c) to provide a set of recom-mendations for best practices in using EFA and CFA in scale development(for more on factor analysis, see Kahn, 2006 [TCP, special issue, part 1]). Weassume the reader has basic knowledge of psychometrics, including principlesof reliability (Helms, Henze, Sass, & Mifsud, 2006 [TCP, special issue, part1]), validity (Hoyt, Warbasse, & Chu, 2006 [this issue]), and multivariate sta-tistics (Sherry, 2006 [TCP, special issue, part 1]). We begin with an overviewof EFA and CFA, followed by a discussion of the procedure we used in con-ducting our content analysis. We then embed the findings of our contentanalysis within more detailed discussions of EFA and CFA, identifying poten-tial problems and highlighting best practices. We conclude with an integrativediscussion of best practices and findings from the content analysis.

OVERVIEW OF EFA AND CFA

Factor analysis is a technique used to identify or confirm a smaller num-ber of factors or latent constructs from a large number of observed variables(or items). There are two main categories of factor analysis: (a) exploratoryand (b) confirmatory (Kahn, 2006 [TCP, special issue, part 1]). Althoughresearchers may use factor analysis for a range of purposes, one of the mostprevalent uses of factor-analytic techniques is to support the validity ofnewly developed tests or scales—that is, does the newly developed test orscale measure the intended construct(s)? More specifically, the applicationof factor analysis to a set of items may help researchers answer the follow-ing questions: How many factors or constructs underlie the set of items?What are the defining features or dimensions of the factors or constructsthat underlie the set of items (Tabachnick & Fidell, 2001)?

EFA assesses the construct validity during the initial development of aninstrument. After developing an initial set of items, researchers apply EFA toexamine the underlying dimensionality of the item set. Thus, they can group alarge item set into meaningful subsets that measure different factors. The pri-mary reason for using EFA is that it allows items to be related to any of thefactors underlying examinee responses. As a result, the developer can easilyidentify items that do not measure an intended factor or that simultaneouslymeasure multiple factors, in which case they could be poor indicators of thedesired construct and eliminated from further consideration.

When used for scale development, EFA becomes a combination of qual-itative and quantitative methods, which can be either confusing or enliven-ing for researchers. We have found that novices (and some who are notnovice) hope to have the statistical program produce the ultimate solutionthat will provide them with a set of empirically determined, indisputable

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 807

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 4: Scale Development Research

dimensions or factors. However, effectively using EFA procedures requiresresearchers to use inductive reasoning, while patiently and subtly adjustingand readjusting their approach to produce the most meaningful results.Therefore, the process of scale development using EFA can become a rela-tively dynamic process of examination and revision, followed by moreexamination and revision, ultimately leading to a tentative rather than adefinitive outcome.

The most current approach in conducting CFA is to use SEM. Prior toanalyzing the data, a researcher must indicate (a) how many factors arepresent in an instrument, (b) which items are related to each factor, and(c) whether the factors are correlated or uncorrelated (issues that arerevealed during the process of EFA). Because the items are generally con-strained to load on only one factor in CFA, it is generally intended not toexplore whether a given item measures no factors, one factor, or multiplefactors but instead to evaluate or confirm the extent to which theresearcher’s measurement model is replicated in the sample data. Thus, itis critical to have prior knowledge of the expected relationships betweenitems and factors before conducting CFA—hence the term confirmatory.SEM is a powerful confirmatory technique because it allows theresearcher greater control over the form of constraints placed on itemsand factors when analyzing a hypothesized model. Furthermore, as wediscuss later, researchers can also use SEM to examine competing modelsto assess the extent to which one hypothesized model fits the data betterthan an alternative model. In our discussion, we provide information aboutthe basic concepts and procedures necessary to use SEM in scale devel-opment research. For more advanced discussions of SEM, we refer read-ers to several existing books and articles (e.g., Bollen, 1989; Kline, 2005;Martens, 2005; Martens & Hasse, 2006; Quintana & Maxwell, 1999;Thompson, 2004).

CONTENT-ANALYSIS PROCEDURE

To provide context for our discussion of scale development best practices,we conducted a content analysis of scale development articles in counsel-ing psychology that reflect common practices. In this section, we providean overview of the article-selection process used in our content analysis.We then integrate the findings of our content analysis into the remainder ofthe article as we review the literature and recommend best practices forscale development.

We reviewed scale development articles published in JCP in the 10 yearsbetween 1995 and 2004, inclusive (see appendix for a list of articles). We

808 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 5: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 809

based our selection of articles on two central criteria: We included (a) onlynew scale development research articles (i.e., we excluded articles investi-gating only the reliability, validity, or revisions of existing scales) and (b)only articles that reported results from EFA and CFA. A paid graduate stu-dent assistant reviewed the tables of contents for each issue of JCP pub-lished during the specified time frame. We instructed the graduate studentto err on the side of being overly inclusive, which resulted in the identifi-cation of 38 articles that used EFA and CFA to examine the psychometricproperties of measurement instruments. The first author reviewed thesearticles and eliminated 15 that did not meet the selection criteria, resultingin 23 articles for our sample. Next, the first author and second author inde-pendently evaluated the 23 articles to identify and quantify the EFA andCFA characteristics. The only discrepancies in the independent evaluationsof the articles were because of clerical errors in recording descriptive infor-mation (as opposed to disagreement in classification), which we jointlychecked and verified.

We were interested in a number of characteristics of the studies. For stud-ies reporting EFA procedures, we were interested in the following: (a) samplecharacteristics, (b) criteria for assessing the factorability of the correlationmatrix, (c) extraction methods, (d) criteria for determining rotation method,(e) rotation methods, (f) criteria for factor retention, (g) criteria for item dele-tion, and (h) purposes and criteria for optimizing scale length (see Table 1).For studies reporting CFA procedures, we were interested in the follow-ing: (a) using SEM versus alternative methods as a confirmatory approach,(b) sample-size criteria, (c) fit indexes, (d) fit-index criteria, (e) cross-validationindexes, and (f) model-modification issues (see Table 2).

THE PROCESS OF SCALE DEVELOPMENT RESEARCH

There are various strategies used in scale construction, often describedusing somewhat differing labels for similar approaches. Brown (1983) sum-marized three primary strategies: logical, empirical, and homogeneous.Friedenberg (1995) identified a slightly different set of categories: logical-content or rational, theoretical, and empirical, in which the latter containscriterion group and factor analysis methods. The rational or logicalapproach simply uses the scale developer’s judgments to identify or con-struct items that are obviously related to the characteristic being measured.The theoretical approach uses psychological theory to determine the con-tent of the scale items. Both the theoretical and rational and logicalapproaches are no longer popular methods in scale development. The morerigorous empirical approach uses statistical analyses of item responses as

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 6: Scale Development Research

810 THE COUNSELING PSYCHOLOGIST / November 2006

TABLE 1: Characteristics of Exploratory Factor Analyses Used in Scale DevelopmentStudies Published in the Journal of Counseling Psychology (1995 to 2004)

Characteristic Frequency

Sample characteristicsConvenience sample 5Purposeful sample of target group 10Convenience and purposeful sampling 6

Criteria used to assess factorability of correlation matrixAbsolute sample size 1Item intercorrelations 1Participants per item ratio 3Barlett’s test of sphericity 5Kaiser-Meyer-Olkin test of sample adequacy 7Unspecified 11

Extraction methodPrincipal-components analysis 9Common-factors analysis

Principal-axis factoring 6Maximum likelihood 3Unspecified 1

Combination principal-components analysis and common-factors analysis 1Unspecified 1

Criteria for determining rotation methodSubscale intercorrelations 2Theory 3Both 1Other 3Unspecified 12

Rotation methodOrthogonal

Varimax 8Unspecified 1

ObliquePromax 1Oblimin 3Unspecified 4

Both orthogonal and oblique 3Unspecified 1

Criteria for item deletion or retentionLoadings 16Cross-loadings 13Communalities 0Item analysis 1Other 3Unspecified 2No items were deleted 2

(continued)

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 7: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 811

the basis for item selection based on (a) predictive utility for a criteriongroup (e.g., depressives) or (b) homogenous item groupings. The methoddescribed in this article is an empirical approach that employs factor analy-sis to form homogenous item groupings.

A number of authors have recommended similar sequences of steps tobe taken prior to using factor-analytic techniques (e.g., Anastasi, 1988;Dawis, 1987; DeVellis, 2003). We review these preliminary steps in the fol-lowing section because, as is the case in most scientific endeavors, earlymistakes in scale development often lead to problems later in the process.Once we have described all the steps in some detail, we address the extentto which the studies in our content analysis incorporated the steps in theirdesigns.

Although there is little variation between models proposed by differentauthors, we rely primarily on DeVellis (2003) as the most current resource.Thus, the following description is only one of several similar models availableand does not reflect a unitary best practice. DeVellis (2003) recommends the

TABLE 1 (continued)

Characteristic Frequency

Criteria for factor retentionEigenvalues 18Scree plot 17Minimum proportion of variance accounted for by factor 2Number of items per factor 4Simple structure 5Conceptual interpretability 15Other 3Unspecified 2

Optimizing scale lengthNone attempted 15Purpose

Reduce total scale length 2Limit total items per factor 3Balance items per factor 2

CriteriaRedundant items 1Conceptually unrelated items 1Statistical invariance 1Cross-loadings 1Dropped items with lowest loadings 4Item content 2

NOTE: Values in each category may not sum to equal the total number of studies becausesome studies may have reported more than one criterion or approach.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 8: Scale Development Research

812 THE COUNSELING PSYCHOLOGIST / November 2006

TABLE 2: Characteristics of Confirmatory Factor Analyses Used in ScaleDevelopment Studies Published in the Journal of Counseling Psychology(1995 to 2004)

Characteristic Frequency

SEM versus FA as a confirmatory approachSEM used 14FA used 2

Typical SEM approachesSingle-model approach 2Competing-models approach 8

Nested models compared 4Nonnested or equivalent models compared 4

Sample-size criteria (SEM only)Participants per parameter 1Unspecified 13

Overall model fitChi-square 12Chi-square and df ratio 6

Incremental fit indexes reportedCFI 8PCFI 1IFI 2NFI 4NNFI/TLI 7RNI 1

Absolute fit indexes reportedGFI 10AGFI 6RMSEA 6RMSEA with confidence intervals 1RMR 4SRMR 1Hoetler N 1

Predictive fit indexes reportedAIC 2CAIC 1ECVI 2BIC 1

Fit index criteriaRecommended cutoff 11Unspecified 3

(continued)

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 9: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 813

following steps in constructing new instruments: (a) Determine clearly whatyou want to measure, (b) generate an item pool, (c) determine the format ofthe measure, (d) have experts review the initial item pool, (e) consider inclu-sion of validation items, (f) administer items to a development sample, (g)evaluate the items, and (h) optimize scale length.

In scale development, the first step is to define your construct clearly andconcretely, using both existing theory and research to provide a sound con-ceptual foundation. This is sometimes more difficult than it may initiallyappear because it requires researchers to distinctly define the attributes ofabstract constructs. Nothing is more difficult to measure than an ill-definedconstruct because it leads to the inclusion of items that may be only periph-erally related to the construct of interest or to the exclusion of items that areimportant components of the content domain.

The next step is to generate a pool of items designed to tap the construct.Ultimately, the objective is to arrive at a set of items that clearly represent theconstruct of interest so that factor-analytic, data-reduction techniques yield astable set of underlying factors that accurately reflect the construct. Items thatare poorly worded or not central to a clearly articulated construct will introducepotential sources of error variance, reducing the strength of correlations amongitems, and will diminish the overall objectives of scale development (seeQuintana & Minami, 2006 [this issue], on dealing with measurement error inmeta-analyses). In general, researchers should write items so that they are clear,concise, readable, distinct, and reflect the scale’s purpose (e.g., produceresponses that can be scored in a meaningful way in relation to the constructdefinition). DeVellis (2003) and Anastasi (1988) offer a host of recommenda-tions for generating quality items and choosing a response format that arebeyond the scope of this article. It suffices to say that the researcher should not

TABLE 2 (continued)

Model modificationLagrange multiplier 3Wald statistic 0Item parceling 2

NOTE: Values in each category may not sum to equal the total number of studies becausesome studies may have reported more than one criterion or approach. AGFI = AdjustedGoodness-of-Fit Index; AIC = Akaike’s Information Criterion; BIC = Bayesian InformationCriterion; CAIC = Consistent Akaike’s Information Criterion; CFI = Comparative Fit Index;ECVI = Expected Cross-Validation Index; FA = Common-Factors Analysis; GFI = Goodness-of-Fit Index; IFI = Incremental Fit Index; NFI = Normed Fit Index; NNFI/TLI = NonnormedFit Index or Tucker-Lewis Index; PCFI = Parsimony Comparative Fit Index; RMR = RootMean-Square Residual; RMSEA = Root Mean-Square Error of Approximation; RNI =Relative Noncentrality Index; SEM = Structural Equation Modeling; SRMR = StandardizedRoot Mean-Square Residual.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 10: Scale Development Research

take the quality of the item pool lightly, and a carefully planned approach toitem generation is a critical beginning to scale development research.

Having the items reviewed by one or more groups of knowledgeablepeople (experts) to assess item quality on a number of different dimensionsis another critical step in the process. At a minimum, expert review shouldinvolve an analysis of content validity (e.g., the extent to which a set of itemsreflects the content domain). Experts can also evaluate items for clarity,conciseness, grammar, reading level, face validity, and redundancy. Finally,it is also helpful at this stage for experts to offer suggestions for adding newitems and length of administration.

Although it is possible to include additional scales for participants tocomplete that may provide information about convergent and discriminantvalidity, we recommend that researchers limit such efforts at this stage ofdevelopment. We recommend this for two reasons. First, it is wise to keepthe total questionnaire length as short as possible and directly related to thestudy’s central purpose. The longer the questionnaire, the less likely poten-tial participants will be to volunteer for the study or to complete all the items(Converse & Presser, 1986). Scale development studies sometimes includeas many as 3 to 4 times the number of items that will eventually end up onthe instrument, making inclusion of additional scales prohibitive. Second,there are several ways that items from other measures may interact withitems designed for the new instrument to affect participant responses and,thus, to interfere in the scale development process. In particular, it would bevery difficult, if not impossible, to control for order effects of different mea-sures while testing the initial factor structure for the new scale. Randomlyadministering existing measures with the other instruments might contami-nate participants’ responses on the items for the new scale, but administer-ing the new items first to avoid contamination eliminates an importantprocedure commonly used when researchers use multiple self-report scalesconcurrently within a single study. Thus, we believe that it is important toavoid influencing item responses during the initial phase of scale develop-ment by limiting the use of additional measures. Although ultimately a mat-ter of researcher judgment, assessing the convergent and discriminantvalidity (e.g., correlation with other measures) is an important step that webelieve should occur later in the process of scale development.

Of the 23 studies in our content analysis, 14 reported a construct or scaledefinition that guided item generation, and all but 2 studies indicated thatitem generation was based on prior theoretical and empirical literaturein the field. Occasionally, however, we found that articles provided onlysparse details in the introductory material articulating the theoreticalfoundations for the research. The studies in our review used various item-generation approaches. All the approaches involved some form of rational

814 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 11: Scale Development Research

item generation, with the primary variations involving the combination ofrational and empirical approaches. Although the extensiveness and specificapproaches of the procedures varied widely, only a few studies (n = 2) did notinclude (or failed to report) expert review of item sets prior to conductingEFA or CFA. Finally, our content analysis showed three typical patterns withrespect to the inclusion of validity items during administration to the initialdevelopment sample: (a) administering only the scale items (no validity itemsbeing included), (b) assessing only social desirability along with the scaleitems, or (c) administering numerous other scales along with the scale itemsto provide additional evidence of convergent and discriminant validity.

THE ORDERING OF EFA AND CFA INNEW SCALE DEVELOPMENT RESEARCH

Researchers typically use CFA after an instrument has already beenassessed using EFA, and they want to know if the factor structure producedby EFA fits the data from a new sample. An alternative, less typical approach,is to perform CFA to confirm a theoretically driven item set without the prioruse of EFA. However, Byrne (2001) stated that “the application of CFA pro-cedures to assessment instruments that are still in the initial stages of devel-opment represents a serious misuse of this analytic strategy” (p. 99).Furthermore, reporting the findings of a single CFA is of little advantage overconducting a single EFA. Specifically, research has shown that exploratorymethods (i.e., principal-axis and maximum-likelihood factor analysis) areable to recover the correct factor model satisfactorily a majority of the time(Gerbing & Hamilton, 1996). In addition, a key validity issue is the replica-tion of the hypothesized factor structure using a new sample. Thus, ratherthan produce a CFA that would ultimately need to be followed by a secondCFA, the most logical approach would be to conduct an EFA followed by aCFA in all cases. Thus, when developing new scales, researchers should con-duct an EFA first, followed by CFA. Regardless of how effectively theresearcher believes item generation has reproduced the theorized latent vari-ables, we believe that the initial validation of an instrument should involveempirically appraising the underlying factor structure (i.e., EFA).

Of the 23 new scale development articles we reviewed, a significant major-ity conducted EFA followed by CFA (n = 10) or only EFA without CFA(n = 8). One article reported using SEM following EFA, but the procedurewas inconsistent with CFA. Two smaller subsets of articles reported onlyCFA (n = 2) or conducted CFA followed by EFA (n = 2). In the two stud-ies in which EFA followed CFA, researchers had produced theoreticallyderived instruments that they believed required only a confirmation of the

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 815

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 12: Scale Development Research

hypothesized factor structure (which proved wrong in both cases). As aresult, when the hypothesized factor structure did not fit the data usingSEM, the researchers reverted to EFA (using the same sample) as a meansof uncovering the underlying factor structure—a somewhat questionableprocedure that could have been avoided if they had relied on EFA in thefirst place. The studies that successfully used only CFA included one thatreported only a single CFA and another that reported two consecutive CFAs(in which the second replicated the findings of the first).

EFA

Development sample characteristics. Representativeness in scale develop-ment research does not follow conventional wisdom—that is, it is not neces-sary to closely represent any clearly identified population as long as thosewho would score high and those who would score low are well represented(Gorsuch, 1997). Furthermore, one reason many scholars have consistentlyadvocated for large samples in scale development research (see further on) isthat scale variance attributable to specific participants tends to be cancelledby random effects as sample size increases (Tabachnick & Fidell, 2001).Nevertheless, samples that do not adequately represent the population ofinterest affect factor-structure stability and generalizability. When all partici-pants are drawn from a particular source sharing certain characteristics (e.g.,age, education, socioeconomic status, and racial and ethnic group), even largesamples will not sufficiently control for the systematic variance produced bythese characteristics. Thus, it is advisable to ensure the appropriateness of thedevelopment sample to the degree possible before conducting an EFA.

An important caveat with respect to sample characteristics is that incounseling psychology research, there are many potential populationswhose members may be difficult to identify or from whom it is particularlydifficult to solicit participation (e.g., lesbian, gay, bisexual and transgenderindividuals,` and persons with disabilities). Under circumstances where aresearcher believes that the sample characteristics might be at variancefrom unknown population characteristics, she or he may be forced to adjustto these unknowns and simply move forward with a sample that is adequatebut not ideal (Worthington & Navarro, 2003).

In the studies we reviewed for the content analysis, some form of purpose-ful sampling from a specific target population was the most common approach,followed by a combination of convenience and purposeful sampling. Onlyabout 25% of the studies used convenience sampling, most often with under-graduate student participants. Three of the studies we reviewed used splitsamples (i.e., a large sample split into two groups for separate analyses).

816 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 13: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 817

Sample size. Sample size is an issue that has received considerablediscussion in the literature. There are two central risks with using too fewparticipants: (a) Patterns of covariation may not be stable, because chancecan substantially influence correlations among items when the ratio of par-ticipants to items is relatively low; and (b) the development sample may notadequately represent the intended population (DeVellis, 2003). Comrey(1973) has been cited often as classifying a variety of sample sizes fromvery poor (N = 50) to excellent (N = 1,000) based solely on the number ofparticipants in a sample and as recommending at least 300 cases for factoranalysis. Gorsuch (1983) has also proposed guidelines for minimum ratiosof participants to items (5:1 or 10:1), which has been widely cited in coun-seling psychology research. However, other authors have pointed out thatthese general guidelines may be misleading (MacCallum, Widaman,Zhang, & Hong, 1999; Tabachnick & Fidell, 2001; Velicer & Fava, 1998).

In general, there is some agreement that larger sample sizes are likelyto result in more stable correlations among variables and will result ingreater replicability of EFA outcomes. Velicer and Fava (1998) producedevidence indicating that any ratio less than a minimum of three partici-pants per item is inadequate, and there is additional evidence that factorsaturation (the number of items per factor) and item communalities are themost important determinants of adequate sample size (Guadagnoli &Velicer, 1988; MacCallum et al., 1999). Thus, we offer four overarchingguidelines: (a) Sample sizes of at least 300 are generally sufficient in mostcases, (b) sample sizes of 150 to 200 are likely to be adequate with datasets containing communalities higher than .50 or with 10:1 items per fac-tor with factor loadings at approximately |.4|, (c) smaller samples sizesmay be adequate if all communalities are .60 or greater or with at least 4:1items per factor and factor loadings greater than |.6|, and (d) samples sizesless than 100 or with fewer than 3:1 participant-to-item ratios are gener-ally inadequate (Reise, Waller, & Comrey, 2000; Thompson, 2004). Notethat this requires researchers to set a minimum sample size at the outsetand to evaluate the need for additional data collection based on the out-comes of an initial EFA.

In our content analysis, absolute magnitude of sample sizes and participant-per-item ratios were virtually the only references made with respect to samplesize, and both varied widely. Absolute sample sizes varied from 84 to 411(M = 258.95; SD = 100.80). Participant-per-item ratios varied from 2:1 to 35:1(the modal ratio was 3:1). The authors addressed no other sample-size criteriawhen discussing the adequacy of their sample sizes.

Factorability of the correlation matrix. Although many people are famil-iar with the previously described standards regarding sample size, the fac-torability of a data set also has been related to the sizes of correlations in

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 14: Scale Development Research

818 THE COUNSELING PSYCHOLOGIST / November 2006

the matrix. Researchers can use Bartlett’s (1950) test of sphericity toestimate the probability that correlations in a matrix are 0. However, it ishighly susceptible to the influence of sample size and likely to be signifi-cant for large samples with relatively small correlations (Tabachnick &Fidell, 2001). Thus, we recommend using this test only if there are fewer thanabout 5 cases per variable, but this becomes moot with samples containingfewer than three cases per variable (see earlier). In studies with cases-per-item ratios higher than 5:1, we recommend that researchers provide addi-tional evidence for scale factorability.

The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is alsouseful for evaluating factorability. This measure of sampling adequacyaccounts for the relationship of partial correlations to the sum of squaredcorrelations. Thus, it indicates the extent to which a correlation matrix actu-ally contains factors or simply chance correlations between a small subsetof variables. Tabachnick and Fidell (2001) suggested that values of .60 andhigher are required for good factor analysis.

In our content analysis of scale development articles in JCP, the largestnumber of studies (n = 11) did not report using any criteria to assess the fac-torability of the correlation matrix. Although some studies (n = 5) reportedusing Bartlett’s test of sphericity, only one of those studies contained acases-to-items ratio small enough to provide useful information on the basisof Bartlett’s test. Although other studies had cases-to-items ratios less than5:1, they did not report using Barlett’s test to assess scale factorability. Only7 of the articles reported the value of KMO as a precursor to completingfactor analysis, and a few articles (n = 3) used the participants-per-itemratio as the sole criterion.

Extraction methods. There are a variety of factor-extraction methodsbased on a number of statistical theories, but the two most commonlyknown and studied are principal-components analysis (PCA) and common-factors analysis (FA). There has been a protracted debate over the preferreduse of PCA versus FA (e.g., principal-axis factoring, maximum-likelihoodfactoring) as exploratory procedures, which has yet to be resolved(Gorsuch, 2003). We do not intend to examine this debate in detail (seeMultivariate Behavioral Research, 1990, Volume 25, Issue 1, for an exten-sive discussion of the pros and cons of both). However, it is important forresearchers to understand the distinct purposes of each technique. The pur-pose of PCA is to reduce the number of items while retaining as much ofthe original item variance as possible. The purpose of FA is to understandthe latent factors or constructs that account for the shared variance amongitems. Thus, the purpose of FA is more closely aligned with the develop-ment of new scales. In addition, although it has been shown that PCA andFA often produce similar results (Velicer & Jackson, 1990; Velicer,

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 15: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 819

Peacock, & Jackson, 1982), there are several conditions under which FAhas been shown to be superior to PCA (Gorsuch, 1990; Tucker, Koopman,& Linn, 1969; Widamen, 1993). Finally, compared with PCA, the outcomesof FA should more effectively generalize to CFA (Floyd & Widaman,1995). Thus, although there may be other appropriate uses for PCA, werecommend FA for the development of new scales.

An example of the use of FA versus PCA in a simulated data set mightillustrate the differences between these two approaches. Imagine that aresearcher at a public university is interested in measuring campus climate fordiversity. The researcher created 12 items to measure three different aspects ofcampus climate (each using 4 items): (a) general comfort or safety, (b) open-ness to diversity, and (c) perceptions of the learning environment. In a sampleof 500 respondents, correlations among the 12 variables indicated that oneitem from each subset did not correlate with any other items on the scale (e.g.,no higher than r = .12 for any bivariate pair containing these items). In FA, thethree uncorrelated items appropriately drop out of the solution because of lowfactor loadings (loadings < .23), resulting in a three-factor solution (each fac-tor retaining 3 items). In PCA, the three uncorrelated items load together on afourth factor (loadings > .45). This example demonstrates that under certainconditions, PCA may overestimate factor loadings and result in erroneousdecisions about the number of factors or items to retain.

We should also make clear that there are several techniques of FA,including principal-axis factoring, maximum likelihood, image factoring,alpha factoring, and unweighted and generalized least squares. Gerbing andHamilton (1996) have shown that principal-axis factoring and maximum-likelihood approaches are relatively equal in their capacities to extract thecorrect model when the model is known in the population. However,Gorsuch (1997) points out that maximum-likelihood extractions result inoccasional problems that do not occur with principal-axis factoring. Prior tothe current use of SEM as a CFA technique, maximum-likelihood extractionhad some advantages over other FA procedures as a confirmatory technique(Tabachnick & Fidell, 2001). For further discussion of less commonly usedapproaches, see Tabachnick and Fidell (2001).

Among the studies in our content analysis, most used some form of FA(n = 10), but a similar number used PCA (n = 9). One study used a combi-nation of PCA and FA, and another did not report an extraction method.(Note: 2 of the 23 studies used only CFA and are not included in the figuresreported earlier.) A cursory examination of the publication dates indicatesthat the majority of studies using PCA were published prior to the majorityof those using FA, suggesting a trend away from PCA in favor of FA.

Criteria for determining rotation method. FA rotation methods includetwo basic types: orthogonal and oblique. Researchers use orthogonal

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 16: Scale Development Research

rotations when the set of factors underlying a given item set are assumed orknown to be uncorrelated. Researchers use oblique rotations when the fac-tors are assumed or known to be correlated. A discussion of the statisticalproperties of the various types of orthogonal and oblique rotation methodsis beyond the scope of this article (we refer readers to Gorsuch [1983] andThompson [2004] for such discussions). In practice, researchers can deter-mine whether to use an orthogonal versus oblique rotation during the initialFA based on either theory or data. However, if they discover that the factorsappear to be correlated in the data when theory has suggested them to beuncorrelated, it is still most appropriate to rely on the data-based approachand to use an oblique rotation. Although, in some cases, both proceduresmight produce the same factor structure with the same data, using anorthogonal rotation with correlated factors tends to overestimate loadings(e.g., they will have higher values than with an oblique rotation; Loehlin,1998). Thus, researchers may retain or reject some items inappropriately,and the factor structure may be more difficult to replicate during CFA.

Our content analysis showed that relatively few of the studies in ourreview reported an adequate rationale for selecting an orthogonal or obliquerotation method, with only 2 using subscale intercorrelations, 3 using theory,and 1 using both. Twelve studies did not specify the criteria used to selecta rotation method, and 3 studies actually reported criteria irrelevant to thetask (e.g., although the factors were correlated, the orthogonal solutionmatched the prior expectations for the factor solution). Also, 8 studies usedorthogonal rotations despite reporting moderate to high correlations amongfactors, and 4 studies did not provide factor intercorrelations.

Criteria for factor retention. Researchers can use numerous criteria toestimate the number of factors for a given item set. The most widelyknown approaches were recommended by Kaiser (1958) and Cattell(1966) on the basis of eigenvalues, which may help determine the impor-tance of a factor and indicate the amount of variance in the entire set ofitems accounted for by a given factor (for a more detailed explanation ofeigenvalues, see Gorsuch, 1983). The iterative process of factor analysisproduces successively less useful information with each new factorextracted in a set because each factor extracted after the first is based onthe residual of the previous factor’s extraction. The eigenvalues producedwill be successively smaller with each new factor extracted (accountingfor smaller and smaller proportions of variance) until virtually meaning-less values result. Thus, Kaiser (1958) believed that eigenvalues less than1.0 reflect potentially unstable factors. Cattell (1966) used the relative val-ues of eigenvalues to estimate the correct number of factors to examineduring factor analysis—a procedure known as the scree test. Using thescree plot, a researcher examines the descending values of eigenvalues to

820 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 17: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 821

locate a break in the size of eigenvalues, after which the remaining valuestend to level off horizontally.

Parallel analysis (Horn, 1965) is another procedure for deciding howmany factors to retain. Generally, when using parallel analysis, researchersrandomly order the participants’ item scores and conduct a factor analysis onboth the original data set and the randomly ordered scores. Researchersdetermine the number of factors to retain by comparing the eigenvaluesdetermined in the original data set and in the randomly ordered data set.They retain a factor if the original eigenvalue is larger than the eigenvaluefrom the random data. This has been shown to work reasonably well whenusing FA (Humphreys & Montanelli, 1975) as well as PCA (Zwick &Velicer, 1986). Parallel analysis is not readily available in commonly usedstatistical software, but programs are available that conduct parallel analysiswhen using principal-axis factor analysis and PCA (see O’Connor, 2000).

Approximating simple structure is another way to evaluate factor reten-tion during EFA. According to McDonald (1985), the term simple structurehas two radically different meanings that are often confused. A factor pat-tern has simple structure (a) if several items load strongly on only one fac-tor and (b) if items have a zero correlation to other factors in the solution.SEM constrains the relationships between items and factors to producesimple structure as defined earlier (which will become important later).McDonald (1985) differentiates this from what he prefers to call approxi-mate simple structure, often reported in counseling psychology research asif it were simple structure, which substitutes the word small (undefined) forthe word zero (definitive) in the primary definition. Researchers can esti-mate approximate simple structure by using rotation methods during FA. InEFA, efforts to produce factor solutions with approximate simple structureare central to decisions about the final number of factors and about theretention and deletion of items in a given solution. If factors share itemsthat cross-load too highly on more than one factor (e.g., > .32), the itemsare considered complex because they reflect the influence of more than onefactor. Approximating simple structure can be achieved through item or fac-tor deletion or both. SEM approaches to CFA assume simple structure, andvery closely approximating simple structure during EFA will likelyimprove the subsequent results of CFA using SEM.

The larger the number of items on a factor, the more confidence one has thatit will be a reliable factor in future studies. Thus, with a few minor caveats,some authors have recommended against retaining factors with fewer thanthree items (Tabachnick & Fidell, 2001). It is possible to retain a factor withonly two items if the items are highly correlated (i.e., r > .70) and relativelyuncorrelated with other variables. Under these conditions, it may be appropri-ate to consider other criteria (e.g., interpretability) in deciding whether to retain

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 18: Scale Development Research

822 THE COUNSELING PSYCHOLOGIST / November 2006

the factor or to discard it. Nevertheless, it may be best to revisit item-genera-tion procedures to produce additional items intended to load on the factor(which would require a new EFA before moving on to the CFA).

Conceptual interpretability is the definitive factor-retention criterion. Inthe end, researchers should retain a factor only if they can interpret it in ameaningful way no matter how solid the evidence for its retention based onthe empirical criteria earlier described. EFA is ultimately a combination ofempirical and subjective approaches to data analysis because the job is notcomplete until the solution makes sense. (Note that this is not necessarilytrue for the criterion-group method of scale development.) At this stage, theresearcher should conduct an analysis of the items within each factor toassess the extent to which the items make sense as a group. Althoughuncommon, it may be useful to submit the item-factor combinations to asmall group of experts for external interpretation to avoid a situation inwhich a factor makes sense to the researcher eager for a viable scale but notto anybody else.

In our content analysis of JCP articles, it appeared that numerousresearchers encountered problems reconciling their EFA findings withtheir conceptual interpretation of the factor solution and occasionallyengaged in rationalizations that led to questionable practices. For example,researchers in one study selected a factor solution that fit their precon-ceived conceptualization of the scale although some of the factors werevery highly intercorrelated (e.g., the data indicated fewer factors than theauthors adopted). When a researcher desires a specific factor structure thatis not adequately reproduced during EFA, the recommended practicewould be (a) to adopt the factor solution supported by the data and engagein meaningful interpretation based on those findings and (b) to return toitem generation and go back through earlier steps in the scale developmentprocess (including EFA). There were a few articles in our content analysisthat inappropriately moved forward with CFA after making revisions thatwere not assessed by EFA.

Criteria for item deletion or retention. Although, on rare occasions, aresearcher may retain all the initial items submitted to EFA, item deletionis a very common and expected part of the process. Researchers most oftenuse the values of the item loadings and cross-loadings on the factors todetermine whether items should be deleted or retained. Inevitably, thisprocess is intertwined with the process of determining the number of fac-tors that will be retained (described earlier). For example, in someinstances, a researcher might be evaluating the relative value of several dif-ferent factor solutions (e.g., 2, 3, or 4 factors). As such, deleting itemsbefore establishing the final number of factors could actually reduce thenumber of factors retained. On the other hand, unnecessarily retaining

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 19: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 823

items that fail to contribute meaningfully to any of the potential factor solu-tions will make it more difficult to make a final decision about the numberof factors to retain. Thus, the process we recommend is designed to retainpotentially meaningful items early in the process and to optimize scalelength only after the factor solution is clear.

Most researchers begin EFA with a substantially larger number of itemsthan they ultimately plan to retain. However, there is considerable variationamong studies in the proportion of items in the initial pool that are plannedfor deletion. We recommend that researchers wait until the last step in EFAto trim unnecessary items and focus primarily on empirical scale develop-ment procedures at this stage in the process so as not to confuse the purposesof these two similar activities (e.g., item deletion). Thus, researchers shouldbase decisions about whether to retain or delete items at this stage on theircontribution to the factor solution rather than on the final length of the scale.

Most researchers use some guideline for a lower limit on item factorloadings and cross-loadings to determine whether to retain or delete items,but the criteria for determining the magnitude of loadings and cross-loadingshave been described as a matter of researcher preference (Tabachnick &Fidell, 2001). Larger, more frequent cross-loadings will contribute to factorintercorrelations (requiring oblique rotation) and lesser approximations ofsimple structure (described earlier). Thus, to the degree possible, researchersshould attempt to set their minimum values for factor loadings as high aspossible and the absolute magnitude for cross-loadings as low as possible(without compromising scale length or factor structure), which will resultin fewer cross-loadings of lower magnitudes and better approximations ofsimple structure. For example, researchers should delete items with factorloadings less than .32 or cross-loadings less than .15 difference from anitem’s highest factor loading. In addition, they should also delete items thatcontain absolute loadings higher than a certain value (e.g., .32) on two ormore factors. However, we urge researchers to use caution when usingcross-loadings as a criterion for item deletion until establishing the finalfactor solution because an item with a relatively high cross-loading couldbe retained if the factor on which it is cross-loaded is deleted or collapsedinto another existing factor.

Item communalities after rotation can be a useful guide for item deletion aswell. Remember that high item communalities are important for determiningthe factorability of a data set, but they can also be useful in evaluating specificitems for deletion or retention because a communality reflects the proportion ofitem variance accounted for by the factors; it is the squared multiple correlationof the item as predicted from the set of factors in the solution (Tabachnick &Fidell, 2001). Thus, items with low communalities (e.g., less than .40) are nothighly correlated with one or more of the factors in the solution.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 20: Scale Development Research

824 THE COUNSELING PSYCHOLOGIST / November 2006

In our content analysis, the most common criteria for item-deletion deci-sions were absolute values of item loadings and cross-loadings, which wereoften used in combination. None of the studies we reviewed reported usingitem communalities as a criterion for deletion, and one study used item-analysis procedures (e.g., contribution to internal consistency reliability).There were no items deleted in two studies, and two others did not specifythe criteria for item deletion.

Optimizing scale length. Once the items have been evaluated, it is useful toassess the trade-off between length and reliability to optimize scale length.Longer scales of relatively highly correlated items are generally more reliable,but Converse and Presser (1986) recommended that questionnaires take nolonger than 50 minutes to complete. In our experience, scales that take longerthan about 15 to 30 minutes might become problematic, depending on therespondents, the intended use of the scale, and the respondents’ motivationregarding the purpose of the administration. Thus, scale developers may findit useful to examine the length of each subscale to determine whether it is areasonable trade-off to sacrifice a small degree of internal consistency toshorten its length. Some statistical packages (e.g., SPSS) allow researchers tocompare all the items on a given subscale to identify those that contribute theleast to internal consistency, making item deletion with the goal of optimizingscale length relatively easy. Generally, when a factor contains more than thedesired number of items, the researcher will have the option of deleting itemsthat (a) have the lowest factor loadings, (b) have the highest cross-loadings,(c) contribute the least to the internal consistency of the scale scores, and(d) have low conceptual consistency with other items on the factor. Theresearcher should avoid scale-length optimization that degrades the quality ofthe factor structure, factor intercorrelations, item communalities, factor load-ings, or cross-loadings. Ultimately, researchers must conduct a final EFA toensure that the factor solution does not change after deleting items.

CFA

SEM versus FA. SEM has become a widely used tool in explaining theo-retical models within the social and behavioral sciences (see Martens, 2005;Martens & Hasse, 2006; Quintana & Maxwell, 1999; Weston & Gore, 2006).CFA is one of the most popular uses of SEM. CFA is most commonly usedduring the scale development process to help support the validity of a scalefollowing an EFA. In the past, a number of published studies have used FA orPCA procedures as confirmatory approaches (Gerbing & Hamilton, 1996).With the increasing availability of computer software, however, mostresearchers use SEM as the preferred approach for CFA.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 21: Scale Development Research

In our content analysis, 14 of the studies used SEM as the confirmatoryapproach. In comparison, 2 studies used PCA as a confirmatory approach(these appeared before SEM was widely applied in counseling psychologyresearch).

Typical SEM approaches. Once a researcher obtains a theoreticallymeaningful factor structure via EFA, the logical next step is to specify theresulting factor solution in the SEM confirmatory procedure—that is, if theresearcher obtains a three-factor oblique factor structure in the EFA, speci-fying the same correlated three-factor model using SEM and finding goodfit of the model to the data in a new sample will help support the factor-structure reliability and the validity of the scale. Another approach is tocompare competing theoretically plausible models (e.g., different numbersof factors, inclusion or exclusion of specific paths). Thus, the researchercan compare the factor structure uncovered in the EFA with alternativemodels to evaluate which model best fits the data. The hypothesizedmodel’s fitting the data better than alternative models is further evidence ofconstruct validity. If an alternative model fits the data better than thehypothesized model, the investigator is obligated to explain how discrepan-cies between models effect construct validity and then to conduct anotherstudy to further validate the newly adopted model (or start over).

Testing nested or hierarchically related models is another typical SEMapproach. A model is nested if it is a subset of another model to which it iscompared. For example, suppose a researcher conducted a study on aneight-item, course-evaluation survey in which four items assess satisfactionwith the readings and homework assigned in the course and the remainingfour items assess satisfaction with the professor’s sensitivity to diversity,resulting in a two-factor correlated model. However, one could assume thatthe eight items on the survey assess overall satisfaction with the course,resulting in a one-factor model. If this one-factor model was compared withthe correlated two-factor model, the one-factor (restricted) model would benested within the two-factor (unrestricted) model because the correlationbetween the two factors in the two-factor model would be set to a valueof 1.0 to form the one-factor model. When comparing nested models,researchers use a chi-square difference test to examine whether a significantloss in fit occurs when going from the unrestricted model to the nested(restricted) model (for the statistical formula, see Kline, 2005).

When structural equation models are not nested (i.e., one model is not asubset of another model), the chi-square difference test is an inappropriatemethod to assess model fit differences because neither of the two modelscan serve as a baseline comparison model. Still, there are instances whenresearchers compare nonhierarchically related models in terms of model fit,such as when testing different theoretical models posited to support the

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 825

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 22: Scale Development Research

data. In this case, researchers may use fit indices to select among compet-ing models. It is becoming more and more common to compare nonnestedmodels using predictive fit indices (discussed further on), which indicatehow well a model will cross-validate in future samples.

Some competing models may be equivalent models—that is, these modelsare mathematically equivalent even when their parameter configurationsappear different (MacCallum, Wegener, Uchino, & Fabrigar, 1993), andthey will have a different configuration but yield the same chi-square teststatistics and goodness-of-fit indices. Thus, theory should play the strongestrole in selecting the appropriate model when comparing equivalent models.

Another SEM approach that may support the construct validity of a scaleis called multiple-group analysis. In multiple-group analysis, the samestructural equation model may be applied to the data for two or more dis-tinct groups (e.g., male and female) to simultaneously test for invariance(model equivalency) across the two groups by constraining different sets ofmodel parameters to be equal in both groups (for more on conductingmultiple-group analysis, see Bentler, 1995; Bollen, 1989; Byrne, 2001).

Of the 10 studies in the content analysis using a confirmatory SEMapproach, 2 of them used the single-model approach wherein the modelproduced by the EFA was specified in a CFA, and 8 of the studies per-formed model comparisons. Of these 8 studies, 4 evaluated nested models,but only 3 of the 4 used the chi-square difference test when selecting amongthe nested models. All 4 of the studies used fit indices to select amongnonnested competing models. Of the 4 studies comparing alternativenonnested models, 2 used predictive fit indices when selecting among theset of competing models. Researchers compared equivalent and nonequiv-alent models in 2 of the studies in the content analysis. One of these stud-ies selected a nonequivalent model over 2 equivalent models based onhigher values of the fit indices. In the second study, the authors relied ontheory when selecting among 2 equivalent models.

Sample-size considerations. The statistical theory underlying SEM isasymptotic, which assumes that large sample sizes are necessary to providestable parameter estimates (Bentler, 1995). Thus, some researchers havesuggested that SEM analyses should not be performed on sample sizessmaller than 200, whereas others recommend minimum sample sizesbetween 100 and 200 participants (Kline, 2005). Another recommendationis that there should be between 5 and 10 participants per observed variable(Grimm & Yarnold, 1995); yet another guideline is that there should bebetween 5 and 10 participants per parameter to be estimated (Bentler &Chou, 1987). The findings are mixed in terms of which criterion is bestbecause it depends on various model characteristics, including the numberof indicator variables per factor (Marsh, Hau, Balla, & Grayson, 1998),

826 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 23: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 827

estimation method (Fan, Thompson, & Wang, 1999), nonnormality of thedata (West, Finch, & Curran, 1995), as well as the strength of the relation-ships among indicator variables and latent factors (Velicer & Fava, 1998).However, because there is a clear relationship between sample size andmodel complexity, we recommend that the researcher should account forthe number of parameters to be estimated when considering sample size.Given ideal conditions (e.g., enough indicators per factor, high factor load-ings, and normally distributed data), we recommend Bentler and Chou’s(1987) guideline of at least the 5:1 ratio of participants to number of para-meters, with the ratio of 10:1 being optimal. In addition, we do not recom-mend using SEM on sample sizes smaller than 100 participants.

Only one study in our content analysis reported using one of the earlierdescribed criteria (5 to 10 participants per indicator) to establish an ade-quate sample size. The remainder of the studies did not specify whether

TABLE 3: Incremental, Absolute, and Predictive Fit Indices Used in StructuralEquation Modeling

Fit Index Citation

Incremental fit indicesNormed Fit Index (NFI) Bentler & Bonnett (1980)Incremental Fit Index (IFI) Bollen (1989)Nonnormed Fit Index (NNFI) or Tucker & Lewis (1973)

Tucker-Lewis Index (TLI)Comparative Fit Index (CFI) Bentler (1990)Parsimony Comparative Fit Index (PCFI) Mulaik et al. (1989)Relative Noncentrality Index (RNI) McDonald & Marsh (1990)

Absolute Fit IndicesChi-square/df ratio Marsh, Balla, & McDonald (1988)Goodness-of-Fit Index (GFI) Jöreskog & Sörbom (1984)Adjusted Goodness-of-Fit Index (AGFI) Jöreskog & Sörbom (1984)McDonald’s Fit Index (MFI) or McDonald (1989)

McDonald’s Centrality Index (MCI)Gamma hat Steiger (1989)Hoelter N Hoelter (1983)Root Mean Square Residual (RMR) Jöreskog & Sörbom (1981)Standardized Root Mean Square Residual (SRMR) Bentler (1995)Root Mean-Square Error of Approximation Steiger & Lind (1980)

(RMSEA)

Predictive Fit IndicesAkaike’s Information Criterion (AIC) Akaike (1987)Consistent AIC (CAIC) Bozdogan (1987)Bayesian Information Criterion (BIC) Schwarz (1978)Expected Cross-Validation Index (ECVI) Browne & Cudeck (1992)

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 24: Scale Development Research

they used particular criteria to evaluate the adequacy of the sample size toconduct SEM. However, we assessed the sample sizes for all the studiesincluded in the content analysis and determined that the remaining studiesmet the 5:1 ratio of participants to parameters.

Overall model fit. Researchers typically use a chi-square test statisticas a test of overall model fit in SEM. The chi-square test, however, isoften criticized for its sensitivity to sample size (Bentler & Bonett, 1980;Hu & Bentler, 1999). The sample-size dependency of the chi-square teststatistic has led to the proposal of numerous alternative fit indices thatevaluate model fit, supplementing the chi-square test statistic. These fitindices may be classified as incremental, absolute, or predictive fit indices(Kline, 2005).

Incremental fit indices measure the improvement in a model’s fit to thedata by comparing a specific structural equation model to a baseline struc-tural equation model. The typical baseline comparison model is the null (orindependence) model in which all the variables are independent of eachother or uncorrelated (Bentler & Bonnett, 1980). Absolute fit indices mea-sure how well a structural equation model explains the relationships foundin the sample data. Predictive fit indices (or information criteria) measurehow well the structural equation model would fit in other samples from thesame population (see Table 3 for examples of incremental, absolute, andpredictive fit indices).

We should note that there are various recommendations about reportingthese indices as well as suggested cutoff values for each of these fit indices(e.g., see Hu & Bentler, 1999; Kline, 2005). Researchers have commonlyinterpreted incremental fit index, goodness-of-fit index, adjusted goodness-of-fit index, and McDonald’s Fit Index (MFI) values greater than .90 as anacceptable cutoff (Bentler & Bonnett, 1980). More recently, however, SEMresearchers have advocated .95 as a more desirable level (e.g., Hu &Bentler, 1999). Values for the standardized root mean square residual(SRMR) less than .10 are generally indicative of acceptable model fit.Values for the root mean square error of approximation (RMSEA) at or lessthan .05 indicate close model fit, which is customarily considered accept-able. However, debate continues concerning the use of these indices and thecutoff values when fitting structural equation models (e.g., see Marsh, Hau,& Wen, 2004). One reason for this debate is that the findings are mixed interms of which index is best, and their performance depends on variousstudy characteristics, including the number of variables (Kenny & McCoach,2003), estimation method (Fan et al., 1999; Hu & Bentler, 1998), modelmisspecification (Hu & Bentler, 1999), and sample size (Marsh, Balla, &Hau, 1996). Researchers should bear in mind that suggested cutoff criteriaare general guidelines and are not necessarily definitive rules.

828 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 25: Scale Development Research

According to Kline (2005), a minimum collection of these types of fitindices to report would consist of (a) the chi-square test statistic with cor-responding degrees of freedom and level of significance, (b) the RMSEA(Steiger & Lind, 1980) with its corresponding 90% confidence interval,(c) the Comparative Fit Index (CFI; Bentler, 1990), and (d) the SRMR(Bentler, 1995). Hu and Bentler (1999) recommend using a two-index com-bination approach when reporting findings in SEM. More specifically, theyrecommend using the SRMR accompanied by one of the following indices:Nonnormed Fit Index, Incremental Fit Index, Relative Noncentrality Index,CFI, Gamma Hat, MFI, or RMSEA. Although there is evidence that Hu andBentler’s (1999) joint criteria help minimize the possibility of rejecting theright model, there is also evidence that misspecified (incorrect) modelscould be considered acceptable when using the proposed cutoff criteria(Marsh et al., 2004). Thus, we adopt Kline’s (2005) recommendation withrespect to the minimum fit indices to report. In addition, because structuralequation models approximate truth, we further recommend that researcherscompare competing theoretically plausible models whenever possible andreport predictive fit indices (see Table 3) to ensure that the model will cross-validate in subsequent samples. Finally, and most important, researchersshould always base their selections of the appropriate model on relevanttheory.

In our content analysis, 12 of the 14 studies using SEM reported the chi-square statistic. All 14 studies reported at least two fit indices. We list themost commonly reported fit indices in these studies in Table 2. Although 7articles reported the RMSEA, only 1 of these reported its corresponding 90%confidence interval (regarding confidence intervals around the RMSEA, seeQuintana & Maxwell, 1999; for more on confidence intervals, see Henson,2006 [TCP, special issue, part 1]). All but 3 studies assessed model fit usingvarious suggested cutoff criteria (e.g., Bentler, 1990, 1992; Byrne, 2001;Comrey & Lee, 1992; Hu & Bentler, 1999; Kline, 2005; Quintana &Maxwell, 1999). Several of the studies were published after the seminal Huand Bentler (1999) cutoff-criteria article and referred to the less stringent cut-off criteria suggested by previous researchers (e.g., .90 for incremental fitindices). Only 3 of the 8 studies in the content analysis comparing compet-ing models (nested or nonnested) reported predictive fit indices.

Model modification. When structural equation models do not demon-strate good fit, researchers often modify (respecify) and subsequently retestmodels (MacCallum, Roznowski, & Necowitz, 1992). This results in theconfirmatory approach’s reverting to an exploratory approach again but thatis of less consequence than not knowing the reasons behind poor model fit.Modification indices are sometimes used to either add or drop parametersin the process of model respecification. For example, the Lagrange

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 829

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 26: Scale Development Research

Multiplier Modification index estimates the decrease in the chi-square teststatistic that would occur if a parameter were to be freely estimated. Morespecifically, it indicates which parameters could be added to increase modelfit by significantly decreasing the chi-square test statistic of overall fit. Incontrast, the Wald statistic estimates the increase in the chi-square test sta-tistic that would occur if a parameter were fixed to 0, which is essentiallythe same as dropping a nonsignificant parameter from the model (Kline,2005). Researchers have examined the performance of these indices interms of helping the researcher arrive at the correct structural equation modeland have shown these indices to be inaccurate under certain conditions(e.g., Chou & Bentler, 2002; MacCallum, 1986). Thus, applied researchersare warned as to the accuracy of respecified models when modifications aremade using the Lagrange Multiplier and the Wald statistic. In the end, the-ory should guide model respecification, and respecified models should betested using new samples.

Researchers may also modify models in terms of the unit of analysisused, such as item parcels. Parceling means either summing or averagingtwo or more items together to create parcels (sometimes referred to as bun-dles). These parcels are then used as the unit of analysis in SEM instead ofthe individual items. It is crucial, however, that researchers in the scaledevelopment process do not use item parceling, because item parcels canhide the true relationships among items in the scale (Cattell, 1974). In addi-tion, model misspecification may be hidden when using item parceling(Bandalos & Finney, 2001).

The data-driven methods for model respecification in SEM are moreappropriate for fine-tuning a model than they are for large-scale respeci-fication of severely misspecified initial models because multiple mis-specification errors interact with each other, making respecification moredifficult (Gerbing & Hamilton, 1996). For similar reasons, Gorsuch(1997) suggested that it is possible to use FA procedures as an appropri-ate alternative to adjusting the confirmatory model when finding mis-specification, but this does not imply reversing the typical order of FAprior to SEM in scale development research. Finally, we highly recom-mend cross-validation of respecified structural equation models to estab-lish predictive validity (MacCallum et al., 1992). Thus, another sample ofdata should be collected and the respecified model tested in a confirma-tory approach.

Of the 14 studies conducting SEM, three examined modification indices(e.g., the Lagrange Multiplier) to assess if they should add parameters to themodel to significantly improve the fit. In two of these three studies, theauthors implemented modifications and retested the models. These twostudies allowed the errors to covary, and one study also allowed the factors

830 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 27: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 831

to covary. Neither of the two studies that modified the original structuralequation model cross-validated the respecified model in a separate sample.Researchers in two of the studies in the content analysis used item parcel-ing to avoid estimating large number of parameters and to reduce randomerror, an approach we do not recommend.

CONCLUSIONS

In this article, we have examined common practices in counseling psy-chology scale development research using EFA and CFA techniques. We con-ducted a content analysis of new scale development articles in JCP during 10years (1995 to 2004) to assess current practices in scale development. Weused data from our content analysis to provide information about the typicalprocedures used in counseling psychology scale development research, andwe compared these practices to current literature on EFA and CFA to makerecommendations about best practices (which we summarize further on).

We found that counseling psychology scale development researchemployed a wide range of procedures. Although we did not conduct a for-mal trend analysis, our impressions were that the content-analysis data indi-cated that counseling psychology scale development research becameincreasingly more rigorous and sophisticated during the evaluation period,especially through the attenuation of PCA procedures and the increasedemployment of SEM as a confirmatory procedure. However, we also founda variety of practices that seemed at odds with the current literature on EFAand SEM, which indicated a need for even more rigor and standardization.

Specifically, we found the use of the following new scale developmentpractices to be problematic: a) employing SEM prior to using EFA, b) usingcriteria that varied widely (or were not reported) with respect to determin-ing the adequacy of the sample for both EFA and SEM, c) failing to reportan adequate rationale for selecting orthogonal versus oblique rotation meth-ods, d) using orthogonal rotation methods during EFA despite clear evi-dence that the factors were moderately to highly correlated, e) usinginappropriate rationales or ignoring contrary data when identifying andreporting the final factor solution during EFA (e.g., ignoring high factorintercorrelations to retain a preferred factor structure), f) using questionablecriteria as the basis for decisions about item deletion or retention, g) failingto consider the extent to which the final factor solution achieved adequateapproximation of simple structure, h) making revisions to item content oradding or deleting items between the conclusion of EFA and the initiationof SEM, i) using criteria and fit indices that varied widely to determineoverall model fit during SEM, j) failing to report confidence intervals when

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 28: Scale Development Research

832 THE COUNSELING PSYCHOLOGIST / November 2006

using the RMSEA, k) using item parcels (bundles) in scale development,and l) failing to engage in additional cross-validation following model mis-specification and modification during SEM.

We offer a number of caveats for the earlier described critique of scaledevelopment practices. First, some of these recommendations do not trans-fer directly to other approaches to empirical scale development (e.g., crite-rion group) and should be understood as primarily referring to thehomogenous item-grouping approach. Second, it is important to note thatEFA is intended to be a flexible statistical procedure to produce the mostinterpretable solution, which can lead to acceptable variations in practice.Thus, some researchers may disagree on how stringently to use criteria toconstrain the process of EFA, and we acknowledge that the subjective andinterpretive aspects of scale development may justify variations that arise inspecific contexts. Finally, the current literature on both EFA and SEM con-tinue to contain debates and conflicting recommendations that may be atvariance with our conclusions. We provide recommendations for bestpractices here to increase standardization and rigor rather than as a res-olution of those ongoing debates and data-driven improvements in bestpractices.

RECOMMENDED BEST PRACTICES

1. Always provide a scale definition of the construct intended to bemeasured.

2. Use expert review of items prior to submitting them to EFA.3. In general, EFA should precede CFA.4. When using EFA, set a preestablished minimum sample size (³ 100)

and then evaluate the need for additional data collection on the basisof an initial EFA using communalities, factor saturation, and factor-loadings criteria: (a) sample sizes of 150 to 200 are likely to be ade-quate with data sets containing communalities higher than .50 orwith 10:1 items per factor with factor loadings at approximately |.4|and (b) smaller samples sizes may be adequate if communalities areall .60 or greater or with at least 4:1 items per factor and factor load-ings greater than |.6|.

5. Verify the factorability of data via a significant Bartlett’s test ofsphericity (when the participants to items ratio is between 3:1 and5:1), the KMO measure of sampling adequacy (values greater than.60), or both.

6. Recognize and understand the basic differences between PCA andFA extraction methods. For the purpose of scale development, FA isgenerally preferred over PCA in most instances.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 29: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 833

7. Even when theory suggests that factors will be uncorrelated, it isgood practice to use an oblique rotation when factors are correlatedin the data. Consider using an oblique rotation in the first run of anEFA with each factor solution to empirically establish whether fac-tors might be correlated.

8. Establish which criteria to use for factor retention and item deletionor retention in advance (e.g., delete items with factor loadings lessthan .32 or cross-loadings less than .15 difference from an item’shighest factor loading; approximate simple structure; parallel analy-sis; delete factors with fewer than two items unless the items are cor-related, for example, r > .70).

9. Avoid allowing the influence of preconceived biases (e.g., how theresearcher wants the final solution to look) to override important sta-tistical findings when making judgments. Consider using indepen-dent judges to assist in decision making if it seems difficult todisentangle researcher bias from conceptual interpretation of EFAresults.

10. If conducting scale-length optimization, it is essential to rerun theEFA to ensure that item elimination did not result in changes tofactor structure, factor intercorrelations, item communalities, factorloadings, or cross-loadings, so that all of the originally establishedcriteria for these outcomes are still met.

11. Avoid making changes to the scale produced by the final EFA priorto conducting a CFA (e.g., adding new items, deleting items, chang-ing item content, altering the rating scale). If you feel that the out-comes of the EFA are unsatisfactory or that changes to the scale arenecessary, it is most appropriate to conduct a new EFA on the revisedscale before moving to CFA.

12. Competing-models approaches in SEM in seem to be gaining favorin the literature over single-model approaches, indicating thatresearchers should consider evaluating the theoretical plausibilityamong either nested or nonnested or equivalent models.

13. When using SEM, use model complexity as the central indicator toestablish the minimum sample size required before conducting CFA;we recommend a minimum of 5 cases per parameter to be esti-mated.

14. At a minimum, report the following SEM fit indices: (a) the chi-square with corresponding degrees of freedom and level of signifi-cance, (b) the RMSEA with corresponding 90% confidenceintervals, (c) the CFI, and (d) the SRMR.

15. When comparing competing models with SEM, add an appropriatepredictive fit index to the standard set described earlier (see Table 3).

16. Data-driven methods for model respecification in SEM are moreappropriate for fine-tuning than for large-scale respecification ofseverely misspecified models.

17. The Lagrange Multiplier Modification index may be used for respec-ifications in which parameters are being added to the model; the

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 30: Scale Development Research

834 THE COUNSELING PSYCHOLOGIST / November 2006

Wald statistic may be used for decisions about eliminating parame-ters from the model. In the end, however, theory should accompanymodification procedures using these modification indices.

18. We recommend against item parceling (bundling) in SEM for scaledevelopment research because item parcels can hide (a) the true rela-tionships among items in the scale and (b) model misspecification(which runs contrary to the underlying purposes of CFA).

19. Clearly report all of the decisions, rationales, and procedures whenusing EFA and SEM in scale development research.

APPENDIXJournal of Counseling Psychology

Scale Development ArticlesReference List (1995 to 2004)

Barber, J. P., Foltz, C., & Weinryb, R. M. (1998). The central relationship questionnaire: Initialreport. Journal of Counseling Psychology, 45, 131-142.

Dillon, F. R., & Worthington, R. L. (2003). The Lesbian, Gay, and Bisexual AffirmativeCounseling Self-Efficacy Inventory (LGB-CSI): Development, validation, and trainingimplications. Journal of Counseling Psychology, 50, 235-251.

Heppner, P. P., Cooper, C., Mulholland, A., & Wei, M. (2001). A brief, multidimensional, problem-solving psychotherapy outcome measure. Journal of Counseling Psychology, 48, 330-343.

Hill, C. E., & Kellems, I. S. (2002). Development and use of the helping skills measure toassess client perceptions of the effects of training and of helping skills in sessions. Journalof Counseling Psychology, 49, 264-272.

Inman, A. G., Ladany, N., Constantine, M. G., & Morano, C. K. (2001). Development and pre-liminary validation of the Cultural Values Conflict Scale for South Asian women. Journalof Counseling Psychology, 48, 17-27.

Kim, B. K., Atkinson, D. R., & Yang, P. H. (1999). The Asian Values Scale: Development, fac-tor analysis, validation, and reliability. Journal of Counseling Psychology, 46, 342-352.

Kivlighan, D. M., Multon, K. D., & Brossart, D. F. (1996). Helpful impacts in group counsel-ing: Development of a multidimensional rating system. Journal of Counseling Psychology,43, 347-355.

Lee, R. M., Choe, J., Kim, G., & Ngo, V. (2000). Construction of the Asian American FamilyConflicts Scale. Journal of Counseling Psychology, 47, 211-222.

Lehrman-Waterman, D., & Ladany, N. (2001). Development and validation of the evaluationprocess within supervision inventory. Journal of Counseling Psychology, 48, 168-177.

Lent, R. W., Hill, C. E. & Hoffman, M. A. (2003). Development and validation of theCounselor Activity Self-Efficacy scales. Journal of Counseling Psychology, 50, 97-108.

Liang, C. T. H., Li, L. C., & Kim, B. S. K. (2004). The Asian American Racism-Related StressInventory: Development, factor analysis, reliability, and validity. Journal of CounselingPsychology, 51, 103-114.

Mallinckrodt, B., Gantt, D. L., & Coble, H. M. (1995). Attachment patterns in the psy-chotherapy relationship: Development of the client attachment to therapist scale. Journalof Counseling Psychology, 42, 307-317.

Miville, M. L., Gelso, C. J., Pannu, R., Liu, W., Touradji, P., Holloway, P., & Fuertes, J. (1999).Appreciating similarities and valuing differences: The Miville-Guzman Universality-Diversity Scale. Journal of Counseling Psychology, 46, 291-307.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 31: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 835

Mohr, J. J., & Rochlen, A. B. (1999). Measuring attitudes regarding bisexuality in lesbian, gaymale, and heterosexual populations. Journal of Counseling Psychology, 46, 353-369.

Neville, H. A., Lilly, R. L., Duran, G., Lee, R. M., & Browne, L. (2000). Construction and ini-tial validation of the Color-Blind Racial Attitudes Scale (CoBRAS). Journal of CounselingPsychology, 47, 59-70.

O’Brien, K. M., Heppner, M. J., Flores, L. Y., Bikos, L. H. (1997). The Career CounselingSelf-Efficacy Scale: Instrument development and training applications. Journal ofCounseling Psychology, 44, 20-31.

Phillips, J. C., Szymanski, D. M., Ozegovic, J. J., & Briggs-Phillips, M. (2004). Preliminaryexamination and measurement of the internship research training environment. Journal ofCounseling Psychology, 51, 240-248.

Rochlen, A. B., Mohr, J. J., & Hargrove, B. K. (1999). Development of the attitudes towardcareer counseling scale. Journal of Counseling Psychology, 46, 196-206.

Schlosser, L. Z., & Gelso, C. J. (2001). Measuring the working alliance in advisor-adviseerelationships in graduate school. Journal of Counseling Psychology, 48, 157-167.

Skowron, E. A., & Friedlander, M. L. (1998). The differentiation of self inventory: Developmentand initial validation. Journal of Counseling Psychology, 45, 235-246.

Spanierman, L. B., & Heppner, M. J. (2004). Psychosocial Costs of Racism to Whites Scale(PCRW): Construction and initial validation. Journal of Counseling Psychology, 51, 249-262.

Utsey, S. O., & Ponterotto, J. G. (1996). Development and validation of the index of race-related stress. Journal of Counseling Psychology, 43, 490-501.

Wang, Y., Davidson, M. M., Yakushko, O. F., Savoy, H. B., Tan, J. A., & Bleier, J. K. (2003).The Scale of Ethnocultural Empathy: Development, validation, and reliability. Journal ofCounseling Psychology, 50, 221-234.

REFERENCES

Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-332.Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan.Bandalos, D. J., & Finney, S. J. (2001). Item parceling issues in structural equation modeling.

In G. A. Marcoulides & R. E. Schumacker (Eds.) New developments and techniques instructural equation modeling (pp. 269-296). Mahwah, NJ: Lawrence Erlbaum.

Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology,3, 77-85.

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin,107, 238-246.

Bentler, P. M. (1992). On the fit of models to covariances and methodology to the Bulletin.Psychological Bulletin, 112, 400-404.

Bentler, P. M. (1995). EQS: Structural equations program manual. Encino, CA: MultivariateSoftware.

Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysisof covariance structures. Psychological Bulletin, 88, 588-606.

Bentler, P. M., & Chou, C.-P. (1987). Practical issues in structural modeling. SociologicalMethods & Research, 16, 78-117.

Bollen, K. A. (1989). A new incremental fit index for general structural equation models.Sociological Methods & Research, 17, 303-316.

Bozdogan, H. (1987). Model selection and Akaike’s information criteria (AIC): The generaltheory and its analytical extensions. Psychometrika, 52, 345-370.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 32: Scale Development Research

836 THE COUNSELING PSYCHOLOGIST / November 2006

Brown, F. G. (1983). Principles of educational and psychological testing (3rd ed.). New York:Holt, Rinehart, & Winston.

Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. SociologicalMethods and Research, 21, 230-258.

Byrne, B. M. (2001). Structural equation modeling with AMOS: Basic concepts, applicationsand programming. Mahwah, NJ: Lawrence Erlbaum.

Cattell, R. B. (1966). The scree test for the number of factors. Multivariate BehavioralResearch, 1, 245-276.

Cattell, R. B. (1974). Radial item parcel factoring vs. item factoring in defining personalitystructure in questionnaires: Theory and experimental checks. Australian Journal ofPsychology, 26, 103-119.

Chou, C., & Bentler, P. M. (2002). Model modification in structural equation modeling byimposing constraints. Computational Statistics and Data Analysis, 41, 271-287.

Comrey, A. L. (1973). A first course in factor analysis. New York: Academic Press.Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, NJ:

Lawrence Erlbaum.Converse, J. M., & Presser, S. (1986). Survey questions: Handcrafting the standardized ques-

tionnaire. Newbury Park, CA: Sage.Dawis, R. V. (1987). Scale construction. Journal of Counseling Psychology, 34, 481-489.DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks,

CA: Sage.Fan, X., Thompson, B., & Wang, L. (1999). Effects of sample size, estimation methods, and model

specification on structural equation modeling fit indexes. Structural Equation Modeling, 6, 56-83.Fassinger, R. E. (1987). Use of structural equation modeling in counseling psychology

research. Journal of Counseling Psychology, 34, 425-436.Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of

clinical assessment instruments. Psychological Assessment, 7, 286-299.Friedenberg, L. (1995). Psychological testing: Design, analysis, and use. Boston, MA: Allyn

and Bacon.Gerbing, D. W., & Hamilton, J. G. (1996). Viability of exploratory factor analysis as a pre-

cursor to confirmatory factor analysis. Structural Equation Modeling, 3, 62-72.Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.Gorsuch, R. L. (1990). Common factor analysis versus principal components analysis: Some

well and little known facts. Multivariate Behavioral Research, 25, 33-39.Gorsuch, R. L. (1997). Exploratory factor analysis: Its role in item analysis. Journal of

Personality Assessment, 68, 532-560.Gorsuch, R. L. (2003). Factor analysis. In J. A Schinka & W. F. Velicer (Eds.), Handbook of

psychology: Research methods in psychology (Vol. 2, pp. 143-164). Hoboken, NJ: John Wiley.Grimm, L. G., & Yarnold, P. R. (1995). Reading and understanding multivariate statistics.

Washington, DC: American Psychological Association.Guadagnoli, E., & Velicer, W. F. (1988). The relationship of sample size to the stability of com-

ponent patterns. Psychological Bulletin, 103, 265-275.Helms, J. E., Henze, K. T., Sass T. L., & Mifsud, V. A. (2006). Treating Cronbach’s alpha

reliability as data in nonpsychometric substantive applied research. The CounselingPsychologist, 34, 630-660.

Henson, R. K. (2006). Effect-size measures and meta-analytic thinking in counseling psy-chology research. The Counseling Psychologist, 34, 601-629.

Hoelter, J. W. (1983). The analysis of covariance structures: Goodness-of-fit indices.Sociological Methods & Research, 11, 325-344.

Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis.Psychometrika, 30, 179-185.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 33: Scale Development Research

Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 837

Hoyt, W. T., Warbasse, R. E., & Chu, E. Y. (2006). Construct validation in counselingpsychology research. The Counseling Psychologist, 34, 769-805.

Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity tounderparameterized model misspecification. Psychological Methods, 3, 424-453.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.

Humphreys, L. G, & Montanelli, R. G. (1975). An investigation of the parallel analysis criterion fordetermining the number of common factors. Multivariate Behavioral Research, 10, 193-205.

Jöreskog, K. G., & Sörbom, D. (1981). LISREL V: Analysis of linear structural relations bythe method of maximum likelihood. Chicago: International Educational Services.

Jöreskog, K. G., & Sörbom, D. (1984). LISREL 6: A guide to the program and applications.Chicago: SPSS.

Kahn, J. H. (2006). Factor analysis in counseling psychology research, training, and practice:Principles, advances, and applications. The Counseling Psychologist, 34, 684-718.

Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis.Psychometrika, 23, 187-200.

Kenny, D. A., & McCoach, D. B. (2003). Effect of the number of variables on measures of fitin structural equation modeling. Structural Equation Modeling, 10, 333-351.

Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). NewYork: Guilford.

Loehlin, J. C. (1998). Latent variable models: An introduction to factor, path, and structuralanalysis (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.

MacCallum, R. C. (1986). Specification searches in covariance structure modeling.Psychological Bulletin, 107, 247-255.

MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covari-ance structure analysis: The problem of capitalization on chance. Psychological Bulletin,111, 490-504.

MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem ofequivalent models in applications of covariance structure analysis. Psychological Bulletin,114, 185-199.

MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor analy-sis. Psychological Methods, 4, 84-99.

Marsh, H. W., Balla, J. R., & Hau, K. T. (1996). An evaluation of incremental fit indices: A clar-ification of mathematical and empirical properties. In G. A. Marcoulides & R. E. Schumacker(Eds.), Advanced structural equation modeling: Issues and techniques (pp. 315-353).Mahwah, NJ: Lawrence Erlbaum.

Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirma-tory factor analysis: The effect of sample size. Psychological Bulletin, 103, 391-410.

Marsh, H. W., Hau, K.-T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? Thenumber of indicators per factor in confirmatory factor analysis. Multivariate BehavioralResearch, 33, 181-220.

Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizingHu and Bentler’s (1999) findings. Structural Equation Modeling, 11, 320-341.

Martens, M. P. (2005). The use of structural equation modeling in counseling psychologyresearch. The Counseling Psychologist, 33, 269-298.

Martens, M. P., & Hasse, R. F. (2006). Advanced applications of structural equation modelingin counseling psychology research. The Counseling Psychologist, 34, 878-911.

McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum.McDonald, R. P. (1989). An index of goodness-of-fit based on noncentrality. Journal of

Classification, 6, 97-103.

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from

Page 34: Scale Development Research

McDonald, R. P., & Marsh, H. W. (1990). Choosing a multivariate model: Noncentrality andgoodness of fit. Psychological Bulletin, 107, 247-255.

Mulaik, S. A., James, L. R., Van Alstine, J., Bennett, N., Lind, S., & Stilwell, C. D. (1989).Evaluation of goodness-of-fit indices for structural equation models. PsychologicalBulletin, 105, 430-445.

O’Connor, B. P. (2000). SPSS and SAS programs for determining the number of componentsusing parallel analysis and Velicer’s MAP test. Behavior Research Methods, Instruments,and Computers, 32, 396-402.

Quintana, S. M., & Maxwell, S. E. (1999). Implications of recent developments in structuralequation modeling for counseling psychology. The Counseling Psychologist, 27, 485-527.

Quintana, S. M., & Minami, T. (2006). Guidelines for meta-analyses of counseling psychol-ogy research. The Counseling Psychologist, 34, 839-876.

Reise, S. P., Waller, N. G., & Comrey, A. L. (2000). Factor analysis and scale revision.Psychological Assessment, 12, 287-297.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.Sherry, A. (2006). Discriminant analysis in counseling psychology research. The Counseling

Psychologist, 34, 661-683.Steiger, J. H. (1989). EzPATH: A supplementary module for SYSTAT and SYGRAPH.

Evanston, IL: SYSTAT.Steiger, J. H., & Lind, J. C. (1980, May). Statistically based tests for the number of common

factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). New York:

Harper & Row.Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts

and applications. Washington, DC: American Psychological Association.Tinsley, H. E. A., & Tinsley, D. J. (1987). Uses of factor analysis in counseling psychology

research. Journal of Counseling Psychology, 34, 414-424.Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969). Evaluation of factor analytic research

procedures by means of simulated correlation matrices. Psychometrika, 34, 421-459.Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor

analysis. Psychometrika, 38, 1-10.Velicer, W. F., & Fava, J. L. (1998). Effects of variable and subject sampling on factor pattern

recovery. Psychological Methods, 3, 231-251.Velicer, W. F., & Jackson, D. N. (1990). Component analysis versus common factor analysis:

Some issues in selecting an appropriate procedure. Multivariate Behavioral Research,25, 1-28.

Velicer, W. F., Peacock, A. C., & Jackson, D. N. (1982). A comparison of component and fac-tor patterns: A Monte Carlo approach. Multivariate Behavioral Research, 17, 371-388.

West, S. G., Finch, J. F., & Curran, P. J. (1995). Structural equation models with nonnormalvariables: Problems and remedies. In R. H. Hoyle (Ed.), Structural equation modeling:Concepts, issues, and applications (pp. 56-75). Thousand Oaks, CA: Sage.

Weston, R., & Gore, P. A., Jr. (2006). SEM 101: A brief guide to structural equation modeling.The Counseling Psychologist, 34, 719-751.

Widaman, K. F. (1993). Common factor analysis versus principal components analysis:Differential bias in representing model parameters? Multivariate Behavioral Research,28, 263-311.

Worthington, R. L., & Navarro, R. L. (2003). Pathways to the future: Analyzing the contentsof a content analysis. The Counseling Psychologist, 31, 85-92.

Zwick, W. R., & Velicer, W. F. (1986). Factors influencing five rules for determining the num-ber of components to retain. Psychological Bulletin, 99, 432-442.

838 THE COUNSELING PSYCHOLOGIST / November 2006

use or unauthorized distribution.© 2006 Division 17 of Counseling Psychologist Association. All rights reserved. Not for commercial

by Plamen Dimitrov on July 11, 2007 http://tcp.sagepub.comDownloaded from