Adcock - Collier

Embed Size (px)

Citation preview

  • 8/2/2019 Adcock - Collier

    1/18

    Measurement Validity: A Shared Standard for Qualitative and QuantitativeResearchROBERT ADCOCK and DAVID COLLIER University of California, Berkeley

    Scholars routinely make claims that presuppose the validity of the observations and measurements thatoperationalize their concepts. Yet, despite recent advances in political science methods, surprisingly

    little attention has been devoted to measurement validity. We address this gap by exploring four themes.First, we seek to establish a shared framework that allows quantitative and qualitative scholars to assessmore effectively, and communicate about, issues of valid measurement. Second, we underscore the need todraw a clear distinction between measurement issues and disputes about concepts. Third, we discuss thecontextual specificity of measurement claims, exploring a variety of measurement strategies that seek tocombine generality and validity by devoting greater attention to context. Fourth, we address the proliferationof terms for alternative measurement validation procedures and offer an account of the three main types ofvalidation most relevant to political scientists.

    Researchers routinely make complex choicesabout linking concepts to observations, that is,about connecting ideas with facts. These choices

    raise the basic question of measurement validity: Dothe observations meaningfully capture the ideas con-tained in the concepts? We will explore the meaning ofthis question as well as procedures for answering it. Inthe process we seek to formulate a methodologicalstandard that can be applied in both qualitative andquantitative research.

    Measurement validity is specifically concerned withwhether operationalization and the scoring of casesadequately reflect the concept the researcher seeks tomeasure. This is one aspect of the broader set ofanalytic tasks that King, Keohane, and Verba (1994,chap. 2) call descriptive inference, which also encom-passes, for example, inferences from samples to popu-

    lations. Measurement validity is distinct from the va-lidity of causal inference (chap. 3), which Cook andCampbell (1979) further differentiate into internal andexternal validity.1 Although measurement validity isinterconnected with causal inference, it stands as animportant methodological topic in its own right.

    New attention to measurement validity is overdue inpolitical science. While there has been an ongoingconcern with applying various tools of measurementvalidation (Berry et al. 1998; Bollen 1993; Elkins 2000;Hill, Hanna, and Shafqat 1997; Schrodt and Gerner

    1994), no major statement on this topic has appearedsince Zeller and Carmines (1980) and Bollen (1989).Although King, Keohane, and Verba (1994, 25, 1525)

    cover many topics with remarkable thoroughness, theydevote only brief attention to measurement validity.New thinking about measurement, such as the idea ofmeasurement as theory testing (Jacoby 1991, 1999), hasnot been framed in terms of validity.

    Four important problems in political science re-search can be addressed through renewed attention tomeasurement validity. The first is the challenge ofestablishing shared standards for quantitative and qual-itative scholars, a topic that has been widely discussed(King, Keohane, and Verba 1994; see also Brady andCollier 2001; George and Bennett n.d.). We believe theskepticism with which qualitative and quantitative re-searchers sometimes view each others measurement

    tools does not arise from irreconcilable methodologicaldifferences. Indeed, substantial progress can be madein formulating shared standards for assessing measure-ment validity. The literature on this topic has focusedalmost entirely on quantitative research, however,rather than on integrating the two traditions. Wepropose a framework that yields standards for mea-surement validation and we illustrate how these applyto both approaches. Many of our quantitative andqualitative examples are drawn from recent compara-tive work on democracy, a literature in which bothgroups of researchers have addressed similar issues.This literature provides an opportunity to identify

    parallel concerns about validity as well as differences inspecific practices.A second problem concerns the relation between

    measurement validity and disputes about the meaningof concepts. The clarification and refinement of con-cepts is a fundamental task in political science, andcarefully developed concepts are, in turn, a majorprerequisite for meaningful discussions of measure-ment validity. Yet, we argue that disputes about con-cepts involve different issues from disputes about mea-surement validity. Our framework seeks to make thisdistinction clear, and we illustrate both types of dis-putes.

    A third problem concerns the contextual specificity

    Robert Adcock ([email protected]) is a Ph.D candi-date, Department of Political Science, and David Collier([email protected]) is Professor of Political Science,University of California, Berkeley, CA 94720-1950.

    Among the many colleagues who have provided helpful commentson this article, we especially thank Christopher Achen, KennethBollen, Henry Brady, Edward Carmines, Rubette Cowan, Paul Dosh,Zachary Elkins, John Gerring, Kenneth Greene, Ernst Haas, EdwardHaertel, Peter Houtzager, Diana Kapiszewski, Gary King, MarcusKurtz, James Mahoney, Sebastian Mazzuca, Doug McAdam, Ger-ardo Munck, Charles Ragin, Sally Roever, Eric Schickler, JasonSeawright, Jeff Sluyter, Richard Snyder, Ruth Stanley, Laura Stoker,and three anonymous reviewers. The usual caveats apply. RobertAdcocks work on this project was supported by a National ScienceFoundation Graduate Fellowship.1 These involve, respectively, the validity of causal inferences aboutthe cases being studied, and the generalizability of causal inferencesto a broader set of cases (Cook and Campbell 1979, 509, 7080).

    American Political Science Review Vol. 95, No. 3 September 2001

    529

  • 8/2/2019 Adcock - Collier

    2/18

    of measurement validityan issue that arises when ameasure that is valid in one context is invalid inanother. We explore several responses to this problemthat seek a middle ground between a universalizingtendency, which is inattentive to contextual differences,and a particularizing approach, which is skeptical aboutthe feasibility of constructing measures that transcendspecific contexts. The responses we explore seek to

    incorporate sensitivity to context as a strategy forestablishing equivalence across diverse settings.

    A fourth problem concerns the frequently confusinglanguage used to discuss alternative procedures formeasurement validation. These procedures have oftenbeen framed in terms of different types of validity,among which content, criterion, convergent, and con-struct validity are the best known. Numerous otherlabels for alternative types have also been coined, andwe have found 37 different adjectives that have beenattached to the noun validity by scholars wrestlingwith issues of conceptualization and measurement.2

    The situation sometimes becomes further confused,

    given contrasting views on the interrelations amongdifferent types of validation. For example, in recentvalidation studies in political science, one valuableanalysis (Hill, Hanna, and Shafqat 1997) treats con-vergent validation as providing evidence for con-struct validation, whereas another (Berry et al. 1998)treats these as distinct types. In the psychometricstradition (i.e., in the literature on psychological andeducational testing) such problems have spurred atheoretically productive reconceptualization. This liter-ature has emphasized that the various procedures forassessing measurement validity must be seen, not asestablishing multiple independent types of validity, butrather as providing different types of evidence for valid-

    ity. In light of this reconceptualization, we differentiatebetween validity and validation. We use validity torefer only to the overall idea of measurement validity,and we discuss alternative procedures for assessingvalidity as different types of validation. In the finalpart of this article we offer an overview of three maintypes of validation, seeking to emphasize how proce-dures associated with each can be applied by bothquantitative and qualitative researchers.

    In the first section of this article we introduce aframework for discussing conceptualization, measure-ment, and validity. We then situate questions of validityin relation to broader concerns about the meaning of

    concepts. Next, we address contextual specificity andequivalence, followed by a review of the evolvingdiscussion of types of validation. Finally, we focus onthree specific types of validation that merit central

    attention in political science: content, convergent/dis-criminant, and nomological/construct validation.

    OVERVIEW OF MEASUREMENT VALIDITY

    Measurement validity should be understood in relationto issues that arise in moving between concepts andobservations.

    Levels and Tasks

    We depict the relationship between concepts and ob-servations in terms of four levels, as shown in Figure 1.At the broadest level is the background concept, whichencompasses the constellation of potentially diversemeanings associated with a given concept. Next is thesystematized concept, the specific formulation of aconcept adopted by a particular researcher or group ofresearchers. It is usually formulated in terms of anexplicit definition. At the third level are indicators,which are also routinely called measures. This level

    includes any systematic scoring procedure, rangingfrom simple measures to complex aggregated indexes.It encompasses not only quantitative indicators butalso the classification procedures employed in qualita-tive research. At the fourth level are scores for cases,which include both numerical scores and the results ofqualitative classification.

    Downward and upward movement in Figure 1 can beunderstood as a series of research tasks. On theleft-hand side, conceptualization is the movement fromthe background concept to the systematized concept.Operationalization moves from the systematized con-cept to indicators, and the scoring of cases appliesindicators to produce scores. Moving up on the right-

    hand side, indicators may be refined in light of scores,and systematized concepts may be fine-tuned in light ofknowledge about scores and indicators. Insights de-rived from these levels may lead to revisiting thebackground concept, which may include assessing al-ternative formulations of the theory in which a partic-ular systematized concept is embedded. Finally, todefine a key overarching term, measurement involvesthe interaction among levels 2 to 4.

    Defining Measurement Validity

    Valid measurement is achieved when scores (including

    the results of qualitative classification) meaningfullycapture the ideas contained in the corresponding con-cept. This definition parallels that of Bollen (1989,184), who treats validity as concerned with whether avariable measures what it is supposed to measure.King, Keohane, and Verba (1994, 25) give essentiallythe same definition.

    If the idea of measurement validity is to do seriousmethodological work, however, its focus must be fur-ther specified, as emphasized by Bollen (1989, 197).Our specification involves both ends of the connectionbetween concepts and scores shown in Figure 1. At theconcept end, our basic point (explored in detail below)is that measurement validation should focus on the

    2 We have found the following adjectives attached to validity indiscussions of conceptualization and measurement: a priori, appar-ent, assumption, common-sense, conceptual, concurrent, congruent,consensual, consequential, construct, content, convergent, criterion-related, curricular, definitional, differential, discriminant, empirical,face, factorial, incremental, instrumental, intrinsic, linguistic, logical,nomological, postdictive, practical, pragmatic, predictive, rational,response, sampling, status, substantive, theoretical, and trait. Aparallel proliferation of adjectives, in relation to the concept ofdemocracy, is discussed in Collier and Levitsky 1997.

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    530

  • 8/2/2019 Adcock - Collier

    3/18

    relation between observations and the systematizedconcept; any potential disputes about the background

    concept should be set aside as an important butseparate issue. With regard to scores, an obvious butcrucial point must be stressed: Scores are never exam-ined in isolation; rather, they are interpreted and givenmeaning in relation to the systematized concept.

    In sum, measurement is valid when the scores (level4 in Figure 1), derived from a given indicator (level 3),can meaningfully be interpreted in terms of the system-atized concept (level 2) that the indicator seeks tooperationalize. It would be cumbersome to refer re-peatedly to all these elements, but the appropriatefocus of measurement validation is on the conjunctionof these components.

    Measurement Error, Reliability, and Validity

    Validity is often discussed in connection with measure-ment error and reliability. Measurement error may besystematicin which case it is called biasor random.Random error, which occurs when repeated applica-tions of a given measurement procedure yield incon-sistent results, is conventionally labeled a problem ofreliability. Methodologists offer two accounts of therelation between reliability and validity. (1) Validity issometimes understood as exclusively involving bias,that is error that takes a consistent direction or form.From this perspective, validity involves systematic er-ror, whereas reliability involves random error (Car-mines and Zeller 1979, 145; see also Babbie 2001,

    FIGURE 1. Conceptualization and Measurement: Levels and Tasks

    American Political Science Review Vol. 95, No. 3

    531

  • 8/2/2019 Adcock - Collier

    4/18

    1445). Therefore, unreliable scores may still be cor-rect on average and in this sense valid. (2) Alterna-tively, some scholars hesitate to view scores as valid ifthey contain large amounts of random error. Theybelieve validity requires the absence of both types oferror. Therefore, they view reliability as a necessary butnot sufficient condition of measurement validity (Kirkand Miller 1986, 20; Shively 1998, 45).

    Our goal is not to adjudicate between these accountsbut to state them clearly and to specify our own focus,namely, the systematic error that arises when the linksamong systematized concepts, indicators, and scoresare poorly developed. This involves validity in the firstsense stated above. Of course, the random error thatroutinely arises in scoring cases is also important, but itis not our primary concern.

    A final point should be emphasized. Because error isa pervasive threat to measurement, it is essential toview the interpretations of scores in relation to system-atized concepts as falsifiable claims (Messick 1989,134). Scholars should treat these claims just as they

    would any casual hypothesis, that is, as tentative state-ments that require supporting evidence. Validity as-sessment is the search for this evidence.

    MEASUREMENT VALIDITY AND CHOICESABOUT CONCEPTS

    A growing body of work considers the systematicanalysis of concepts an important component of polit-ical science methodology.3 How should we understandthe relation between issues of measurement validityand broader choices about concepts, which are acentral focus of this literature?

    Conceptual Choices: Forming theSystematized Concept

    We view systematized concepts as the point of depar-ture for assessing measurement validity. How do schol-ars form such concepts? Because background conceptsroutinely include a variety of meanings, the formationof systematized concepts often involves choosingamong them. The number of feasible options variesgreatly. At one extreme are concepts such as triangle,which are routinely understood in terms of a singleconceptual systematization; at the other extreme arecontested concepts (Gallie 1956), such as democracy.

    A careful examination of diverse meanings helps clarifythe options, but ultimately choices must be made.These choices are deeply interwined with issues of

    theory, as emphasized in Kaplans (1964, 53) paradoxof conceptualization: Proper concepts are needed toformulate a good theory, but we need a good theory toarrive at the proper concepts. . . . The paradox isresolved by a process of approximation: the better our

    concepts, the better the theory we can formulate withthem, and in turn, the better the concepts available forthe next, improved theory. Various examples of thisintertwining are explored in recent analyses of impor-tant concepts, such as Laitins (2000) treatment oflanguage community and Kurtzs (2000) discussion ofpeasant. Fearon and Laitins (2000) analysis of ethnicconflict, in which they begin with their hypothesis and

    ask what operationalization is needed to capture theconceptions of ethnic group and ethnic conflict en-tailed in this hypothesis, further illustrates the interac-tion of theory and concepts.

    In dealing with the choices that arise in establishingthe systematized concept, researchers must avoid threecommon traps. First, they should not misconstrue theflexibility inherent in these choices as suggesting thateverything is up for grabs. This is rarely, if ever, thecase. In any field of inquiry, scholars commonly asso-ciate a matrix of potential meanings with the back-ground concept. This matrix limits the range of plau-sible options, and the researcher who strays outside it

    runs the risk of being dismissed or misunderstood. Wedo not mean to imply that the background concept isentirely fixed. It evolves over time, as new understand-ings are developed and old ones are revised or fall fromuse. At a given time, however, the background conceptusually provides a relatively stable matrix. It is essentialto recognize that a real choice is being made, but it isno less essential to recognize that this is a limitedchoice.

    Second, scholars should avoid claiming too much indefending their choice of a given systematized concept.It is not productive to treat other options as self-evidently ruled out by the background concept. Forexample, in the controversy over whether democracy

    versus nondemocracy should be treated as a dichotomyor in terms of gradations, there is too much reliance onclaims that the background concept of democracyinherently rules out one approach or the other (Collierand Adcock 1999, 54650). It is more productive torecognize that scholars routinely emphasize differentaspects of a background concept in developing system-atized concepts, each of which is potentially plausible.Rather than make sweeping claims about what thebackground concept really means, scholars shouldpresent specific arguments, linked to the goals andcontext of their research, that justify their particularchoices.

    A third problem occurs when scholars stop short ofproviding a fleshed-out account of their systematizedconcepts. This requires not just a one-sentence defini-tion, but a broader specification of the meaning andentailments of the systematized concept. Within thepsychometrics literature, Shepard (1993, 417) summa-rizes what is required: both an internal model ofinterrelated dimensions or subdomains of the system-atized concept, and an external model depicting itsrelationship to other [concepts]. An example is Bol-lens (1990, 912; see also Bollen 1980) treatment ofpolitical democracy, which distinguishes the two di-mensions of political rights and political liberties,clarifies these by contrasting them with the dimensions

    3 Examples of earlier work in this tradition are Sartori 1970, 1984and Sartori, Riggs, and Teune 1975. More recent studies includeCollier and Levitsky 1997; Collier and Mahon 1993; Gerring 1997,1999, 2001; Gould 1999; Kurtz 2000; Levitsky 1998; Schaffer 1998.Important work in political theory includes Bevir 1999; Freeden1996; Gallie 1956; Pitkin 1967, 1987.

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    532

  • 8/2/2019 Adcock - Collier

    5/18

    developed by Dahl, and explores the relation betweenthem. Bollen further specifies political democracythrough contrasts with the concepts of stability andsocial or economic democracy. In the language ofSartori (1984, 514), this involves clarifying the seman-tic field.

    One consequence of this effort to provide a fleshed-out account may be the recognition that the concept

    needs to be disaggregated. What begins as a consider-ation of the internal dimensions or components of asingle concept may become a discussion of multipleconcepts. In democratic theory an important exampleis the discussion of majority rule and minority rights,which are variously treated as components of a singleoverall concept of democracy, as dimensions to beanalyzed separately, or as the basis for forming distinctsubtypes of democracy (Dahl 1956; Lijphart 1984;Schmitter and Karl 1992). This kind of refinement mayresult from new conceptual and theoretical argumentsor from empirical findings of the sort that are the focusof the convergent/discriminant validation procedures

    discussed below.

    Measurement Validity and the SystematizedVersus Background Concept

    We stated earlier that the systematized concept, ratherthan the background concept, should be the focus inmeasurement validation. Consider an example. A re-searcher may ask: Is it appropriate that Mexico, priorto the year 2000 (when the previously dominant partyhanded over power after losing the presidential elec-tion), be assigned a score of 5 out of 10 on an indicatorof democracy? Does this score really capture how

    democratic Mexico was compared to other coun-tries? Such a question remains underspecified until weknow whether democratic refers to a particular sys-tematized concept of democracy, or whether this re-searcher is concerned more broadly with the back-ground concept of democracy. Scholars who questionMexicos score should distinguish two issues: (1) aconcern about measurementwhether the indicatoremployed produces scores that can be interpreted asadequately capturing the systematized concept used ina given study and (2) a conceptual concernwhetherthe systematized concept employed in creating theindicator is appropriate vis-a-vis the background con-

    cept of democracy.We believe validation should focus on the first issue,whereas the second is outside the realm of measure-ment validity. This distinction seems especially appro-priate in view of the large number of contested con-cepts in political science. The more complex andcontested the background concept, the more importantit is to distinguish issues of measurement from funda-mental conceptual disputes. To pose the question ofvalidity we need a specific conceptual referent againstwhich to assess the adequacy of a given measure. Asystematized concept provides that referent. By con-trast, if analysts seek to establish measurement validityin relation to a background concept with multiple

    competing meanings, they may find a different answerto the validity question for each meaning.

    By restricting the focus of measurement validation tothe systematized concept, we do not suggest thatpolitical scientists should ignore basic conceptual is-sues. Rather, arguments about the background conceptand those about validity can be addressed adequatelyonly when each is engaged on its own terms, rather

    than conflated into one overly broad issue. ConsiderSchumpeters (1947, chap. 21) procedural definition ofdemocracy. This definition explicitly rules out elementsof the background concept, such as the concern withsubstantive policy outcomes, that had been central towhat he calls the classical theory of democracy. Al-though Schumpeters conceptualization has been veryinfluential in political science, some scholars (Hardingand Petras 1988; Mouffe 1992) have called for a revisedconception that encompasses other concerns, such associal and economic outcomes. This important debateexemplifies the kind of conceptual dispute that shouldbe placed outside the realm of measurement validity.

    Recognizing that a given conceptual choice does notinvolve an issue of measurement validity should notpreclude considered arguments about this choice. Anexample is the argument that minimal definitions canfacilitate causal assessment (Alvarez et al. 1996, 4; Karl1990, 12; Linz 1975, 1812; Sartori 1975, 34). Forinstance, in the debate about a procedural definition ofdemocracy, a pragmatic argument can be made that ifanalysts wish to study the casual relationship betweendemocracy and socioeconomic equality, then the lattermust be excluded from the systematization of theformer. The point is that such arguments can effectivelyjustify certain conceptual choices, but they involveissues that are different from the concerns of measure-

    ment validation.

    Fine-Tuning the Systematized Concept withFriendly Amendments

    We define measurement validity as concerned with therelation among scores, indicators, and the systematizedconcept, but we do not rule out the introduction of newconceptual ideas during the validation process. Keyhere is the back-and-forth, iterative nature of researchemphasized in Figure 1. Preliminary empirical workmay help in the initial formulation of concepts. Later,even after conceptualization appears complete, the

    application of a proposed indicator may produce un-expected observations that lead scholars to modifytheir systematized concepts. These friendly amend-ments occur when a scholar, out of a concern withvalidity, engages in further conceptual work to suggestrefinements or make explicit earlier implicit assump-tions. These amendments are friendly because they donot fundamentally challenge a systematized conceptbut instead push analysts to capture more adequatelythe ideas contained in it.

    A friendly amendment is illustrated by the emer-gence of the expanded procedural minimum defini-tion of democracy (Collier and Levitsky 1997, 4424).Scholars noted that, despite free or relatively free

    American Political Science Review Vol. 95, No. 3

    533

  • 8/2/2019 Adcock - Collier

    6/18

    elections, some civilian governments in Central andSouth America to varying degrees lacked effectivepower to govern. A basic concern was the persistenceof reserved domains of military power over whichelected governments had little authority (Valenzuela1992, 70). Because procedural definitions of democracydid not explicitly address this issue, measures basedupon them could result in a high democracy score for

    these countries, but it appeared invalid to view them asdemocratic. Some scholars therefore amended theirsystematized concept of democracy to add the differ-entiating attribute that the elected government must toa reasonable degree have the power to rule (Karl 1990,2; Loveman 1994, 10813; Valenzuela 1992, 70). De-bate persists over the scoring of specific cases (Rabkin1992, 165), but this innovation is widely acceptedamong scholars in the procedural tradition (Hunting-ton 1991, 10; Mainwaring, Brinks, and Perez-Linan2001; Markoff 1996, 1024). As a result of this friendlyamendment, analysts did a better job of capturing, forthese new cases, the underlying idea of procedural

    minimum democracy.

    VALIDITY, CONTEXTUAL SPECIFICITY, ANDEQUIVALENCE

    Contextual specificity is a fundamental concern thatarises when differences in context potentially threatenthe validity of measurement. This is a central topic inpsychometrics, the field that has produced the mostinnovative work on validity theory. This literatureemphasizes that the same score on an indicator mayhave different meanings in different contexts (Moss

    1992, 2368; see also Messick 1989, 15). Hence, thevalidation of an interpretation of scores generated inone context does not imply that the same interpreta-tion is valid for scores generated in another context. Inpolitical science, this concern with context can arisewhen scholars are making comparisons across differentworld regions or distinct historical periods. It can alsoarise in comparisons within a national (or other) unit,given that different subunits, regions, or subgroups mayconstitute very different political, social, or culturalcontexts.

    The potential difficulty that context poses for validmeasurement, and the related task of establishingmeasurement equivalence across diverse units, deservemore attention in political science. In a period whenthe quest for generality is a powerful impulse in thesocial sciences, scholars such as Elster (1999, chap. 1)have strongly challenged the plausibility of seekinggeneral, law-like explanations of political phenomena.A parallel constraint on the generality of findings maybe imposed by the contextual specificity of measure-ment validity. We are not arguing that the quest forgenerality be abandoned. Rather, we believe greatersensitivity to context may help scholars develop mea-sures that can be validly applied across diverse con-texts. This goal requires concerted attention to theissue of equivalence.

    Contextual Specificity in Political Research

    Contextual specificity affects many areas of politicalscience. It has long been a problem in cross-nationalsurvey research (Sicinski 1970; Verba 1971; Verba,Nie, and Kim 1978, 3240; Verba et al. 1987, Appen-dix). An example concerning features of national con-text is Cain and Ferejohns (1981) discussion of how

    the differing structure of party systems in the UnitedStates and Great Britain should be taken into accountwhen comparing party identification. Context is also aconcern for survey researchers working within a singlenation, who wrestle with the dilemma of inter-person-ally incomparable responses (Brady 1985). For exam-ple, scholars debate whether a given survey item hasthe same meaning for different population sub-groupswhich could be defined, for example, by re-gion, gender, class, or race. One specific concern iswhether population subgroups differ systematically intheir response style (also called response sets).Some groups may be more disposed to give extremeanswers, and others may tend toward moderate an-

    swers (Greenleaf 1992). Bachman and OMalley (1984)show that response style varies consistently with race.They argue that apparently important differencesacross racial groups may in part reflect only a differentmanner of answering questions. Contextual specificityalso can be a problem in survey comparisons over time,as Baumgartner and Walker (1990) point out in dis-cussing group membership in the United States.

    The issue of contextual specificity of course alsoarises in macro-level research in international andcomparative studies (Bollen, Entwisle, and Anderson1993, 345). Examples from the field of comparativepolitics are discussed below. In international relations,

    attention to context, and particularly a concern withhistoricizing the concept of structure, is central toconstructivism (Ruggie 1998, 875). Constructivistsargue that modern international relations rest uponconstitutive rules that differ fundamentally fromthose of both medieval Christendom and the classicalGreek world (p. 873). Although they recognize thatsovereignty is an organizing principle applicable acrossdiverse settings, the constructivists emphasize that themeaning and behavioral implications of this principlevary from one historical context to another (Reus-Smit 1997, 567). On the other side of this debate,neorealists such as Fischer (1993, 493) offer a generalwarning: If pushed to an extreme, the claim to contextdependency threatens to make impossible the collec-tive pursuit of empirical knowledge. He also offersspecific historical support for the basic neorealist posi-tion that the behavior of actors in international politicsfollows consistent patterns. Fischer (1992, 463, 465)concludes that the structural logic of action underanarchy has the character of an objective law, which isgrounded in an unchanging essence of human na-ture.

    The recurring tension in social research betweenparticularizing and universalizing tendencies reflects inpart contrasting degrees of concern with contextualspecificity. The approaches to establishing equivalence

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    534

  • 8/2/2019 Adcock - Collier

    7/18

    discussed below point to the option of a middle ground.These approaches recognize that contextual differencesare important, but they seek to combine this insightwith the quest for general knowledge.

    The lessons for political science are clear. Anyempirical assessment of measurement validity is neces-sarily based on a particular set of cases, and validityclaims should be made, at least initially, with reference

    to this specific set. To the extent that the set isheterogeneous in ways that may affect measurementvalidity, it is essential to (1) assess the implications forestablishing equivalence across these diverse contextsand, if necessary, (2) adopt context-sensitive measures.Extension to additional cases requires similar proce-dures.

    Establishing Equivalence: Context-SpecificDomains of Observation

    One important means of establishing equivalenceacross diverse contexts is careful reasoning, in the

    initial stages of operationalization, about the specificdomains to which a systematized concept applies. Wellbefore thinking about particular scoring procedures,scholars may need to make context-sensitive choicesregarding the parts of the broader polity, economy, orsociety to which they will apply their concept. Equiva-lent observations may require, in different contexts, afocus on what at a concrete level might be seen asdistinct types of phenomena.

    Some time ago, Verba (1967) called attention to theimportance of context-specific domains of observation.In comparative research on political opposition instable democracies, a standard focus is on politicalparties and legislative politics, but Verba (pp. 1223)

    notes that this may overlook an analytically equivalentform of opposition that crystallizes, in some countries,in the domain of interest group politics. Skocpol (1992,6) makes a parallel argument in questioning the claimthat the United States was a welfare laggard becausesocial provision was not launched on a large scale untilthe New Deal. This claim is based on the absence ofstandard welfare programs of the kind that emergedearlier in Europe but fails to recognize the distinctiveforms of social provision in the United States, such asveterans benefits and support for mothers and chil-dren. Skocpol argues that the welfare laggard charac-terization resulted from looking in the wrong place,

    that is, in the wrong domain of policy.Locke and Thelen (1995, 1998) have extended thisapproach in their discussion of contextualized com-parison. They argue that scholars who study nationalresponses to external pressure for economic decentral-ization and flexibilization routinely focus on thepoints at which conflict emerges over this economictransformation. Yet, these sticking points may belocated in different parts of the economic and politicalsystem. With regard to labor politics in different coun-tries, such conflicts may arise over wage equity, hoursof employment, workforce reduction, or shop-floorreorganization. These different domains of labor rela-tions must be examined in order to gather analytically

    equivalent observations that adequately tap the con-cept of sticking point. Scholars who look only at wageconflicts run the risk of omitting, for some nationalcontexts, domains of conflict that are highly relevant tothe concept they seek to measure.

    By allowing the empirical domain to which a system-atized concept is applied to vary across the units beingcompared, analysts may take a productive step toward

    establishing equivalence among diverse contexts. Thispractice must be carefully justified, but under somecircumstances it can make an important contribution tovalid measurement.

    Establishing Equivalence: Context-SpecificIndicators and Adjusted CommonIndicators

    Two other ways of establishing equivalence involvecareful work at the level of indicators. We will discusscontext-specific indicators,4 and what we call adjustedcommon indicators. In this second approach, the same

    indicator is applied to all cases but is weighted tocompensate for contextual differences.An example of context-specific indicators is found in

    Nie, Powell, and Prewitts (1969, 377) five-countrystudy of political participation. For all the countries,they analyze four relatively standard attributes of par-ticipation. Regarding a fifth attributemembership ina political partythey observe that in four of thecountries party membership has a roughly equivalentmeaning, but in the United States it has a differentform and meaning. The authors conclude that involve-ment in U.S. electoral campaigns reflects an equivalentform of political participation. Nie, Powell, and Prewittthus focus on a context-specific domain of observation

    (the procedure just discussed above) by shifting theirattention, for the U.S. context, from party membershipto campaign participation. They then take the furtherstep of incorporating within their overall index ofpolitical participation context-specific indicators thatfor each case generate a score for what they see as theappropriate domain. Specifically, the overall index forthe United States includes a measure of campaignparticipation rather than party membership.

    A different example of context-specific indicators isfound in comparative-historical research, in the effortto establish a meaningful threshold for the onset ofdemocracy in the nineteenth and early twentieth cen-

    tury, as opposed to the late twentieth century. Thiseffort in turn lays a foundation for the comparativeanalysis of transitions to democracy. One problem inestablishing equivalence across these two eras lies inthe fact that the plausible agenda of full democrati-zation has changed dramatically over time. Full bythe standards of an earlier period is incomplete by laterstandards. For example, by the late twentieth century,universal suffrage and the protection of civil rights forthe entire national population had come to be consid-ered essential features of democracy, but in the nine-

    4 This approach was originally used by Przeworski and Teune (1970,chap. 6), who employed the label system-specific indicator.

    American Political Science Review Vol. 95, No. 3

    535

  • 8/2/2019 Adcock - Collier

    8/18

    teenth century they were not (Huntington 1991, 7, 16).Yet, if the more recent standard is applied to theearlier period, cases are eliminated that have long beenconsidered classic examples of nascent democratiza-tion in Europe. One solution is to compare regimeswith respect to a systematized concept of full democ-ratization that is operationalized according to thenorms of the respective periods (Collier 1999, chap. 1;

    Russett 1993, 15; see also Johnson 1999, 118). Thus, adifferent scoring procedurea context-specific indica-toris employed in each period in order to producescores that are comparable with respect to this system-atized concept.5

    Adjusted common indicators are another way toestablish equivalence. An example is found in Moeneand Wallersteins (2000) quantitative study of socialpolicy in advanced industrial societies, which focusesspecifically on public social expenditures for individualsoutside the labor force. One component of their mea-sure is public spending on health care. The authorsargue that in the United States such health care

    expenditures largely target those who are not membersof the labor force. By contrast, in other countrieshealth expenditures are allocated without respect toemployment status. Because U.S. policy is distinctive,the authors multiply health care expenditures in theother countries by a coefficient that lowers their scoreson this variable. Their scores are thereby made roughlyequivalentas part of a measure of public expendi-tures on individuals outside the labor forceto theU.S. score. A parallel effort to establish equivalence inthe analysis of economic indicators is provided byZeitsch, Lawrence, and Salernian (1994, 169), who usean adjustment technique in estimating total factorproductivity to take account of the different operating

    environments, and hence the different context, of theindustries they compare. Expressing indicators in per-capita terms is also an example of adjusting indicatorsin light of context. Overall, this practice is used toaddress both very specific problems of equivalence, aswith the Moene and Wallerstein example, as well asmore familiar concerns, such as standardizing by pop-ulation.

    Context-specific indicators and adjusted commonindicators are not always a step forward, and somescholars have self-consciously avoided them. The use ofsuch indicators should match the analytic goal of theresearcher. For example, many who study democrati-

    zation in the late twentieth century deliberately adopta minimum definition of democracy in order to con-centrate on a limited set of formal procedures. They dothis out of a conviction that these formal proceduresare important, even though they may have differentmeanings in particular settings. Even a scholar such asODonnell (1993, 1355), who has devoted great atten-tion to contextualizing the meaning of democracy,insists on the importance of also retaining a minimaldefinition of political democracy that focuses on

    basic formal procedures. Thus, for certain purposes, itcan be analytically productive to adopt a standarddefinition that ignores nuances of context and apply thesame indicator to all cases.

    In conclusion, we note that although Przeworski andTeunes (1970) and Verbas arguments about equiva-lence are well known, issues of contextual specificityand equivalence have not received adequate attention

    in political science. We have identified three toolscontext-specific domains of observation, context-spe-cific indicators, and adjusted common indicatorsforaddressing these issues, and we encourage their wideruse. We also advocate greater attention to justifyingtheir use. Claims about the appropriateness of contex-tual adjustments should not simply be asserted; theirvalidity needs to be carefully defended. Later, weexplore three types of validation that may be fruitfullyapplied in assessing proposals for context-sensitivemeasurement. In particular, content validation, whichfocuses on whether operationalization captures theideas contained in the systematized concept, is central

    to determining whether and how measurement needsto be adjusted in particular contexts.

    ALTERNATIVE PERSPECTIVES ON TYPESOF VALIDATION

    Discussions of measurement validity are confoundedby the proliferation of different types of validation, andby an even greater number of labels for them. In thissection we review the emergence of a unified concep-tion of measurement validity in the field of psychomet-rics, propose revisions in the terminology for talkingabout validity, and examine the important treatmentsof validation in political analysis offered by Carmines

    and Zeller, and by Bollen.

    Evolving Understandings of Validity

    In the psychometric tradition, current thinking aboutmeasurement validity developed in two phases. In thefirst phase, scholars wrote about types of validity in away that often led researchers to treat each type as if itindependently established a distinct form of validity. Indiscussing this literature we follow its terminology byreferring to types of validity. As noted above, in therest of this article we refer instead to types of valida-tion.

    The first pivotal development in the emergence of aunified approach occurred in the 1950s and 1960s,when a threefold typology of content, criterion, andconstruct validity was officially established in reactionto the confusion generated by the earlier proliferationof types.6 Other labels continued to appear in otherdisciplines, but this typology became an orthodoxy in

    5 A well-known example of applying different standards for democ-racy in making comparisons across international regions is Lipset(1959, 734).

    6 The second of these is often called criterion-related validity.Regarding these official standards, see American Psychological As-sociation 1954, 1966; Angoff 1988, 25; Messick 1989, 167; Shultz,Riggs, and Kottke 1998, 2679. The 1954 standards initially pre-sented four types of validity, which became the threefold typology in1966 when predictive and concurrent validity were combined ascriterion-related validity.

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    536

  • 8/2/2019 Adcock - Collier

    9/18

    psychology. A recurring metaphor in that field charac-terized the three types as something of a holy trinityrepresenting three different roads to psychometric sal-vation (Guion 1980, 386). These types may be brieflydefined as follows.

    Content validity assesses the degree to which anindicator represents the universe of content entailed

    in the systematized concept being measured. Criterion validity assesses whether the scores pro-duced by an indicator are empirically associated withscores for other variables, called criterion variables,which are considered direct measures of the phe-nomenon of concern.

    Construct validity has had a range of meanings. Onecentral focus has been on assessing whether a givenindicator is empirically associated with other indica-tors in a way that conforms to theoretical expecta-tions about their interrelationship.

    These labels remain very influential and are still thecenterpiece in some discussions of measurement valid-

    ity, as in the latest edition of Babbies (2001, 1434)widely used methods textbook for undergraduates.The second phase grew out of increasing dissatisfac-

    tion with the trinity and led to a unitarian ap-proach (Shultz, Riggs, and Kottke 1998, 26971). Abasic problem identified by Guion (1980, 386) andothers was that the threefold typology was too oftentaken to mean that any one type was sufficient toestablish validity (Angoff 1988, 25). Scholars increas-ingly argued that the different types should be sub-sumed under a single concept. Hence, to continue withthe prior metaphor, the earlier trinity came to be seenin a monotheistic mode as the three aspects of aunitary psychometric divinity (p. 25).

    Much of the second phase involved a reconceptual-ization of construct validity and its relation to contentand criterion validity. A central argument was that thelatter two may each be necessary to establish validity,but neither is sufficient. They should be understood aspart of a larger process of validation that integratesmultiple sources of evidence and requires the com-bination of logical argument and empirical evidence(Shepard 1993, 406). Alongside this development, areconceptualization of construct validity led to a morecomprehensive and theory-based view that subsumedother more limited perspectives (Shultz, Riggs, andKottke 1998, 270). This broader understanding of

    construct validity as the overarching goal of a single,integrated process of measurement validation is widelyendorsed by psychometricians. Moss (1995, 6) statesthere is a close to universal consensus among validitytheorists that content- and criterion-related evidenceof validity are simply two of many types of evidencethat support construct validity.

    Thus, in the psychometric literature (e.g., Messick1980, 1015), the term construct validity has becomeessentially a synonym for what we call measurementvalidity. We have adopted measurement validity as thename for the overall topic of this article, in partbecause in political science the label construct validitycommonly refers to specific procedures rather than to

    the general idea of valid measurement. These specificprocedures generally do not encompass content valida-tion and have in common the practice of assessingmeasurement validity by taking as a point of referenceestablished conceptual and/or theoretical relationships.

    We find it helpful to group these procedures into twotypes according to the kind of theoretical or conceptualrelationship that serves as the point of reference.

    Specifically, these types are based on the heuristicdistinction between description and explanation.7 First,some procedures rely on descriptive expectationsconcerning whether given attributes are understood asfacets of the same phenomenon. This is the focus ofwhat we label convergent/discriminant validation.Second, other procedures rely on relatively well-estab-lished explanatory causal relations as a baselineagainst which measurement validity is assessed. Inlabeling this second group of procedures we draw onCampbells (1960, 547) helpful term, nomologicalvalidation, which evokes the idea of assessment inrelation to well-established causal hypotheses or law-

    like relationships. This second type is often calledconstruct validity in political research (Berry et al.1998; Elkins 2000).8 Out of deference to this usage, inthe headings and summary statements below we willrefer to nomological/construct validation.

    Types of Validation in Political Analysis

    A baseline for the revised discussion of validationpresented below is provided in work by Carmines andZeller, and by Bollen. Carmines and Zeller (1979, 26;Zeller and Carmines 1980, 7880) argue that contentvalidation and criterion validation are of limited utilityin fields such as political science. While recognizingthat content validation is important in psychology andeducation, they argue that evaluating it has proved tobe exceedingly difficult with respect to measures of themore abstract phenomena that tend to characterize thesocial sciences (Carmines and Zeller 1979, 22). Forcriterion validation, these authors emphasize that inmany social sciences, few criterion variables areavailable that can serve as real measures of thephenomena under investigation, against which scholarscan evaluate alternative measures (pp. 1920). Hence,for many purposes it is simply not a relevant procedure.Although Carmines and Zeller call for the use ofmultiple sources of evidence, their emphasis on the

    limitations of the first two types of validation leadsthem to give a predominant role to nomological/construct validation.

    In relation to Carmines and Zeller, Bollen (1989,1856, 1904) adds convergent/discriminant validation

    7 Description and explanation are of course intertwined, but we findthis distinction invaluable for exploring contrasts among validationprocedures. While these procedures do not always fit in sharplybounded categories, many do indeed focus on either descriptive orexplanatory relations and hence are productively differentiated byour typology.8 See also the main examples of construct validation presented in themajor statements by Carmines and Zeller 1979, 23, and Bollen 1989,18990.

    American Political Science Review Vol. 95, No. 3

    537

  • 8/2/2019 Adcock - Collier

    10/18

    to their three types and emphasizes content validation,which he sees as both viable and fundamental. He alsoraises general concerns about correlation-based ap-proaches to convergent and nomological/construct val-idation, and he offers an alternative approach based onstructural equation modeling with latent variables (pp.192206). Bollen shares the concern of Carmines andZeller that, for most social research, true measures

    do not exist against which criterion validation can becarried out, so he likewise sees this as a less relevanttype (p. 188).

    These valuable contributions can be extended inseveral respects. First, with reference to Carmines andZellers critique of content validation, we recognizethat this procedure is harder to use if concepts areabstract and complex. Moreover, it often does not lenditself to the kind of systematic, quantitative analysisroutinely applied in some other kinds of validation.Yet, like Bollen (1989, 1856, 194), we are convinced itis possible to lay a secure foundation for contentvalidation that will make it a viable, and indeed essen-

    tial, procedure. Our discussion of this task belowderives from our distinction between the backgroundand the systematized concept.

    Second, we share the conviction of Carmines andZeller that nomological/construct validation is impor-tant, yet given our emphasis on content and conver-gent/discriminant validation, we do not privilege it tothe degree they do. Our discussion will seek to clarifysome aspects of how this procedure actually works andwill address the skeptical reaction of many scholars toit.

    Third, we have a twofold response to the critique ofcriterion validation as irrelevant to most forms of socialresearch. On the one hand, in some domains criterion

    validation is important, and this must be recognized.For example, the literature on response validity insurvey research seeks to evaluate individual responsesto questions, such as whether a person voted in aparticular election, by comparing them to official votingrecords (Anderson and Silver 1986; Clausen 1968;Katosh and Traugott 1980). Similarly, in panel studiesit is possible to evaluate the adequacy of recall (i.e.,whether respondents remember their own earlier opin-ions, dispositions, and behavior) through comparisonwith responses in earlier studies (Niemi, Katz, andNewman 1980). On the other hand, this is not one ofthe most generally applicable types of validation, and

    we favor treating it as one subtype within the broadercategory of convergent validation. As discussed below,convergent validation compares a given indicator withone or more other indicators of the conceptin whichthe analyst may or may not have a higher level ofconfidence. Even if these other indicators are as fallibleas the indicator being evaluated, the comparison pro-vides greater leverage than does looking only at one ofthem in isolation. To the extent that a well-established,direct measure of the phenomenon under study isavailable, convergent validation is essentially the sameas criterion validation.

    Finally, in contrast both to Carmines and Zeller andto Bollen, we will discuss the application of the differ-

    ent types of validation in qualitative as well as quanti-tative research, using examples drawn from both tradi-tions. Furthermore, we will employ crucial distinctionsintroduced above, including the differentiation of levelspresented in Figure 1, as well as the contrast betweenspecific procedures for validation, as opposed to theoverall idea of measurement validity.

    THREE TYPES OF MEASUREMENTVALIDATION: QUALITATIVE ANDQUANTITATIVE EXAMPLES

    We now discuss various procedures, both qualitativeand quantitative, for assessing measurement validity.We organize our presentation in terms of a threefoldtypology: content, convergent/discriminant, and nomo-logical/construct validation. The goal is to explicateeach of these types by posing a basic question that, inall three cases, can be addressed by both qualitativeand quantitative scholars. Two caveats should be intro-duced. First, while we discuss correlation-based ap-

    proaches to validity assessment, this article is notintended to provide a detailed or exhaustive account ofrelevant statistical tests. Second, we recognize that norigid boundaries exist among alternative procedures,given that one occasionally shades off into another.Our typology is a heuristic device that shows howvalidation procedures can be grouped in terms of basicquestions, and thereby helps bring into focus parallelsand contrasts in the approaches to validation adoptedby qualitative and quantitative researchers.

    Content Validation

    Basic Question. In the framework of Figure 1, does a

    given indicator (level 3) adequately capture the fullcontent of the systematized concept (level 2)? Thisadequacy of content is assessed through two furtherquestions. First, are key elements omitted from theindicator? Second, are inappropriate elements in-cluded in the indicator?9 An examination of the scores(level 4) of specific cases may help answer thesequestions about the fit between levels 2 and 3.

    Discussion. In contrast to the other types considered,content validation is distinctive in its focus on concep-tual issues, specifically, on what we have just calledadequacy of content. Indeed, it developed historicallyas a corrective to forms of validation that focused solely

    on the statistical analysis of scores, and in so doingoverlooked important threats to measurement validity(Sireci 1998, 837).

    Because content validation involves conceptual rea-soning, it is imperative to maintain the distinction wemade between issues of validation and questions con-cerning the background concept. If content validationis to be useful, then there must be some ground ofconceptual agreement about the phenomena beinginvestigated (Bollen 1989, 186; Cronbach and Meehl

    9 Some readers may think of these questions as raising issues of facevalidity. We have found so many different definitions of face validitythat we prefer not to use this label.

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    538

  • 8/2/2019 Adcock - Collier

    11/18

    1955, 282). Without it, a well-focused validation ques-tion may rapidly become entangled in a broader dis-pute over the concept. Such agreement can be pro-vided if the systematized concept is taken as given, soattention can be focused on whether a particularindicator adequately captures its content.

    Examples of Content Validation. Within the psycho-

    metric tradition (Angoff 1988, 278; Shultz, Riggs, andKottke 1998, 2678), content validation is understoodas focusing on the relationship between the indicator(level 3) and the systematized concept (level 2), with-out reference to the scores of specific cases (level 4).We will first present examples from political sciencethat adopt this focus. We will then turn to a somewhatdifferent, case-oriented procedure (Ragin 1987,chap. 3), identified with qualitative research, in whichthe examination of scores for specific cases plays acentral role in content validation.

    Two examples from political research illustrate, re-spectively, the problems of omission of key elementsfrom the indicator and inclusion of inappropriate ele-

    ments. Paxtons (2000) article on democracy focuses onthe first problem. Her analysis is particularly salient forscholars in the qualitative tradition, given its focus onchoices about the dichotomous classification of cases.Paxton contrasts the systematized concepts of democ-racy offered by several prominent scholarsBollen,Gurr, Huntington, Lipset, Muller, and Rueschemeyer,Stephens, and Stephenswith the actual content of theindicators they propose. She takes their systematizedconcepts as given, which establishes common concep-tual ground. She observes that these scholars includeuniversal suffrage in what is in effect their systematizedconcept of democracy, but the indicators they employ

    in operationalizing the concept consider only malesuffrage. Paxton thus focuses on the problem that animportant component of the systematized concept isomitted from the indicator.

    The debate on Vanhanens (1979, 1990) quantitativeindicator of democracy illustrates the alternative prob-lem that the indicator incorporates elements that cor-respond to a concept other than the systematizedconcept of concern. Vanhanen seeks to capture theidea of political competition that is part of his system-atized concept of democracy by including, as a compo-nent of his scale, the percentage of votes won by partiesother than the largest party. Bollen (1990, 13, 15) andCoppedge (1997, 6) both question this measure of

    democracy, arguing that it incorporates elementsdrawn from a distinct concept, the structure of theparty system.

    Case-Oriented Content Validation. Researchers en-gaged in the qualitative classification of cases routinelycarry out a somewhat different procedure for contentvalidation, based on the relation between conceptualmeaning and choices about scoring particular cases. Inthe vocabulary of Sartori (1970, 10406), this concernsthe relation between the intension (meaning) andextension (set of positive cases) of the concept. ForSartori, an essential aspect of concept formation is theprocedure of adjusting this relation between cases and

    concept. In the framework of Figure 1, this procedureinvolves revising the indicator (i.e., the scoring proce-dure) in order to sort cases in a way that better fitsconceptual expectations, and potentially fine-tuningthe systematized concept to better fit the cases. Ragin(1994, 98) terms this process of mutual adjustmentdouble fitting. This procedure avoids conceptualstretching (Collier and Mahon 1993; Sartori 1970), that

    is, a mismatch between a systematized concept and thescoring of cases, which is clearly an issue of validity.

    An example of case-oriented content validation isfound in ODonnells (1996) discussion of democraticconsolidation. Some scholars suggest that one indicatorof consolidation is the capacity of a democratic regimeto withstand severe crises. ODonnell argues that bythis standard, some Latin American democracieswould be considered more consolidated than those insouthern Europe. He finds this an implausible classifi-cation because the standard leads to a reductio ad

    absurdum (p. 43). This example shows how attentionto specific cases can spur recognition of dilemmas in

    the adequacy of content and can be a productive tool incontent validation.In sum, for case-oriented content validation, upward

    movement in Figure 1 is especially important. It canlead to both refining the indicator in light of scores andfine-tuning the systematized concept. In addition, al-though the systematized concept being measured isusually relatively stable, this form of validation maylead to friendly amendments that modify the system-atized concept by drawing ideas from the backgroundconcept. To put this another way, in this form ofvalidation both an inductive component and concep-tual innovation are especially important.

    Limitations of Content Validation. Content validationmakes an important contribution to the assessment ofmeasurement validity, but alone it is incomplete, fortwo reasons. First, although a necessary condition, thefindings of content validation are not a sufficient con-dition for establishing validity (Shepard 1993, 4145;Sireci 1998, 112). The key point is that an indicatorwith valid content may still produce scores with lowoverall measurement validity, because further threatsto validity can be introduced in the coding of cases. Asecond reason concerns the trade-off between parsi-mony and completeness that arises because indicatorsroutinely fail to capture the full content of a system-atized concept. Capturing this content may require acomplex indicator that is hard to use and adds greatlyto the time and cost of completing the research. It is amatter of judgment for scholars to decide when effortsto further improve the adequacy of content may be-come counterproductive.

    It is useful to complement the conceptual criticism ofindicators by examining whether particular modifica-tions in an indicator make a difference in the scoring ofcases. To the extent that such modifications have littleinfluence on scores, their contribution to improvingvalidity is more modest. An example in which theircontribution is shown to be substantial is provided byPaxton (2000). She develops an alternative indicator of

    American Political Science Review Vol. 95, No. 3

    539

  • 8/2/2019 Adcock - Collier

    12/18

    democracy that takes female suffrage into account,compares the scores it produces with those producedby the indicators she originally criticized, and showsthat her revised indicator yields substantially differentfindings. Her content validation argument stands onconceptual grounds alone, but her information aboutscoring demonstrates the substantive importance ofher concerns. This comparison of indicators in a sense

    introduces us to convergent/discriminant validation, towhich we now turn.

    Convergent/Discriminant Validation

    Basic Question. Are the scores (level 4) produced byalternative indicators (level 3) of a given systematizedconcept (level 2) empirically associated and thus con-vergent? Furthermore, do these indicators have aweaker association with indicators of a second, differ-ent systematized concept, thus discriminating this sec-ond group of indicators from the first? Stronger asso-ciations constitute evidence that supports interpretingindicators as measuring the same systematized con-

    ceptthus providing convergent validation; whereasweaker associations support the claim that they mea-sure different conceptsthus providing discriminantvalidation. The special case of convergent validation inwhich one indicator is taken as a standard of reference,and is used to evaluate one or more alternative indi-cators, is called criterion validation, as discussed above.

    Discussion. Carefully defined systematized concepts,and the availability of two or more alternative indica-tors of these concepts, are the starting point forconvergent/discriminant validation. They lay thegroundwork for arguments that particular indicatorsmeasure the same or different concepts, which in turncreate expectations about how the indicators may beempirically associated. To the extent that empiricalfindings match these descriptive expectations, theyprovide support for validity.

    Empirical associations are crucial to convergent/discriminant validation, but they are often simply thepoint of departure for an iterative process. Whatinitially appears to be negative evidence can spurrefinements that ultimately enhance validity. That is,the failure to find expected convergence may encour-age a return to the conceptual and logical analysis ofindicators, which may lead to their modification. Alter-natively, researchers may conclude that divergence

    suggests the indicators measure different systematizedconcepts and may reevaluate the conceptualizationthat led them to expect convergence. This processillustrates the intertwining of convergent and discrimi-nant validation.

    Examples of Convergent/Discriminant Validation.Scholars who develop measures of democracy fre-quently use convergent validation. Thus, analysts whocreate a new indicator commonly report its correlationwith previously established indicators (Bollen 1980,3802; Coppedge and Reincke 1990, 61; Mainwaring etal. 2001, 52; Przeworski et al. 1996, 52). This is avaluable procedure, but it should not be employed

    atheoretically. Scholars should have specific conceptualreasons for expecting convergence if it is to constituteevidence for validity. Let us suppose a proposed indi-cator is meant to capture a facet of democracy over-looked by existing measures; then too high a correla-tion is in fact negative evidence regarding validity, forit suggests that nothing new is being captured.

    An example of discriminant validation is provided by

    Bollens (1980, 3734) analysis of voter turnout. As inthe studies just noted, different measures of democracyare compared, but in this instance the goal is to findempirical support for divergence. Bollen claims, basedon content validation, that voter turnout is an indicatorof a concept distinct from the systematized concept ofpolitical democracy. The low correlation of voter turn-out with other proposed indicators of democracy pro-vides discriminant evidence for this claim. Bollen con-cludes that turnout is best understood as an indicatorof political participation, which should be conceptual-ized as distinct from political democracy.

    Although qualitative researchers routinely lack the

    data necessary for the kind of statistical analysis per-formed by Bollen, convergent/discriminant validationis by no means irrelevant for them. They often assesswhether the scores for alternative indicators convergeor diverge. Paxton, in the example discussed above, ineffect uses discriminant validation when she comparesalternative qualitative indicators of democracy in orderto show that recommendations derived from her as-sessment of content validation make a difference em-pirically. This comparison, based on the assessment ofscores, discriminates among alternative indicators.Convergent/discriminant validation is also employedwhen qualitative researchers use a multimethod ap-proach involving triangulation among multiple indi-cators based on different kinds of data sources (Brewerand Hunter 1989; Campbell and Fiske 1959; Webb etal. 1966). Orum, Faegin, and Sjoberg (1991, 19) spe-cifically argue that one of the great strengths of thecase study tradition is its use of triangulation forenhancing validity. In general, the basic ideas of con-vergent/discriminant validation are at work in qualita-tive research whenever scholars compare alternativeindicators.

    Concerns about Convergent/Discriminant Validation. Afirst concern here is that scholars might think that inconvergent/discriminant validation empirical findingsalways dictate conceptual choices. This frames theissue too narrowly. For example, Bollen (1993, 12089,1217) analyzes four indicators that he takes as compo-nents of the concept of political liberties and fourindicators that he understands as aspects of democraticrule. An examination of Bollens covariance matrixreveals that these do not emerge as two separateempirical dimensions. Convergent/discriminant valida-tion, mechanically applied, might lead to a decision toeliminate this conceptual distinction. Bollen does nottake that approach. He combines the two clusters ofindicators into an overall empirical measure, but healso maintains the conceptual distinction. Given theconceptual congruence between the two sets of indica-

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    540

  • 8/2/2019 Adcock - Collier

    13/18

    tors and the concepts of political liberties and demo-cratic rule, the standard of content validation is clearlymet, and Bollen continues to use these overarchinglabels.

    Another concern arises over the interpretation oflow correlations among indicators. Analysts who lack atrue measure against which to assess validity mustbase convergent validation on a set of indicators, none

    of which may be a very good measure of the system-atized concept. The result may be low correlationsamong indicators, even though they have shared vari-ance that measures the concept. One possible solutionis to focus on this shared variance, even though theoverall correlations are low. Standard statistical tech-niques may be used to tap this shared variance.

    The opposite problem also is a concern: the limita-tions of inferring validity from a high correlationamong indicators. Such a correlation may reflect fac-tors other than valid measurement. For example, twoindicators may be strongly correlated because theyboth measure some other concept; or they may mea-

    sure different concepts, one of which causes the other.A plausible response is to think through, and seek torule out, alternative reasons for the high correlation.10

    Although framing these concerns in the language ofhigh and low correlations appears to orient the discus-sion toward quantitative researchers, qualitative re-searchers face parallel issues. Specifically, these issuesarise when qualitative researchers analyze the sortingof cases produced by alternative classification proce-dures that represent different ways of operationalizingeither a given concept (i.e., convergent validation) ortwo or more concepts that are presumed to be distinct(i.e., discriminant validation). Given that these scholarsare probably working with a small N, they may be able

    to draw on their knowledge of cases to assess alterna-tive explanations for convergences and divergencesamong the sorting of cases yielded by different classi-fication procedures. In this way, they can make valu-able inferences about validity. Quantitative research-ers, by contrast, have other tools for making theseinferences, to which we now turn.

    Convergent Validation and Structural Equation Modelswith Latent Variables. In quantitative research, animportant means of responding to the limitations ofsimple correlational procedures for convergent/dis-criminant validation is offered by structural equationmodels with latent variables (also called LISREL-type

    models). Some treatments of such models, to theextent that they discuss measurement error, focus theirattention on random error, that is, on reliability (Hay-duk 1987, e.g., 11824; 1996).11 However, Bollen hasmade systematic error, which is the concern of the

    present article, a central focus in his major method-ological statement on this approach (1989, 190206).He demonstrates, for example, its distinctive contribu-tion for a scholar concerned with convergent/discrimi-nant validation who is dealing with a data set with highcorrelations among alternative indicators. In this case,structural equations with latent variables can be usedto estimate the degree to which these high correlations

    derive from shared systematic bias, rather than reflectthe valid measurement of an underlying concept.12

    This approach is illustrated by Bollen (1993) andBollen and Paxtons (1998, 2000) evaluation of eightindicators of democracy taken from data sets devel-oped by Banks, Gastil, and Sussman.13 For each indi-cator, Bollen and Paxton estimate the percent of totalvariance that validly measures democracy, as opposedto reflecting systematic and random error. The sourcesof systematic error are then explored. Bollen andPaxton conclude, for example, that Gastils indicatorshave conservative bias, giving higher scores to coun-tries that are Catholic, that have traditional monar-

    chies, and that are not Marxist-Leninist (Bollen 1993,1221; Bollen and Paxton 2000, 73). This line of re-search is an outstanding example of the sophisticateduse of convergent/discriminant validation to identifypotential problems of political bias.

    In discussing Bollens treatment and application ofstructural equation models we would like to note bothsimilarities, and a key contrast, in relation to thepractice of qualitative researchers. Bollen certainlyshares the concern with careful attention to concepts,and with knowledge of cases, that we have emphasizedabove, and that is characteristic of case-oriented con-tent validation as practiced by qualitative researchers.He insists that complex quantitative techniques cannot

    replace careful conceptual and theoretical reasoning;rather they presuppose it. Furthermore, structuralequation models are not very helpful if you have littleidea about the subject matter (Bollen 1989, vi; see also194). Qualitative researchers, carrying out a case-by-case assessment of the scores on different indicators,could of course reach some of the same conclusionsabout validity and political bias reached by Bollen. Astructural equation approach, however, does offer a

    10 On the appropriate size of the correlation, see Bollen and Lennox1991, 3057.11 To take a political science application, Green and Palmquists(1990) study also reflects this focus on random error. By contrast,Green (1991) goes farther by considering both random and system-atic error. Like the work by Bollen discussed below, these articles arean impressive demonstration of how LISREL-type models canincorporate a concern with measurement error into conventionalstatistical analysis, and how this can in turn lead to a major

    reevaluation of substantive findingsin this case concerning partyidentification (Greene 1991, 6771).12 Two points about structural equation models with latent variablesshould be underscored. First, as noted below, these models can alsobe used in nomological/construct validation, and hence should not beassociated exclusively with convergent/discriminant validation, whichis the application discussed here. Second, we have emphasized thatconvergent/discriminant validation focuses on descriptive relationsamong concepts and their components. Within this framework, itmerits emphasis that the indicators that measure a given latentvariable (i.e., concept) in these models are conventionally inter-preted as effects of this latent variable (Bollen 1989, 65; Bollen andLennox 1991, 3056). These effects, however, do not involve causalinteractions among distinct phenomena. Such interactions, which instructural equation models involve causal relations among differentlatent variables, are the centerpiece of the conventional understand-ing of explanation. By contrast, the links between one latentvariable and its indicators are productively understood as involving adescriptive relationship.13 See, for example, Banks 1979; Gastil 1988; Sussman 1982.

    American Political Science Review Vol. 95, No. 3

    541

  • 8/2/2019 Adcock - Collier

    14/18

    fundamentally different procedure that allows scholarsto assess carefully the magnitude and sources of mea-surement error for large numbers of cases and tosummarize this assessment systematically and con-cisely.

    Nomological/Construct Validation

    Basic Question. In a domain of research in which agiven causal hypothesis is reasonably well established,we ask: Is this hypothesis again confirmed when thecases are scored (level 4) with the proposed indicator(level 3) for a systematized concept (level 2) that is oneof the variables in the hypothesis? Confirmation istreated as evidence for validity.

    Discussion. We should first reiterate that because theterm construct validity has become synonymous inthe psychometric literature with the broader notion ofmeasurement validity, to reduce confusion we useCampbells term nomological validation for proce-dures that address this basic question. Yet, given

    common usage in political science, in headings andsummary statements we call this nomological/constructvalidation. We also propose an acronym that vividlycaptures the underlying idea: AHEM validation; thatis, Assume the Hypothesis, Evaluate the Measure.

    Nomological validation assesses the performance ofindicators in relation to causal hypotheses in order togain leverage in evaluating measurement validity.Whereas convergent validation focuses on multipleindicators of the same systematized concept, and dis-criminant validation focuses on indicators of differentconcepts that stand in a descriptive relation to oneanother, nomological validation focuses on indicatorsofdifferent concepts that are understood to stand in an

    explanatory, causal relation with one another. Al-though these contrasts are not sharply presented inmost definitions of nomological validation, they areessential in identifying this type as distinct from con-vergent/discriminant validation. In practice the con-trast between description and explanation depends onthe researchers theoretical framework, but the distinc-tion is fundamental to the contemporary practice ofpolitical science.

    The underlying idea of nomological validation is thatscores which can validly be claimed to measure asystematized concept should fit well-established expec-tations derived from causal hypotheses that involve this

    concept. The first step is to take as given a reasonablywell-established causal hypothesis, one variable inwhich corresponds to the systematized concept ofconcern. The scholar then examines the association ofthe proposed indicator with indicators of the otherconcepts in the causal hypothesis. If the assessmentproduces an association that the causal hypothesisleads us to expect, then this is positive evidence forvalidity.

    Nomological validation provides additional leveragein assessing measurement validity. If other types ofvalidation raise concerns about the validity of a givenindicator and the scores it produces, then analystsprobably do not need to employ nomological valida-

    tion. When other approaches yield positive evidence,however, then nomological validation is valuable inteasing out potentially important differences that maynot be detected by other types of validation. Specifi-cally, alternative indicators of a systematized conceptmay be strongly correlated and yet perform very dif-ferently when employed in causal assessment. Bollen(1980, 3834) shows this, for example, in his assess-

    ment of whether regime stability should be a compo-nent of measures of democracy.

    Examples of Nomological/Construct Validation. Lijp-harts (1996) analysis of democracy and conflict man-agement in India provides a qualitative example ofnomological validation, which he uses to justify hisclassification of India as a consociational democracy.Lijphart first draws on his systematized concept ofconsociationalism to identify descriptive criteria forclassifying any given case as consociational. He thenuses nomological validation to further justify his scor-ing of India (pp. 2624). Lijphart identifies a series ofcausal factors that he argues are routinely understood

    to produce consociational regimes, and he observesthat these factors are present in India. Hence, classify-ing India as consociational is consistent with an estab-lished causal relationship, which reinforces the plausi-bility of his descriptive conclusion that India is a case ofconsociationalism.

    Another qualitative example of nomological valida-tion is found in a classic study in the tradition ofcomparative-historical analysis, Perry Andersons Lin-

    eages of the Absolutist State.14 Anderson (1974, 4135)is concerned with whether it is appropriate to classifyas feudalism the political and economic system thatemerged in Japan beginning roughly in the fourteenth

    century, which would place Japan in the same analyticcategory as European feudalism. His argument is partlydescriptive, in that he asserts that the fundamentalresemblance between the two historical configurationsas a whole [is] unmistakable (p. 414). He validates hisclassification by observing that Japans subsequentdevelopment, like that of postfeudal Europe, followedan economic trajectory that his theory explains as thehistorical legacy of a feudal state. The basic parallel-ism of the two great experiences of feudalism, at theopposite ends of Eurasia, was ultimately to receive itsmost arresting confirmation of all, in the posteriordestiny of each zone (p. 414). Thus, he uses evidenceconcerning an expected explanatory relationship to

    increase confidence in his descriptive characterizationof Japan as feudal. Anderson, like Lijphart, thus fol-lows the two-step procedure of making a descriptiveclaim about one or two cases, and then offering evi-dence for the validity of this claim by observing that itis consistent with an explanatory claim in which he hasconfidence.

    A quantitative example of nomological validation isfound in Elkinss evaluation of the proposal that de-mocracy versus nondemocracy should be treated as adichotomy, rather than in terms of gradations. One

    14 Sebastian Mazzuca suggested this example.

    Measurement Validity: A Shared Standard for Qualitative and Quantitative Research September 2001

    542

  • 8/2/2019 Adcock - Collier

    15/18

    potential defense of a dichotomous measure is basedon convergent validation. Thus, Alvarez and colleagues(1996, 21) show that their dichotomous measure isstrongly associated with graded measures of democ-racy. Elkins (2000, 2946) goes on to apply nomolog-ical validation, exploring whether, notwithstanding thisassociation, the choice of a dichotomous measuremakes a difference for causal assessment. He compares

    tests of the democratic peace hypothesis using bothdichotomous and graded measures. According to thehypothesis, democracies are in general as conflictprone as nondemocracies but do not fight one another.The key finding from the standpoint of nomologicalvalidation is that this claim is strongly supported usinga graded measure, whereas there is no statisticallysignificant support using the dichotomous measure.These findings give nomological evidence for thegreater validity of the graded measure, because theybetter fit the overall expectations of the acceptedcausal hypothesis. Elkinss approach is certainly morecomplex than the two-step procedure followed by

    Lijphart and Anderson, but the basic idea is the same.Skepticism about Nomological Validation. Many schol-ars are skeptical about nomological validation. Oneconcern is the potential problem of circularity. If oneassumes the hypothesis in order to validate the indica-tor, then the indicator cannot be used to evaluate thesame hypothesis. Hence, it is important to specify thatany subsequent hypothesis-testing should involve hy-potheses different from those used in nomologicalvalidation.

    A second concern is that, in addition to taking thehypothesis as given, nomological validation also pre-supposes the valid measurement of the other system-

    atized concept involved in the hypothesis. Bollen(1989, 18890) notes that problems in the measure-ment of the second indicator can undermine thisapproach to assessing validity, especially when scholarsrely on simple correlational procedures. Obviously,researchers need evidence about the validity of thesecond indicator. Structural equation models with la-tent variables offer a quantitative approach to address-ing such difficulties because, in addition to evaluatingthe hypothesis, these models can be specified so as toprovide an estimate of the validity of the secondindicator. In small-N, qualitative analysis, the re-searcher has the resource of detailed case knowledgeto help evaluate this second indicator. Thus, bothqualitative and quantitative researchers have a meansfor making inferences about whether this importantpresupposition of nomological validation is indeedmet.

    A third problem is that, in many domains in whichpolitical scientists work, there may not be a sufficientlywell-established hypothesis to make this a viable ap-proach to validation. In such domains, it may beplausible to assume the measure and evaluate thehypothesis, but not the other way around. Nomologicalvalidation therefore simply may not be viable. Yet, it ishelpful to recognize that nomological validation neednot be restricted to a dichotomous understanding in

    which the hypothesis either is or is not reconfirmed,using the proposed indicator. Rather, nomologicalvalidation may focus, as it does in Elkins (2000; see alsoHill, Hanna, and Shafqat 1997), on comparing twodifferent indicators of the same systematized concept,and on asking which better fits causal expectations. Atentative hypothesis may not provide an adequatestandard for rejecting claims of measurement validity

    outright, but it may serve as a point of reference forcomparing the performance of two indicators andthereby gaining evidence relevant to choosing betweenthem.

    Another response to the concern that causal hypoth-eses may be too tentative a ground for measurementvalidation is to recognize that neither measurementclaims nor causal claims are inherently more epistemo-logically secure. Both types of claims should be seen asfalsifiable hypotheses. To take a causal hypothesis asgiven for the sake of measurement validation is not tosay that the hypothesis is set in stone. It may be subjectto critical assessment at a later point. Campbell (1977/

    1988, 477) expresses this point metaphorically: We arelike sailors who must repair a rotting ship at sea. Wetrust the great bulk of the timbers while we replace aparticularly weak plank. Each of the timbers we nowtrust we may in turn replace. The proportion of theplanks we are replacing to those we treat as sound mustalways be small.

    CONCLUSION

    In conclusion, we return to the four underlying issuesthat frame our discussion. First, we have offered a newaccount of different types of validation. We haveviewed these types in the framework of a unified

    conception of measurement validity. None of the spe-cific types of validation alone establishes validity;rather, each provides one kind of evidence to beintegrated into an overall process of assessment. Con-tent validation makes the indispensable contribution ofassessing what we call the adequacy of content ofindicators. Convergent/discriminant validationtakingas a baseline descriptive understandings of the rela-tionship among concepts, and of their relation toindicatorsfocuses on shared and nonshared varianceamong indicators that the scholar is evaluating. This