16
PSYCHOLOGICAL BULLETIN VOL. 54, No. 4, 1957 FACTORS RELEVANT TO THE VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 1 DONALD T. CAMPBELL Northwestern University What do we seek to control in ex- perimental designs? What extraneous variables which would otherwise con- found our interpretation of the experiment do we wish to rule out? The present paper attempts a specifi- cation of the major categories of such extraneous variables and employs these categories in evaluating the validity of standard designs for ex- perimentation in the social sciences. Validity will be evaluated in terms of two major criteria. First, and as a basic minimum, is what can be called internal validity: did in fact the ex- perimental stimulus make some sig- nificant difference in this specific in- stance? The second criterion is that of external validity, representativeness, or generalizability: to what popula- tions, settings, and variables can this effect be generalized? Both criteria are obviously important although it turns out that they are to some ex- tent incompatible, in that the con- trols required for internal validity often tend to jeopardize representa- tiveness. The extraneous variables affecting internal validity will be introduced in 1 A dittoed version of this paper was pri- vately distributed in 1953 under the title "Designs for Social Science Experiments." The author has had the opportunity to benefit from the careful reading and suggestions of L. S. Burwen, J. W. Cotton, C. P. Duncan, D. W. Fiske, C. I. Hovland, L. V. Jones, E. S. Marks, D. C. Pelz, and B. J. Underwood, among others, and wishes to express his ap- preciation. They have not had the opportu- nity of seeing the paper in its present form, and bear no responsibility for it. The author also wishes to thank S. A. Stouffer (33) and B. J. Underwood (36) for their public en- couragement. the process of analyzing three pre- experimental designs. In the subse- quent evaluation of the applicability of three true experimental designs, factors leading to external invalidity will be introduced. The effects of these extraneous variables will be considered at two levels: as simple or main effects, they occur independ- ently of or in addition to the effects of the experimental variable; as interactions, the effects appear in conjunction with the experimental variable. The main effects typically turn out to be relevant to internal validity, the interaction effects to ex- ternal validity or representativeness. The following designation for ex- perimental designs will be used: X will represent the exposure of a group to the experimental variable or event, the effects of which are to be meas- ured; 0 will refer to the process of observation or measurement, which can include watching what people do, listening, recording, interviewing, ad- ministering tests, counting lever de- pressions, etc. The Xs and Os in a given row are applied to the same specific persons. The left to right di- mension indicates temporal order. Parallel rows represent equivalent samples of persons unless otherwise specified. The designs will be num- bered and named for cross-reference purposes. THREE PRE-EXPERIMENTAL DESIGNS AND THEIR CONFOUNDED EXTRANE- OUS VARIABLES The One-Shot Case Study. As Stouffer (32) has pointed out, much social science research still uses De- 297

FACTORS RELEVANT TO THE VALIDITY OF EXPERIMENTS IN …rogosateaching.com/somgen290/campbell_1957.pdf · 2019. 5. 3. · active measures and nonreactive meas- ... VALIDITY OF EXPERIMENTS

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • PSYCHOLOGICAL BULLETINVOL. 54, No. 4, 1957

    FACTORS RELEVANT TO THE VALIDITY OFEXPERIMENTS IN SOCIAL SETTINGS1

    DONALD T. CAMPBELLNorthwestern University

    What do we seek to control in ex-perimental designs? What extraneousvariables which would otherwise con-found our interpretation of theexperiment do we wish to rule out?The present paper attempts a specifi-cation of the major categories of suchextraneous variables and employsthese categories in evaluating thevalidity of standard designs for ex-perimentation in the social sciences.

    Validity will be evaluated in termsof two major criteria. First, and as abasic minimum, is what can be calledinternal validity: did in fact the ex-perimental stimulus make some sig-nificant difference in this specific in-stance? The second criterion is thatof external validity, representativeness,or generalizability: to what popula-tions, settings, and variables can thiseffect be generalized? Both criteriaare obviously important although itturns out that they are to some ex-tent incompatible, in that the con-trols required for internal validityoften tend to jeopardize representa-tiveness.

    The extraneous variables affectinginternal validity will be introduced in

    1 A dittoed version of this paper was pri-vately distributed in 1953 under the title"Designs for Social Science Experiments."The author has had the opportunity to benefitfrom the careful reading and suggestions ofL. S. Burwen, J. W. Cotton, C. P. Duncan,D. W. Fiske, C. I. Hovland, L. V. Jones, E. S.Marks, D. C. Pelz, and B. J. Underwood,among others, and wishes to express his ap-preciation. They have not had the opportu-nity of seeing the paper in its present form,and bear no responsibility for it. The authoralso wishes to thank S. A. Stouffer (33) andB. J. Underwood (36) for their public en-couragement.

    the process of analyzing three pre-experimental designs. In the subse-quent evaluation of the applicabilityof three true experimental designs,factors leading to external invaliditywill be introduced. The effects ofthese extraneous variables will beconsidered at two levels: as simple ormain effects, they occur independ-ently of or in addition to the effectsof the experimental variable; asinteractions, the effects appear inconjunction with the experimentalvariable. The main effects typicallyturn out to be relevant to internalvalidity, the interaction effects to ex-ternal validity or representativeness.

    The following designation for ex-perimental designs will be used: Xwill represent the exposure of a groupto the experimental variable or event,the effects of which are to be meas-ured; 0 will refer to the process ofobservation or measurement, whichcan include watching what people do,listening, recording, interviewing, ad-ministering tests, counting lever de-pressions, etc. The Xs and Os in agiven row are applied to the samespecific persons. The left to right di-mension indicates temporal order.Parallel rows represent equivalentsamples of persons unless otherwisespecified. The designs will be num-bered and named for cross-referencepurposes.

    THREE PRE-EXPERIMENTAL DESIGNSAND THEIR CONFOUNDED EXTRANE-

    OUS VARIABLES

    The One-Shot Case Study. AsStouffer (32) has pointed out, muchsocial science research still uses De-

    297

  • 298 DONALD T. CAMPBELL

    sign 1, in which a single individual orgroup is studied in detail only once,and in which the observations areattributed to exposure to some priorsituation.

    X 0 1. One-Shot Case Study

    This design does not merit the title ofexperiment, and is introduced only toprovide a reference point. The veryminimum of useful scientific infor-mation involves at least one formalcomparison and therefore at least twocareful observations (2).

    The One-Group Pretest-Posttest De-sign. This design does provide for oneformal comparison of two observa-tions, and is still widely used.

    0i X 02 2. One-Group Pretest-PosttestDesign

    However, in it there are four or fivecategories of extraneous variables leftuncontrolled which thus become rivalexplanations of any difference be-tween 0i and 02, confounded withthe possible effect of X.

    The first of these is the main effectof history. During the time spanbetween 0i and 02 many events haveoccurred in addition to X, and theresults might be attributed to these.Thus in Collier's (8) experiment,while his respondents2 were readingNazi propaganda materials, Francefell, and the obtained attitudechanges seemed more likely a result ofthis event than of the propaganda.3

    By history is meant the specific eventseries other than X, i.e., the extra-experimental uncontrolled stimuli.Relevant to this variable is the con-cept of experimental isolation, theemployment of experimental settings

    s In line with the central focus on socialpsychology and the social sciences, the termrespondent is employed in place of the termssubject, patient, or client.

    3 Collier actually used a more adequate de-sign than this, an approximation to Design 4.

    in which all extraneous stimuli areeliminated. The approximation ofsuch control in much physical andbiological research has permitted thesatisfactory employment of Design 2.But in social psychology and theother social sciences, if history is con-founded with X the results are gen-erally uninterpretable.

    The second class of variables con-founded with X in Design 2 is heredesignated as maturation. This coversthose effects which are systematicwith the passage of time, and not,like history, a function of the specificevents involved. Thus between Oiand 02 the respondents may havegrown older, hungrier, tireder, etc.,and these may have produced thedifference between 0\ and 02, inde-pendently of X. While in the typicalbrief experiment in the psychologylaboratory, maturation is unlikely tobe a source of change, it has been aproblem in research in child develop-ment and can be so in extended ex-periments in social psychology andeducation. In the form of "spontane-ous remission" and the generalprocesses of healing it becomes animportant variable to control inmedical research, psychotherapy, andsocial remediation.

    There is a third source of variancethat could explain the difference be-tween Oi and 02 without a recourse tothe effect of X. This is the effect oftesting itself. It is often true thatpersons taking a test for the secondtime make scores systematically dif-ferent from those taking the test forthe first time. This is indeed the casefor intelligence tests, where a secondmean may be expected to run asmuch as five IQ points higher thanthe first one. This possibility makesimportant a distinction between re-active measures and nonreactive meas-ures. A reactive measure is one

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 299

    which modifies the phenomenon un-der study, which changes the verything that one is trying to measure.In general, any measurement proce-dure which makes the subject self-conscious or aware of the fact of theexperiment can be suspected of beinga reactive measurement. Wheneverthe measurement process is not a partof the normal environment it is prob-ably reactive. Whenever measure-ment exercises the process understudy, it is almost certainly reactive.Measurement of a person's height isrelatively nonreactive. However,measurement of weight, introducedinto an experimental design involvingadult American women, would turnout to be reactive in that the processof measuring would stimulate weightreduction. A photograph of a crowdtaken in secret from a second storywindow would be nonreactive, but anews photograph of the same scenemight very well be reactive, in thatthe presence of the photographerwould modify the behavior of peopleseeing themselves being photo-graphed. In a factory, productionrecords introduced for the purpose ofan experiment would be reactive, butif such records were a regular part ofthe operating environment theywould be nonreactive. An Englishanthropologist may be nonreactive asa participant-observer at an Englishwedding, but might be a highly reac-tive measuring instrument at a Dobunuptials. Some measures are so ex-tremely reactive that their use in apretest-posttest design is not usuallyconsidered. In this class would betests involving surprise, deception,rapid adaptation, or stress. Evidenceis amply present that tests of learn-ing and memory are highly reactive(35, 36). In the field of opinion andattitude research our well-developedinterview and attitude test tech-

    niques must be rated as reactive, asshown, for example, by Crespi's (9)evidence.

    Even within the personality andattitude test domain, it may befound that tests differ in the degree towhich they are reactive. For somepurposes, tests involving voluntaryself-description may turn out to bemore reactive (especially at the inter-action level to be discussed below)than are devices which focus therespondent upon describing the ex-ternal world, or give him less latitudein describing himself (e.g., 5). Itseems likely that, apart from consid-erations of validity, the Rorschachtest is less reactive than the TAT orMMPI. Where the reactive nature ofthe testing process results from thefocusing of attention on the experi-mental variable, it may be reducedby imbedding the relevant content ina comprehensive array of topics, ashas regularly been done in Hovland'sattitude change studies (14). Itseems likely that with attention tothe problem, observational and meas-urement techniques can be devel-oped which are much less reactivethan those now in use.

    Instrument decay provides a fourthuncontrolled source of variance whichcould produce an Oi-Os differencethat might be mistaken for the effectof X. This variable can be exem-plified by the fatiguing of a springscales, or the condensation of watervapor in a cloud chamber. For psy-chology and the social sciences itbecomes a particularly acute problemwhen human beings are used as a partof the measuring apparatus, asjudges, observers, raters, coders, etc.Thus 0i and 02 may differ becausethe raters have become more experi-enced, more fatigued, have acquireda different adaptation level, or havelearned about the purpose of the ex-

  • 300 DONALD T. CAMPBELL

    periment, etc. However infelicitously,this term will be used to typify thoseproblems introduced when shifts inmeasurement conditions are con-founded with the effect of X, includ-ing such crudities as having a differ-ent observer at Oi and 02, or using adifferent interviewer or coder. Wherethe use of different interviewers,observers, or experimenters is un-avoidable, but where they are usedin large numbers, a sampling equiva-lence of interviewers is required, withthe relevant N being the N of inter-viewers, not interviewees, except asrefined through cluster sampling con-siderations (18).

    A possible fifth extraneous factordeserves mention. This is statisticalregression. When, in Design 2, thegroup under investigation has beenselected for its extremity on 0\,Oi-Oj shifts toward the mean willoccur which are due to random im-perfections of the measuring instru-ment or random instability withinthe population, as reflected in thetest-retest reliability. In general, re-gression operates like maturation inthat the effects increase systemati-cally with the Oi-Ot time interval.McNemar (22) has demonstrated theprofound mistakes in interpretationwhich failure to control this factorcan introduce in remedial research.

    The Static Group Comparison. Thethird pre-experimental design is theStatic Group Comparison.

    X Oi----- 3§ Static Group Comparison

    In this design, there is a comparisonof a group which has experienced Xwith a group which has not, for thepurpose of establishing the effect ofX. In contrast with Design 6, thereis in this design no means of certify-ing that the groups were equivalent

    at some prior time. (The absence ofsampling equivalence of groups issymbolized by the row of dashes.)This design has its most typical oc-currence in the social sciences, andboth its prevalence and its weaknesshave been well indicated by Stouffer(32). It will be recognized as oneform of the correlational study. It isintroduced here to complete the list ofconfounding factors. If the Os differ,this difference could have come aboutthrough biased selection or recruit-ment of the persons making up thegroups; i.e., they might have differedanyway without the effect of X.Frequently, exposure to X (e.g., somemass communication) has been vol-untary and the two groups have aninevitable systematic difference onthe factors determining the choiceinvolved, a difference which noamount of matching can remove.

    A second variable confounded withthe effect of X in this design can becalled experimental mortality. Evenif the groups were equivalent at someprior time, Oi and Oa may differ nownot because individual members havechanged, but because a biased subsetof members have dropped out. Thisis a typical problem in making infer-ences from comparisons of the atti-tudes of college freshmen and collegeseniors, for example.

    TRUE EXPERIMENTAL DESIGNSThe Pretest-Posttest Control Group

    Design. One or another of the aboveconsiderations led psychologists be-tween 1900 and 1925 (2, 30) to ex-pand Design 2 by the addition of acontrol group, resulting in Design 4.

    0i X Oi 4. Pretest-Posttest Control GroupOi 0< Design

    Because this design so neatly con-trols for the main effects of history,maturation, testing, instrument de-

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 301

    cay, regression, selection, and mor-tality, these separate sources of vari-ance are not usually made explicit.It seems well to state briefly the rela-tionship of the design to each of theseconfounding factors, with particularattention to the application of thedesign in social settings.

    If the differences between Oi and0^ were due to intervening historicalevents, then they should also showup in the 03-04 comparison. Note,however, several complications inachieving this control. If respondentsare run in groups, and if there is onlyone experimental session and onecontrol session, then there is no con-trol over the unique internal historiesof the groups. The Oi-Os difference,even if not appearing in 03-0 ,̂ maybe due to a chance distracting factorappearing in one or the other group.Such a design, while controlling forthe shared history or event series,still confounds X with the uniquesession history. Second, the designimplies a simultaneity of Oi with 03and 02 with On which is usually im-possible. If one were to try to achievesimultaneity by using two experimen-ters, one working with the experi-mental respondents, the other withthe controls, this would confoundexperimenter differences with X (in-troducing one type of instrumentdecay). These considerations make itusually imperative that, for a trueexperiment, the experimental andcontrol groups be tested and exposedindividually or in small subgroups,and that sessions of both types betemporally and spatially intermixed.

    As to the other factors: if matura-tion or testing contributed an 0i-02difference, this should appear equallyin the 03-04 comparison, and thesevariables are thus controlled for theirmain effects. To make sure the de-sign controls for instrument decay,

    the same individual or small-ses-sion approximation to simultaneityneeded for history is required. Theoccasional practice of running theexperimental group and control groupat different times is thus ruled out onthis ground as well as that of history.Otherwise the observers may havebecome more experienced, more hur-ried, more careless, the maze moreredolent with irrelevant cues, thelever-tension and friction diminished,etc. Only when groups are effectivelysimultaneous do these factors affectexperimental and control groupsalike. Where more than one experi-menter or observer is used, counter-balancing experimenter, time, andgroup is recommended. The balancedLatin square is frequently useful forthis purpose (4).

    While regression is controlled inthe design as a whole, frequentlysecondary analyses of effects aremade for extreme pretest scorers inthe experimental group. To providea control for effects of regression, aparallel analysis of extremes shouldalso be made for the control group.

    Selection is of course handled bythe sampling equivalence ensuredthrough the randomization employedin assigning persons to groups, per-haps supplemented by, but not sup-planted by, matching procedures.Where the experimental and controlgroups do not have this sort of equiv-alence, one has a compromise designrather than a true experiment. Fur-thermore, the 0i~0a comparison pro-vides a check on possible samplingdifferences.

    The design also makes possible theexamination of experimental mor-tality, which becomes a real problemfor experiments extended over weeksor months. If the experimental andcontrol groups do not differ in thenumber of lost cases nor in their

  • 302 DONALD T. CAMPBELL

    pretest scores, the experiment can bejudged internally valid on this point,although mortality reduces the gen-eralizability of effects to the originalpopulation from which the groupswere selected.

    For these reasons, the Pretest-Posttest Control Group Design hasbeen the ideal in the social sciencesfor some thirty years. Recently,however, a serious and avoidableimperfection in it has been noted,perhaps first by Schanck and Good-man (2Q). Solomon (30) has expressedthe point as an interaction effect oftesting. In the terminology of anal-ysis of variance, the effects of his-tory, maturation, and testing, asdescribed so far, are all main effects,manifesting themselves in mean dif-ferences independently of the pres-ence of other variables. They areeffects that could be added on toother effects, including the effect ofthe experimental variable. In con-trast, interaction effects represent ajoint effect, specific to the concomi-tance of two or more conditions, andmay occur even when no main effectsare present. Applied to the testingvariable, the interaction effect mightinvolve not a shift due solely ordirectly to the measurement process,but rather a sensitization of respon-dents to the experimental variable sothat when X was preceded by 0 therewould be a change, whereas bothX and 0 would be without effect ifoccurring alone. In terms of thetwo types of validity, Design 4 isinternally valid, offering an adequatebasis for generalization to othersampling-equivalent pretested groups.But it has a serious and systematicweakness in representativeness inthat it offers, strictly speaking, nobasis for generalization to the un-pretested population. And it is us-ually the unpretested larger universe

    from which these samples were takento which one wants to generalize.

    A concrete example will help makethis clearer. In the NORC study ofa United Nations information cam-paign (31), two equivalent samples,of a thousand each, were drawn fromthe city's population. One of thesesamples was interviewed, followingwhich the city of Cincinnati wassubjected to an intensive publicitycampaign using all the mass mediaof communication. This includedspecial features in the newspapersand on the radio, bus cards, publiclectures, etc. At the end of twomonths, the second sample of 1,000was interviewed and the results com-pared with the first 1,000. Therewere no differences between the twogroups except that the second groupwas somewhat more pessimistic aboutthe likelihood of Russia's cooperatingfor world peace, a result which wasattributed to history rather than tothe publicity campaign. The secondsample was no better informed aboutthe United Nations nor had it no-ticed in particular the publicitycampaign which had been going on.In connection with a program of re-search on panels and the reinterviewproblem, Paul Lazarsfeld and theBureau of Applied Social Researcharranged to have the initial samplereinterviewed at the same time asthe second sample was interviewed,after the publicity campaign. Thisreinterviewed group showed signifi-cant attitude changes, a high degreeof awareness of the campaign andimportant increases in information.The inference in this case is unmis-takably that the initial interviewhad sensitized the persons interviewedto the topic of the United Nations,had raised in them a focus of aware-ness which made the subsequentpublicity campaign effective for them

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 303

    but for them only. This study andother studies clearly document thepossibility of interaction effects whichseriously limit our capacity to gen-eralize from the pretested experi-mental group to the unpretestedgeneral population. Hovland (15)reports a general finding which is ofthe opposite nature but is, nonethe-less, an indication of an interactiveeffect. In his Army studies the initialpretest served to reduce the effects ofthe experimental variable, presum-ably by creating a commitment to agiven position. Crespi's (9) findingssupport this expectation. Solomon(30) reports two studies with schoolchildren in which a spelling pretestreduced the effects of a trainingperiod. But whatever the directionof the effect, this flaw in the Pretest-Posttest Control Group Design isserious for the purposes of the socialscientist.

    The Solomon Four-Group Design. Itis Solomon's (30) suggestion to con-trol this problem by adding to thetraditional two-group experiment twounpretested groups as indicated inDesign 5.

    0, X 02Oi Oi 5. Solomon Four-Group Design

    X Ot0,

    This Solomon Four-Group Designenables one both to control and meas-ure both the main and interactioneffects of testing and the main effectsof a composite of maturation andhistory. It has become the new idealdesign for social scientists. A wordneeds to be said about the appro-priate statistical analysis. In Design4, an efficient single test embodyingthe four measurements is achievedthrough computing for each individ-ual a pretest-posttest differencescore which is then used for compar-

    ing by t test the experimental andcontrol groups. Extension of thismode of analysis to the SolomonFour-Group Design introduces aninelegant awkwardness to the other-wise elegant procedure. It involvesassuming as a pretest score for theunpretested groups the mean value ofthe pretest from the first two groups.This restricts the effective degrees offreedom, violates assumptions of in-dependence, and leaves one withouta legitimate base for testing thesignificance of main effects and inter-action. An alternative analysis isavailable which avoids the assumedpretest scores. Note that the fourposttests form a simple two-by-twoanalysis of variance design:

    Pretested

    Unpretested

    NoX

    Ot

    Oe

    X

    0,

    The column means represent themain effect of X, the row means themain effect of pretesting, and the in-teraction term the interaction of pre-testing and X. (By use of a t test thecombined main effects of maturationand history can be tested throughcomparing Oe with 0\ and 03.)

    The Posttest-Only Control GroupDesign. While the statistical proce-dures of analysis of variance intro-duced by Fisher (10) are dominant inpsychology and the other socialsciences today, it is little noted in ourdiscussions of experimental arrange-ments that Fisher's typical agricul-tural experiment involves no pretest:equivalent plots of ground receivedifferent experimental treatmentsand the subsequent yields are meas-ured.4 Applied to a social experiment

    4 This is not to imply that the pretest istotally absent from Fisher's designs. He sug-gests the use of previous year's yields, etc., incovariance analysis. He notes, however, "with

  • 304 DONALD T. CAMPBELL

    as in testing the influence of a motionpicture upon attitudes, two randomlyassigned audiences would be selected,one exposed to the movie, and theattitudes of each measured subse-quently for the first time.A X Oi 6. Posttest-Only Control GroupA Oa Design

    In this design the symbol A had beenadded, to indicate that at a specifictime prior to X the groups were madeequivalent by a random samplingassignment. A is the point of selec-tion, the point of allocation of in-dividuals to groups. It is the exist-ence of this process that distinguishesDesign 6 from Design 3, the StaticGroup Comparison. Design 6 is nota static cross-sectional comparison,but instead truly involves control andobservation extended in time. Thesampling procedures employed assureus that at time A the groups wereequal, even if not measured. A pro-vides a point of prior equality just asdoes the pretest. A point A is, ofcourse, involved in all true experi-ments, and should perhaps be in-dicated in Designs 4 and 5. It is es-sential that A be regarded as aspecific point in time, for groupschange as a function of time since A,through experimental mortality.Thus in a public opinion survey situ-ation employing probability samplingfrom lists of residents, the longer thetime since A, the more the sampleunderrepresents the transient seg-ments of society, the newer dwellingunits, etc. When experimentalgroups are being drawn from a self-selected extreme population, such as

    annual agricultural crops, knowledge of yieldsof the experimental area in a previous yearunder uniform treatment has not been foundsufficiently to increase the precision to war-rant the adoption of such uniformity trials asa preliminary to projected experiments" (10,p. 176).

    applicants for psychotherapy, timesince A introduces maturation (spon-taneous remission) and regressionfactors. In Design 6 these effectswould be confounded with the effectof X if the As as well as the Os werenot contemporaneous for experimen-tal and control groups.

    Like Design 4, this design controlsfor the effects of maturation and his-tory through the practical simul-taneity of both the ^4s and the Os. Insuperiority over Design 4, no main orinteraction effects of pretesting areinvolved. It is this feature that rec-ommends it in particular. While itcontrols for the main and interactioneffects of pretesting as well as doesDesign 5, the Solomon Four-GroupDesign, it does not measure theseeffects, nor the main effect of history-maturation. It can be noted thatDesign 6 can be considered as the twounpretested "control" groups fromthe Solomon Design, and that Solo-mon's two traditional pretestedgroups have in this sense the sole pur-pose of measuring the effects of pre-testing and history-maturation, apurpose irrelevant to the main aim ofstudying the effect of X (25). How-ever, under normal conditions of notquite perfect sampling control, thefour-group design provides in addi-tion greater assurance against mis-takenly attributing to X effects whichare not due it, inasmuch as the effectof X is documented in three differentfashions (0\ vs. Oa, 02 vs. On, and Osvs. Oe). But, short of the four-groupdesign, Design 6 is often to be pre-ferred to Design 4, and is a fully validexperimental design.

    Design 6 has indeed been used inthe social sciences, perhaps first of allin the classic experiment by Gosnell,Getting Out the Vote (11). SchanckandGoodman (2Q), Hovland (IS) andothers (1, 12, 23, 24, 27) have also

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 305

    employed it. But, in spite of itsmanifest advantages of simplicityand control, it is far from being apopular design in social research andindeed is usually relegated to an in-ferior position in discussions of exper-imental designs if mentioned at all(e.g., 15, 16, 32). Why is this thecase?

    In the first place, it is often con-fused with Design 3. Even where 5shave been carefully assigned to ex-perimental and control groups, one isapt to have an uneasiness about thedesign because one "doesn't knowwhat the subjects were like before."This objection must be rejected, asour standard tests of significance aredesigned precisely to evaluate thelikelihood of differences occurring bychance in such sample selection. It istrue, however, that this design isparticularly vulnerable to selectionbias and where random assignment isnot possible it remains suspect.Where naturally aggregated units,such as classes, are employed intact,these should be used in large numbersand assigned at random to the experi-mental and control conditions; clus-ter sampling statistics (18) should beused to determine the error term. Ifbut one or two intact classrooms areavailable for each experimentaltreatment, Design 4 should certainlybe used in preference.

    A second objection to Design 6, incomparison with Design 4, is that itoften has less precision. The differ-ence scores of Design 4 are less vari-able than the posttest scores ofDesign 6 if there is a pretest-posttestcorrelation above .50 (15, p. 323),and hence for test-retest correlationsabove that level a smaller mean dif-ference would be statistically signifi-cant for Design 4 than for Design 6,for a constant number of cases. Thisadvantage to Design 4 may often be

    more than dissipated by the costs andloss in experimental efficiency result-ing from the requirement of two test-ing sessions, over and above theconsiderations of representativeness.

    Design 4 has a particular advan-tage over Design 6 if experimentalmortality is high. In Design 4, onecan examine the pretest scores of lostcases in both experimental and con-trol groups and check on their com-parability. In the absence of this inDesign 6, the possibility is opened fora mean difference resulting from dif-ferential mortality rather than fromindividual change, if there is a sub-stantial loss of cases.

    A final objection comes from thosewho wish to study the relationship ofpretest attitudes to kind and amountof change. This is a valid objection,and where this is the interest, Design4 or 5 should be used, with parallelanalysis of experimental and controlgroups. Another common type ofindividual difference study involvesclassifying persons in terms ofamount of change and finding asso-ciated characteristics such as sex, age,education, etc. While unavailable inthis form in Design 6, essentially thesame correlational information can beobtained by subdividing both experi-mental and control groups in termsof the associated characteristics, andexamining the experimental-controldifference for such subtypes.

    For Design 6, the Posttest-OnlyControl Group Design, there is aclass of social settings in which it isoptimally feasible, settings whichshould be more used than they noware. Whenever the social contactrepresented by X is made to singleindividuals or to small groups, andwhere the response to that stimuluscan be identified in terms of indivi-duals or type of X, Design 6 can beapplied. Direct mail and door-to-

  • 306 DONALD T. CAMPBELL

    door contacts represent such settings.The alternation of several appealsfrom door-to-door in a fund-raisingcampaign can be organized as a trueexperiment without increasing thecost of the solicitation. Experimentalvariation of persuasive materials in adirect-mail sales campaign can pro-vide a better experimental laboratoryfor the study of mass communicationand persuasion than is available inany university. The well-established,if little-used, split-run technique incomparing alternative magazine adsis a true experiment of this type,usually limited to coupon returnsrather than sales because of the prob-lem of identifying response with stim-ulus type (20). The split-ballot tech-nique (7) long used in public opinionpolls to compare different questionwordings or question sequences pro-vides an excellent example which canobviously be extended to other topics(e.g., 12). By and large these labora-tories have not yet been used to studysocial science theories, but they aredirectly relevant to hypotheses aboutsocial persuasion.

    Multiple X designs. In presentingthe above designs, X has been op-posed to No-X, as is traditional indiscussions of experimental design inpsychology. But while this may be alegitimate description of the stimulus-isolated physical science laboratory,it can only be a convenient shorthandin the social sciences, for any No-.X'period will not be empty of po-tentially change-inducing stimuli.The experience of the control groupmight better be categorized as an-other type of X, a control experience,an Xc instead of No-X. It is alsotypical of advance in science that weare soon no longer interested in thequalitative fact of effect or no-effect,but want to specify degree of effectfor varying degrees of X. These con-

    siderations lead into designs in whichmultiple groups are used, each with adifferent X\, X%, Xs, Xn, or in multi-ple factorial design, as X\a, Xu, Xza,Xu, etc. Applied to Designs 4 and 6,this introduces one additional groupfor each additional X. Applied to 5,The Solomon Four-Group Design,two additional groups (one pretested,one not, both receiving Xn) would beadded for each variant on X.

    In many experiments, X\, X2, Xs,and Xn are all given to the samegroup, differing groups receiving theXs in different orders. Where theproblem under study centers aroundthe effects of order or combination,such counterbalanced multiple X ar-rangements are, of course, essential.Studies of transfer in learning are acase in point (34). But where onewishes to generalize to the effect ofeach X as occurring in isolation, suchdesigns are not recommended be-cause of the sizable interactionsamong Xs, as repeatedly demon-strated in learning studies under suchlabels as proactive inhibition andlearning sets. The use of counter-balanced sets of multiple Xs toachieve experimental equation, wherenatural groups not randomly assem-bled have to be used, will be dis-cussed in a subsequent paper oncompromise designs.

    Testing for effects extended in time.The researches of Hovland and hisassociates (14, IS) have indicatedrepeatedly that the longer range ef-fects of persuasive Xs may be quali-tatively as well as quantitativelydifferent from immediate effects.These results emphasize the im-portance of designing experiments tomeasure the effect of X at extendedperiods of time. As the misleadingearly research on reminiscence and onthe consolidation of the memorytrace indicate (36), repeated measure-

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 307

    ment of the same persons cannot betrusted to do this if a reactive meas-urement process is involved. Thus,for Designs 4 and 6, two separategroups must be added for each post-test period. The additional controlgroup cannot be omitted, or the ef-fects of intervening history, matura-tion, instrument decay, regression,and mortality are confounded withthe delayed effects of X. To followfully the logic of Design S, four addi-tional groups are required for eachposttest period.

    True experiments in which O is notnot under E's control. It seems well tocall the attention of the social scien-tist to one class of true experimentswhich are possible without the fullexperimental control over both the"when" and "to whom" of both Xand 0. As far as this analysis hasbeen able to go, no such true experi-ments are possible without the abilityto control X, to withhold it fromcarefully randomly selected respon-dents while presenting it to others.But control over 0 does not seem soindispensable. Consider the follow-ing design.A X 0,A 02 6. Posttest Only Design, where 0

    (O) cannot be withheld from any(O) respondent(0)

    The parenthetical Os are inserted toindicate that the studied groups, ex-perimental and control, have beenselected from a larger universe all ofwhich will get 0 anyway. An electionprovides such an 0, and using"whether voted" rather than "howvoted," this was Gosnell's design(11). Equated groups were selectedat time A, and the experimentalgroup subjected to persuasive ma-terials designed to get out the vote.Using precincts rather than personsas the basic sampling unit, similar

    studies can be made on the contentof the voting (6). Essential to thisdesign is the ability to create specifiedrandomly equated groups, the abilityto expose one of these groups to Xwhile withholding it (or providingX%) from the other group, and theability to identify the performance ofeach individual or unit in the subse-quent 0. Since such measures arenatural parts of the environment towhich one wishes to generalize, theyare not reactive, and Design 4, thePretest-Posttest Control Group De-sign, is feasible if 0 has a predictableperiodicity to it. With the precinctas a unit, this was the design of Hart-mann's classic study of emotional vs.rational appeals in a public election(13). Note that S, the Solomon Four-Group Design, is not available, as itrequires the ability to withhold 0experimentally, as well as X.

    FURTHER PROBLEMS OF REPRESEN-TATIVENESS

    The interaction effect of testing, af-fecting the external validity or repre-sentativeness of the experiment, wastreated extensively in the previoussection, inasmuch as it was involvedin the comparison of alternative de-signs. The present section deals withthe effects upon representativenessof other variables which, whileequally serious, can apply to any ofthe experimental designs.

    The interaction effects of selection.Even though the true experimentscontrol selection and mortality forinternal validity purposes, these fac-tors have, in addition, an importantbearing on representativeness. Thereis always the possibility that the ob-tained effects are specific to the ex-perimental population and do nothold true for the populations to whichone wants to generalize. Defining theuniverse of reference in advance and

  • 308 DONALD T. CAMPBELL

    selecting the experimental and con-trol groups from this at randomwould guarantee representativeness ifit were ever achieved in practice.But inevitably not all those so des-ignated are actually eligible for selec-tion by any contact procedure. Ourbest survey sampling techniques, forexample, can designate for potentialcontact only those available throughresidences. And, even of those sodesignated, up to 19 per cent are notcontactable for an interview in theirown homes even with five callbacks(37). It seems legitimate to assumethat the more effort and time requiredof the respondent, the larger the lossthrough nonavailability and nonco-operation. If one were to try to as-semble experimental groups awayfrom their own homes it seems rea-sonable to estimate a 50 per cent se-lection loss. If, still trying to extra-polate to the general public, one fur-ther limits oneself to docile preassem-bled groups, as in schools, militaryunits, studio audiences, etc., the pro-portion of the universe systematicallyexcluded through the sampling proc-ess must approach 90 per cent ormore. Many of the selection factorsinvolved are indubitably highly sys-tematic. Under these extreme selec-tion losses, it seems reasonable tosuspect that the experimental groupsmight show reactions not characteris-tic of the general population. Thispoint seems worth stressing lest weunwarrantedly assume that the selec-tion loss for experiments is compar-able to that found for survey inter-views in the home at the respondent'sconvenience. Furthermore, it seemsplausible that the greater the cooper-ation required, the more the respon-dent has to deviate from the normalcourse of daily events, the greaterwill be the possibility of nonrepre-sentative reactions. By and large,

    Design 6 might be expected to requireless cooperation than Design 4 or S,especially in the natural individualcontact setting. The interactive ef-fects of experimental mortality are ofsimilar nature. Note that, on thesegrounds, the longer the experiment isextended in time the more respond-ents are lost and the less representa-tive are the groups of the originaluniverse.

    Reactive arrangements. In any ofthe experimental designs, the re-spondents can become aware thatthey are participating in an experi-ment, and this awareness can have aninteractive effect, in creating reac-tions to X which would not occur hadX been encountered without this"I'm a guinea pig" attitude. Lazars-feld (19), Kerr (17), and Rosenthaland Frank (28), all have providedvaluable discussions of this problem.Such effects limit generalizations torespondents having this awareness,and preclude generalization to thepopulation encountering X with non-experimental attitudes. The direc-tion of the effect may be one ofnegativism, such as an unwillingnessto admit to any persuasion or change.This would be comparable to theabsence of any immediate effect fromdiscredited communicators, as foundby Hovland (14). The result is prob-ably more often a cooperative re-sponsiveness, in which the respondentaccepts the experimenter's expecta-tions and provides psueudoconfirma-tion. Particularly is this positiveresponse likely when the respondentsare self-selected seekers after the curethat X may offer. The Hawthornestudies (21), illustrate such sym-pathetic changes due to awareness ofexperimentation rather than to thespecific nature of X. In some settingsit is possible to disguise the experi-mental purpose by providing plausi-

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 309

    ble facades in which X appears as anincidental part of the background(e.g., 26, 27, 29). We can also makemore extensive use of experimentstaking place in the intact socialsituation, in which the respondent isnot aware of the experimentation atall.

    The discussion of the effects ofselection on representativeness hasargued against employing intact nat-ural preassembled groups, but theissue of conspicuousness of arrange-ments argues for such use. Themachinery of breaking up naturalgroups such as departments, squads,and classrooms into randomly as-signed experimental and controlgroups is a source of reaction whichcan often be avoided by the use ofpreassembled groups, particularly ineducational settings. Of course, ashas been indicated, this requires theuse of large numbers of such groupsunder both experimental and controlconditions.

    The problem of reactive arrange-ments is distributed over all featuresof the experiment which can draw theattention of the respondent to thefact of experimentation and its pur-poses. The conspicuous or reactivepretest is particularly vulnerable, in-asmuch as it signals the topics andpurposes of the experimenter. Forcommunications of obviously per-suasive aim, the experimenter's topi-cal intent is signaled by the X itself,if the communication does not seema part of the natural environment.Even for the posttest-only groups,the occurrence of the posttest maycreate a reactive effect. The respon-dent may say to himself, "Aha, nowI see why we got that movie." Thisconsideration justifies the practice ofdisguising the connection between 0and X even for Design 6, as throughhaving different experimental per-

    sonnel involved, using different fa-fades, separating the settings andtimes, and embedding the Jf-relevantcontent of 0 among a disguisingvariety of other topics.6

    Generalizing to other Xs. After theinternal validity of an experiment hasbeen established, after a dependableeffect of X upon 0 has been found,the next step is to establish the limitsand relevant dimensions of general-ization not only in terms of popula-tions and settings but also in termsof categories and aspects of X. Theactual X in any one experiment is aspecific combination of stimuli, allconfounded for interpretative pur-poses, and only some relevant to theexperimenter's intent and theory.Subsequent experimentation shouldbe designed to purify X, to discoverthat aspect of the original conglom-erate X which is responsible for theeffect. As Brunswik (3) has empha-sized, the representative sampling ofXs is as relevant a problem in linkingexperiment to theory as is the sam-pling of respondents. To define acategory of Xs along some dimension,and then to sample Xs for experi-mental purposes from the full rangeof stimuli meeting the specificationwhile other aspects of each specificstimulus complex are varied, servesto untie or unconfound the defineddimension from specific others, lend-ing assurance of theoretical relevance.

    In a sense, the placebo problem canbe understood in these terms. The

    ' For purposes of completeness, the inter-action of X with history and maturationshould be mentioned. Both affect the gen-eralizability of results. The interaction effectof history represents the possible specificity ofresults to a given historical moment, a possi-bility which increases as problems are moresocietal, less biological. The interaction ofmaturation and X would be represented in thespecificity of effects to certain maturationallevels, fatigue states, etc.

  • 310 DONALD T. CAMPBELL

    experiment without the placebo hasclearly demonstrated that some as-pect of the total X stimulus complexhas had an effect; the placebo experi-ment serves to break up the complexX into the suggestive connotation ofpill-taking and the specific pharma-cological properties of the drug-separating two aspects of the X pre-viously confounded. Subsequentstudies may discover with similarlogic which chemical fragment of thecomplex natural herb is most essen-tial. Still more clearly, the shamoperation illustrates the process of Xpurification, ruling out general effectsof surgical shock so that the specificeffects of loss of glandular or neuraltissue may be isolated. As theseparallels suggest, once recurrent un-wanted aspects of complex Xs havebeen discovered for a given field, con-trol groups especially designed toeliminate these effects can be regu-larly employed.

    Generalizing to other Qs. In parallelform, the scientist in practice uses acomplex measurement procedurewhich needs to be refined in subse-quent experimentation. Again, thisis best done by employing multipleOs all having in common the theoret-ically relevant attribute but varyingwidely in their irrelevant specificities.For Os this process can be introducedinto the initial experiment by em-ploying multiple measures. A majorpractical reason for not doing so isthat it is so frequently a frustratingexperience, lending hesitancy, in-decision, and a feeling of failure tostudies that would have been inter-preted with confidence had but asingle response measure been em-ployed.

    Transition experiments. The twoprevious paragraphs have arguedagainst the exact replication of experi-mental apparatus and measurementprocedures on the grounds that this

    continues the confounding of theory-relevant aspects of X and 0 withspecific artifacts of unknown influ-ence. On the other hand, the con-fusion in our literature generated bythe heterogeneity of results from stud-ies all on what is nominally the"same" problem but varying in im-plementation, is leading some to callfor exact replication of initial proce-dures in subsequent research on atopic. Certainly no science canemerge without dependably repeat-able experiments. A suggested res-olution is the transition experiment,in which the need for varying thetheory-independent aspects of X and0 is met in the form of a multiple X,multiple 0 design, one segment ofwhich is an "exact" replication of theoriginal experiment, exact at least inthose major features which are nor-mally reported in experimental writ-ings.

    Internal vs. external validity. If oneis in a situation where either internalvalidity or representativeness mustbe sacrificed, which should it be? Theanswer is clear. Internal validity isthe prior and indispensable considera-tion. The optimal design is, of course,one having both internal and externalvalidity. Insofar as such settings areavailable, they should be exploited,without embarrassment from the ap-parent opportunistic warping of thecontent of studies by the availabilityof laboratory techniques. In thissense, a science is as opportunistic asa bacteria culture and grows onlywhere growth is possible. One basicnecessity for such growth is themachinery for selecting among al-ternative hypotheses, no matter howlimited those hypotheses may haveto be.

    SUMMARYIn analyzing the extraneous vari-

    ables which experimental designs for

  • VALIDITY OF EXPERIMENTS IN SOCIAL SETTINGS 311

    social settings seek to control, sevencategories have been distinguished:history, maturation, testing, instru-ment decay, regression, selection, andmortality. In general, the simple ormain effects of these variables jeopar-dize the internal validity of the ex-periment and are adequately con-trolled in standard experimental de-signs. The interactive effects of thesevariables and of experimental ar-rangements affect the external valid-

    ity or generalizability of experimentalresults. Standard experimental de-signs vary in their susceptibility tothese interactive effects. Stress isalso placed upon the differencesamong measuring instruments andarrangements in the extent to whichthey create unwanted interactions.The value for social science purposesof the Posttest-Only Control GroupDesign is emphasized.

    REFERENCES

    1. ANNIS, A. D., & MEIER, N. C. The induc-tion of opinion through suggestion bymeans of planted content. /. soc.Psychol., 1934, 5, 65-81.

    2. BORING, E. G. The nature and history ofexperimental control. Amer. J. Psy-chol., 1954, 67, 573-589.

    3. BRUNSWIK, E. Perception and the repre-sentative design of psychological experi-ments. Berkeley: Univer. of CaliforniaPress, 1956.

    4. BUGELSKI, B. R. A note on Grant's dis-cussion of the Latin square principle inthe design and analysis of psychologicalexperiments. Psychol. Bull., 1949, 46,49-50.

    5. CAMPBELL, D. T. The indirect assessmentof social attitudes. Psychol. Bull., 1950,47, 15-38.

    6. CAMPBELL, D. T. On the possibility ofexperimenting with the "Bandwagon"effect. Int. J. Opin. Attitude Res., 1951,5, 251-260.

    7. CANTRIL, H. Gauging public opinion.Princeton: Princeton Univer. Press,1944.

    8. COLLIER, R. M. The effect of propagandaupon attitude following a critical exami-nation of the propaganda itself. /. soc.Psychol., 1944, 20, 3-17.

    9. CRESPI, L. P. The interview effect inpolling. Publ. Opin. Quart., 1948, 12,99-111.

    10. FISHER, R. A. The design of experiments.Edinburgh: Oliver & Boyd, 1935.

    11. GOSNELL, H. F. Getting out the vote: anexperiment in the stimulation of voting.Chicago: Univer. of Chicago Press, 1927.

    12. GREENBERG, A. Matched samples.J. Marketing, 1953-54, 18, 241-245.

    13. HARTMANN, G. W. A field experiment onthe comparative effectiveness of "emo-tional" and "rational" political leaflets

    in determining election results. J.abnorm. soc. Psychol., 1936, 31, 99-114.

    14. HOVLAND, C. E., JANIS, I. L., & KELLEY,H. H. Communication and persuasion.New Haven: Yale Univer. Press, 1953.

    15. HOVLAND, C. I., LUMSDAINE, A. A., &SHEFFIELD, F. D. Experiments on masscommunication. Princeton: PrincetonUniver. Press, 1949.

    16. JAHODA, M., DEUTSCH, M., & COOK,S. W. Research methods in social rela-tions. New York: Dryden Press, 1951.

    17. KEKR, W. A. Experiments on the effect ofmusic on factory production. Appl.Psychol. Monogr., 1945, No. 5.

    18. KlSH, L. Selection of the sample. InL. Festinger and D. Katz (Eds.), Re-search methods in the behavioral sciences.New York: Dryden Press, 1953, 175-239.

    19. LAZARSFELD, P. F. Training guide on thecontrolled experiment in social re-search. Dittoed. Columbia Univer.,Bureau of Applied Social Research,1948.

    20. LUCAS, D. B., & BRITT, S. H. Advertisingpsychology and research. New YorkMcGraw-Hill, 1950.

    21. MAYO, E. The human problems of an in-dustrial civilization. New York:Macmillan, 1933.

    22. McNEMAR, Q. A critical examination ofthe University of Iowa studies of en-vironmental influences upon the IQ.Psychol. Bull, 1940, 37, 63-92.

    23. MBNEFEE, S. C. An experimental studyof strike propaganda. Soc. Forces, 1938,16, 574-582.

    24. PARRISH, J. A., & CAMPBELL, D. T.Measuring propaganda effects withdirect and indirect attitude tests./. abnorm. soc. Psychol., 1953, 48, 3-9.

    25. PAYNE, S. L. The ideal model for con-

  • 312 DONALD T. CAMPBELL

    trolled experiments. Publ. Opin. Quart.,1951, IS, 557-562.

    26. POSTMAN, L., &BRUNER,]. S. Perceptionunder stress. Psychol. Rev., 1948, S5,314-322.

    27. RANKIN, R. E., & CAMPBELL, D. T.Galvanic skin response to Negro andwhite experimenters. J. abnorm. soc,Psychol, 1955, 51, 30-33.

    28. ROSENTHAL, D., & FRANK, J. O. Psycho-therapy and the placebo effect. Psy-chol. Bull., 1956, S3, 294-302.

    29. SCIIANCK, R. L., & GOODMAN, C. Reac-tions to propaganda on both sides of acontroversial issue. Publ. Opin. Quart.,1939, 3, 107-112.

    30. SOLOMON, R. W. An extension of controlgroup design. Psychol. Bull., 1949, 46,137-150.

    31. STAR, S. A., & HUGHES, H. M. Report onan educational campaign: the Cincin-

    nati plan for the United Nations.Amer. J. Social., 1949-50, 55, 389.

    32. STOUFFER, S. A. Some observations onstudy design. Amer. J. Social., 1949-50,55, 355-361.

    33. STOUFFER, S. A. Measurement in sociol-ogy. Amer. social. Rev., 1953, 18, 591-597.

    34. UNDERWOOD, B. J. Experimental psychol-ogy. Appleton-Century-Crofts, 1949.

    35. UNDERWOOD, B. J. Interference and for-getting. Psychol. Rev., 1957, 64, 49-60.

    36. UNDERWOOD, B. J. Psychological re-search. New York: Appleton-Century-Crofts, 1957.

    37. WILLIAMS, R. Probability sampling in thefield: a case history. Publ. Opin. Quart.,1950, 14, 316-330.

    Received October 7, 1956