117
Observational Studies 1 (2015) 124-125 Submitted 8/15; Published 8/15 Introduction to Observational Studies and the Reprint of Cochran’s paper “Observational Studies” and Comments Dylan S. Small [email protected] Department of Statistics, University of Pennsylvania Philadelphia, PA 19104, USA In this first issue of Observational Studies, we reprint a review of observational studies by William Cochran, a pioneer of statistical research on observational studies, followed by comments by leading current researchers in observational studies. Cochran (1965, Journal of the Royal Statistical Society, Series A) defined an observational study as an empiric investigation [in which]...the objective is to elucidate cause-and-effect relationships...[in which] it is not feasible to use controlled experimentation, in the sense of being able to impose the procedures or treatments whose effects it is desired to discover, or to assign subjects at random to different procedures. Observational Studies is a new peer-reviewed journal that seeks to publish papers on all aspects of observational studies. Researchers from all fields that make use of observational studies are encouraged to submit papers. Topics covered by the journal include, but are not limited, to the following: Study protocols for observational studies. The journal seeks to promote the planning and transparency of observational studies. In addition to publishing study protocols, the journal will publish comments on the study protocols and allow the authors of the study protocol to respond to the comments. Methodologies for observational studies. This includes statistical methods for all as- pects of observational studies and methods for the conduct of observational studies such as methods for collecting data. In addition to novel methodological articles, the journal welcomes review articles on methodology relevant to observational studies as well as illustrations/explanations of methodologies that may have been developed in a more technical article in another journal. Software for observational studies. The journal welcomes articles describing software relevant to observational studies. Descriptions of observational study data sets. The journal welcomes descriptions of observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form collaborations, to learn from each other and to maximize use of existing resources. The journal also encourages submission of examples of how a publicly available observational study database can be used. c 2015 Dylan S. Small.

Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 124-125 Submitted 8/15; Published 8/15

Introduction to Observational Studies and the Reprint ofCochran’s paper “Observational Studies” and Comments

Dylan S. Small [email protected]

Department of Statistics, University of Pennsylvania

Philadelphia, PA 19104, USA

In this first issue of Observational Studies, we reprint a review of observational studiesby William Cochran, a pioneer of statistical research on observational studies, followed bycomments by leading current researchers in observational studies. Cochran (1965, Journalof the Royal Statistical Society, Series A) defined an observational study as

an empiric investigation [in which]...the objective is to elucidate cause-and-effectrelationships...[in which] it is not feasible to use controlled experimentation, inthe sense of being able to impose the procedures or treatments whose effects itis desired to discover, or to assign subjects at random to different procedures.

Observational Studies is a new peer-reviewed journal that seeks to publish papers on allaspects of observational studies. Researchers from all fields that make use of observationalstudies are encouraged to submit papers. Topics covered by the journal include, but arenot limited, to the following:

• Study protocols for observational studies. The journal seeks to promote the planningand transparency of observational studies. In addition to publishing study protocols,the journal will publish comments on the study protocols and allow the authors of thestudy protocol to respond to the comments.

• Methodologies for observational studies. This includes statistical methods for all as-pects of observational studies and methods for the conduct of observational studiessuch as methods for collecting data. In addition to novel methodological articles, thejournal welcomes review articles on methodology relevant to observational studies aswell as illustrations/explanations of methodologies that may have been developed ina more technical article in another journal.

• Software for observational studies. The journal welcomes articles describing softwarerelevant to observational studies.

• Descriptions of observational study data sets. The journal welcomes descriptions ofobservational study data sets and how to access them. The goal of the descriptions ofobservational study data sets is to enable readers to form collaborations, to learn fromeach other and to maximize use of existing resources. The journal also encouragessubmission of examples of how a publicly available observational study database canbe used.

c⃝2015 Dylan S. Small.

Page 2: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

• Analyses of observational studies. The journal welcomes analyses of observationalstudies. The journal encourages submissions of analyses that illustrate use of soundmethodology and conduct of observational studies.

The paper we reprint of Cochran’s and the comments by leading current researchers inobservational studies provide illuminating perspectives on important issues in observationalstudies that the journal seeks to address. The contents of the rest of this section are asfollows:

Author Title Pages

William Cochran Observational Studies 126-136Norman Breslow William G. Cochran and the 1964 Surgeon’s 137-140

General ReportThomas Cook The Inheritance bequeathed to William G. 140-163

Cochran that he willed forward and leftfor others to will forward again:The Limits of Observational Studies thatseek to Mimic Randomized Experiments

David Cox & Nanny Wermuth Design and interpretation of studies: relevant 165–170concepts from the past and some extensions

Stephen Fienberg Comment on “Observational Studies” 171–172by William G. Cochran

Joseph Gastwirth & Barry Graubard Comment on Cochran’s “Observational Studies 173–181Andrew Gelman The State of the Art in Causal Inference: 182–183

Some Changes Since 1972Ben Hansen & Adam Sales Comment on Cochran’s “Observational Studies” 184–193Miguel Hernan A good deal of humility: 194–195

Cochran on observational studiesJennifer Hill Lessons we are still learning 196–199Judea Pearl Causal Thinking in the Twilight Zone 200–204Paul Rosenbaum Cochran’s Causal Crossword 205–211Donald Rubin Comment on Cochran’s “Observational Studies” 212–216Herbert Smith Comment on Cochran’s “Observational Studies” 217–219Mark van der Laan Comment on “Observational Studies” 220–222

by Dr. W.G. Cochran (1972)Tyler VanderWeele Observational Studies and Study Designs: 223–230

An Epidemiologic PerspectiveStephen West Reflections on “Observational Studies”: 231–240

Looking Backward and Looking Forward

125

Page 3: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 126-136 Submitted 1972; Published reprinted, 8/15

Observational Studies

William G. Cochran

Editor’s Note: William G. Cochran (1909-1980) was Professor of Statistics, HarvardUniversity, Cambridge, Massachusetts. This article was originally published in StatisticalPapers in Honor of George W. Snedecor, ed. T.A. Bancroft, 1972, Iowa State UniversityPress, pp. 77-90. The paper is reprinted with permission of the copyright holder, IowaState University Press. Comments by leading current researchers in observational studiesfollow.

1. Introduction

OBSERVATIONAL STUDIES are a class of statistical studies that have increased in fre-quency and importance during the past 20 years. In an observational study the investigatoris restricted to taking selected observations or measurements on the process under study.For one reason or another he cannot interfere in the process in the way that one does in acontrolled laboratory type of experiment.

Observational studies fall roughly into two broad types. The first is often given thename of “analytical surveys.” The investigator takes a sample survey of a population ofinterest and proceeds to conduct statistical analyses of the relations between variables ofinterest to him. An early example was Kinsey’s study (1948) of the relation between thefrequencies of certain types of sexual behavior and variables like the age, sex, social level,religious affiliation, rural-urban background, and direction of social mobility of the personinvolved. Dr. Kinsey gave much thought to the methodological problems that he wouldface in planning his study. More recently, in what is called the “midtown Manhattan study”(Srole et al., 1962), a team of psychiatrists studied the relation in Manhattan, New York,between age, sex, parental and own social level, ethnic origin, generation in the UnitedStates, and religion and nonhospitalized mental illness.

The second type of observational study is narrower in scope. The investigator has inmind some agents, procedures, or experiences that may produce certain causal effects (goodor bad) on people. These agents are like those the statistician would call treatments in acontrolled experiment, except that a controlled experiment is not feasible. Examples of thistype abound. A simple one structurally is a Cornell study of the effect of wearing a lap seatbelt on the amount and type of injury sustained in an automobile collision. This study wasdone from police and medical records of injuries in automobile accidents. The prospectivesmoking and health studies (1964) are also a well-known example. These are comparisonsof the death rates and causes of death of men and women with different smoking patterns inregard to type and amount. An example known as the “national halothane study” (Bunkeret al., 1969) attempted to make a fair comparison of the death rates due to the five leadinganesthetics used in hospital operations.

c⃝2015 Iowa State University Press.

Page 4: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies

Several factors are probably responsible for the growth in the number of studies of thiskind. One is a general increase in funds for research in the social sciences and medicine. Arelated reason is the growing awareness of social problems. A study known as the “Colemanreport” (1966) has attracted much discussion. This was begun because Congress gave theU.S. Office of Education a substantial sum and asked it to conduct a nation wide surveyof elementary schools and high schools to discover to what extent minority-group childrenin the United States (Blacks, Indians, Puerto Ricans, Mexican-Americans, and Orientals)receive a poorer education than the majority whites. A third reason is the growing areaof program evaluation. All over the world, administrative bodies – central, regional, andlocal – spend the taxpayers’ money on new programs intended to benefit some or all ofthe population or to combat social evils. Similarly, a business organization may institutechanges in its operations in the hope of improving the running of the business. The ideais spreading that it might be wise to devote some resources to trying to measure boththe intended and the unintended effects of these programs. Such evaluations are difficultto do well, and they make much use of observational studies. Finally, some studies areundertaken to investigate stray reports of unexpected effects that appear from time to time.The halothane study is an example; others are studies of side effects of the contraceptivepill and studies of health effects of air pollution.

This paper is confined mainly to the second, narrower class of observational studies,although some of the problems to be considered are also met in the broader analyticalones.

For this paper I naturally sought a topic that would reflect the outlook and researchinterests of George Snedecor. In his career activity of helping investigators, he developeda strong interest in the design of experiments, a subject on which numerous texts are nowavailable. The planning of observational studies, in which we would like to do an experimentbut cannot, is a closely related topic which cries aloud for George’s mature wisdom and themethodological truths that he expounded so clearly.

Succeeding sections will consider some of the common issues that arise in planning.

2. The Statement of Objectives

Early in the planning it is helpful to construct and discuss as clear and specific a writtenstatement of the objectives as can be made at that stage. Otherwise it is easy in a studyof any complexity to take later decisions that are contrary to the objectives or to findthat different team members have conflicting ideas about the purpose of the study. Someinvestigators prefer a statement in the form of hypotheses to be tested, others in the formof quantities to be estimated or comparisons to be made. An example of the hypothesistype comes from a study (Buck et al., 1968), by a Johns Hopkins team, of the effects ofcoca-chewing by Peruvian Indians. Their hypotheses were stated as follows.

1. Coca, by diminishing the sensation of hunger, has an unfavorable effect on the nutri-tional state of the habitual chewer. Malnutrition and conditions in which nutritionaldeficiencies are important disease determinants occur more frequently among chewersthan among control subjects.

127

Page 5: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran

2. Coca chewing leads to a state of relative indifference which can result in inferiorpersonal hygiene.

3. The work performance of coca chewers is lower than that of comparable nonchewers.

One objection sometimes made to this form of statement is its suggestion that theanswers are already known, and thus it hints at personal bias. However, these statementscould easily have been put in a neutral form, and the three specific hypotheses about cocawere suggested by a previous League of Nations commission. The statements perform thevaluable purpose of directing attention to the comparisons and measurements that will beneeded.

3. The Comparative Structure of the Plan

The statement of objectives should have suggested the type of comparisons on which logicaljudgments about the effects of treatment would be based. Some of the most commonstructures are outlined below. First, the study may be restricted to a single group of people,all subject to the same treatment. The timing of the measurements may take several forms.

1. After Only (i.e., after a period during which the treatment should have had time toproduce its effects).

2. Before and After (planned comparable measurements both before and after the periodof exposure to the agent or treatment).

3. Repeated Before and Repeated After.

In both (1) and (2) there may be a series of After measurements if there is interest in thelong-term effects of treatment.

Single-group studies are so weak logically that they should be avoided whenever possible,but in the case of a compulsory law or change in business practice, a comparable group notsubject to the treatment may not be available. In an After Only study we can perhaps judgewhether or not the situation after the period of treatment was satisfactory but have no basisfor judging to what extent, if any, the treatment was a cause, except perhaps by an opinionderived from a subjective impression as to the situation before exposure. Supplementaryobservations might of course teach something useful about the operation of a law–e.g., thatit was widely disobeyed through ignorance or unpopularity with the public or that it wasunworkable as too complex for the administrative staff.

In the single-group Before and After study we at least have estimates of the changesthat took place during the period of treatment. The problem is to judge the role of thetreatment in producing these changes. For this step it is helpful to list and judge anyother contributors to the change that can be envisaged. Campbell and Stanley (1966) haveprovided a useful list with particular reference to the field of education.

Consider a Before-After rise. This might be due to what I vaguely call “external” causes.In an economic study a Before-After rise might accompany a wide variety of “treatments,”good or bad, during a period of increasing national employment and prosperity. In ed-ucational examinations contributors might be the increasing maturity of the students orfamiliarity with the tests. In a study of an apparently low group on some variable (e.g.,

128

Page 6: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies

poor at some task) a rise might be due to what is called the regression effect. If a person’sscore fluctuates from time to time through human variability or measurement error, the“low” group selected is likely to contain persons who were having an unusually bad day orhad a negative error of measurement on that day. In the subsequent After measurement,such persons are likely to show a rise in score even under no treatment – either they arehaving one of their “up” days or the error of measurement is positive on that day. AfterWorld War I the French government instituted a wage bonus for civil servants with largefamilies to stimulate an increase in the birthrate and the population of France. I have beentold the primary effect was an influx of men with large families into French civil servicejobs, creating a Before-After rise that might be interpreted as a success of the “treatment.”An English Before-After evaluation of a publicity campaign to encourage people to comeinto London clinics for needed protective shots obtained a Before-After drop in number ofshots given. The clinics, who were asked to keep the records, had persuaded patrons tocome in at once if they were known to be intending to have shots (Before), so that thesepeople would be out of the way when the presumed big rush from the campaign started.

A time-series study with repeated measurements Before and After presents interest-ing problems – that of appraising whether the Before-After change during the period oftreatment is real in relation to changes that occur from external causes in the Before andAfter periods and that of deciding what is suggested about the time-response curve to thetreatment. Campbell and Ross (1968) give an excellent account of the types of analysisand judgment needed in connection with a study of the Connecticut state law imposinga crackdown on speeding, and Campbell (1969) has discussed the role of this and othertechniques in a highly interesting paper on program evaluation.

Single-group studies emphasize a characteristic that is prominent in the analysis ofnearly all observational studies – the role of judgment. No matter how well-constructeda mathematical model we have, we cannot expect to plan a statistical analysis that willprovide an almost automatic verdict. The statistician who intends to operate in this fieldmust cultivate an ability to judge and weigh the relative importance of different factorswhose effects cannot be measured at all accurately.

Reverting to types of structure, we come now to those with more than one group.The simplest is a two-group study of treated and untreated groups (seat-belt wearers andnonwearers). We may also have various treatments or forms of treatment, as in the smokingand health studies (pipes, cigars, cigarettes, different amounts smoked, and ex-smokers whohad stopped for different lengths of time and had previously smoked different amounts).Both After Only and Before and After measurements are common. Sometimes both anAfter Only and a Before-After measurement are recommended for each comparison groupif there is interest in studying whether the taking of the Before measurement influenced theAfter measurement.

Comparison groups bring a great increase in analytical insight. The influence of externalcauses on both groups will be similar in many types of study and will cancel or be minimizedwhen we compare treatment with no treatment. But such studies raise a new problem –How do we ensure that the groups are comparable? Some relevant statistical techniques areoutlined in section 6. In regard to incomparability of the groups the Before and After studyis less vulnerable than the After Only since we should be able to judge comparability of thetreated and untreated groups on the response variable at a time when they have not been

129

Page 7: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran

subjected to the difference in treatment. Occasionally, we might even be able to select thetwo groups by randomization, having a randomized experiment instead of an observationalstudy; but this is not feasible when the groups are self-selected (as in smokers) or selectedby some administrative fiat or outside agent (e.g., illness).

4. Measurements

The statement of objectives will also have suggested the types of measurements needed; theirrelevance is obviously important. For instance, early British studies by aerial photographsin World War II were reported to show great damage to German industry. Knowing thatearly British policy was to bomb the town center and that German factories were oftenconcentrated mainly on the outskirts, Yates (1968) confined his study to the factory areas,with quite a different conclusion which was confirmed when postwar studies could be made.The question of what is considered relevant is particularly important in program evaluation.A program may succeed in its main objectives but have undesirable side effects. The verdicton the program may differ depending on whether or not these side effects are counted inthe evaluation.

It is also worth reviewing what is known about the accuracy and precision of proposedmeasurements. This is especially true in social studies, which often deal with people’s atti-tudes, motivations, opinions, and behavior – factors that are difficult to measure accurately.Since we may have to manage with very imperfect measurements, statisticians need moretechnical research on the effects of errors of measurement. Three aspects are: (1) morestudy of the actual distribution of errors of measurement, particularly in multivariate prob-lems, so that we work with realistic models; (2) investigation, from these models, of theeffects on the standard types of analysis; (3) study of methods of remedying the situation bydifferent analyses with or without supplementary study of the error distributions. To judgeby work to date on the problem of estimating a structural regression, this last problem isformidable.

It is also important to check comparability of measurement in the comparison groups. Ina medical study a trained nurse who has worked with one group for years but is a strangerto the other group might elicit different amounts of trustworthy information on sensitivequestions. Cancer patients might be better informed about cases of cancer among bloodrelatives than controls free from cancer.

The scale of the operation may also influence the measuring process. The midtownManhattan study, for instance, at first planned to use trained psychiatrists for obtainingthe key measurements, but they found that only enough psychiatrists could be providedto measure a sample of 100. The analytical aims of the study needed a sample of at least1,000. In numerous instances the choice seems to lie between doing a study much smallerand narrower in scope than desired but with high quality of measurement, or an extensivestudy with measurements of dubious quality. I am seldom sure what to advise.

In large studies one occasionally sees a mistake in plans for measurement that is perhapsdue to inattention. If two laboratories or judges are needed to measure the responses, anadministrator sends all the treatment group to laboratory 1 and the untreated to laboratory2 – it is at least a tidy decision. But any systematic difference between laboratories or judges

130

Page 8: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies

becomes part of the estimated treatment effect. In such studies there is usually no difficultyin sending half of each group, selected at random, to each judge.

5. Observations and Experiments

In the search for techniques that help to ensure comparability in observational studies, it isworth recalling the techniques used in controlled experiments, where the investigator facessimilar problems but has more resources to command. In simple terms these techniquesmight be described as follows.

Identify the major sources of variation (other than the treatments) that affect the re-sponse variable. Conduct the experiment and analysis so that the effects of such sourcesare removed or balanced out. The two principal devices for this purpose are blocking andthe analysis of covariance. Blocking is employed at the planning stage of the experiment.With two treatments, for example, the subjects are first grouped into pairs (blocks of size2) such that the members of a pair are similar with respect to the major anticipated sourcesof variation. Covariance is used primarily when the response variable y is quantitative andsome of the major extraneous sources of variation can also be represented by quantitativevariables x1, x2, . . . From a mathematical model expressing y in terms of the treatmenteffects and the values of the xi, estimates of the treatment effects are obtained that havebeen adjusted to remove the effects of the xi. Covariance and blocking may be combined.

For minor and unknown sources of variation, use randomization. Roughly speaking,randomization makes such sources of error equally likely to favor either treatment andensures that their contribution is included in the standard error of the estimated treatmenteffect if properly calculated for the plan used.

In general, extraneous sources of variation may influence the estimated treatment effectτ in two ways. They may create a bias B. Instead of estimating the true treatment effectτ , the expected value of τ is (τ + B), where B is usually unknown. They also increasethe variance of τ . In experiments a result of randomization and other precautions (e.g.,blindness in measurement) is that the investigator usually has little worry about bias.Discussions of the effectiveness of blocking and covariance (e.g., Cox, 1957) are confined totheir effect on V (τ) and on the power of tests of significance.

In observational studies we cannot use random assignment of subjects, but we can tryto use techniques like blocking and covariance. However, in the absence of randomizationthese techniques have a double task – to remove or reduce bias and to increase precisionby decreasing V (τ). The reduction of bias should, I think, be regarded as the primaryobjective – a highly precise estimate of the wrong quantity is not much help.

6. Matching and Adjustments

In observational studies as in experiments we start with a list of the most important extra-neous sources of variation that affect the response variable. The Cornell study, based onautomobile accidents involving seat-belt wearers and nonwearers, listed 12 major variables.The most important was the intensity and direction of the physical force at impact. Ahead-on collision at 60 mph is a very different matter from a sideswipe at 25 mph. In thesmoking–death-rate studies age gradually becomes a predominating variable for men over

131

Page 9: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran

55. In the raw data supplied to the Surgeon General’s committee by the British and Cana-dian studies and in a U.S. study cigarette smokers and nonsmokers had about the samedeath rates. The high death rates occurred among the cigar and pipe smokers. If thesedata had been believed, television warnings might now be advising cigar and pipe smokersto switch to cigarettes. However, cigar and pipe smokers in these studies were found to bemarkedly older than nonsmokers, while cigarette smokers were, on the whole, younger. Allstudies regarded age as a major extraneous variable in the analysis. After adjustment forage differences, death rates for cigar and pipe smokers were close to those for nonsmokers;those for cigarette smokers were consistently higher.

In observational studies three methods are in common use in an attempt to remove biasdue to extraneous variables.

Blocking, usually known as matching in observational studies. Each member of thetreated group has a match or partner in the untreated group. If the x variables are classified,we form the cells created by the multiple classification (e.g., x1 with 3 classes and x2 with4 classes create 12 cells). A match means a member of the same cell. If x is quantitative(discrete or continuous), a common method is to turn it into a classified variate (e.g., agein 10-year classes). Another method, caliper matching, is to call x11i (in group 1) and x12i(in group 2) matches with respect to x1 if |x11i − x22j | ≤ a.

Standardization (adjustment by subclassification). This is the analogue of covariancewhen the x’s are classified and we do not match. Arrange the data from the treated anduntreated samples in cells, the ith cell containing say n1i, n2i observations with responsemeans y1i, y2i. If the effect τ of the treatment is the same in every cell, this method dependson the result that for any set of weights wi with

∑wi = 1, the quantity τ =

∑wi(y1i− y2i)

is an unbiased estimate of τ (apart from any within-cell biases). The weights can thereforebe chosen to minimize V (τ). If it is clear that τ varies from cell to cell as often happens,the choice of weights becomes more critical, since it determines the quantity

∑wiτi that

is being estimated. In vital statistics a common practice is to take the weights from somestandard population to which we wish the comparison to apply.

Covariance (with x’s quantitative), used just as in experiments. The idea of matchingis easy to grasp, and the statistical analysis is simple. On the operational side, matchingrequires a large reservoir in at least one group (treated or untreated) in which to lookfor matches. The hunt for matches (particularly with caliper matching) may be slow andfrustrating, although computers should be able to help if data about the x’s can be fedinto them. Matching is avoided when the planned sample size is large, there are numeroustreatments, subjects become available only slowly through time, and it is not feasible tomeasure the x’s until the samples have already been chosen and y is also being measured.

There has been relatively little study of the effects of these devices on bias and precision,although particular aspects have been discussed by Billewicz (1965), Cochran (1968) andRubin (1970). If x is classified and two members of the same class are identical in regard tothe effect of x on y, matching and standardization remove all the bias, while matching shouldbe somewhat superior in regard to precision. I am not sure, however, how often such idealclassifications actually exist. Many classified variables, especially ordered classifications,have an underlying quantitative x – e.g., for sex with certain types of response there is awhole gradation from very manly men to very womanly women. This is obviously true forquantitative x’s that are deliberately made classified in order to use within-cell matching.

132

Page 10: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies

In such cases, matching and standardization remove between-cell bias but not within-cellbias. Of an initial bias in means µ1x − µ2x, they remove about 64%, 80%, 87%, 91%, and93% with 2, 3, 4, 5, and 6 classes, the actual amount varying a little with the choice ofclass boundaries and the nature of the x distribution (Cochran, 1968). Caliper matchingremoves about 76%, 84%, 90%, 95%, and 99% with a/σx = 1, 0.8, 0.6, 0.4, and 0.2. Thesepercentages also apply to y under a linear or nearly linear regression of y on x.

With a quantitative x, covariance adjustments remove all the initial bias if the correctmodel is fitted, and they are superior to within-class matching of x when this assumptionholds. In practice, covariance nearly always means linear covariance to most users, andsome bias remains after covariance adjustment if the y, x relation is nonlinear and a linearcovariance is fitted. If nonlinearity is of the type that can be approximated by a quadraticcurve, results by Rubin (1970) suggest that the residual bias should be small if σ2

1x = σ22x

and x is symmetrical or nearly so in distribution. When σ21x/σ

22x is 1/2 or 2,, the adjustment

can either overcorrect or undercorrect to a material extent.

Caliper matching, on the other hand, and even within-class matching do not lean on anassumed linear relation between y and x. If σ2

1x/σ22x is near 1 (perhaps between 0.8 and

1.2), the evidence to date suggests, however, that linear covariance is superior to within-class matching in removing bias under a moderately curved y, x relation, although morestudy of this point is needed. Linear covariance applied to even loosely caliper-matchedsamples should remove nearly all the initial bias in this situation. Billewicz (1965) comparedlinear covariance and within-class matching (3 or 4 classes) in regard to precision in amodel in which x was distributed as N(0, 1) in both populations. For the curved relationsy = 0.4x−0.1x2, y = 0.8x−0.14x2 and y = tanhx, he found covariance superior in precisionon samples of size 40.

Larger studies in which matching becomes impractical present difficult problems inanalysis. Protection against bias from numerous x variables is not easy. Further, if thereare say four x variables, the treatment effect may change with the levels of x2 and x3. Forapplications of the conclusions it may be important to find this out. The obvious recourse isto model construction and analysis based on the model, which has been greatly developed,particularly in regression. Nevertheless the Coleman report on education (1966) and thenational halothane study (Bunker et al., 1969) illustrate difficulties that remain.

7. Further Points on Planning

7.1 Sample Size

Statisticians have developed formulas that provide guidance on the sample size needed in astudy. The formulas tend to be harder to use in observational studies than in experimentsbecause less may be known about the likely values of population parameters that appear inthe formulas and the formulas assume that bias is negligible. Nevertheless there is frequentlysomething useful to be learned – for instance, that the proposed size looks adequate forestimating a single overall effect of the treatment, but does not if the variation in effectwith an x is of major interest.

133

Page 11: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran

7.2 Nonresponse

Certain administrative bodies may refuse to cooperate in a study; certain people may beunwilling or unable to answer the questions asked or may not be found at home. In modernstudies, standards with regard to the nonresponse problem seem to me to be lax. In boththe smoking and Coleman studies nonresponse rates of over 30% were common. The maindifficulty with nonresponse is not the reduction in sample size but that nonrespondents maybe to some extent different types of people from respondents and give different types ofanswers, so that results from respondents are biased in this sense. Fortunately, nonresponsecan often be reduced materially by hard work during the study, but definite plans for thisneed to be made in advance.

7.3 Pilot Study

The case for starting with a small pilot study should be considered – for instance, to workout the field procedures and check the understanding and acceptability of the questions andthe interviewing methods and time taken. When information is wanted on a new problem,the cheapest and quickest method is to base a study on routine records that already exist.However, such records are often incomplete and have numerous gross errors. A law oradministrative rule specifying that records shall be kept does not ensure that the recordsare usable for research purposes. A good pilot study of the records should reveal the stateof affairs. It is worth looking at variances; a suspiciously low variance has sometimes ledto detection of the practice of copying previous values instead of making an independentdetermination.

7.4 Critique

When the draft of plans for a study is prepared, it helps to find a colleague willing to play therole of devil’s advocate – to read the plan and to point out any methodological weaknessesthat he sees. Since observational studies are vulnerable to such defects, the investigatorshould of course also be doing this, but it is easy to get in a rut and overlook some aspect.It helps even more if the colleague can suggest ways of removing or reducing these faults.In the end, however, the best plan that investigator and colleague can devise may still besubject to known weaknesses. In the report of the results these should be discussed in aclearly labeled section, with the investigator’s judgment about their impact.

7.5 Sampled and Target Populations

Ideally, the statistician would recommend that a study start with a probability sample ofthe target population about which the investigator wishes to obtain information. But bothin experiments and in observational surveys many factors – feasibility, costs, geography,supply of subjects, opportunity – influence the choice of samples. The population actuallysampled may therefore differ in several respects from the target population. In his report theinvestigator should try to describe the sampled population and relevant target populationsand give his opinion as to how any differences might affect the results, although this isadmittedly difficult.

134

Page 12: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies

One reason why this step is useful is that an administrator in California, say, may wantto see the results of a good study on some social issue for policy guidance and may findthat the only relevant study was done in Philadelphia or Sweden. He will appreciate helpin judging whether to expect the same results in California.

7.6 Judgment about Causality

Techniques of statistical analysis of observational studies have in general employed standardmethods and will not be discussed here. When the analysis is completed, there remainsthe problem of reaching a judgment about causality. On this point I have little to addto a previous discussion (Cochran, 1965). It is well known that evidence of a relationshipbetween x and y is no proof that x causes y . The scientific philosophers to whom we mightturn for expert guidance on this tricky issue are a disappointment. Almost unanimously andwith evident delight they throw the idea of cause and effect overboard. As the statisticalstudy of relationships has become more sophisticated, the statistician might admit, however,that his point of view is not very different, even if he wishes to retain the terms cause andeffect.

The probabilistic approach enables us to discard oversimplified deterministic notionsthat make the idea look ridiculous. We can conceive of a response y having numerouscontributory causes, not just one. To say that x is a cause of y does not imply that x isthe only cause. With 0,1 variables we may merely mean that if x is present, the probabilitythat y happens is increased – but not necessarily by much. If x and y are continuous, acausal relation may imply that as x increases, the average value of y increases, or someother feature of its distribution changes. The relation may be affected by the levels of othervariables; it may be strengthened or weakened or entirely disappear, depending on theselevels. One can see why the idea becomes tortuous. For successful prediction, however, aknowledge of the nature and stability of these relationships is an essential step and this issomething that we can try to learn in observational studies.

A claim of proof of cause and effect must carry with it an explanation of the mechanismby which the effect is produced. Except in cases where the mechanism is obvious andundisputed, this may require a completely different type of research from the observationalstudy that is being summarized. Thus in most cases the study ends with an opinion orjudgment about causality, not a claim of proof.

Given a specific causal hypothesis that is under investigation, the investigator shouldthink of as many consequences of the hypothesis as he can and in the study try to includeresponse measurements that will verify whether these consequences follow. The cigarette-smoking and death-rate studies are a good example. For causes of death to which smokingis thought to be a leading contributor, we can compare death rates for nonsmokers andfor smokers of different amounts, for ex-smokers who have stopped for different lengths oftime but used to smoke the same amount, for ex-smokers who have stopped for the samelength of time but used to smoke different amounts, and (in later studies) for smokers offilter and nonfilter cigarettes. We can do this separately for men and women and also forcauses of death to which, for physiological reasons, smoking should not be a contributor.In each comparison the direction of the difference in death rates and a very rough guess atthe relative size can be made from a causal hypothesis and can be put to the test.

135

Page 13: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran

The same can be done for any alternative hypotheses that occur to the investigator. Itmight be possible to include in the study response measurements or supplementary obser-vations for which alternative hypotheses give different predictions. In this way, ingenuityand hard work can produce further relevant data to assist the final judgment. The finalreport should contain a discussion of the status of the evidence about these alternatives aswell as about the main hypothesis under study.

In conclusion, observational studies are an interesting and challenging field which de-mands a good deal of humility, since we can claim only to be groping toward the truth.

References

Billewicz, W. Z. (1965). The efficiency of matched samples. Biometrics 21 : 623-44.Buck, A. A. et al. (1968). Coca chewing and health. Am. J. Epidemiol. 88: 159-77.Bunker, J. P. et al., eds. (1969). The national halothane study. Washington, D.C.: USGPO.Campbell, D. T. (1969). Reforms as experiments. Am. Psychologist 24: 409-29.Campbell, D. T., and H. L. Ross. (1968). The Connecticut crackdown on speeding: Time

series data in quasi-experimental analysis. Law and Society Rev. 3: 33-53.Campbell, D. T., and J. C. Stanley. (1966). Experimental and quasi-experimental designs

in research. Chicago : Rand McNally.Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser.

A, 128 : 234-66.Cochran, W.G. (1968). The effectiveness of adjustment by classification in removing bias

in observational studies. Biometrics 24 : 295-314.Coleman, J. S. (1966). Equality of educational opportunity. Washington, D.C.: USGPO.Cox, D. R. (1957). The use of a concomitant variable in selecting an experimental design.

Biometrika 44 : 150-58.Kinsey, A. C., W. B. Pomeroy, and C. E. Martin. (1948). Sexual behavior in the human

male. Philadelphia : Saunders.Rubin, D. B. (1970). The use of matched sampling and regression adjustment in observa-

tional studies. Ph.D. thesis, Harvard Univ., Cambridge.Srole, L., T. S. Langner, S. T. Michael, M. K. Opler, and T. A. C. Rennie. (1962). Mental

health in the metropolis (The midtown Manhattan study). New York : McGraw-Hill.U.S. Surgeon-General’s committee (1964). Smoking and health. Washington, D.C. : US-

GPO.Yates, F. (1968). Theory and practice in statistics. J. Roy. Statist. Soc. Ser. A, 131 :

463-77.

136

Page 14: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 137-140 Submitted 3/15; Published 8/15

William G. Cochran and the 1964 Surgeon General’s Report

Norman Breslow [email protected]

Department of Biostatistics

University of Washington

Seattle, WA 98195, USA

By the late 1950’s the causal connection between cigarette smoking and lung cancerwas well established. Several excellent retrospective (case-control) and prospective (cohort)studies had been published that led the US Surgeon General to declare “excessive smokingis one of the causative factors of lung cancer” (Burney 1959). The next few years broughtnew evidence of this and other major health effects of smoking. Although “medical opin-ion had shifted significantly against smoking” (United States Surgeon General’s AdvisoryCommittee Report, 1964), no concerted action had yet been taken to alert the public toits dangers. The Federal Trade Commission (FTC) was clamoring for guidance on how toregulate the labeling and advertising of tobacco products.

Accordingly, in 1962, Surgeon General Luther Terry selected an advisory committee often members to revisit the scientific evidence and produce a technical report on the healthhazards of smoking. Representatives of government, medicine and industry, including somefrom the Tobacco Institute Inc., submitted a list of over 150 candidates for possible appoint-ment to the committee. Each organization reserved the right to veto, without explanation,any name on the list. People who had taken a position on the issue, which included allthose who performed the studies under review, were excluded from consideration.

The committee on smoking and health ultimately comprised eight physicians, one chemistand one statistician, William Cochran. Reputed to be a “statistician you could talk to,”Cochran was by then well known for prior service on several national advisory commit-tees dealing with prominent science policy issues: the effectiveness of the battery additiveADX2; an evaluation of the Kinsey report on sexual behavior; and the planning of the Salkpolio vaccine trial (Meier, 1984). His acceptability to all the organizations responsible forproposing candidates may have been helped by the fact that he was a heavy smoker (Colton,1981). Indeed, smokers made up half the committee.

Cochran’s influence on the report and its conclusions was enormous. Although none ofits chapters were attributed to individual committee members, he was known in particularto have written Chapter 8, Mortality, and its appendices. This chapter reviewed sevenlarge cohort studies of smoking and mortality in men. In his recent bestseller, SiddarthaMukherjee (2010) stated:

The precise and meticulous Cochran devised a new mathematical insight tojudge the trials [studies]. Rather than privilege any particular study, he rea-soned, perhaps one could use a method to estimate the relative risk as a com-posite number through all the trials in the aggregate. (This method, termedmeta-analysis, would deeply influence academic epidemiology in the future.)

c⃝2015 Norman Breslow.

Page 15: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Breslow

Table 26 of Chapter 8 contained the key results. Its importance to the overall evaluationof the evidence was apparent from the fact that an abridged version appeared as Table 2in Chapter 4, Summaries and Conclusions. For each of 25 specific causes of death, and forall causes, the table listed for each of the seven studies the observed numbers of deaths insmokers, the expected numbers and their ratio. Following principles of indirect standard-ization, the expected numbers were the sum over age categories of the age-specific deathrates among non-smokers times the age-specific person-years of observation for smokers.Age adjustment was essential. Since smokers were younger than non-smokers, their crudedeath rates were less than those for non-smokers. Cochran’s innovation was to present twosummaries of the seven mortality ratios for each cause of death. The first was a summarymortality ratio, where the expected number was obtained by pooling the age-specific dataover studies. The second was simply the median of the mortality ratios for the seven stud-ies. These were remarkably consistent: 10.8 vs. 11.7, respectively, for lung cancer; 1.7 vs.1.7 for coronary artery disease, the most common cause of death; and 1.68 vs. 1.65 for allcauses.

In the parlance of modern meta-analysis, the first method, the summary mortalityratio, approximates the summary measure from a fixed effects model whereas the second,the median, corresponds more to a random effects model. In a 2014 letter to the editor ofthe New England Journal of Medicine, Schumacher et al. (2014) used modern software toproduce a graphical “forest plot” of the 1964 results that shows study-specific and summaryconfidence intervals under both models.

In two appendices to Chapter 8, Cochran described the statistical methods he usedto estimate the bias and uncertainty of results presented in the main report. The firstappendix reports a sensitivity analysis of possible bias in the mortality ratios caused bynon-response. The second appendix presents two approximate methods for obtaining aconfidence interval for the mortality ratio. The first, derived under the admittedly falseassumption that the age-specific ratios of person years of observation for smokers vs. non-smokers were constant over age, used the fact that the ratio of a Poisson variable to its sumwith another, independent Poisson variable is binomial. The second method avoided theperson-years assumption, but involved other assumptions including a normal approximationthat “are shaky with small numbers of deaths” ((United States Surgeon General’s AdvisoryCommittee Report, 1964). Fortunately, the two methods produced comparable results,especially for the lower confidence limit.

Cochran’s mortality ratio would be most compelling as a summary measure if the age-specific death rates for smokers vs. non-smokers were in constant ratio, in which caseit consistently estimates the constant. A modern approach would be to fit the modelassuming constant age-specific rate ratios using Poisson regression (Greenland and Robins,1985). The “robust” standard error for the regression estimate of the (constant) log rateratio, allowing for model misspecification, provides an alternative to the ad-hoc methodsproposed by Cochran. The 1964 report makes clear, however, that the rate ratios declinedwith age, dropping by nearly half from ages 40-49 to 80-89. Accounting for this systematicdecline in the model could clarify the interpretation of the summary measure as pertainingto a specific age, e.g., 65 years, with predictable changes for younger or older men. On theother hand, the simplicity of the observed/expected formulation was likely more persuasiveto most readers of the report than a modeling approach would have been.

138

Page 16: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran and the Surgeon’s General Report

Many other aspects of Chapter 8, and of the 1964 report in general, reflect Cochran’sphilosophy regarding observational studies. Section 7.6 of his paper “Observational Studies”(reprinted in this volume), titled Judgment about Causality, expresses a common theme inhis writing: “Given a specific causal hypothesis that is under investigation, the investigatorshould think of as many consequences of the hypothesis as he can and in the study tryto include response measurements that will verify whether these consequences follow.” Hegoes on to illustrate this point with examples drawn from Chapter 8 of the report. Thiscontained sections that dealt with mortality ratios by amount smoked, by age at whichsmoking started, by duration of smoking, by inhalation of smoke, by current vs. ex smokers,by causes of death that one might expect to be related to smoking and by other causes onemight not. There was a section on non-response and another on confounding (“disturbing”)variables, which were measured and considered in some studies.

Needless to say, the 1964 report had enormous impact (Mukherjee, 2010). The morningafter its release, on January 11, 1964, it was front-page news and the subject of widespreadmedia coverage throughout the world. The tobacco industry initially took some refuge inthe fact that public reaction was not as strong as feared, and for a time it appeared thatthey might escape significant regulation. The FTC proposed a strongly worded warningfor cigarette packages, but this was watered down by congress. What eventually led to thevoluntary withdrawal of tobacco advertising from radio and television, in 1971, was a 1968court decision mandating that stations broadcasting tobacco ads had to give equal time toanti-tobacco advertising under the “fairness doctrine” that applied to controversial issues.While Cochran himself remained a heavy smoker until the end of his days, his work on thecommittee contributed to a dramatic decline in smoking in the US, to the easing of theburden of chronic disease and to demonstrably increased longevity.

References

Burney, L.E. (1959). Smoking and lung cancer – a statement of the Public Health Service.Journal of the American Medical Association, 171, 1829-1837.

Colton T. (1981). Cochran, Bill - his contributions to medicine and public-health and somepersonal recollections. American Statistician, 35, 167-170.

Greenland S. and Robins, J.M. (1985). Estimation of a common effect parameter fromsparse follow-up data. Biometrics, 41, 55-68.

Meier P. (1984). William G. Cochran and public health. In: Rao, PSRS and Sedransk J.,eds. WG Cochran’s Impact on Statistics. Wiley, New York: 73-81.

Mukherjee S. The Emperor of all Maladies: A Biography of Cancer. Scribner, New York.

Schumacher, M., Rucker, G. and Schwarzer G. (2014). Meta-analysis and the SurgeonGeneral’s report on smoking and health. New England Journal of Medicine, 370, 186-188.

United States Surgeon General’s Advisory Committee Report (1964). Smoking and Health.U.S. Department of Health, Education and Welfare, Washington D.C.

139

Page 17: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Breslow

A Personal Recollection

To my knowledge I met Bill Cochran only once. The occasion, during Spring of 1962, wasa trip East to visit universities where I had applied to graduate school in mathematics. Myfather, a public health physician, was the principal investigator on two of the seven mortalitystudies summarized in the 1964 report and later testified before the committee. He knewand admired Cochran and arranged for me to have an interview at Harvard, undoubtedlyhoping that I might become interested in statistics. I remember Cochran as a tall and, tome, somewhat formal figure who displayed little interest in my career choice. He may haveknown that Harvard’s math faculty would reject my application. My experience at TheJohns Hopkins University was different. After a disastrous interview with one of the seniormath faculty, a meeting with Alan Kimball that had been similarly arranged by my fatherled me to withdraw my application on the spot and to re-apply to the statistics departmentthat Kimball was attempting to resurrect. When I returned to my undergraduate collegeand told my professors what I had done, they said that if I was truly interested in statisticsI should apply to UC Berkeley and to Stanford, from which I ultimately graduated. Ialways wondered how my career might have evolved had the interview with Cochran gonedifferently and I had applied (and been accepted) by Harvard statistics.

140

Page 18: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 141-164 Submitted 7/15; Published 8/15

The Inheritance bequeathed to William G. Cochran that hewilled forward and left for others to will forward again: The

Limits of Observational Studies that seek to MimicRandomized Experiments

Thomas D. Cook [email protected]

Northwestern University &

Mathematical Policy Research, Inc.

Introduction

Seamus Heaney had the courage to want to use the past to construct a better future fromamong the many potential futures always available. In “The Settle Bed” of 1991 he wrote:

And now this is “an inheritance” –Upright, rudimentary, unshiftably plankedIn the long ago, yet willable forwardAgain and again and again.

To etymologists, “upright” connotes full of rectitude, being correct, while “rudimentary”connotes beginnings and first principles. Together, they signify that Heaney’s concern iswith inheritances that address fundamental issues and want to be right about them. Thatall things are linked to the past is a truism, but that many things are “unshiftably planked”there is not. Firmly rooted inheritances require generation-transcending transmission mech-anisms, whether objects like books, or cognitive habits like theories, or social institutionslike universities, or subsequent generations like one’s own students and students’ students.Such mechanisms transcend lives, including those of individual scholars. Heaney insiststhat unshiftably planked inheritances have to be willed forward, not just once or twice, but“again and again and again”. So he rejects ossifying traditions that fail to accommodatenew realities and instead recommends using human will to make continuous changes in aninheritance. Presumably this is by implementing possible changes, learning about theirsuccesses and failures, and incorporating their results into the original inheritance that isthereby modified. The consequence is an inheritance that is planked in the past, modifiedin the present, and repeatedly modified in the future by dint of human will.

Cochran was himself the beneficiary of an important inheritance that he improved andpassed on. It does not diminish him to note that his writings on causation and the de-sign of experiments and observational studies are inconceivable without Fisher. The twodid not work together at Rothamsted, but Fisher often returned there after his move toCambridge and they talked there as well as at Royal Society meetings (Watson, 1982). Weeven have Cochran’s own reports of conversations with Fisher, including the insight thatobservational studies should make the implications of a single causal hypothesis more elabo-

c©2015 Thomas D. Cook.

Page 19: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

rated in the data (as reported, for example, in Rosenbaum, 2005). Yates mentored Cochranat Rothamsted and had earlier been a colleague of Fisher there (Yates, 1964). It seemsinconceivable that Cochran and Yates did not speak in detail about Fisher’s contributions,given its general intellectual resonance and its clear relevance to their own research agendasand the mission of their research station. Cochran brought detailed knowledge of Fisher tohis first stay at Iowa State; and after he emigrated he continued this dissemination processat Iowa State again and then in North Carolina, at Johns Hopkins and at Harvard. Hisdissemination of Fisher took place face-to-face, in teaching and in writing, perhaps mostsaliently in his seminal texts with Cox and Snedecor. Fisher himself did not become anintellectual giant out of nothing, but he nonetheless created most of the planks on whichCochran first stood. Here are what I think are the five major ones.

(1) It is legitimate for statisticians to choose their intellectual problems from among thepractical problems faced by those who seek to improve the physical world, be they farmers,health workers, engineers or social welfare professionals. The main alternative to suchpractice-based problem choice is when research agendas emanate from puzzles in existingstatistical theory or from past and emerging issues in mathematics.

(2) Many of the practical problems practitioners face require identifying whether a ma-nipulable action is “causally” related to possible consequences. The inference entailed heremoves Statistics beyond its historical concerns with reducing uncertainty and improvingprediction (Stigler, 1986). It legitimates research on causal bias and its control, the centralpoint of this paper. But it also legitimates working with what is perhaps a less fundamentaltheory of causation. Variously labeled by philosophers of science as the activity, manip-ulability or recipe theory of causation (see Cook and Campbell, 1979), it seeks the validdescription of concrete If/Then connections rather than explanations of why things happenin the world, including causal connections.

(3) Causal bias is best controlled through experimental design. Such design requiresclear null hypotheses that are then tested through a combination of how units are allocatedto the treatments, how and when the study outcome is assessed, and how comparisongroups are selected. The hope is that such structural elements will provide a perfect no-treatment counterfactual against which observed performance in the treatment group can becompared. Then, no other interpretation of the relationship between the independent anddependent variable is possible other than that it is causal. Many non-experimental causalmethods exist, primarily “causal modeling” methods where substantive theory about aset of interdependent temporal influences is tested, essentially by estimating how well theobtained and predicted data match.

(4) Within experimental design, random assignment is the best tool for warranting un-biased counterfactual estimates. Random assignment balances the study groups on all ob-served and unobserved variables, entailing a perfect counterfactual in expectation and aprobabilistically equivalent counterfactual in individual studies. Random assignment is alsodemonstrably implementable in real-world settings where its assumptions are often clearlymet. There are other cause-probing “experimental” traditions that do not require randomassignment. For instance, the experimental laboratory sciences routinely test causal hy-potheses, but they rule out alternative causal interpretations by virtue (1) of closed-systemtest settings and apparatus that physically exclude most contending hypotheses, (2) ofsubstantive theories whose predictions are so numerically or functionally specific that no

142

Page 20: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

contending theory can explain them, and (3) of implementing procedures that have evolved,and are still evolving, to reduce the confounds from the laboratory researchers’ own hopes,expectancies and interests. “Quasi-experiments” are also experimental, but by definitionthey are without random assignment. Instead, they seek to mimic the logic and structure ofrandom assignment experiments, usually without the benefit of closed-system test settings.Causal ambiguities often remain with quasi-experiments, therefore, given how rare it is forthe non-random process of selection into treatment to be perfectly known and measured.

(5) When random assignment is not possible, the preference is for observational studydesigns that mimic random assignment as much as possible. From Fisher’s Latin Squaredesigns on, these mimetic designs test a null hypothesis about a deliberately manipulatedtreatment versus a comparison group; the comparison group is deliberately selected to min-imize pre-intervention group differences; the occasions of measurement are those found inmany outdoor and long-lasting random assignment studies that include pretest assessmentsas well as posttest ones; and these pretest covariates are then “somehow” used to control forany bias remaining after the study groups have been chosen. In more modern language, thegoal of the mimetic tradition is to provide the best approximations for those potential out-comes that are missing in the quasi-experiment but are available in the random assignmentexperiment.

Legitimate debate is possible about these five propositions, and we could add others. Butthey help describe and demarcate the inheritance Cochran received. Other causal inheri-tances were available to him at the time. One was the Galton/Pearson tradition with its em-phasis on prediction, substantive modeling, multivariate data analysis and epistemologicalverification – an evident counterpoint to Fisher’s emphasis on causation, experimentation,random assignment and what later came to be called falsification (Meehl, 1978). Cochrancould also have worked on identifying the processes through which laboratory research pro-motes clear causal conclusions, given the strong interest at the time in how experimentalphysics, chemistry and biology advanced theory in their respective disciplines. But he didneither of these things, instead representing the third position just described. Each of thethree has evolved as an adaptation to different parts of the world of research. For most ofthe 20th century, the experimental model has been preferred for open-system applicationsin agriculture, medicine, psychology and some engineering. More recently, it has also beenimported into micro-economics, political science, education and sociology. However, it israrely relevant in chemistry, physics or micro-biology, where closed-system experiments areused to describe and explain cause-effect relationships. Nor is the experimental model ofmuch relevance in meteorology, macro-economics, macro-sociology, historical geology andmuch of population genetics. Cause is still a goal in these fields, but control through de-liberate manipulation is difficult, causal modeling is the norm, and local understandingsof “cause” stress explanation more than the description of If/Then relationships. Knowingabout the alternatives Cochran did not choose helps us distinguish the unique boundariesof the experimental inheritance he was bequeathed, willed forward and passed on.

How he made progress is evident in several ways. One is his role in disseminating the coreinheritance through his physical presence, expository writings and transmission of insightsnot yet committed to print (Cochran, 1965). A second is through his personal research. Iam not a statistician or historian of science, but note that he sharpened thinking about howobservational studies should seek to mimic the logic of randomized experiments, especially

143

Page 21: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

through his work on matching (Cochran, 1983). He had begun exploring the bias-reductionrole of matching much earlier (Cochran, 1950) and was particularly creative in dealing withsub-classification (Cochran, 1968) and even determining how different numbers of strataaffect the amount of bias reduced (Cochran, 1965). In addition, he discovered that themode of analyzing observational study data makes little practical difference, and that biascontrol is made more difficult by the size of the initial overlap between groups (Cochran,1968). He also forwarded his inheritance through his influence on others. He trained 39doctoral students and they (and their students) have created an even more systematic theoryof causal identification and estimation that assumes the five main planks described earlierand is best embodied by the Rubin Causal Model. Expositions of it explicitly acknowledgeCochran’s influences (e.g., Rosenbaum and Rubin, 1983; Imbens and Rubin, 2015). He alsoinfluenced many other scholars of causation, including fellow statisticians like Mosteller,Moses and Tukey as well as admirers from more distant fields, like Donald T. Campbell andhis students whose work often cites Cochran (see the bibliography in Shadish et al., 2002).

A full historical analysis of how Cochran directly and indirectly willed forward his in-heritance is beyond the scope of this essay. But the point is clear. Cochran was embeddedin an upright and rudimentary inheritance; he willed it forward first in one paper and thenagain in another and then again in another; and he bequeathed this marginally improvedinheritance to others, including his own students, so that they might will it even furtherforward, again and again and again.

The Present Purpose

As a practitioner of mostly quasi-experimental research in complex field settings, I am de-nied the protections of random assignment and closed-system laboratories. So I am usedto feeling vulnerable and suspect and even envious of others’ certainties about methodchoice. I am not formally connected to the intellectual history of causal design in Statistics.Nonetheless, I feel very comfortable with the first four propositions characterizing Cochran’sinheritance, but am ambivalent about my connections to the mimetic conception of obser-vational study design. This is because I am aware of contrary examples I will present andof three issues central to quasi-experimental practice that the Cochran inheritance rarelydiscusses. I want to present the issues here and ask: (1) whether they help demarcate theinheritance’s boundaries and (2) whether one or more of them might be worth “willingforward” to incorporate into the inheritance, even if only at one of its margins. Of course,no appointed court of statisticians exists to deliberate about exclusion from the causal in-heritance or inclusion into it. Scientific agenda-setting and -modification is a much moread hoc process that almost certainly has chance components. My hope is merely to starta conversation about what should and should not be part of Cochran’s inheritance, not ashe found it, but as he and his students, friends and followers have elaborated it.

The first issue from my own work as a quasi-experimental practitioner speaks to thereality that I sometimes cannot construct a single focused null hypothesis test of a causalhypothesis, even though all random assignment studies and all mimetic quasi-experimentsaspire to such a test. Archetypically, this test evaluates the difference between two posttestmeans from two initially identical groups. Fisher himself advised Cochran that observa-tional studies should not take this approach and should instead elaborate the same causal

144

Page 22: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

hypothesis until it has multiple implications within the data that are subsequently tested.“Somehow” multiple sub-hypothesis tests have to be constructed – even from multiple datasources – and the case has to be made that they are collectively sufficient to test the hy-pothesis. Fisher’s advice probably surprised Cochran, for it sees to be at odds with Fisher’sown writings on null hypothesis testing. Nonetheless, Cochran (1965) provided some briefexamples of causal questions that cannot be answered with a single focused test.

Other scholars have noted the same, and have worked on concepts that overlap withFisher’s off-hand remark. Campbell’s (1966) “pattern-matching” requires postulating andtesting a complex pattern of differences that might rule out all other alternative interpreta-tions. The “critical multiplism” of Cook (1985) depends on multiple tests that are criticallychosen because they collectively rule all the currently identifiable causal alternatives to thehypothesis under test. Rosenbaum’s (2005; 2009; 2011) “coherence” notion depends on theconsistency of results from multiple tests across different datasets that provide a coherentlink to a single causal hypothesis. Finally, the idea of Generalized Differences in Differ-ences (GDD) involves testing statistical interaction hypotheses that are of higher order(and thus more “pattern-laden”) than the two-way interactions in standard differences indifferences approaches (e.g., Imbens and Wooldridge, 2007; Chetty et al., 2009). In whatfollows, the national educational reform program No Child Left Behind (NCLB) illustratesa case where no single focused hypothesis test is possible but where elaborated and testablesub-hypotheses are. All the results are consistent with the hypothesis that NCLB raisedacademic achievement, but they flirt with verifying a predicted pattern of data and fail tofalsify all possible alternative interpretations even if they do falsify all “plausible” ones. So Iwant to ask: Does the inheritance under discussion want to wash its hands of such inelegantand marginally less successful omnibus tests? If it does not, how might the Fisher strategybe incorporated into an inheritance where its current role is minimal.

The second issue I want to address is that causal statements require more than identify-ing whether two variables are “causally” related and then estimating the size and statisticalsignificance of the obtained relationship. Also needed is a name for both the cause and effectin general language, given the impossibility of providing a comprehensive description of thecause (or effect) each time it is mentioned. Cochran (1965) briefly mentions this issue, butin the examples he cites he leaves construct validity to psychologists and sociologists, notstatisticians. I do not dispute that his inheritance has made some breakthroughs in labelingcausal manipulations – e.g., in discussions of comparison group types (e.g., no-treatmentversus placebo control groups) and of factorial designs that decompose a complex interven-tion into parts that are then separately examined. Even so, randomized experiments weredesigned to optimize construct validity much less than internal validity, and I suspect manypeople would question the utility of valid cause-effect relationships in which either the causeor effect were wrongly named. So I present the example of an otherwise successful bail bondreform program that was discontinued because of how the manipulation was (mis-)labeled.Its benefits might have continued, however, if the treatment had been correctly labeled.What intellectual responsibility, if any, does the Cochran inheritance want to take for theconstruct validity of independent and dependent variables? Is this issue demarcated out,or worth eventually incorporating?

The final issue I bring up concerns the generalization of causal relationships, using theregression discontinuity design (RDD) as an example. RDD uses a deterministic treatment

145

Page 23: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

assignment procedure to identify causal effects. It is deterministic because assignment totreatment or control status depends only on whether a unit scores above or below a singleobserved score on a continuous measure that is often of need, merit or age. Since thisassignment procedure creates non-overlapping groups on each side of the cutoff, causalidentification requires extrapolating the functional from the untreated regression segmentinto the treated segment where it estimates the missing but crucial potential outcome –what would have happened to treated units had they not been treated. Unfortunately, noindependent support exists for this crucial extrapolation, and so causal inference is usuallylimited to the cutoff point where only a small fraction of those receiving treatment are tobe found. The ensuing loss in causal generalization contrasts badly with the randomizedexperiment whose treatment and comparison groups totally overlap and so warrant theestimation of an average rather than a local treatment effect. The combination of RDD’sunsupported extrapolation, limited causal generalization, and dependence on modeling mayexplain why few statisticians have paid much attention to the design (Cook, 2008). However,simple design elements can be added to the basic RDD structure and will provide somesupport for the extrapolation RDD always needs. These supplementary data elementsmight be untreated regressions from pretest observations (Wing and Cook, 2013), fromother covariates (Angrist and Rokkanen, 2015), or from non-equivalent comparison groups(Tang et al., 2015). When certain assumptions are met, we later show that RDDs with anuntreated comparison function (CRD) lead to causal results that are demonstrably valid inthe whole treated area beyond the cutoff. Do some versions of RDD, like CRD, deserve willingforward? More generally, should the inheritance pay more attention to causal generalizationand raise its profile relative to causal identification and estimation?

Issue 1: When no single focused test of a causal hypothesis is possible

Random assignment experiments create a single focused test of a hypothesis about a singlecause and a single effect, usually a test of mean differences. Something very similar hap-pens in most quasi-experiments. The Rubin Causal Model seeks to create treatment andcomparison groups that are equivalent conditional on covariates, thus allowing the groupmean differences to be examined at posttest. I prefer such focused tests and can usuallycreate them in observational study work. But I cannot always do so, and it is this sometimefailure that motivates the issue addressed here.

Cochran was apparently fond of saying something like “Unless you can give me anexample that illustrates your statistical problem, I won’t find it important enough to botherwith”. In this spirit let me offer the example of NCLB. It sought to improve academicachievement in all public schools nationwide. The program began in 2002, and the lawspecified that by 2014 children in all schools in all states had to attain passing scores onstate achievement tests. States were free to set their own time schedule for reaching thenational goal, but they had to implement a system in which sanctions escalated as thenumber of consecutive years increased over which a school had failed to reach pre-specifiedannual performance levels.

Random assignment was not possible because NCLB was the product of a national lawthat was rolled out immediately. Nonetheless, Dee and Jacob (2011) reasoned that thekey component of the national program was “consequential accountability”, the system of

146

Page 24: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 1: Trends in grade 4 math in Main NAEP, by timing of accountability policy

escalating reforms a school had to undertake as a function of the number of consecutivelyfailed years. Each state had to have such a system by 2002. But some states already hadone, allowing states to be partitioned into those with and without accountability in 2002.Using the national Main NAEP math test for 4th graders over eight time points, some beforeand some after NCLB, Dee and Jacob constructed Figure 1.

It shows that for states that already had consequential accountability before 2002, slopesfor achievement are steeper; but after 2002 the slope difference changes to become parallel,suggesting a post-2002 improvement in those states getting accountability in 2002. This isconsistent with a causal impact, but only if a number of problems are addressed. First, theobject being evaluated is not NCLB; it is only one mechanism within it, thus compromisingthe construct validity of the cause. Second, it is not clear whether other state-level, math-correlated forces might have differentially affected achievement before and after 2002 – anissue of internal validity. And finally, there are few data points during the baseline period,leading to questions about how well baseline functional form differences have been modeled– another internal validity issue.

An alternative design (Wong et al., 2015) pits national public schools that are subjectto NCLB against the nation’s private schools to which NCLB hardly applied. This isnot a sharply focused test, though. Prior to NCLB, public and private schools were quitedifferent in achievement levels and maybe even slopes, and they were also subject to differenthistorical forces that might have changed around 2002. Moreover, though the baselinetime points are now more, they are still few. Nonetheless, Figures 2 and 3 plot 4th gradedifferences in math on Main NAEP when public schools are contrasted with Catholic schoolsand then with non-Catholic private schools.

For both grades, a large mean selection difference is evident at baseline, with perfor-mance higher in each type of private school. The baseline time trends are less clear, however.Simple visual tests comparing differences in differences suggest that the public schools came

147

Page 25: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

Figure 2: 4th grade math for Main NAEP:Public and Catholic schools

Figure 3: 4th grade math for Main NAEP:Public and non-Catholic private schools

Figure 4: 8th grade math scores for MainNAEP: Public and Catholic schools

Figure 5: 8th grade math for Main NAEP:Public and non-Catholic private schools

to do better after 2002, narrowing the achievement gaps visible before NCLB. Statisticaltests used baseline means and linear trends to examine immediate posttest mean differences,posttest linear slope differences, and final mean differences. While every estimate has a signindicating positive NCLB effects, most are not statistically significant – perhaps due to thesmall number of degrees of freedom in a national level analysis.

For 8th grade math and 4th grade reading scores, the corresponding data are in Figure 4through 7. All causal signs are again positive, but few are statistically significant. Andeffects seem even smaller for reading than math.

The national Main NAEP results just presented are based on items that vary over timein order to reflect national changes in teaching content. In contrast, Trend NAEP holds testitems constant. Figures 8 through 10 plot the corresponding Trend NAEP differences for4th grade math, 8th grade math and 4th grade reading. (Trend NAEP data for non-Catholicprivate schools are not available, and there is only one interpretable posttest time pointexists due to a change in sampling design after 2004). All the results point to greater mean

148

Page 26: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 6: 4th grade reading scores for MainNAEP: Public and Catholic schools

Figure 7: 4th grade reading Main NAEP:Public and non-Catholic private schools

change after 2002 in public schools and to a reduced achievement gap. Now, the two mathdifferences in difference are statistically significant.

State level tests have advantages over national ones, particularly as regards sample size.NCLB left states free to set their own standards about how quickly to reach the ultimate2014 national goal, and some states chose to make faster initial changes. So states variedin whether they initially had higher, moderate or lower standards for initial improvement,.(This contrast was not correlated with which states adopted consequential accountabilitybefore or after 2002). Figures 11 through 13 give the results for 4th and 8th grade math and4th grade reading.

Immediately after 2002, all estimates of the difference of differences are positive in sign,with states taking NCLB more seriously doing better, and those in the moderate categorydoing next best. Adding these state-level results to the national ones in Wong et al. indicatescoefficients favoring NCLB in every one of the 39 difference of difference tests across 4th

or 8th grade, math or English, Main NAEP or Trend NAEP, Catholic or non-Catholicprivate schools, and state standard differences assessed in categorical or continuous form.Of these 39, 10 are independent tests of immediate mean and slope differences for math(p < .001); and eight are independent tests of differences of differences in slope and finalendpoint (p < .01). The statistical significance levels are generally weaker for Main thanTrend NAEP, but they are strong for Main NAEP when the strongest tests are conductedthat combine both the initial mean and subsequent slope effects in order to examine meandifferences at the last data collection point. Of the 9 such estimates, all have the samesign, and five are statistically significant at the .10 level or lower. The case is strong then,that the obtained data match the pattern predicted from the single hypothesis that NCLBraised achievement.

Yet none of the single tests warrants a strong causal conclusion; only together do theydo what Fisher recommended by rendering the single causal hypothesis more complex andtesting how well the predicted and obtained data patterns correspond. However, suchverification does not necessarily fulfill what a well executed random assignment experimentachieves – ruling out all alternative interpretations. Fisher’s advice will almost alwaysreduce the range of alternative interpretations, but it will not necessarily eliminate all of

149

Page 27: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

Figure 8: 4th grade math scores for TrendNAEP: Public and Catholic schools

Figure 9: 8th grade math Trend NAEP:Public and Catholic schools

Figure 10: 4th grade reading for TrendNAEP: Public and Catholic schools

them. Indeed, Wong et al. (2015) discuss three possible alternatives. Given the coherentpattern of results, each is of low plausibility – but nonetheless each is still possible if ahighly unlikely concatenation of alternative forces operates.

One alternative is that students suddenly began to leave Catholic schools for publicschools after 2002 when scandals about the sexual behavior of priests began to emerge.This might suddenly affect mean performance in these schools and also in the other schoolsto which the departing students moved. Moves from Catholic schools were indeed greaterjust after 2002 – by about one-fifth of one percent relative to prior years. But these ratesare too small to account for the public school gains, given that public schools cater to about90% of all students nationally and Catholic schools to about 5%. Moreover, Wong et al.show that Catholic school movers had no detectable impact on the post-2002 composition ofschools on three demographic measures that are usually highly correlated with achievement.In addition, for inter-school student composition changes to account for the obtained datapattern further requires that (a) the students who left Catholic schools in states with higher

150

Page 28: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 11: 4th grade math for Main NAEP:High vs. med. vs. low proficiency standardstates

Figure 12: 8th grade math for Main NAEP:High vs. medium vs. low proficiency stan-dard states

Figure 13: 4th grade reading scores onMain NAEP: High vs. medium vs. low pro-ficiency standard states

2002 standards achieved more than students who exited Catholic schools in states with lowerstandards, and (b) students in states with high consequential accountability were morelikely to leave Catholic schools than their counterparts in states with low consequentialaccountability.

Another interpretation is that school officials manipulated NAEP achievement testscores after 2002, and that they were more motivated to do so in public than private schoolsand in states with higher standards and also in states with consequential accountability be-ginning in 2002. Two popular ways of manipulating test scores are to reduce how manystudents with disabilities (SWD) or English language learners (ELL) are tested. Table 9of Wong et al. (2015) shows the percentages excused from NAEP testing did shift morein public than private schools after 2002, but that no such differential exclusion pattern isevident in the contrast of states varying in standards. Also, school officials have less moti-vation to manipulate the testing process for NAEP than for state achievement tests since no

151

Page 29: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

consequential decisions depend on a school’s NAEP results. The only manipulation-relatedexplanation we can envisage is that high stakes testing at the state level changed the cul-ture of all achievement test taking in and after 2002 – including for NAEP – and that thischanging culture was stronger in public than Catholic schools even though both kinds ofschools take part in Main NAEP testing. No data exist on this quite circumscribed versionof an alternative interpretation based on manipulating test scores.

A third alternative invokes the National Council of Teachers of Mathematics (NCTM).It updated its math standards in 2000 and later claimed that this was responsible for thesubsequent national improvements in NAEP math scores (National Council of TeachersMathematics, 2008). While this alternative might explain why the math effect was largerthan the reading one in 4th grade, it also requires that NCTM standards were adopted moreoften, or were implemented better, in high standard versus low standard states, in statesadopting consequential accountability only after 2002, and in public schools more than thetwo private school types. Also needed are the assumptions that: (a) math standards operatewith a three-year causal delay, for the standards changed in 2000 but the first evidence ofa possible effect appears in 2003; and (b) issuing new standards is by itself sufficient tochange performance at the national level!

The need for discussions about ruling out alternative interpretations leaves a bad tastein the mouth when compared to the simple use of a single focused hypothesis test. Butconsider the alternatives to this bad taste. One would be to do no evaluation. Wouldthat leave an even worse taste? Another would be to do a randomized experiment underatypical circumstances, perhaps by issuing waivers from NCLB to exempt some schooldistricts. However, this set-aside waiver experiment would be compromised by the kindsof districts or schools eligible for waivers and applying for them. More serious would bethe nature of the schools available to parents wanting to leave a school. In the waivercontext we envisage, there would be some public schools without NCLB in each schooldistrict. Yet this would not be possible in a national program and would fundamentallyalter the school choices available. When study designs are evaluated by both internal andexternal validity criteria, a randomized experiment of a systematic and likely small sampleof schools operating in a social context to which one does not want to generalize might notbe superior to an unaesthetic bundle of multiple interrupted time-series tests conductedat both the national and state levels. Construing study quality within a multi-attributedecision framework, rather than today’s single-attribute framework that prioritizes onlyinternal validity, suggests that a sharply focused experimental test from a waiver experimentmight not be superior to the messy multiple tests described above.

The epistemological bottom line with elaborating the causal hypothesis is that empiricalcoherence is only part of the bias control story; the rest depends on how well alternativeinterpretations are ruled out by the uniqueness of the predictions. In his own work, Fisherwrote about single sharp hypothesis tests, and he did not exemplify his own advice aboutelaborating a causal hypothesis, though I know little of his later genetics research. Nor,to my knowledge, did Cochran do anything more with Fisher’s advice than pass it on, allthe time advocating for observational studies based on matching procedures designed tomimic single sharp tests. Most of Cochran’s students (and their students) have not heededFisher’s advice either, with the salient exception of Rosenbaum. I grant the superiority ofsingle focused tests. My question is whether the critical multiplist perspective described

152

Page 30: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

here should be demarcated out of the inheritance, though it is consonant with Fisher’sdictum? Or is it worth willing forward when single sharp hypothesis tests are not possible?

Issue 2: Simultaneously Identifying Causal Relationships and CausalConstructs

The central task of the inheritance under discussion is to identify causal relationshipsthrough a falsification process that distributes conceptually irrelevant causal forces equallyacross conditions. In random assignment studies, this is done unconditionally, though inquasi-experiments it is done conditionally on the covariates used. Other meanings of causeare broader. Let us leave aside the distinction between causal explanation and causal de-scription, recognizing that experimentation always deals with causal description but onlydeals with causal explanation if (1) the independent and dependent variables are delib-erately chosen to index explanatory concepts from a clear substantive theory, or (2) allplausible temporally mediating variables in a study are measured and the analysis identifieswhich of them offer a more plausible causal explanation than others. Let us instead focuson the fact that the cause and effect in each study need to be labeled in abstract languageif scientific communication is to ensue. The alternative, providing a complete description ofthe manipulandum, is impossible because the volume of words needed is impractical and,anyway, full knowledge of causal agency is often elusive. The next example explores theconundrum that a causal relationship can be valid but useless, or even dangerous, when thecausal agent is invalidly labeled.

Cook County, Illinois changed its system of bail hearings in 1999, hoping to reduce costsby substituting video hearings for face-to-face hearings with a judge. The reform appliedto all charged offenses except the most serious ones – murder and serious sexual offenses,for which face-to-face hearings persisted. All the labels attached to the reform featuredits video component, reflecting the universal consensus that this was the core interventioncomponent. Some legal advocates came to dislike the reform, and one reason for this wasthe suspicion that it may have unintentionally increased bail amounts. By so doing, moreindividuals could not now afford bail and would have to remain in jail even though presumedinnocent; their families and jobs would be endangered; and the county would have to payincreased jail and welfare costs.

A study was conducted using 17 years of data on daily bail amounts, but aggregatedby quarter (Diamond et al., 2010). About half of the days came from before the videointervention and half after it. The charged offenses were partitioned by seriousness, withthe less serious ones receiving the intervention and the most serious ones serving as no-treatment controls. The resulting comparative interrupted time series showed that, afterthe intervention, bail amounts changed by more for the less serious than the more seriouscharges. These results were first privately released to the county senior judges with infor-mation that they would soon be released to the press. Just before they were released, thevideo innovation of 8 years was rescinded. Face to face hearings returned for all charges.

There is little doubt that bail amounts increased after the video reform. Figure 14 givesthat lowess-smoothed daily aggregate data over 17 years from Cook et al. (2014). The dateof the reform is set at 0 and the range of days goes from plus to minus 3,000. The greatermean shift after the intervention is clear.

153

Page 31: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

Figure 14: Serious and less serious offenses over time: LOWESS-smoothed daily aggregatedbail amounts from before and after the bail bond reform intervention

Figure 15: LOWESS-smoothed number of judges hearing serious and less serious cases byday before and after the bail bond reform intervention

Figure 15 shows the number of judges who adjudicated bail bond hearings each day. Forthe less serious cases the number of officiating judges dropped precipitously from about 8per day to three, indicating that the reform was a multi-attribute treatment package. Onepart was the video, and another was the reduction in the number of judges. Other evidencenot produced here shows that the volume of bail bond hearings did not change over theintervention period, suggesting that the post-video judges became more specialized.

154

Page 32: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 16: Percentage of all felony bail hearings held before and after the bail bond reformby the Groups of 3, 13 and 83 judges who held bail hearings both before and after thereform.

Of the 530 judges adjudicating in bail bond hearings over 17 years, 99 heard casesboth before and after the video-based reform. We call these the before-after judges. Allthe other judges heard cases only before or only after the change. Of the 99 before-afterjudges, three were responsible for almost 100% of all the cases for the first year afterthe change and for 50% of them for the next two years. In contrast they heard 40% inthe year immediately before the reform and very little before that. So they were heavilyoverrepresented immediately after the reform when compared to before it. This is revealedin Figure 16 which plots the number of pre- and post-intervention hearings adjudicated bythe 3 most specialized, the 13 next most specialized, and the 83 who heard very few caseseven after the reform.

Key issues are whether these before-after judges were disposed to set higher bail amountsthan their pre-intervention colleagues even before the intervention occurred; and whethertheir bail amounts increased at all. Figure 17 gives the relevant data for the bail amountset by three most prolific judges, and we see no change in bail amounts from before to afterthe reform!

Data are more sparse in the immediate post-video period for the other before-afterjudges, but Figure 18 for the 83 before-after judges shows there is again no evidence thatthey set higher bail amounts after the intervention than before. The apparent mean increasefor less serious offenses is matched by an apparent increase for the more serious offensesthat were always adjudicated face-to-face.

If within-judge analyses show that bail amounts did not change from before to afterthe reform, and if analyses aggregated across all judges (including include the before-onlyand after-only judges) show that bail amounts changed from before to after the videointervention, what explains the seeming discrepancy. The most likely explanation is thatthe before-only judges set lower bail amounts than their before-after colleagues, and Figure 4

155

Page 33: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

Figure 17: Serious and less serious offenses over time: LOWESS-smoothed individual caserecord data for the three most active before-after judges.

Figure 18: Serious and less serious offenses over time for the 83 before-after judges: Overlapbetween the LOWESS-smoothed individual case record data and quadratic order regression.

in Cook et al. (2014) confirms this. So the higher post-reform bail amounts found with allof the judges is probably due to the before-after change in judge composition rather than tothe introduction of the new video system. Even before the video reform, the judges hearingcases both before and after the reform were prone to set higher bail amounts than theircolleagues who only adjudicated them before.

What would have happened if the true causal agent had been correctly identified? Couldthe Cook County Court system have changed the judges who held bail hearings withoutsacrificing the video component and the cost savings it generated through mechanismsoutlined in Cook et al. (2014)? Could the post-intervention judges have been randomlychosen?; Could a different set of judges with different proclivities have been selected?; Couldthe 3 or 99 judges have been instructed to re-consider how they set bail amounts? These areall pertinent questions, but they were never asked at the time because the universal socialconsensus was that the reform entailed changing only the mode in which hearings were held– from face-to-face to video. The resulting mis-identified causal agent points to a dilemma:What is the value of learning that a causal relationship is true if the cause involved in it

156

Page 34: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

is wrongly named and a name has to be provided? Does the Cochran inheritance take noresponsibility for mis-identified causal agents, only for mis-identified causal relationships?Many non-statisticians may find it odd to limit the idea of causation to identifying andestimating causal relationships without considering how the cause and effect are named. Isit worthwhile willing forward research on this cause-naming topic?

Issue 3: Generalizing Causal Relationships, including in the RegressionDiscontinuity Context

Despite research on pooling experimental results and on identifying statistical interactionsbetween treatments and study details, the Achilles Heel of experimentation remains whatis variously called external validity, heterogeneity, or causal generalization – the capacity toidentify the boundary conditions on which a specific causal relationship depends, includingtotal generalization across all boundary conditions within our space-time continuum.

Few experiments are based on samples randomly drawn from a clearly described popu-lation; most sampling particulars are products of opportunism. In research syntheses it israre for the sample of available studies to be formally linked to a meaningful population and,anyway, formal sampling models are not relevant to all the objects of causal generalization– e.g., to different types of settings, times, and ways of operationalizing the cause and effect.Also, studies rarely have samples so large that they permit examining how the treatmentinteracts with all, or even many, of the possible sources of heterogeneity. Scientific decisionsabout causal robustness are usually more weakly supported and depend, say, on a predom-inantly positive causal sign or on whether effect sizes differ across the possible sources ofheterogeneity that happen to be examined. When causal results are not robust, it is com-mon to use the pattern of results to begin the process of specifying the boundary conditionsthat determine variation in effect sizes. The limitation here is the extensiveness of the po-tential moderator variables available. Longitudinal surveys typically have better samplingbases and more measures of potential moderators, but internal validity is a much biggerproblem with them than with experiments. And, if researchers cannot be “reasonably” sureof a causal connection, what is the point of worrying about its generalization?

When it comes to causal generalization, RDD is in even worse shape than the usual ex-periment. Although the process of selection into treatment and control is perfectly knownand can be modeled, the unbiased causal inference that results depends on (1) having spec-ified the correct functional form or bandwidth size (Imbens and Lemieux, 2008); (2) theutility of a local average treatment effect at the cutoff (LATE); and (3) the willingnessto tolerate less statistical power than in the experiment. To try to counteract these lim-itations, the comparative RDD design (CRD) adds an untreated regression to the basicRDD structure, typically to date from a pretest measure (Wing and Cook, 2013), other co-variates (Angrist and Rokkanen, 2015) or a non-equivalent comparison group (Tang et al.,2015). The purposes of this untreated regression are to provide additional support for theextrapolation that RDD always needs, to increase power by adding data, and to test causalgeneralization not just at the cutoff but also in the entire area of the assignment variablewhere the treatment is available.

In Wing and Cook (2013), the key assumption of CRD is that the three untreatedregression lines are parallel. These are the pretest scores on the untreated side of the

157

Page 35: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

cutoff and on the treated side as well, plus the posttest observations in the untreated partof the assignment variable. When these slopes are plausibly parallel and the pretest isincluded in the outcome model together with the assignment variable and binary cutoffscore, then the relationship of the assignment variable to the outcome will be zero. This isthe ignorability condition that Angrist and Rokkanen (2015) believe is necessary for validcausal generalization beyond the cutoff score in CRD.

Tang et al. have just tested these ideas using the national Head Start impact study, anationally representative study of 2499 children who were randomly assigned to Head Startor control status. As causal benchmarks, experimental estimates were estimated both atthe RD cutoff and also in the total treated area above it. The CRD was constructed out ofthe experimental data by creating an assignment variable (IQ test scores), defining a cutoffvalue on it, and then omitting the control cases from the treated side of the assignmentvariable and the treated cases from the control side. One comparison regression functioncame from pretest scores on each of these outcomes, and the second from a cohort of non-Head Start children one year older and from the same catchment areas. There were threestudy outcomes measured before and after entering Head Start – performance in math,literacy and social behavior. In addition, a second assignment variable was also designatedto test if the results achieved with the IQ assignment variable would be replicated. Theywere, and in most of what follows we present results for the single IQ assignment variable.We use CRD-Pre to designate use of a pretest no-treatment regression function and CRD-CG to designate use of a non-equivalent independent comparison group. We refer to therandomized experiment as an RCT.

Analyses not reported here showed that the three untreated regression segments wereplausibly parallel for each outcome and each assignment variable. One question is howsimilar the experimental, basic RD and CRD estimates are at the cutoff. This is wheretheory indicates each should be unbiased and so the two causal estimates should be similar,not just to each other within the limits of sampling error, but also to the RCT whenit is estimated as a LATE. More important is the second question: How similar are theRCT, CRD-Pre and CRD-CG estimates away from the cutoff, given that basic RDD isnot estimated there and the internal validity of CRD designs is unknown there? A finalquestion is how similar standard error estimates will be across the three design variants,after adjusting for the inevitably smaller RDD group due to how it was constructed fromthe RCT groups?

Figure 19 shows results comparing only the RCT and basic RDD for each outcome andassignment variable. They show the expected results: No bias at the cutoff, and largerconfidence intervals with basic RDD.

Figure 20 shows the results for CRD-Pre at the cutoff, and Figure 21 for CRD-CG atthe cutoff. In each case the estimates seem close to the experimental ones (and hence alsothe basic RDD). The confidence intervals are narrower than with the basic RDD and arethus closer to those from the RCT. But they are not equal to the RCT.

Figure 22 shows the causal estimates away from the cutoff for CRD-Pre and thus cal-culated for more cases than just at the cutoff. Figure 23 shows the results for CRD-CG.In each case, both the treatment estimates and confidence intervals are similar to thosefrom the RCT. Under the conditions tested, the RCT, CRD-Pre and CRD-CG results are

158

Page 36: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 19: Basic RD estimates and confidence intervals compared to the correspondingRCT benchmarks at the cutoff over three outcomes and two assignment variables.

Figure 20: CRD-Pre estimates and confidence intervals at the cutoff compared to thecorresponding RCT benchmarks for one assignment variable.

virtually interchangeable with respect to both bias away from the cutoff and statisticalpower.

Are such findings of any relevance to the inheritance in which Cochran was embedded,that he marginally improved, and that he then passed on for others to improve again andagain and again? In Tang et al., as in Wing and Cook, the RCT causal estimates andstandard errors are reproduced in the CRDs merely by adding supplementary data to whatis otherwise a weak but theoretically unbiased design – the basic RDD. Does RDD fall

159

Page 37: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

Figure 21: CRD-CG estimates and confidence intervals at the cutoff compared to thecorresponding RCT benchmarks for one assignment variable

Figure 22: CRD-Pre estimates and confidence intervals above the cutoff compared to thecorresponding RCT benchmarks for one assignment variable.

outside of the Cochran inheritance’s bailiwick; or is it worth including in it even if onlyin some forms (like CRD) but not others (like basic RDD). CRD certainly needs willingforward, but will statisticians of causation take on this task?

Conclusion

The three examples presented here support causal conclusions; they may even stronglywarrant them. One conclusion is about the achievement consequences of a national law;another about the consequences of a county-level reform on bail bond amounts and numberof officiating judges; and the third is about the consequences of CRD for the quality of

160

Page 38: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Figure 23: CRD-CG estimates and confidence intervals above the cutoff compared to thecorresponding RCT benchmarks for one assignment variable.

causal inferences relative to an RCT. None of these studies started from the premise ofcreating an observational study that mimics what would be done in an experiment. Thefirst example involves multiple hypothesis tests, each individually imperfect, rather than asingle sharply focused test. The second compares legal charges whose seriousness widelydiffers and whose bail amounts, unlike in an experiment, hardly overlap at all. The third isabout a design (RDD) whose treatment and control groups also do not overlap and so arenot matched. Yet each of these designs supported strong causal conclusions without anymatching for a single strong test.

Indubitably, there are great conceptual advantages when observational studies are con-sidered as though they were broken down experiments; and in actual research practice,there are many opportunities to match individual cases where the selection process is mostlyknown and where relevant and otherwise rich covariates are available. I am not against theutility of the mimetic conception of observational study design that Cochran willed forward.My concern is limited. It is whether the current dominance of this conception threatensto rule out other ways of advancing observational study design that are predicated on dif-ferent principles than the pursuit of ineffable exact matching. I would like to see moreprinciples than matching implicated in observational study design, to see more arrows inthe observational study quiver. I have tried to model some of them here.

No body of statisticians is charged with making decisions about which issues to includein Cochran’s inheritance or to exclude from it. So there is no mechanism for formallyanswering the questions this paper posed about demarcation – what should and should notbe part of the inheritance?; and what might be willed forward because it seems worthy ofinclusion?

The most to hope is that, over time, a consensus will emerge about (1) supplementing thetraditional research agenda on internal validity by increasing the profiles of both constructand external validity, and (2) adding RDD and CRD to the Pantheon of acceptable designs,even though they are predicated on minimizing the very group overlap that experiments and

161

Page 39: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

mimetic quasi-experiments seek to maximize. The issues I have described are important inthe daily struggles of practitioners of observational studies who, like me, look to statisticiansfor help. I wish I could talk with Cochran to learn how he would make demarcationdecisions. Of course, he might not be interested and simply reject the assumptions of thispaper, perhaps believing I have reified the intrinsically fuzzy concept of an inheritance. Butthat would still be interesting to learn.

Though conversation with him is impossible, it is with some of his students and theirstudents. Most of them are acquainted, I would guess, with the distinguished figure andrecord of Professor William G. Cochran of Harvard University. But it is important not toforget that in the working class Glasgow of his childhood he would have been Oor Wullie,that he was Bill to his friends on both sides of the Atlantic, and that the few students of hisI know remember him with affection and with respect for his work on observational studyprocedures that seek to approximate the logic and practice of randomized experiments.That his students and his students’ students have willed his work forward so tellingly is ablessing for his memory; but it is not something any scholar should take for granted.

Consider Galileo. He used his lens as a telescope and revealed a heliocentric universethat transformed how Man, God and Nature are related. A decade later, he realized his lenswas also a microscope, but he was not interested enough to venture far into the micro worldsthat eventually transformed biology and medicine. Fortunately, others were more interestedin microscopes than he was, and about 50 years later “the pioneers of microscopy Antonievan Leeuwenhoek and Robert Hooke revealed that a Lilliputian universe existed all aroundand even inside of us. But neither of them had students (my italics), and their researchesended in another false dawn for microscopy. It was not until the middle of the nineteenthcentury. . . that the discovery of the very small began to alter science in fundamental ways”(Flannery, 2015, p. 30). How lucky Cochran’s students and his students’ students are inbeing so centrally networked within the extraordinarily upright and rudimentary inheritancethat Cochran received, improved and passed on to them. Does this very privileged positionin history entail any obligation to take seriously the demarcation tasks so amateurishlydescribed in this paper? I guess not, but. . .

Acknowledgments

This work was supported by NSF grant #DRL-1228866

References

Angrist, J. D. and Rokkanen, M. (2015). Wanna get away? Regression discontinuity esti-mation of exam school effects away from the cutoff. Journal of the American StatisticalAssociation. (Just accepted).

Campbell, D. T. (1966). Pattern matching as an essential in distal knowing. In Hammond,K. R., editor, The psychology of Egon Brunswik, pages 81–106. Holt, Rinehart & Winston,Oxford, UK.

162

Page 40: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The Limits of Observational Studies that seek to Mimic Randomized Experiments

Chetty, R., Looney, A., and Kroft, K. (2009). Salience and taxation: Theory and evidence.American Economic Review, 99:1145–1177.

Cochran, W. G. (1950). The comparison of percentages in matched samples. Biometrika,37(3/4):256–266.

Cochran, W. G. (1965). The planning of observational studies of human populations. Jour-nal of the Royal Statistical Society. Series A (General), 128(2):234–266.

Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removingbias in observational studies. Biometrics, 24(2):295–313.

Cochran, W. G. (1983). Planning and analysis of observational studies. Wiley, New York.

Cook, T. D. (1985). Post-positivist critical multiplism. In Shotland, R. L. and Mark, M. M.,editors, Social Science and Social Policy, pages 21–62. Sage Publications, Beverly Hills,CA.

Cook, T. D. (2008). “Waiting for life to arrive”: A history of the regression-discontinuitydesign in psychology, statistics and economics. Journal of Econometrics, 142(2):636–654.

Cook, T. D. and Campbell, D. T. (1979). Quasi-Experimentation: Design and AnalysisIssues for Field Settings. Houghton Mifflin, Boston.

Cook, T. D., Tang, Y., and Diamond, S. S. (2014). Causally valid relationships that invokethe wrong causal agent: Construct validity of the cause in policy research. Journal of theSociety for Social Work & Research, 5(4):379–414.

Dee, T. S. and Jacob, B. (2011). The impact of No Child Left Behind on student achieve-ment. Journal of Policy Analysis and Management, 30:418–446.

Diamond, S. S., Bowman, L. E., Wong, M., and Patton, M. M. (2010). Efficiency and cost:The impact of videoconferenced hearings on bail decisions. Journal of Criminal Law andCriminology, 100:869–902.

Flannery, T. (2015). How you consist of thousands of tiny machines. New York Review ofBooks, LXII(12):30–32.

Imbens, G. W. and Lemieux, T. (2008). Regression discontinuity designs: A guide topractice. Journal of Eeconometrics, 142(2):615–635.

Imbens, G. W. and Rubin, D. B. (2015). An Introduction to Causal Inference in Statistics,Biomedical and Social Sciences. CAMBRIDGE UNIVERSITY PRESS, New York, NY.

Imbens, G. W. and Wooldridge, J. (2007). What’s new in econometrics? Lecture notes,NBER Summer Institute.

Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald and the slowprogress of soft psychology. Journal of Consulting and Clinical Psychology, 46:806–834.

163

Page 41: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cook

Rosenbaum, P. R. (2005). Observational studies. In Everitt, B. S. and Howell, D. C.,editors, Encyclopedia of statistics in behavioral science, Vol. 3, pages 1451–1462. Wiley& Sons, Chichester, UK.

Rosenbaum, P. R. (2009). Observational Studies. Springer-Verlag, New York, NY, 2ndedition.

Rosenbaum, P. R. (2011). Some approximate evidence factors in observational studies.Journal of the American Statistical Association, 106:285–295.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal effects. Biometrika, 70(1):41–55.

Shadish, W. R., Cook, T. D., and Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin, Boston.

Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900.Harvard University Press, Cambridge, MA, and London, England.

Tang, Y., Cook, T. D., and Kisbu-Sakarya, Y. (2015). Reducing bias and increasing precisionby adding either a pretest measure of the study outcome or a nonequivalent comparisongroup to the basic regression discontinuity design. Working paper, Institute for PolicyResearch at Northwestern University.

Watson, G. S. (1982). William Gemmell Cochran 1909–1980. The Annals of Statistics,10(1):1–10.

Wing, C. and Cook, T. D. (2013). Strengthening the regression discontinuity design usingadditional design elements: A within-study comparison. Journal of Policy Analysis andManagement, 32(4):853–877.

Wong, M., Cook, T. D., and Steiner, P. M. (2015). Adding design elements to improve timeseries designs: No Child Left Behind as an example of causal pattern-matching. Journalof Research on Educational Effectiveness, 8(2):245–279.

Yates, F. (1964). Sir Ronald Fisher and the design of experiments. Biometrics, 20(2):307–321.

164

Page 42: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 165-170 Submitted 12/14; Published 8/15

Design and interpretation of studies: relevant concepts fromthe past and some extensions

David R. Cox [email protected] College, Oxford UniversityOxford OX1 1NF, United Kingdom

Nanny Wermuth [email protected]

Mathematical Statistics, Chalmers University of Technology

Gothenburg, Sweden and

Medical Psychology and Medical Sociology, Gutenberg-University

Mainz, Germany

We are happy to have the chance of discussing the paper by W.G. (Bill) Cochran, titled“Observational Studies” and reprinted here. It appeared first in 1972 and, we call it the“present paper” below. We start however by describing our personal encounters with Bill.

1. Personal Encounters with Bill Cochran

DRC: I first heard Bill Cochran lecture in 1956 and, about that time, greatly benefited fromhis pre-publication comments on a draft of a book on experimental design. I recall also amemorable meeting of the Royal Statistical Society at which the precursor (Cochran, 1965)of the present paper was given for discussion.

NW: As a Ph.D. student, I was fortunate to get to know Bill Cochran as an excellent teacherand researcher. His way of teaching was typically most illuminating for me. He was involvedin many different types of empirical studies and he shared his experiences openly with thestudents. He would talk with joy about successes but would also report on disappointingdevelopments that had led to difficult, unsolved problems. I regarded him as the heart ofour department. He stressed the positive features of his colleagues and he remembered thenames of all the students as well as what he had discussed with them before. This couldconcern statistical questions or personal experiences. He was kind and modest, typicallyfull of energy, and always ready to listen and talk. I learned a lot from him not only aboutstatistics.

2. Discussion of Cochran’s “Observational Studies”

The present paper is striking for its relevance even after so many years. Cochran’s conceptsand ideas are presented with clarity and simplicity. Many of them appear to be ignored inthe current inrush of “big data.” This makes many of Cochran’s points ever more topical.

The discussion of principles of design makes it clear that there are essential differencesbetween experiments and observational studies. In experiments, crucial aspects are underthe investigators control while in observational studies the features measured will largely

c⃝2015 David Cox and Nanny Wermuth.

Page 43: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cox and Wermuth

have to be accepted as they happen to arise. Cochran stresses however that, nevertheless,experiments and observational studies have much in common.

In particular, for the types of observational study he is discussing, the motivation is asearch for causes. Several variables may be viewed as treatments in a broad sense. Forinstance, stronger positive effects may be expected for a set of new teaching methods, orstronger negative effects after exposure to higher levels of several risk factors for a givendisease. When experiments are not feasible, the main aim is still to establish, as firmlyas possible, the link with an underlying data generating process. Cochran states this as:“A claim of proof of cause and effect must carry with it an explanation of the mechanismby which this effect is produced.” Thus, an underlying data generating process is to bescientifically explainable in the given subject matter context.

Some of the terminology has changed since the paper was written, but several key aspectsremain essential for any planned study today:

• stating the main objectives of a study before the data are collected,

• planning for well-defined comparisons and for one or several control groups,

• thinking about the types of measurements needed and how to assure their compara-bility,

• specifying target populations and being aware of nonresponse as one reason for missinga target.

The relative importance of these aspects may differ in different fields of application. Forexample, in many areas of physics there is likely to be a secure base of background knowledgeand theory, whereas in some types of social science research, this may not yet be the case.

The broad approach to design must depend also on the time-scale and costs of a singleinvestigation. Whenever new studies can be designed speedily and the data can be collectedquickly and analyses are easily computed and interpreted, then a flexible approach with asequence of simple studies may be feasible. But when the effort and time involved in anysingle study is considerable, all the above four points become essential for the study tobecome successful. A noteworthy example is the prospective study by Doll and Hill (1956)establishing cigarette smoking as a cause of lung cancer.

For experiments, R.A. Fisher (1926, 1935) had suggested, as principles of design, theneed to avoid systematic distortions in treatment effects and the enhancement of the pre-cision in estimates of effects. He stressed also the value of considering several treatmentssimultaneously rather than one factor at a time. This gives the chance to see whether effectsare substantially modified for particular levels of another factor or for level combinationsof several factors, that is to understand major interactions. More importantly, it may helpto establish the stability of an effect under a range of conditions by showing the absence ofmajor interactions. This idea carries directly over to observational studies.

However, to avoid systematic distortions, called often also “bias,” is considerably harderin observational studies. In experiments, in addition to creating laboratory-like condi-tions for obtaining measurements for quantitative variables and observations for categoricalvariables, the main tools are randomization, that is random allocation of participants totreatment levels, stratification (called also subclassification or standardisation), the use of

166

Page 44: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Design and Interpretation of Studies

important covariates (in some contexts called concomitant variables) and blocking (whichturns in observational studies into matching).

Clinical trials with randomized allocation of patients to treatments ideally may be re-garded as experiments rather than observational studies. But in reality, distorted estimatesof treatment effects can occur even in such clinical trials, for instance, when relevant inter-mediate variables are overlooked, such as non-compliance of patients to assigned treatments,or when there is a substantial undetected interactive effect of a treatment and a backgroundvariable on the response, even though, by successful randomization, this background vari-able has become independent of the treatment. Thus, Cochran’s statement (on page 85)that “in regard to the effect of x on y, matching and standardization remove all bias” cannothold when one of the above mentioned sources of distortion for treatment effects is present.

When randomization is not an option, the next best approach is to design a prospectivelongitudinal study. But it may take a long time to see any results and these types ofstudy are often expensive. They offer however the possibility of deriving and studyingdata generating processes. This option was not yet available in the 1970’s except in thespecial situation of only linear relations and with responses that are affected one after theother, that is when path analysis, called recursive systems by Strotz and Wold (1960), isapplicable. The importance of such an approach was rarely appreciated at that time; thetextbook by Snedecor and Cochran (1966) was a notable exception.

The direct generalisation of path analysis, to include other than linear relations andarbitrary types of variables, is to the directed acyclic graph (DAG) models. A more ap-propriate class of models for data generating processes are the recursive systems in singleas well as joint responses, called traceable regressions; see Wermuth (2012), Wermuth andCox (2013, 2015). In these models, several responses may be affected at the same time,such as for instance, systolic and diastolic blood pressure which are two aspects of a singlephenomenon, namely the blood pressure wave. Both will for instance be influenced at thesame time when patients receive a medication to reduce high blood pressure.

These sequences of regressions form one subclass of the so-called graphical chain modelsand they include DAG-models as a subclass. They often permit the use of a correspondinggraph to trace pathways of development and they may be compatible with causal interpre-tations. They also take care of a main criticism of DAG-models regarding causal interpre-tations by Lindley (2002): that DAGs do not include joint responses and therefore cannotcapture many types of causal processes.

In the last section of the present paper, there is a beautiful illustration of the suggestion“make your theories elaborate,” given by R.A. Fisher when asked how to clarify the stepfrom association to causation; see Cochran (1965). We fully agree that this step needscareful planning of studies and good judgement in interpreting statistical evidence.

In the meantime, some of our colleagues have derived a “causal calculus” for the chal-lenging process of inferring causality; see Pearl (2015). In our view, it is unlikely that avirtual intervention on a probability distribution, as specified in this calculus, is an accuraterepresentation of a proper intervention in a given real world situation. Their virtual inter-vention on a given distribution just introduces some conditional independence constraintsand leaves all other aspects unchanged. This may sometimes happen, but experience fromcontrolled clinical trials suggests that this is a relatively rare situation.

167

Page 45: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cox and Wermuth

Even before the step to a causal interpretation, it is, as discussed below, less clear thatmatching or some adjustment will always be beneficial in observational studies. For in-stance, with pair-matched samples, no clear target population is defined, hence it remainsoften unclear to which situations the results could be generalized. Blocking in experimentsand matching in observational studies clearly make the measurements in different treat-ment groups more comparable. And, it has been demonstrated explicitly, how with morehomogeneous groups to compare, both sampling variability and sensitivity to other sourcesof distortions are reduced; see Rosenbaum (2005).

But for data looked at only after pair-matching, it becomes impossible to study de-pendences among the matching variables, in particular, to recognize an extremely strongdependence among them in a target population that could even lead to a reversal of thedependence of the response on this treatment. In addition, if results for the dependenceof a response are computed exclusively for explanatory variables other than the matchingvariables, then an important interactive effect, of a treatment and a matching variable onthe response, may get overlooked; for some examples see McKinlay (1977).

The same holds for caliper matching, as defined in the present paper, and for a formalextension of it, called propensity-score matching by Rosenbaum and Rubin (1983). For acareful study and discussion of the large differences in estimated bias that can result withdifferent choices of variables included in the propensity score and with different types ofmatching methods, see for instance Smith and Todd (2005).

Similarly, any adjustment of estimates depends typically on how well the associatedmodel is specified; see for instance Bushway et al (2007). For poor estimates or withsome model misspecifications, adjustments may do harm instead of being beneficial. Forapproaches to move away from mere adjustments, see for instance Genback et al. (2014).

In all of these discussions of matching and adjustments in the literature, generatingprocesses are rarely mentioned. But their importance was already stressed in the presentpaper even though at that time, more than 40 years ago, the corresponding sequences ofregressions, necessary for full discussions, had been studied intensively only for the veryspecial situation of exclusively quantitative responses and linear dependences.

Generating processes lead from background variables, such as intrinsic features of theindividuals, via treatments and intermediate variables to the outcomes of main interest. Incorresponding sequences of regressions, the dependence structure among directly and indi-rectly important explanatory variables is estimated and different pathways for dependencesof the responses are displayed in corresponding regression graphs.

Such graphs may be derived from underlying statistical analyses for a given set of dataand they represent hypothetical processes that can be tested in future studies. In addition,consequences of any given regression graph can be derived. Consequences that result aftermarginalizing over some of the variables or after conditioning on other variables in such away that the conditional independences present in the generating process are preserved forthe remaining variables, can be collected into a “summary graph” by using, for instance,subroutines in the program environment R; see Sadeghi and Marchetti (2012).

In this way, it will become evident which variables need to be conditioned on and suchknowledge may possibly lead to a single measure for conditioning. Generating processeswill point directly to situations in which seemingly replicated results in several groups, suchas strong positive dependences, change substantially after marginalising over some of the

168

Page 46: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Design and Interpretation of Studies

groups, in some cases even turning positive into negative dependences. This can happenonly when some of the grouping variables are strongly dependent. This well known phe-nomenon has been named differently in different contexts, for instance as the presence ofmulticollinearity, as highly unbalanced groupings, or as the Yule-Simpson paradox. Con-ditions for the absence of such situations have been named and studied as conditions for“transitivity of association signs” by Jiang et al. (2015).

With the dissemination of fully directed acyclic graph models, some more recent termi-nology has become common. For instance, when an outcome has one important explanatoryvariable and there exists, in addition, an important common explanatory variable for both,the latter is a confounding variable and when unobserved, it is now named an “unmeasuredconfounder” that may distort the true dependence substantially. Similarly, when an out-come has one important explanatory variable and another outcome depends strongly onboth, then by conditioning on this common response, a distortion of the first dependence isintroduced and is named “selection bias.”

In the current literature on “causal models,” known to us, both these types of distortionsare discussed separately. A related phenomenon, for which a first example had been given byRobins and Wasserman (1997), is typically overlooked: by a combination of marginalizingover and conditioning on variables in a given generating process, a much stronger distortion,named now “indirect confounding,” may be introduced than by an unmeasured confounderalone or by a selection bias alone. Parametric examples for exclusively linear dependencesand graphical criteria for detecting indirect confounding, in general, are available. Thelatter use summary graphs that are derived by marginalizing only; see Wermuth and Cox(2008, 2015).

The broad issues so clearly emphasized in the present paper remain central, challengingand relevant. That is to say, not only are firm statistical relations of particular kinds tobe established, such as the estimation of treatment effects and of possibly underlying data-generating processes, but the statistical results need to be interpretable in terms of theunderlying science.

References

Bushway, S., Johnson, B.D. and Slocum, L.A. (2007). Is the magic still there? The use ofthe Heckman two-step correction for selection bias in criminology. J. Quant. Criminol.,23, 151–178.

Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser.A, 128, 234-66.

Cochran, W. G. (1972). Observational studies. In T. A. Bancroft (ed.), Statistical Papersin Honor of George W. Snedecor. Iowa State University Press, Ames, Iowa.

Doll, R. and Hill, A. (1956). Lung cancer und other causes of death in relation to smoking.A second report on the mortality of British doctors. British Medical Journal, 2 1071–1081.

Fisher, R.A. (1926). The arrangement of field experiments. J. Ministry of Agric., 33,503–513.

Fisher, R.A. (1935). Design of experiments. Edinburgh: Oliver and Boyd.

169

Page 47: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cox and Wermuth

Genback, M., Stanghellini, E. and de Luna, X. (2014). Uncertainty intervals for regressionparameters with non-ignorable missingness in the outcome. Statistical Papers; openaccess articles.

Jiang, Z., Ding, P. and Geng, Z. (2015). Qualitative evaluation of associations by thetransitivity of the association signs. Statistica Sinica, 25, 1065–1079.

Lindley, D.V. (2002). Seeing and doing: the concept of causation. Int. Statist. Rev. 70,191–214.

McKinlay, S.M. (1977). Pair-matching - a reappraisal of a popular technique. Biometrics,33, 725–735.

Pearl, J. (2015). Trygve Haavelmo and the emergence of causal calculus. EconometricTheory, 31, 152–179.

Robins, J. and Wasserman, L. (1997). Estimation of effects of sequential treatments byreparametrizing directed acyclic graphs. In: Proc. 13th Ann. Conf., UAI, D. Geiger andO. Shenoy (eds.) Morgan and Kaufmann, San Mateo, 409–420.

Rosenbaum P.R. and Rubin, D.B. (1983). The central role of the propensity score inobservational studies for causal effects. Biometrika, 70, 41–55.

Rosenbaum P.R. (2005). Heterogeneity and causality: Unit heterogeneity and design sen-sitivity in observational studies. American Statistician, 2005, 59, 147–152.

Sadeghi K. and Marchetti, G.M. (2012). Graphical Markov models with mixed graphs inR. The R Journal, 4, 65–73.

Snedecor, G.W. and Cochran, W.G. (1966). Statistical Methods (5th ed.) Ames: The IowaState University Press.

Smith, J.A. and Todd, P.E. (2005). Does matching overcome LaLondes critique of nonex-perimental estimators? J. Econometrics, 125, 305353.

Strotz, R.H. and Wold, H.O.A. (1960). Recursive vs. nonrecursive systems: an attempt atsynthesis. Econometrica, 28, 417–427.

Wermuth, N. (2012). Traceable regressions. Intern. Statist. Review. 80, 415–438.Wermuth, N. and Cox, D.R. (2008). Distortions of effects caused by indirect confounding.

Biometrika, 95, 17–33.Wermuth, N. and Cox, D.R. (2013). Concepts and a case study for a flexible class of

graphical Markov models. In: Robustness and Complex Data Structures. Festschrift inhonour of Ursula Gather. Becker, C., Fried, R. and Kuhnt, S. (eds.), Springer, Heidelberg,331–350; also on ArXiv: 1303.1436.

Wermuth, N. and Cox D.R. (2015). Graphical Markov models: overview. In InternationalEncyclopedia of the Social and Behavioral Sciences, 2nd ed., J. Wright ed., 341–350; alsoon ArXiv: 1407.7783

170

Page 48: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 171-172 Submitted 5/15; Published 8/15

Comment on “Observational Studies”, by William G.Cochran

Stephen E. Fienberg [email protected]

Department of Statistics

Carnegie Mellon University

Pittsburgh, PA 15213-3890, USA

It has been a pleasure for me to reread this paper by Wm. Cochran after a numberof years and to see it reprinted in the opening issue of the journal, Observational Studies.Although I was a graduate student in the Department of Statistics at Harvard in the 1960s,I never actually took a course from him but I did sit in on a seminar where he describedthe work that appeared initially as department technical reports and then ultimately in aseries of papers on the topic, e.g., see Cochran (1965, 1968) and Cochran and Rubin (1973),and his lectures included early versions of the ideas that ultimately found their way intothis paper and his posthumously published book on the topic (Cochran, 1983). I actuallystill have copies of those technical reports in my files! As I interacted with Cochran overthe ensuing decade or so I came to appreciate, more and more, the wisdom of his insightsinto drawing causal inferences from observational studies.

Although he is often thought of primarily as a sampling statistician or as a major forcein the design of experiments, e.g., see Cochran and Cox (1957), Cochran was clearly heavilyinfluenced in his work on observational studies by his experiences with several very largestudies, such as the Kinsey Report (Cochran et al., 1953, 1954), and especially by hiswork as the only statistician on the 10-member scientific advisory committee for the 1964U.S. Surgeon General’s report (United States Public Health Service, 1964) concluding thatcigarette smoking caused lung cancer. In the Kinsey report, Cochran et al. argued thatthere was no basis for many of the inferences drawn by the authors, although he says thisin quite a gentle way in the present paper, whereas in the case of the effects of smoking oncancer the observational evidence was strong and he describes some of his thinking on thetopic in Section 6.

Perhaps the most important thing we can take from this paper is Cochran’s clear mes-sage: If we really want to draw causal inferences from observational studies then we needto think about their design and how such designs approximate those that we might havedeveloped, had we only been able to design a randomized controlled trial. As methodsfor dealing with observational studies have developed over the ensuing decades this samephilosophy can be found in the books by Rosenbaum (2002, 2010) and the article by Rubin(2008).

This leads me to speculate how Cochran would have viewed the current “Big Data”movement. As others have remarked in recent years, drawing causal inferences from largeamounts of data gleaned from the WWW is fraught with difficulty, and this involves bothinternal validity (to which randomization provides the key) and external validity (our ability

c⃝2015 Stephen Fienberg.

Page 49: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Fienberg

to generalize to a relevant population). If only the proponents of big data for causal purposeswould take the time to read Cochran’s 1972 paper with care!

Acknowledgments

This preparation of this comment was supported by the Singapore National Research Foun-dation under its International Research Centre @ Singapore Funding Initiative and admin-istered by the IDM Programme Office.

References

Cochran, W. G. (1965). The planning of observational studies of human populations. Jour-nal of the Royal Statistical Society. Series A (General), pages 234–266.

Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removingbias in observational studies. Biometrics, pages 295–313.

Cochran, W. G. (1983). Planning and Analysis of Observational Studies. John Wiley &Sons, New York. Lincoln E. Moses and Frederick Mosteller, eds.

Cochran, W. G. and Cox, G. M. (1957). Experimental Designs. John Wiley & Sons, NewYork, 2nd edition.

Cochran, W. G., Mosteller, F., and Tukey, J. W. (1953). Statistical problems of the kinseyreport. Journal of the American Statistical Association, 48(264):673–716.

Cochran, W. G., Mosteller, F., and Tukey, J. W. (1954). Statistical Problems of the Kin-sey Report on Sexual Behavior in the Human Male. American Statistical Association,Washington, DC.

Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: Areview. Sankhya: The Indian Journal of Statistics, Series A, pages 417–446.

Rosenbaum, P. R. (2002). Observational Studies. Springer, New York.

Rosenbaum, P. R. (2010). Design of Observational Studies. Springer, New York.

Rubin, D. B. (2008). For objective causal inference, design trumps analysis. Annals ofApplied Statistics, 2(3):808–840.

United States Public Health Service (1964). Smoking and Health: Report of the AdvisoryCommittee to the Surgeon General of the Public Health Service. US Department of Health,Education, and Welfare, Washington, DC.

172

Page 50: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 173-181 Submitted 4/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Joseph L. Gastwirth [email protected] of Statistics, George Washington UniversityWashington, DC 20052, USA

Barry I. Graubard [email protected]

Division of Cancer, Epidemiology & Genetics, National Cancer Institute

Bethesda, Maryland 20892, USA

1. Introduction

The timelessness of Prof. Cochran’s contributions to the planning, design and analysis ofboth large scale surveys and observational studies is exemplified by his 1972 paper summa-rizing many years of research. Prof. Small should be thanked by the statistics communityfor devoting a special section of the new journal Observational Studies to bring this im-portant work to the attention of the next generation of statisticians and data scientists.Researchers in almost every field will benefit from reading the advice given in the paper atthe very start of thinking about a study, whether randomized or observational. Our com-mentary will focus on differences between studies designed to make inferences applicable forthe general population or studies carried out to understand what occurred in the populationstudied, i.e. the study population is the population for which inferences will be made. Inthe first and most common setting, the importance of Cochran’s observation that one needsto consider whether the study population differs in some important ways from the generalpopulation will be illustrated by reviewing studies concerning the relationship between obe-sity and more generally body weight and mortality and morbidity. While that issue doesnot arise in the second setting, frequently occurring in legal cases dealing with discrimina-tion or violation of occupational safety and health rules, the wisdom of Prof. Cochran’srecommendations on analytic techniques and methods for controlling for the potential effectof confounders are very relevant to the proper analysis of statistical evidence.

2. Inference from samples from a population

Cochran points out that the target population should be identified and that a probabilitysample of the target population be collected. However, he points out that a probabilitysample may not be possible, e.g., because of cost and operational considerations, so thatmany studies obtain a sample from a population that differs somewhat from the targetpopulation. For example to study the association of body weight (actually body weightadjusted for height or body mass index (BMI), i.e., weight in kg divided by the square ofheight in meters) with all-cause mortality, researchers have used existing cohorts of adultssuch as the National Institutes of Health (NIH)-AARP Diet and Health Study (Adams et al.,2006), which we will call the NIH-AARP Study. The NIH-AARP Study sent a questionnaire

c⃝2015 Joseph Gastwirth and Barry Graubard.

Page 51: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Gastwirth and Graubard

in 1995-96 to all members of the AARP 50-71 years old who resided in six U.S. states(California, Florida, Louisiana, New Jersey, North Carolina, and Pennsylvania) and twometropolitan areas (Atlanta and Detroit). The questionnaire collected self-reported bodyweight, height along with other relevant covariates such as smoking, alcohol consumptionand physical activity that were incorporated in the analysis to remove the potential forconfounding in the estimated association of BMI and mortality. The sample studied hadinformation on the 18% of male and female members of AARP (n= 567,169) who returnedthe questionnaires. The low response rate also raises important statistical concerns aboutthe generalizability of the conclusions to the population of AARP members, much less theentire adult U.S. population 50-71 years old.

The Adams et al. article states “Even against the background of advances in the man-agement of obesity-related chronic diseases in the past few decades, our findings suggestthat adiposity, including overweight, is associated with an increased risk of death” andcompares their results to those reported by Flegal et al (2005). The earlier study usednational probability samples from the National Health and Nutrition Examination Survey(NHANES) to construct US representative cohorts and did not find a an increased risk ofdeath in overweight (25 ≤ BMI < 30) individuals. The NIH-AARP Study does not staterestrictions on the population to which their conclusions are applicable and implies thatthey are valid for the entire US population. Investigators (Durazo-Arvizo et al 1997; Calleet al 1999) have reported that the BMI mortality relationship may vary by race, so thosegroups should be appropriately represented in the study sample. Although membership inthe AARP is open to the entire population, it is not a representative sub-population ofthe over 50 population of the nation as members must pay annual dues. Further, the lowresponse rate potentially exacerbates the non-representativeness of the NIH-AARP sample.

Cochran’s recommendation that discussing how differences between a study sample andthe target population, such as the US population, may affect the interpretation of the in-ferences drawn from it is very important. The BMI and mortality relationship found in theNIH-AARP Study may not be generalizable to the entire US population as its race/ethnicmix differs from that of the entire US. The 1995 US census projections for race/ethnicitydistribution of 50-69 year olds (the closest age range to the NIH-AARP available) were80.7%, 9.5%, 6.5%, 3.3% (see http://www.census.gov/prod/1/pop/p25-1130/p251130.pdf) compared to the NIH-AARP race/ethnicity distribution of 91.6%, 3.7%, 1.8%, 1.6% forwhite, African-American, Hispanic and Asian, Pacific Islander or Native American, re-spectively. Not surprisingly, whites are over-represented in the NIH-ARRP study; noticethat all three minority groups are under-represented by a factor of two. The use of vol-unteers, i.e., nonrandom samples, from selected populations is typical of many epidemi-ologic studies; examples are plentiful, e.g., the Harvard University Nurses Health Studyhttp://www.channing.harvard.edu/nhs/, the National Cancer Institute U.S. RadiologicTechnologists Cohorthttp://dceg.cancer.gov/research/who-we-study/cohorts/us-radiologic-technologists,and the City of Hope National Medical Center California Teacher Studyhttps://www.calteachersstudy.org/study_data.html. Even though the sample size ofthese cohorts is quite large, the potential bias resulting from having specialized samplesof a target population (e.g., the US population) is not ameliorated by the small standarderrors of estimates of association obtained from large samples. This important point is

174

Page 52: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

often ignored by epidemiologists even though statisticians are well aware of the potentialmagnitude of this type of bias, e.g., from the Literary Digest pre-election poll in 1936. Forconclusions concerning a relative risk obtained from a sampled population differing fromthe target one, the risk model fit to the data needs to be correctly specified with the correctfunctional form of the relationship of the response to the covariates along with appropriateinteractions and information on all major covariates should be obtained. Only then willthere be a good chance of estimating the risk accurately for a target population. Further-more, the sample should span the same distribution of covariates as the target population,e.g., if the relative risks differ by age group the age groups of the target population shouldbe represented in the sample population.

From a public health policy point of view, the results from cohort studies or other typesof epidemiologic studies are most useful if they are generalizable to the target populationof interest. In the Adams et al. paper and many other epidemiologic analyses, populationattributable risks (PAR) for an “exposure” are used to estimate how many cases or deathswould be prevented if the exposure was eliminated or reduced by a suitable intervention.If the estimated relative risks from a particular study are not applicable to the targetpopulation then the estimated PARs could be misleading resulting in a misallocation ofresources that may be directed to more important public health exposures, e.g., preventingsmoking, or warning the public of a risk of a serious disease, e.g. Reye syndrome, from afrequently used product.

3. Inference for the study population

Although the objective of most statistical surveys and studies is to draw inferences froma subset, ideally a random sample, of a population that will apply to the much largerpopulation, in some important applications one is concerned with drawing conclusions thatwill apply only to the study population. In many legal cases the question addressed bythe statistical analysis can shed light on concerns of the appropriateness of the practices ofa specific employer or firm. For example, in a fair trade case the question may focus onwhether a particular exporter “dumped” or sold goods below cost, which is unfair to theimporting nation’s producers; in an equal pay case the issue is whether female employeesare paid the same as similarly qualified males. In both situations, the conclusions will onlyapply to the particular firm or employer, i.e. if it turns out that the firm did not dumpgoods or that the employer underpaid female employees by $2.00 an hour, those conclusionswill not be considered in a similar case involving a different firm nor would they implythat females employed in similar jobs throughout the nation are under-paid by an averageof $2.00 an hour. This section will show that several of Prof. Cochrans wise suggestionsand guidelines are very useful in analyzing this type of observational study but also havebeen misinterpreted by “experts” and courts as they were developed for situations wherethe ultimate inference will apply to a much larger population than the one studied.

In the legal setting where one has data for the entire finite population for the periodunder review, summary statistics calculated from the data are in fact population quantities.Statisticians often impose a probability model to aid in interpreting and understanding theevidentiary strength of a difference in averages or percentages. For example, in an equalemployment case concerning the fairness of an employers promotions, suppose that 2 of

175

Page 53: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Gastwirth and Graubard

15 (13.3%) eligible female employees and 12 of 23 (52.2%) eligible males were promotedduring the relevant time period. There is no sampling error involved; however, as an aideto interpreting the data statisticians often assume that the promotions are randomly chosenfrom the pool of eligible employees. Assuming that all other relevant factors, e.g. seniorityare balanced in both groups, the number of females among the 14 promotions follows a hy-pergeometric distribution and Fisher’s exact test yields a p-value of .02 (two-sided). Noticethat the proportion of women promoted is about one-fourth the corresponding proportionof male employees, which is clearly meaningful. The statistical test informs the court thatthe data is unlikely to occur if promotions were randomly selected from the eligible pool.Thus, we infer that the gender of an eligible employee affected their chance of promotion.Courts then require the employer to justify their promotion process.

Notice that the total sample size in the above situation is much smaller than in theapplications discussed by Prof. Cochran. Unfortunately, courts often ignore data setsreferring to the complete set of eligible employees and simply say the sample is too small.For example, the decision in the age discrimination case, Fallis v. Kerr-McGee Corp. 944F.2d 743 (10th Cir. 1991), stated that the “sample” of 51 employees was too small.1 Arelated problem is that expert witnesses have convinced courts that samples of 200-400 maybe needed to subject data comparing the pass rates of minority and majority pass rateson a pre-employment exam to assess whether it has a disparate impact on the minorityapplicants.2

Because courts have not encouraged the analysis of data pertaining to a seemingly smallpopulation, they may not fully appreciate the meaning of a simple statistical summary. Thecase Chappel v. Ayala3 currently being considered by the U.S. Supreme court provides anilluminating example. The case concerns whether the lower courts properly considereda defendants Batson allegation that the prosecutor discriminated against minorities byremoving all seven minority members from the venire of potential jurors through peremptorychallenges. Although the main legal issues concern the propriety of the trial judge excludingthe defendant’s lawyer from part of the proceedings where the prosecutor explained why theminorities were challenged and the apparent loss of many questionnaires potential jurorsfilled out, the courts might have benefitted from a formal statistical analysis of the data.Even Judge Callahan who dissented from the 9th circuits opinion granting the defendant anew trial noted “The only indicia of possible racial bias was the fact seven of the eighteenperemptory challenges exercised by the prosecutor excused African-American and Hispanicjurors.” To properly interpret this information one needs to know the number of non-minorities who were on the venire. The majority opinion noted that in the case, each sidecould remove 20 members of the venire by peremptory challenge when the jury of 12 waschosen and then had six more peremptory challenges to use when the six alternates werechosen. Thus, there must have been at least seventy individuals on the venire in order forthe court to end with a jury of twelve and six alternates. To maximize the proportion of

1In the case, 3 of 9 employees over 40 were fired in contrast to 4 of 42 employees under 40. Analyzing thedata with Fisher’s test yields a non-significant result (one-sided p-value = .095), which would support thecourts decision and avoid making an “ad hoc” judgement that a sample of 51 is too small.

2See Lopez v. Lawrence (D. Mass.) 2014 U.S. Dist. LEXIS 124139.3The Supreme Court granted certiorari in Chappell v. Ayala, 2014 U.S. LEXIS 7094 (U.S., Oct. 20, 2014)and will review the decision Ayala v. Wong 756 F.3d 656 (9th Cir. 2014); 2014 U.S. App. LEXIS 3699.

176

Page 54: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

minorities in the pool from which the jury of twelve were chosen, let us assume that the trialcourt proceeded in two stages: first, selecting the jury and then the alternates. Allowing foreach side to have twenty peremptory challenges, the minimum size of the panel from whichthe twelve jurors were chosen is 52, of whom 7 were minority. The prosecution actuallyremoved 18 members of the panel, all seven minorities and eleven whites. Applying Fishersexact test shows that the probability that a random sample of 18 taken from a pool of 7minorities and 45 whites would include all 7 minorities is .00024 or about 1 in 4000.4 This isquite a significant result, which suggests that the court should carefully examine the reasonsthe prosecution offers to justify its challenges when the judge compares the characteristicsof the minority members excluded with the majority members who were not excluded tosee whether the offered reasons were applied to all members of the venire.5

Prof. Cochran emphasized the usefulness of matching and stratification methods andthey are especially appropriate when the results of one’s analysis needs to be explained to anon-statistician as they can understand that the factors used in the matching/stratificationprocess are controlled for. In other words proper matching and stratification can simplifyanalysis to examining possibly simple means or proportions when otherwise, for instance,a less intuitive regression method may be used to conduct analyses that adjust for thestratification or matching variables.

If the concomitant factor used in the matching process is ordinal, however, one may losesome relevant information. As an example consider the pay data in Table 1 on the followingpage from EEOC v. Shelby County Government, a case concerning whether women clericalemployees in the county’s criminal court were discriminated against in pay and promotion.In the opinion, the judge noted that judges are very familiar with the duties of these clericalworkers and found unequal pay after considering the pay data in Table 1, stratified intofour seniority levels. Although the data is so clear a formal statistical test was not required,Gastwirth (1992) applied the Mann-Whitney form of the Wilcoxon test to the data in eachstrata and combined the results using the van Elteren procedure. The result was highlysignificant (p¡.001). One feature of the data, however, is ignored in this analysis. Noticethat some men are paid more than women with noticeably more seniority. For example,D.V., a male is paid more than the four females who have higher seniority (i.e., F.R., T.D.,P.B., and P.E.) and B.W., another male who has even less seniority than D.V., is paid morethan three of those females and the same as the fourth. This phenomenon held true evenin 1988, five years after the charge was filed (Gastwirth, 1992).

4If the trial court started with a panel large enough for it to select twelve jurors and six alternates, then theminimum size would be 70 and the probability that all seven minorities would be in a random sample of 18from this larger pool would be 2.55× 10−5 or just over one in 40,000. Unfortunately, none of the opinionsreports the full data set or provides a detailed description of the original jury selection procedure.

5In United States v. Omoruyi, 7 F.3d 880 (9th Cir. 1993), the prosecutor peremptorily challenged the twosingle minority females and the defendant raised a Batson claim after the second one was removed. Thetrial judge accepted the prosecutor’s claim that he removed them because they were single. The appellatecourt noted that the prosecutor had not peremptorily challenged single, unmarried men in the jury paneland granted the defendant a new trial. In contrast, in Alviero v. Sam’s Warehouse Club Inc. 253 F.3d 933,940-41 (7th Cir. 2001) the court accepted the prosecutor’s explanation of the removal of all three femalemembers of the jury panel on the basis of their limited work experience and level of education even thoughsome males with similar educational backgrounds but more work experience but more education were notchallenged.

177

Page 55: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Table 1. Pay Data for Male and Female Employees Clerical Employees of Shelby County Criminal Court, and Estimated Damages for Female Employees using the Peters-Belson Approach

Initials of Employee

Dender

Hire date

Salary in 1983

(dollars per month)

Estimated damages for female employees for 1983 (dollars per month)

C.R. C 5/73 1474 203.61 J.t.V. a 9/73 1666 T.D. C 1/74 1403 227.50 C.H. a 1/74 1666 t... C 5/74 1403 203.94 L.A. a 5/74 1548 C.C. a 5/74 1548 t.E. C 8/74 1403 186.27 D.V. a 9/74 1548 T.t. C 5/75 1112 424.27 D.L. C 1/76 1306 183.15 S... C 2/76 1336 147.26 D.V. a 3/76 1548 J... C 9/76 1336 106.04 ..W. a 1/78 1474 ..D. C 9/78 1000 300.69 ..t. C 10/79 1000 224.12 J.A. a 10/79 1157 C.D. a 8/82 1000 t.S. C 9/82 929 88.99 a.D. C 12/82 929 71.32 V.H. C 1/83 929 65.43 S.C. C 7/83 800 159.10

Page 56: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

The Peters-Belson (Peters, 1941; Belson, 1956) approach to regression, discussed inCochran and Rubin (1973), was used by Gastwirth and Greenhouse (1994) to analyze thisdata. First, one fits a model relating the salaries of male employees to their seniority level.Then one predicts the salary a female employee would receive had they been paid accordingthe male equation. The differences Di for each female estimate the shortfall (if negative) intheir salary and Z = D/

√V (D) where D is the average of the Di and V (D) is the variance

of D, and Z is approximately normally distributed in large samples and a t-distributionin small ones. For the Shelby data the model was a linear regression predicting a worker’ssalary in 1983 from the number of months they had worked. Table 1 displays the salaryand date hired and gender data that we use here; see Table 7 in Gastwirth (1992) forthis data and salary data for other years. The observed average shortfall D = $185.12 inthe monthly salary of females has a standard error of 35.62 resulting in two-sided p-value< .001. Another analytic approach that does not require the assumption that the errors inthe regression model follow a normal distribution and logically follows from the idea thatone is imposing a probability model on the data is to apply a permutation test. A completepermutation test would consist of swapping the gender labels and repeatedly applying thePeters-Belson approach to each relabeled data set. As there are 9 males and 14 females,there are

(249

)= 1, 307, 504 ways to relabel gender in the Shelby data. For computational

purposes we randomly selected (without replacement) 1,000 relabeled data sets and foundonly 3 of the D across the 1,000 to be as large or larger in absolute value as the observedshortfall, yielding a two-sided p-value of 0.003. Table 1 shows the Peters-Belson estimate ofDi for each female employee is negative, which illustrates the unfairness of the pay systemexamined and provides an estimate of the amount of money each woman deserves. Otheruses of permutation methods in least squares are described in Sprent (1998).

The Peters-Belson (PB) method is related to the use of counterfactuals in the Neyman-Rubin (Holland, 1986; Morgan and Winship, 2007) approach to causal inference if oneconsiders the PB estimated salary for each female obtained from the male equation as theestimated salary of her “male counterfactual.” This predicted salary, however, may not bethe salary of any of the male employees; rather it is a “statistical match” in the terminologyof Peters (1944). The accuracy of the shortfall obtained from PB regression depends on theappropriateness of the model and the completeness of the information on the covariates. Inthe context of an “Equal Pay” case, the employer knows the relevant factors used and indetermining salaries and the relative weight given to each of them and should ensure thataccurate information on them is obtained and retained.

4. Summary and future thoughts

Very few publications remain highly relevant to their subject after forty years have passed.Professor Cochran’s 1972 paper and his earlier work, which is summarized in it, are in thatspecial category. Every investigator should review the recommendations on the need tohave a clear statement of the objectives of a study when planning and designing a studyand follow his suggestions, e.g. have a pilot study, at those stages. His discussion of thevarious methods for removing the effect of confounders remains the basis of much currentresearch (Rosenbaum, 2002).

179

Page 57: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Gastwirth and Graubard

Our comments focused on the value of Cochran’s emphasis on the need to consider theeffect of differences between the study population and the population to which the inferencesdrawn from the study will be applied and the situation when the study group is the entirepopulation. In the context of examining the complete population, especially in a legal case,Prof. Cochran’s concern with stability of the relationship, presumably over time, is lessimportant than in the usual setting where one desires to draw inferences valid for a muchlarger population from a sample and learn about the underlying mechanisms producing theresponse. In an equal employment case, one’s focus is on what happened during the fewyears in which the employer used the practice (job requirement, pay decision process) underreview. Indeed, quite often an employer will change policies in response to a claim so thatthe earlier relationship between salary and gender or race and other covariates may wellchange.

In practice, almost no large scale study will be “perfect,” the statistical model willgenerally be a good “approximation” of the relationship of the response to the predictorsand there will be errors of measurement and a potentially relevant covariate may be omitted.Readers should be aware of the usefulness of sensitivity analysis (Rosenbaum, 2002) and inparticular, the importance of the Cornfield inequality (Cornfield et al., 1959, Gastwirth andGreenhouse, 1995) in assessing whether a possible omitted factor can explain a statisticallysignificant difference between two groups. Briefly, Cornfield gave conditions on the strengthand imbalance or differences in the prevalence the omitted variable must satisfy in order toexplain a relative risk. The result has been used by Gastwirth (1992) to show that judgeswho required the party suggesting that an observed difference or relative risk was due toan omitted variable to submit an analysis including that factor were correct.

In view of recent interest in the issue of reproducibility of scientific studies, basinginferences that will be applied to the target population on random samples from it has theadvantage that investigators using different random samples should arrive at similar results,within sampling variation.

References

Adams K.F., Schatzkin A., Harris T.B., Kipnis V., Mouw T., Ballard-Barbash R., Hol-lenbeck A. and Leitzmann M.F. (2006). Overweight, obesity, and mortality in a largeprospective cohort of persons 50 to 71 years old. New England Journal of Medicine, 355,763–78.

Belson W.A. (1956). A technique for studying the effects of a television broadcast. Journalof the Royal Statistical Society, Series C (Applied Statistics), 5, 195-202.

Calle, E.E., Thun, M.J., Petrelli, J.M., Rodriguez, C. and Heath, C.W. Jr. (1999). Body-mass index and mortality in a prospective cohort of U.S. adults. New England Journalof Medicine, 341, 1097-1105.

Cochran, W.G. and Rubin, D.B. (1973). Controlling bias in observational studies, a review.Sankyha (A), 35, 417–446.

Cornfield, J.C., Haenzel, W., Hammond, E.C. , Liliefield, A.M., Shminkin, M.B. and Wyn-der, E.L. (1959). Smoking and lung cancer: recent evidence and a discussion of somequestions. Journal of the National Cancer Institute, 22, 173–203.

180

Page 58: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

Durazo-Arvizu R., Cooper R.S., Luke A., Prewitt T.E., Liao Y. and McGee D.L. (1997).Relative weight and mortality in U.S. blacks and whites: findings from representativenational population samples. Annals of Epidemiology, 7, 383–395.

Flegal K.M., Graubard B.I., Williamson D.F. and Gail M.H. (2005). Excess deaths as-sociated with underweight, overweight, and obesity. Journal of the American MedicalAssociation, 293, 1861–1867.

Gastwirth, J.L. (1992). Methods for assessing the sensitivity of statistical comparisons usedin title VII cases to omitted variables. Jurimetrics, 33, 19–34.

Gastwirth, J.L. (1992). Statistical reasoning in the legal wetting. American Statistician,46, 55–69.

Gastwirth, J.L. and Greenhouse, S.W. (1995). Biostatistical concepts and methods in thelegal setting. Statistics in Medicine, 14: 1641–1653.

Holland P.W. (1986). Statistics and causal inference. Journal of the American StatisticalAssociation, 81, 945–970.

Morgan, S.L. and Winship, C. (2007). Counterfactuals and Causal Inference: Methods andPrinciples for Social Research. Cambridge University Press, Cambridge.

Peters C.C. (1941), A method of matching groups for experiment with no loss of population.The Journal of Educational Research, 34, 606-612.

Sprent, P. (1998). Data Driven Methods in Statistics. Chapman & Hall, London.

181

Page 59: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 182-183 Submitted 1/15; Published 8/15

The State of the Art in Causal Inference: Some ChangesSince 1972

Andrew Gelman [email protected]

Department of Statistics and Department of Political Science

Columbia University

New York, NY 10027, USA

William Cochran’s 1972 article on observational studies is refreshing and includes rec-ommendations and warnings that remain relevant today. Also interesting are the ways thatCochrans advice differs from recent textbook treatments of causal inference from observa-tional data.

Most notable, perhaps, is that Cochran talks about design, and estimation, and generalgoals of a study but almost nothing about causality, devoting only one page out of ten tothe topic. In statistical terms, Cochran spends a lot of time on the estimator (and, moregenerally, the procedure to decide what estimator to use) but never defines the estimand inan observational study. He refers to bias but gives no clear sense of what exactly is beingestimated (he does not, for example, define any sort of average causal effect). Modern treat-ments of causal inference are much more direct on this point, with the benefit of the variousformal models of causal inference that were developed by Rubin and others starting in the1970s. Scholars have pointed out the ways in which the potential-outcome formulation de-rives from earlier work by statisticians and economists, but Cochran’s chapter reveals whatwas missing in this earlier era: there was only a very weak connection between substantiveconcerns of designs and measurement, and statistical inference decisions regarding match-ing, weighting, and regression. In more recent years, the filling in of this gap has been animportant research area of Rosenbaum and others; again, seeing Cochran’s essay gives us asense of how much needed to be done.

One area that Cochran discusses in detail, and which I think could use more attentionin modern textbooks (including those of my collaborators and myself) is measurement.Statistics has been described as living in the intersection of variation, comparison, andmeasurement, and most textbooks in statistics and econometrics tend to focus on the firsttwo of these, taking measurement for granted. Only in psychometrics do we really seemeasurement getting its due. So I was happy to see Cochran discuss measurement, evenif he did not get to all the relevant issuesin particular, external validity, which has beenthe subject of much recent discussion in the context of laboratory experiments vs. fieldexperiments vs. observational studies for social science and policy.

In reading Cochran’s chapter, I was struck by his apparent lack of interest in causalidentification. Modern textbooks (for example, the econometrics book of Angrist and Pis-chke) discuss the search for natural experiments, along with the assumptions under whichan observational study can yield valid causal inference, and various specific methods such asinstrumental variables and regression discontinuity that can identify causal effects if defined

c⃝2015 Andrew Gelman.

Page 60: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

The State of the Art in Causal Inference: Some Changes Since 1972

carefully enough under specified conditions. In contrast, Cochran discusses generic before-and-after designs and restricts himself to analysis strategies that do basic controlling forpre-treatment covariates by matching and regression. He is not so clear on what variablesshould be controlled for (which perhaps can be expected given that he was writing beforeRubin codified the concept of ignorability), and this has the practical consequence that hedevotes little space to any discussion of the data-generating process. Sure, an experimentis, all else equal, better than an observational study, but we don’t get much guidance onhow an observational study can be closer or further from the experimental ideal. Cochrandid write, “a claim of proof of cause and effect must carry with it an explanation of themechanism by which the effect is produced,” which could be taken as an allusion to the sub-stantive assumptions required for causal inference from observational data but he suppliedno specifics, nothing like, for example, the exclusion restriction in instrumental variablesanalysis.

Another topic that has appeared from time to time in the causal inference literature,notably by Leamer in the 1970s and in recent years by researchers such as Ioannidis, Button,and Simonsohn in medicine and psychology, are the biases resulting from the search for lowp-values and the selective publication of large and surprising results. We are increasinglyaware of how the “statistical significance filter” and other sorts of selection bias can distortour causal estimates in a variety of applied settings. Cochran, though, followed the standardstatistical tradition of approaching studies one at a time; the terms “selection” and “meta-analysis” do not appear at all in his essay. Just to be clear: In noting this perspective, I amnot suggesting that his own analyses were rife with selection bias. It is my impression that,in his work, Cochran was much more interested in improving the highest-quality researcharound him and was not particularly interested in criticizing the worst stuff. I get the sense,though, that, whatever things may have been like in the 1960s, in recent years selectionbias has become a serious problem even in much of the most serious work in social scienceand medicine, and that careful analysis of individual studies is only part of the picture.

Let me conclude by emphasizing that the above discussion is not intended to be exhaus-tive. The design and analysis of observational studies is a huge topic, and I have merely triedto point to some areas that today are considered central to causal inference, were barelynoted at all by a leader in the field in 1972. Much of the research we are doing today canbe viewed as a response to the challenges laid down by Cochran in his thought-provokingessay that mixes practical concerns with specific statistical techniques.

183

Page 61: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 184-193 Submitted 7/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Ben B. Hansen [email protected] of Statistics, University of MichiganAnn Arbor, MI 48109, USA

Adam Sales [email protected]

Department of Educational Psychology, University of Texas at Austin

Austin, TX 78712, USA

1. Cochran’s advice

It is a tribute to Cochran’s wisdom and foresight that the issues he identified as centralin 1972 remain so for the methodologists of 2015, with newer subfield journals as well asolder, mainstream statistical journals steadily presenting techniques with which to ease“Judgment about Causality.” Among the newer journals Observational Studies stands outfor facilitating the study planning Cochran urged, presenting a forum for the sharing ofstudy protocols, even those that do not require the prior endorsement of a funding panelor regulatory body. Cochran saw protocols as important for the purpose of soliciting andaddressing criticism before rather than after completion of the study, and for keeping studieson track once they are underway. This journal’s Aims and Scope statement correctly addsthat a study conducted according to a public protocol will generally be more transparentand persuasive.

Both Cochran and the Observational Studies founding document emphasized utility ofplanning with regard to choice and measurement of outcomes, as opposed to the choiceand implementation of statistical methods. However, in-advance study protocols turn outto be particularly useful for structuring and organizing the many small choices underlyingstatistical analysis of most any comparative study. It is self-evident that a detailed andpublicly posted plan may inoculate a study against suspicions of its having been engineeredto confirm preconceptions of the investigators. What is more subtle is that if the study’send verdict depends directly or indirectly on choices among statistical models – as nearlyall studies do – then in a purely statistical sense the chosen model’s integrity is protectedby codifying the sequence of modeling choices, with each selection guided by a test withprespecified frequency characteristics.

This is particularly so if the stepwise construction of the model proceeds in a “forward”rather than a “backward” direction, starting simple before narrowing or complicating asneeded, rather than starting with something elaborate and progressively seeking to simplifyit. Then the frequency properties of the overall procedure follow in a particularly simpleway from those of the constituent tests, by dint of an insight of R. Berger, the “stepwiseintersection-union principle” (SIUP). The term appears in an unpublished manuscript, butBerger et al. (1988) gave an application and Rosenbaum (2008, Proposition 1) noted its

c©2015 Ben B. Hansen and Adam C. Sales.

Page 62: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment

relevance to observational studies. To our knowledge its special relevance to setting upan observational study has not previously been noted. The concept itself is very simple;the remainder of this comment restates it before discussing its application in observationalstudy designs of increasing complexity.

2. Stepwise intersection-union testing

Let the set A be totally ordered by ≺: for all a 6= a′ in A, either a ≺ a′ or a′ ≺ a. Supposethat each a ∈ A corresponds to a null hypothesis, Ha, to be tested with size α ∈ (0, 1). Forthe set of hypotheses {Ha}a∈A assume:

1. There exists a level-α test of Ha for all a ∈ A

2. Either Ha is false for all a ∈ A or there exists an a ∈ A such that Ha is true and forall b ≺ a, Hb is false—Ha is the first true hypothesis.

Let F be a family of tests testing the hypotheses {Ha}a∈A. Then let a∗ = sup{a :Ha is not rejected byF}. Then let F∗ be a modified family of tests, also indexed by A,which rejects all hypotheses Ha with a∗ ≺ a and does not reject Ha with a ≺ a∗. Then thefamily F∗ strongly controls the family-wise error rate.

That is, the SIUP states:

Proposition 1 If for each a ∈ A the corresponding test from F incorrectly rejects truehypotheses with probability at most α, then with probability at least 1 − α, F∗ rejects onlyhypotheses that are false.

Proof If all hypotheses {Ha}a∈A are false, it is trivially impossible to reject a true hy-pothesis. Then, following the second condition above, let Ha be the first true hypothesis.Let R be the event that a researcher rejects at least one true hypothesis. Under F∗, Ris equivalent to rejecting Ha: since Ha is true, rejecting Ha implies R, and since F∗ canonly reject additional true hypotheses after first rejecting Ha, then R entails rejecting Ha.Therefore, Pr(R) = Pr(reject Ha) = α.

The SIUP states that if a researcher pre-specifies a sequence of hypotheses and cor-responding level-α tests, tests those hypotheses in order, and stops testing after the firstnon-rejected hypothesis, then the probability of incorrectly rejecting at least one correcthypothesis is at most α. As Rosenbaum (2008) pointed out, inverting a test to form aconfidence interval can be though of as an application of the SIUP. Say a researcher seeksto estimate a one-sided 95% confidence interval for the average height of a population, usingdata from a random sample. Assume, for the sake of this example, that the distribution ofheights in the population is normal, so a t-test yields exact p-values. Then to approximatea 95% interval, she could specify a grid of possible average heights measured in inches, saya = 50, 51, 52,· · · . For each of these, she would test the hypothesis Ha that the populationaverage height is less than or equal to a, and reject those hypotheses with p-values lowerthan α = 0.05. If H50 is rejected, she then tests H51; if H51 is rejected, she then tests H52

and so on. Eventually, she will test a hypothesis that cannot be rejected—say H60. She maythen state, with 1−α =95% confidence, that the average height of the target population is

185

Page 63: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Hansen and Sales

greater than 59 inches. Even though this procedure may result in many hypothesis tests,with no correction for multiplicity, the SIUP states that because the hypotheses are testedin order the probability of rejecting a true hypothesis is only α = 0.05.

3. Regression discontinuity bandwidth selection

In a Regression Discontinuity Design (RDD), treatment Z is a deterministic function of anumeric “running variable” R: Z = 1 when R > c for some known constant c, and Z = 0otherwise. For example, Bartlett III and McCrary (2015) (B&M) discuss the effect of asecurities trading rule, that only applies when the price of the security (R) meets or exceedsone dollar (c). Under this design, they estimate the effect of the rule on a number of marketoutcomes, such as quotes-per-second.

RDDs are uniquely persuasive because the treatment assignment mechanism is known,and hence there is only one confounder—R—and that confounder is measured. On the otherhand, matching on R is impossible, since all subjects with R on either side of c are eithertreated or untreated. For that reason, statistical modeling is necessary. With statisticalmodels come modeling assumptions, and corresponding specification tests. For instance,B&M model expected outcomes as a function of price, E[Y |R = r], with a local-linearsmoother. One test of this specification is a covariate placebo test: researchers typicallyhave access to a set of covariates X, such as prior weekly returns in B&M, that could nothave been affected by the treatment. A covariate placebo test estimates treatment effectsfor X—if the method discovers a treatment, something must be wrong.

Covariate placebo tests in RDDs involve two forms of multiple testing, both discussedin Sales and Hansen (2015, Section 2). First, say several covariates are available, so eachsubject i has a vector of length p ~Xi· = {Xik}pk=1. As p increases, so does the probability offalsely rejecting a null hypothesis at a fixed level α for at least one of the covariates. This isnot unique to RDDs; in a general context, Hansen and Bowers (2008) suggested combiningcovariate placebo tests into an omnibus test. In an RDD context, Lee and Lemieux (2010)suggested combining separate RDD analyses, one for each column of X, with the omnibusp-value from seemingly unrelated regressions. B&M used a version of Hotelling’s T 2 statisticto combine p = 4 covariates.

A second form of multiple testing emerges from a search for specification. For instance,to mitigate the bias caused by model-misspecification in RDDs, it is common practice toestimate effects in RDDs using only data in a window around the cutoff Wb = {i : Ri ∈(c− b, c+ b)} for some bandwidth b > 0. One method for choosing b, recommended in Salesand Hansen (2015), Cattaneo et al. (2015) and, in a slightly different context, Angrist andRokkanen (2015) is based on sequential specification tests: for each candidate b, conducta covariate specification test using the data in Wb. This is illustrated in Bartlett andMcCrary’s (2015) Figure 1, reproduced with permission as Figure 1 here, which displaysp-values issuing from covariate placebo tests in windows Wb, b = 0.05, 0.06, ..., 1.5. Inprinciple, there are two ways to conduct this procedure: one is to start with the smallestpossible bandwidth b, and expandWb until a specification test rejects. In the B&M example,for α = 0.1, say, this would result in a very small bandwidth b = 0.11. Alternatively,selecting forward, an analyst can begin with the largest possible b, and restrict the window

186

Page 64: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment

Appendix Figure 1. Randomization Inference p-value as a Function of Bandwidth

0.2

.4.6

.8P

erm

uta

tion

Te

st p

−va

lue

0 .5 1 1.5Bandwidth

Note: For bandwidths over a grid of {0.05,0.06,...,1.5}, figure shows randomization inference p-value associated

with Hotelling’s T2 and the four covariates from Figures 1, 2, 3A, and 3B.

Appendix Figure 2. Placebo Distribution of t-Ratios for Delisting Risk

Note: Figure shows the histogram of t-ratios obtained by estimating placebo discontinuities in delisting risk

from $0.50 to $4.00. The superimposed grey curve is the standard normal density function.

57

Figure 1: The stepwise intersection-union principle applied to regression discontinuity band-width selection. For a sequence of candidate bandwidths b, Bartlett III and Mc-Crary (2015) conducted covariate placebo tests using data in Wb. Figure copiedwith permission from Bartlett III and McCrary (2015). In practice, the detailsof B&M’s bandwidth selection procedure differed from what we present here.

Wb until a specification test fails to reject the specification. This would allow much largerbandwidths in the B&M dataset.

The SIUP recommends the latter procedure, which strongly controls familywise errorrates. Let B ⊂ (0,∞) be the set of candidate bandwidths, and let F be a family of tests,one for each b ∈ B, testing the null hypothesis Hb that the specification holds in Wb. Thendefine the modified family of tests F∗ that rejects each hypothesis Hb at level α if and onlyif F rejects each Hb′ , b

′ ≥ b. The testing in order principle states that, with probability1− α, F∗ rejects at most one true Hb.

4. Pair matching, nearest neighbor matching or something in between?

Suppose that nt members of a treatment group are to be paired to members of a reservoir ofnc controls on the basis of some summary of their baseline differences, such as their absolutedifferences along an estimated propensity score. The simplest and most intuitive match-ing structures emerge from pair matching without replacement, which generates preciselymin(nc, nt) disjoint matched sets, each containing a single representative of either group.

187

Page 65: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Hansen and Sales

At the outset of an investigation there can be no guarantee that pair matching willachieve a satisfactory reduction in baseline differences — even with the use of a optimalmatching algorithm, and even with a large control reservoir. Difficult to match membersof the treatment group may compete for the same controls; in this case, with-replacementpair matching may better remove bias due to differences between the groups at baseline,as might full matching to n∗c controls, n∗c < nc (Rosenbaum, 1991; Hansen and Klopfer,2006). In the worst case common support (Heckman et al., 1998) may fail, with a subsetof the treatment group possessing combinations of characteristics that could never appearin the control group, even in indefinitely large samples. Then one might instead settle formatching only a subset of the treatment group. A suitable optimal matching algorithm canselect a size-n∗t subset of the treatment group, n∗t < nt, along with a without-replacementmatch to n∗t controls, to achieve the minimum sum of paired matching discrepancies amongall possible collections of n∗t nonoverlapping 1:1 matched pairs (Rosenbaum, 2012). Whilethere can be no knowing in advance which option will be most appropriate, or whethereither one will be necessary, a protocol has every reason to anticipate the eventuality thatsuch a choice might be needed.

One approach to such decisions is to structure them with statistical hypothesis tests,the first pertinent null hypothesis being that treatments and their matched controls weredrawn from the same population, so far as baseline characteristics are concerned. (Or,more precisely, that matched counterparts’ propensity scores are always equal [Rosenbaum,2010].) The part of a decision that’s guided by a hypothesis test is relatively easy to codifyin a protocol, through pre-specification of a hypothesis, a test statistic and an α level. Thedecision may also have additional components, particularly if the eventuality of rejectionwould force a choice among contrasting methods: after rejection of an ordinary pair matchone might decide between full matching to a fraction of the control reservoir and optimalpair matching of an optimally chosen subset of the treatment group. This decision may bebased on some diagnostic of whether the problem appears to be competition for controls ora failure of overlap.

Whatever fallback procedure may be selected, it’s natural to subject the matches itgenerates to a test of the same baseline equivalence hypothesis, proceeding to later stagesof analysis only if that hypothesis can be sustained. This recycling of hypothesis testsleads naturally to misgivings; the repetition would seem to threaten frequentist error ratesassociating with the test. But the misgivings are misplaced – this turns out to be step-wise intersection-union testing, because the later matches are constructed only after earliermatches were rejected. If at each step, the decision to try another match or to stick withthe current one is guided by a test with local level α, then the entire procedure has level α,in the family-wise sense.

This is fortunate, because our pair matching alternatives each requires the choice ofa tuning parameter: for with-replacement matching, a positive integer n∗c < min(nc, nt);for matching of an optimally chosen subset, an integer n∗t < min(nc, nt). Were a multiple-testing penalty to accrue from increasing the number of specification tests, then determininghow find a grid to search would itself constitute a wrenching decision. With stepwiseintersection-union testing, however, there’s no reason not to start with n∗c = min(nc, nt)−1,or with n∗t = nt−1, and to reduce them in increments of 1 until precisely the point at whichbaseline equivalence is no longer rejected.

188

Page 66: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment

Indeed, this perspective suggests beginning the investigation of matching options at astill more optimistic position than pair matching . If nc � nt and good matches are plenti-ful for all subjects then 1:2 matched triples, or 1:3 matched quadruples, might be arrangedwith little increase in biases but significant benefit for variances of eventual effect estimates.Full matching (Rosenbaum, 1991; Gu and Rosenbaum, 1993) modifies pair matching in bothdirections at once, permitting difficult-to-match treatments to share controls while pairingeasier to match treatment group subjects to multiple controls, in varying configurations.Bias-variance tradeoffs arise also in full matching, however, when it is combined with struc-tural restrictions to enhance the effective sample size (Hansen, 2004, 2011); hence therecontinues to be an important role to play for sequential intersection-union testing.

5. Falsification with planned fallbacks

The No Child Left Behind (NCLB) Act of 2002 aimed to have all K-12 public school pupilsmeet grade-level proficiency standards, in reading and math, and to do so by 2014. Itcoerced the 50 states to track their schools’ performance and to sanction those found tobe making inadequate progress. The year 2014 has since come and gone, with the law’slofty goal remaining to be achieved. Yet Dee and Jacob’s (2011) authoritative study of itmaintains that NCLB generated meaningful gains, particularly in lower grade mathematics.

Many states had self-imposed NCLB-type accountability measures over the 10 yearspreceding the law, and the basis of Dee and Jacob’s (D & J’s) analysis is a comparison oftrends in these states and in those states, the treatment group, for which NCLB appreciablystrengthened school accountability. Between 1992 and 2002, fourth graders in both groupsof states improved on the math portion of the National Assessment of Educational Progress(NAEP), with states independently adopting accountability measures during this periodstarting lower in 1992 but improving more rapidly, surpassing their counterparts by 2000.These states’ improvements continued at much the same rate between 2003 and 2007, whilestates affected by NCLB jumped sharply after 2002, then continued to increase at a rateparalleling that of the unaffected group. Much of the paper investigates and rules outattributions of the post-2002 jump to factors other than the NCLB law.

Some of the investigations closely resembled the covariate placebo tests discussed abovein § 3. D & J fit regression models parallel to their eventual analysis of NCLB’s associationswith NAEP scores, but with dependent variables other than, and arguably unrelated to,measures of student achievement. Specifically, they considered whether, net of state fixedeffects and time trends, the passage of NCLB covaried with: the states’ NAEP participationrates; fractions of grade-level cohorts eligible to participate in the NAEP that had previouslyattended full-day kindergarten, or preschool; proportions of school-age children attendingpublic school; public school cohort demographics; or aggregate economic indicators. Theresults were largely consistent with hypotheses of no difference (on these measures) betweenstates NCLB did and did not immediately affect. However, a small NCLB “effect” on medianhousehold income was noted, along with differences in the student cohorts’ race compositionsthat would be difficult to attribute to chance, even accounting for multiplicity.

Procedures of this type are sometimes labeled “falsification tests,” implying that rejec-tion of the null would invalidate some meaningful premise of the research. But few workingresearchers would be so wasteful as to discard a study of a policy because its introduction

189

Page 67: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Hansen and Sales

was found to coincide with changes in other potentially relevant variables; the obvious rem-edy is to fold adjustments for those variables into the outcome analysis. D & J do just this,if somewhat apologetically: since their validation strategy relied on models of the same formbeing used in the eventual outcome analysis, their primary results are based on regressionsof state mean NAEP scores on precisely the independent variables as used for the validation.That is, they didn’t adjust their primary analyses in light of the mixed result of the validitycheck; instead the specification adding demographic composition adjustments is presentedas an appendicial supplement. It was one “sensitivity analysis” alongside 13 others alteringtheir analytic choices in various ways. D & J label this validation exercise a “robustnesscheck,” which sits more comfortably with their approach than would “falsification test.”

To avoid rigid falsificationism is to show good judgment. In this instance and others,however, a modicum of Popperian orientation (Popper, 1963, ch. 1; Wong et al., 2015) canbe practical as well as idealistic. Dee and Jacob’s robustness checks lead them to present15 pairs of estimates and standard errors, for each of their 4 outcomes. The results arebroadly in agreement, fortunately; but what if they had not been? Would it not have beenbetter to somewhat narrow this wide field prior to outcome analysis?

A pre-arranged sequence of robustness checks, with designated adjustments to the ana-lytic framework corresponding to the eventuality of any particular test’s failure, would havestreamlined the process. For example, at Stage I D & J might have used their preferred,more parsimonious model specification to estimate policy effects on each of the putativelyunrelated variables, combining the estimates into a single test reflecting the multiplicity ofthe estimations. For instance, a Bonferroni adjustment and overall size set to αf = .25entails rejecting this stage 1 specification if any of the 10 estimates differs from zero atlocal level αl = .025. The plan should prespecify how the basic model specification is to bechanged if this stage I test ends in rejection. It might for example say that each of the vari-ables for which the significance criterion was met is to be adjusted for in the specificationstested at later stages, and in the specification eventually used for outcome analysis.

Perhaps better yet, rather than attempting to address each shortcoming of the stage 1specification at once, the fallback plan for the eventuality of stage 1’s ending in rejectionmight specify that adjustment be made for just one of the unruly variables, or even for avariable with which no problem was detected. In light of the SIUP, one can always planadditional fallback measures if minor adjustments to the model prove insufficient. In the D& J study, the fallback to a rejection at stage I might simply have been to add adjustmentsfor the proportions of Black, Hispanic, and free lunch eligible students, whether or not anyof these variables were responsible for the stage I null being rejected. Stage II would then beanother level α ensemble test adding these adjustments and omitting the regressions withstudent demographics as dependent variables.

If falsification at stage I led to demographic adjustments, then a second falsificationat stage II might for example recommend economic indicator adjustments. Or perhaps itwould occasion more comprehensive adjustments for student demographics, with additionaladjustment variables being added and only after stage III (if the procedure gets to stage III).Importantly, at each stage the family wise error rate — accounting for previous stages aswell as multiplicities within the current stage — is fully controlled because of the sequentialintersection union test, and the principle that later stages are reached only if each earlierstage ended in rejection.

190

Page 68: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment

While there is nothing to prevent the protocol from detailing eventualities in which ev-ery validation variable might be factored into the outcome analysis model as a covariate, itmay be advantageous to hold some of them back, even in the worst-case scenario that thelast putatively unrelated variable left standing still seems to be affected by the policy. D &J included as validation variables the cohort fractions that had attended kindergarten, andpreschool, in order to address the possibility that states that did not begin accountabilitymeasures before being forced to by NCLB had allocated resources to early childhood educa-tion rather than school accountability, investments which may in turn have increased theirstudents’ later achievement scores relative to other states. Had partial associations of theNCLB law with these variables remained even after adjustment for the remaining valida-tion variables, the research effort might be better served to step back and consider a moremajor revision to its identification strategy — that is, to declare the original identificationstrategy “falsified.” The overall probability of such falsification occurring falsely is cappedat αf , again due to the SIUP.

6. Summary

The sequential intersection union principle is well suited to aid the selection of quasiexperi-mental study designs. Observational studies require models and hence modeling choices andassumptions, some of which can be tested. When performed without structure, such testsresult in a hard-to interpret proliferation of p-values, and in spuriously rejected model spec-ifications. However, if the researcher in advance maps out a sequence of modeling choicesand corresponding specification tests, the SIUP allows her to strongly control the type-Ierror rate of her model selection process. We have argued by way of example that this pro-cess works best when researchers begin with a specification that is likely to be overly broad,or a model that is too simple, and sequentially add restrictions or complexity as broad orsimple specifications are rejected. Of course, researchers are free to use even narrower ormore complex specifications than the first that is not rejected. The first hypothesis whosespecification test fails to reject at α forms, in a way, the boundary of a region of plausiblespecifications.

It is important to note that the appropriate size for a specification test often exceedsthe typical α = 0.05 threshold for outcome analysis. This is because in the search for aplausible specification, the costs of mistakenly accepting an improper specification oftengreatly outweigh the costs of rejecting a valid specification. Rejecting a valid specificationmay lead to the researcher having to settle for a more restrictive, but also valid, specification,if in virtue of a type I error the sequence of tests terminates at a later position than it couldhave. If it occurs that the sequence of modifications anticipated in the protocol is exhausted,the investigator may still opt to proceed, presenting the findings with an acknowledgment ofthe possibility of design bias. Alternatively, she can go back to the drawing board, mappingout a new sequence of decisions culminating in an analysis with similar but more feasiblegoals.

Acknowledgments

191

Page 69: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Hansen and Sales

B.H. and A.S. thank Jake Bowers, Justin McCrary and Dylan Small for helpful commentsand encouragement (while retaining full responsibility for any errors or oversights.) A.S. ispartially supported by the Institute of Education Sciences, U.S. Department of Education,through Grant R305B1000012. Any opinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors.

References

Joshua D. Angrist and Miikka Rokkanen. Wanna get away? Regression discontinuityestimation of exam school effects away from the cutoff. Journal of the American StatisticalAssociation, 2015. Advance online publication.

Robert P Bartlett III and Justin McCrary. Dark trading at the midpoint: Pricing rules,order flow and price discovery. Technical report, UC Berkeley School of Law, 2015.

Roger L Berger, Dennis D Boos, and Frank M Guess. Tests and confidence sets for com-paring two mean residual life functions. Biometrics, pages 103–115, 1988.

Matias D Cattaneo, Brigham R Frandsen, and Rocio Titiunik. Randomization inference inthe regression discontinuity design: An application to party advantages in the US senate.Journal of Causal Inference, 3(1):1–24, 2015.

Thomas S Dee and Brian Jacob. The impact of No Child Left Behind on student achieve-ment. Journal of Policy Analysis and management, 30(3):418–446, 2011.

X.S. Gu and Paul R. Rosenbaum. Comparison of multivariate matching methods: Struc-tures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2(4):405–420, 1993.

Ben B. Hansen. Full matching in an observational study of coaching for the SAT. Journalof the American Statistical Association, 99(467):609–618, September 2004.

Ben B. Hansen. Propensity score matching to extract latent experiments from nonexperi-mental data: A case study. In Neil Dorans and Sandip Sinharay, editors, Looking Back:Proceedings of a Conference in Honor of Paul W. Holland, chapter 9, pages 149–181.Springer, 2011.

Ben B. Hansen and Jake Bowers. Covariate balance in simple, stratified and clusteredcomparative studies. Statistical Science, 23(2):219–236, 2008.

Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs vianetwork flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006.URL http://www.stat.lsa.umich.edu/%7Ebbh/hansenKlopfer2006.pdf.

James J. Heckman, Hidehiko Ichimura, and Petra E. Todd. Matching as an EconometricEvaluation Estimator. Review of Economic Studies, 65(2):261–294, 1998.

D.S. Lee and T. Lemieux. Regression discontinuity designs in economics. Journal of Eco-nomic Literature, 48:281–355, 2010.

192

Page 70: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment

Karl Popper. Conjectures and refutations. London: Routledge and Kegan Paul, 1963.

Paul R. Rosenbaum. A characterization of optimal designs for observational studies. Journalof the Royal Statistical Society, 53:597– 610, 1991.

Paul R Rosenbaum. Testing hypotheses in order. Biometrika, 2008.

Paul R. Rosenbaum. Design of Observational Studies. Springer Verlag, 2010.

Paul R. Rosenbaum. Optimal matching of an optimally chosen subset in observationalstudies. Journal of Computational and Graphical Statistics, 21(1):57–71, 2012.

Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score inobservational studies for causal effects. Biometrika, 70:41–55, 1983.

Adam Sales and Ben B Hansen. Limitless regression discontinuity. arXiv preprintarXiv:1403.5478, 2015.

Manyee Wong, Thomas D. Cook, and Peter M. Steiner. Adding design elements to improvetime series designs: No Child Left Behind as an example of causal pattern-matching.Journal of research on educational effectiveness, 8(2):245–279, 2015.

193

Page 71: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 194-195 Submitted 6/15; Published 8/15

A good deal of humility: Cochran on observational studies

Miguel A. Hernan miguel [email protected]

Department of Epidemiology and Department of Biostatistics

Harvard T.H. Chan School of Public Health Boston, MA 02115, USA

Cochran’s commentary shows that much of our conceptual framework for observationalstudies was already in place in the early 1970s. It also illuminates the technical progressthat has been achieved since then, and identifies crucial methodologic challenges that stillremain and that may remain forever.

In his commentary Cochran classifies observational studies into two groups. The firstgroup–“analytical surveys” in Cochran’s terminology–investigates the “relation betweenvariables of interest” in a sample of a target population. In these studies the goal is pre-diction, not causality. The second group studies the causal effects of “agents, procedures,or experiences.” Cochran’s commentary is almost exclusively concerned with this secondgroup of observational studies whose goal is causal inference about comparative effects.

Cochran views these observational studies as attempts to answer causal questions insettings “in which we would like to do an experiment but cannot” because it is impractical,unethical, or untimely. The agents being compared in observational studies “are like thosethe statistician would call treatments in a controlled experiment.” Cochran effectivelyargues that observational studies for comparative effects can be viewed as attempts toemulate randomized experiments. Cochran, the Harvard statistician, was not alone. Otherprominent researchers like Feinstein, the Yale epidemiologist, espoused similar views.

The concept of observational studies as an attempt to emulate randomized experimentswas central for the next generation of statisticians and epidemiologists. Cochran arguesthat, like in randomized experiments, a prerequisite for causal inference from observationaldata is the statement of objectives or the “description of the quantities to be estimated.”Rubin, Cochran’s former student and future Chair of Statistics at Harvard, championedthe use of counterfactual notation to unambiguously express these quantities as contrastsinvolving potential outcomes. A decade later Robins, also at Harvard, generalized coun-terfactual theory to time-varying treatments, a generalization that extends the concept oftrial emulation to settings in which treatment strategies are sustained over time. Theseformalizations had profound effects on the field of causal inference from observational data.

A practical consequence of Cochran’s viewpoint is that observational studies can benefitfrom the basic principles that guide the design and analysis of randomized experiments.His commentary reminds us that causal analyses of observational data, like those of ran-domized trials, need to specify the “sample and target populations.” When discussing the“comparative structure”, he reminds us that studies with a control group, whether theyare observational or randomized, are generally preferred to those without a control group:“Single group studies are so weak logically that they should be avoided whenever possible”.And he identifies the defining problem of causal inference from observational data when

c⃝2015 Miguel Hernan.

Page 72: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

A good deal of humility: Cochran on observational studies

two or more groups are compared: “How do we ensure that the groups are comparable?”This is the fundamental problem of confounding or, in Cochran’s terminology, “bias due toextraneous variables”.

Cochran classifies methods for confounding adjustment into three classes: analysis of co-variance, matching, and standardization. In the decades after Cochran’s commentary, eachof these methods was deeply transformed from simple techniques that could only be usedunder serious constraints (few covariates, linear models...) to powerful analytical tools withfew restrictions. The analysis of covariance morphed into sophisticated outcome regressionmodels that can easily handle complexities such as repeated measures, random effects, flex-ible dose-response functions, and failure time data. Matching was further developed forapplication to high-dimensional applications, which can incorporate propensity scores (co-developed by Rubin). Standardization in its two modern forms–the parametric g-formulaand inverse probability weighting of marginal structural models (by Robins)–can now beapplied to complex longitudinal data with multiple time points and covariates. In additionto these three classes of methods for confounding adjustment, a fourth class emerged inthe early 1990s: g-estimation (Robins again). In many settings, the above methods canbe made doubly-robust, another technical development that arose at the turn of the cen-tury. Finally, a whole suite of econometric methods like instrumental variable estimationare being progressively embraced by statisticians and epidemiologists interested in causalinference.

All these technical and conceptual developments, however, do not alter Cochran’s takehome message: causal inference from observational data “demands a good deal of humility.”Fancy techniques for confounding adjustment will not protect us from bias if the confoundersare unknown or if key variables of the analysis are mismeasured. Cochran reminds us thatthose who aspire to make causal inferences from observational data “must cultivate anability to judge and weigh the relative importance of different factors whose effects cannotbe measured at all accurately.” Because human judgment and subject-matter knowledgeare fallible, causal inference from observational data is also fallible in ways that causalinference from ideal randomized experiments is not. A fascinating question is how muchmachine learning algorithms will be able to replace subject-matter knowledge in the yearsto come. For the time being, however, expert knowledge continues to be as paramount forthe design and analysis of studies based on observational data as it was in Cochran’s time.

195

Page 73: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 196-199 Submitted 6/15; Published 8/15

Lessons we are still learning

Jennifer L. Hill [email protected]

Department of Humanities and Social Sciences

New York University

New York, NY 10003, USA

I thoroughly enjoyed re-reading Cochran’s commentary on observational studies. Inparticular, Cochran captured my feelings towards the topic of observational studies quiteaptly in his final sentence, “observational studies are an interesting and challenging fieldwhich demands a good deal of humility, since we can claim only to be groping toward thetruth.” Scientific inquiry would benefit from greater humility among researchers pursuingcausal answers today.

In this comment I will briefly highlight some of the knowledge that has been gained aboutcausal inference since that time (apologies in advance for not referencing all the scholarswho have contributed – there are too many people to do it equitably). I will then focuson what I feel has been lost in the past decades and point out what I see to be importantdirections for the future.

1. Causal inference without randomized experiments

Causal inference typically requires satisfying both structural and parametric assumptions.Randomized experiments have the advantage of addressing both of these types of assump-tions. The most problematic structural assumption, ignorabilty, (also referred to as allconfounders measured, selection on observables, conditional independence assumption, ex-changeability, etc) is trivially satisfied in a pristine randomized experiment. Randomizedexperiments also ensure common support across treatment and control groups.

Randomized experiments have the added advantage that they do not require condition-ing on confounders for unbiased estimation, thus eliminating dependence on parametricassumptions. Moreover, even if we use a model to estimate treatment effects in this setting(for instance with goal of increasing efficiency) it is likely that our estimates will be robustto violations of the parametric assumptions of the model.

Of course in practice, noncompliance, missing data, measurement issues and other com-plications can still wreak havoc with treatment effect estimation even in the context of arandomized experiment. Even more problematic, randomized experiments are often notpossible due to ethical, financial, or logistical reasons.

In the absence of a randomized experiment (or natural experiment) the structural as-sumptions required to identify a causal effect become more heroic, requiring appropriateconditioning on confounding covariates. Unfortunately our dependence on the parametricassumptions grows as well since we now must appropriately estimate expectations condi-tional on the set of proposed confounding covariates.

c⃝2015 Jennifer Hill.

Page 74: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Lessons we are still learning

Much of the work in causal inference methodology in the decades since Cochran’s paperwas first published has focused on relaxation of parametric assumptions. Cochran discussesmatching, subclassifcation, and covariance adjustment. Since that time, however, use ofpropensity scores for matching, inverse probability weighting using propensity scores haveyielded improvements in our ability to estimate causal effects with less bias due to a reducedreliance on parametric assumptions. More recently, more sophisticated matching methodsthat capitalize on advances in computer science have increased our ability to find good bal-ance targeted to particular balance criteria without undue investment of researcher time.Along another vein, it has been proposed that use of flexible modeling of the response sur-face using Bayesian nonparametrics along with appropriate checks for overlap may largelyobviate the need for such preprocessing methods. All in all, our ability to condition on po-tential confounders without making extreme parametric assumptions has increased greatlyin the past few decades.

Other groundbreaking work has been done since Cochran’s paper around causal inferencein longitudinal settings, greater understanding of the role of double robustness in estimation,mediation and approaches to SUTVA violations. Moreover our awareness about and abilityto exploit rigourous quasi-experimental designs has grown far stronger.

Most of these advances have been facilitated by the creation of a shared formal languagefor describing both causal estimands and assumptions. Actually, two “languages” currentlyhold sway: the potential outcome framework, and directed acyclic graphs, and their exten-sions. I will not advocate for one or another of these frameworks but rather suggest that tothe extent that researchers interested in pursuing causal estimation understand these lan-guages we can learn from each other more easily and advance science with greater efficiencyand clarity.

2. A broader perspective

One of the most refreshing aspects of the Cochran article is his concern over many dif-ferent parts of the research process, including: the framing of the research question (howoften do we formally talk about that?), study design (too often statisticians are not in-volved!), measurement (we largely ignore... that’s for the psychometricians!), non-response(we sometimes address...), power (boring!), and generalizability (there is renewed interestin this topic). While all of these topics are still actively pursued as research, it is rare to seea study that does due diligence to all of these concerns. Perhaps this is due to the pressurein academia to be specialists, and the lack of reward to working in large research teams(in many, though not all, fields). Few academics rigorously address more than two or threeof these concerns in their work. Moreover, these issues are typically addressed in separatepapers rather than understanding the complexity of the relationships between them.

And in applied papers we typically pick our battles. How often do we apply a newmethod in an applied problem that might eliminate a small percentage of the bias onlyto ignore measurement error or missing data issues that might be causing bias that wouldeasily swamp this gain? If we address only some of the statistical issues what can we expectto achieve overall?

I advocate that we could increase our impact by focusing on being more broad thandeep. Rather than trying to act as surgeons working in a specialized field, we would reach

197

Page 75: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Hill

more people by thinking of ourselves as army surgeons at the front trying to extract somemeaning from the messy and imperfect world of empirical science.

3. Beyond “correlation is not causation”

Understanding what situations allow for causal inference is non-trivial. This challengeis complicated by the fact that most people intuitively assign causal attribution withoutthinking too hard about it; it feels natural. Psychological experiments have demonstratedtime and again, however, that humans are easily misled into drawing causal conclusions evenwhen none are warranted. Yet understanding the answers to causal questions is criticalfor making progress in science, assessing policy implications, and even trying to betterunderstand the implications of our actions in our individual lives.

While we have made advances in creating a broader understanding of causal issues itis mostly captured by the phrase “correlation is not causation”. This is useful and has ledto more broadly visible manifestations of this advice. For instance, science writer GaryTaubes wrote a helpful (if slightly imperfect) New York Times magazine article about thelimitations of observational studies (Taubes, 2007). As a sillier example Tyler Vigen has awebsite http://www.tylervigen.com/spurious-correlations and a book (Vigen, 2015)that highlight amusing real world examples of correlations that quite clearly do not reflectcausal relationships (for instance the 94% correlation between per capita cheese consumptionand the number of people who died by becoming tangled in their bedsheets).

This is helpful but we need to do much more. Too many scientific journals promotethe practice of using the word association (rather than effect) as a panacea. Consequently,authors dutifully describe relationships between variables as associations (typically not evenas conditional associations with a very particular conditioning set!) and then proceed inthe discussion section of the paper to make recommendations for policy or practice (clearlyinterpreting their own results causally). That is hardly less damaging than just using theword cause throughout.

Instead, we need to encourage a culture of transparency about causal claims. After all,almost all interesting scientific questions are causal in nature. We’re all trying to do it. Solet’s be honest about it and be clear about our assumptions and explore how far we mightstray from those assumptions and what the implications of those violations might be.

In terms of education, we need to move beyond the catch-all “correlation is not cau-sation” admonition and help create a deeper understanding about the broader populaceabout what it means to make a causal claim and what kind of research strongly supportssuch claims. That would mean starting to teach about counterfactuals and a wider rangeof designs even in introductory statistics courses (I would like some of these concepts to betaught in grade school). It would also mean totally rethinking the way we typically teachregression.

4. Embrace your inner scold

Returning to Cochran’s final sentence. Humility may be one of the least exhibited traitsamong academics today. We see a fair amount of hubris with regard to causality, bothwithin the academy and in industries that rely heavily on “data science”. For instance, in a

198

Page 76: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Lessons we are still learning

now infamous article in Wired magazine, Chris Anderson wrote “There is now a better way.Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We cananalyze the data without hypotheses about what it might show.” (Anderson, 2007). Thesetypes of overstatements may seem amusing and nonthreatening, but this type of hype about“big data” or new technologies could set us back decades.

Unfortunately, statisticians are often seen as scolds or Chicken Little. They are less aptto say “Yes you can!” and more apt to say “Oh my goodness you can’t do that!!!” This doesnot always make us popular either as cocktail party guests or collaborators. I don’t suggestthat we entirely give up this role – it’s too important and nobody else seems prepared totake it on. However, if we embrace the role of scold we also need to point out what can bedone to fix the problems. Focus on better designs (bring statisticians in to a study fromday one!) and sensitivity analyses that explore how far from the mark our estimates maybe if our assumptions are violated are a start to this process.

References

Anderson, C. (2007). The End of Theory: The Data Deluge Makes the Scientific MethodObsolete. Wired Magazine

Taubes, G. (September 17, 2007). Do We Really Know What Makes Us Healthy? NewYork Times Magazine.

Vigen, T. (2015). Spurious Correlations. Hatchett Books, New York.

199

Page 77: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 200-204 Submitted 7/15; Published 8/15

Causal Thinking in the Twilight Zone

Judea Pearl [email protected]

Computer Science Department

University of California, Los Angeles

Los Angeles, CA 90095, USA

To students of causality, the writings of William Cochran provide an excellent and in-triguing vantage point for studying how statistics, lacking the necessary mathematical tools,managed nevertheless to cope with increasing demands for policy evaluation from obser-vational studies. Cochran met this challenge in the years 1955-1980, when statistics waspreparing for a profound, albeit tortuous transition from a science of data, to a scienceof data generating processes. The former, governed by Fisher’s dictum (Fisher, 1922) “theobject of statistical methods is the reduction of data” was served well by the traditional lan-guage of probability theory. The latter, on the other hand, seeking causal effects and policyrecommendations, required an extension of probability theory to facilitate mathematicalrepresentations of generating processes.

No such representation was allowed into respectable statistical circles in the 1950-60s,when Cochran started looking into the social effects of public housing in Baltimore. Whiledata showed improvement in health and well-being of families that moved from slums topublic housing, it soon became obvious that the estimated improvement was strongly biased;Cochran reasoned that in order to become eligible for public housing the parent of a familymay have to possess both initiative and some determination in dealing with the bureaucracy,thus making their families more likely to obtain better healthcare than non-eligible families.1

This led him to suggest “adjustment for covariates” for the explicit purpose of reducing thiscausal effect bias. While there were others before Cochran who applied adjustment forvarious purposes, Cochran is credited for introducing this technique to statistics (Salsburg,2002) primarily because he popularized the method and taxonomized it by purpose of usage.

Unlike most of his contemporaries, who considered cause-effect relationships “ill-defined”outside the confines of Fisherian experiments, Cochran had no qualm admitting that hesought such relationships in observational studies. He in fact went as far as defining theobjective of an observational study: “to elucidate causal-and-effect relationships” in situ-ations where controlled experiments are infeasible (Cochran, 1965). Indeed, in the paperbefore us, the word “cause” is used fairly freely, and other causal terms such as “effect,”“influence,” and “explanation” are almost as frequent as “regression” or “variance.” Still,Cochran was well aware that he was dealing with unchartered extra-statistical territory andcautioned us:

“Claim of proof of cause and effect must carry with it an explanation of themechanism by which this effect is produced.”

1. Narrated in Cochran (1983, p. 24)

c⃝2015 Judea Pearl.

Page 78: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Causal Thinking in the Twilight Zone

Today, when an analyst declares that a claim depends on “the mechanism by whichan effect is produced” we expect the analyst to specify what features of the mechanismwould make the claim valid. For example, when Rosenbaum and Rubin (1983) claimed thatpropensity score methods may lead to unbiased estimates of causal effects, they conditionedthe claim on a counterfactual assumption named “strong ignorability.” Such identifyingassumptions, though cognitively formidable, provided a formal instrument for proving thatsome adjustments can yield unbiased estimates. Similarly, when a structural analyst makesthe claim that an “indirect effect” is estimable from observational studies, the claim mustfollow assumptions about the structure of the underlying graph which, again, assures us ofzero-bias estimates (see Pearl, 2014b).

Things were quite different in Cochran’s era; an appeal to “a mechanism,” like an ap-peal to “subject matter information” stood literally for a confession of helplessness, since“mechanisms” and causal relationships had no representation in statistics. Structural equa-tion models (SEM), the language used by economists to represent mechanisms, were deeplymistrusted by statisticians, who could not bring themselves to distinguish structural fromregression models (Guttman, 1977; Freedman, 1987; Cliff, 1983; Wermuth, 1992; Holland,1995).2 Counterfactuals, on the other hand, were still in the embryonic state that Neymanleft them in – symbols with no model, no formal connection to realizable variables, and noinferential machinery with which to support or refute claims.3 Fisher’s celebrated advice:“make your theories elaborate” was no help in this transitional era of pre-formal causation;there is no way to elaborate on a theory that cannot be represented in some language.

It is not surprising, therefore, that Cochran’s conclusions are quite gloomy:

“It is well known that evidence of a relationship between x and y is no proofthat x causes y. The scientific philosophers to whom we might turn for expertguidance on this tricky issue are a disappointment. Almost unanimously andwith evident delight they throw the idea of cause and effect overboard. As thestatistical study of relationships has become more sophisticated, the statisticianmight admit, however, that his point of view is not very different, even if hewishes to retain the terms cause and effect.”

It is likewise not surprising that in the present article, Cochran does not offer readersany advice on which covariates are likely to reduce bias and which would amplify bias. Anysuch advice, as we know today, requires a picture of reality, which Cochran understood tobe both needed and lacking at his time.4 On the positive side, though, he did have thevision to anticipate the emergence of a new type of research paradigm within statistics, aparadigm centered on mechanisms:

“A claim of proof of cause and effect must carry with it an explanation ofthe mechanism by which the effect is produced. Except in cases where the

2. This mistrust persists to some degree even in our century, see Berk (2004) or Sobel (2008).3. These had to wait for Rubin (1974), Robins (1986), and the structural semantics of Balke and Pearl

(1994).4. To the best of my knowledge, the only adjustment-related advice in the entire statistics literature prior

to 1980 was Cox’s warning that “the concomitant observations be quite unaffected by the treatments”(Cox, 1958, p. 48) ; it was the first defiance of an unwritten taboo against the use of data-generatingmodels.

201

Page 79: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Pearl

mechanism is obvious and undisputed, this may require a completely differenttype of research from the observational study that is being summarized.”

I believe the type of research we see flourishing today, based on a symbiosis between thegraphical and counterfactual languages (Morgan and Winship, 2014; Vanderweele, 2015;Bareinboim and Pearl, 2015) would perfectly meet Cochran’s vision of a “completely dif-ferent type of research.” This research differs fundamentally from the type of research con-ducted in Cochran’s generation. First, it commences with a commitment to understandingwhat reality must be like for a statistical routine to succeed and, second, it representsreality in terms of data-generating models (read: “mechanisms”), rather than probabilitydistributions.

Encoded as nonparametric structural equations, these models have led to a fruitful sym-biosis between graphs and counterfactuals and have unified the potential outcome frame-work of Neyman, Rubin, and Robins with the econometric tradition of Haavelmo, Marschak,and Heckman. In this symbiosis, counterfactuals (potential outcomes) emerge as naturalbyproducts of structural equations and serve to formally articulate research questions ofinterest. Graphical models, on the other hand, are used to encode scientific assumptionsin a qualitative (i.e., nonparametric) and transparent language and to identify the logicalramifications of these assumptions, in particular their testable implications.5

A summary of results emerging from this symbiotic methodology is given in Pearl (2014a)and includes complete solutions6 to several long-standing problem areas, ranging from policyevaluation (Tian and Shpitser, 2010) and selection bias (Bareinboim, Tian and Pearl, 2014)to external validity (Bareinboim and Pearl, 2015; Pearl and Bareinboim, 2014) and missingdata (Mohan, Pearl and Tian, 2014).

This development has not met with universal acceptance. Cox and Wermuth (2015), forexample, are still reluctant to endorse the tools that this symbiosis has spawned, questioningin essence whether interventions can ever be mathematized.7 Others regard the symbiosisas unscientific (Rubin, 2008) or less than helpful (Imbens and Rubin, 2015, p. 22) insistingfor example that investigators should handle ignorability judgments by unaided intuition.

I strongly believe, however, and I say it with a deep sense of responsibility, that futureexplorations of observational studies will rise above these inertial barriers and take fulladvantage of the tools that the graphical-counterfactual symbiosis now offers.

References

Bareinboim, E. and Pearl, J. (2015). Causal inference from big data: Theoretical founda-tions and the data-fusion problem. Tech. Rep. R-450,http://ftp.cs.ucla.edu/pub/statser/r450.pdf, Department of Computer Science,

5. Note that the potential outcome framework alone does not meet these qualifications. Scientific assump-tions must be converted to conditional ignorability statements (Rosenbaum and Rubin 1983; Imbensand Rubin, 2015) which, being cognitively formidable, escape the scrutiny of plausibility judgment andimpede the search for their testable implications.

6. By “complete solution” I mean a method of producing consistent estimates of (causal) parameters ofinterests, applicable to any hypothesized model, and accompanied by a proof that no other method cando better except by strengthening the model assumptions.

7. Unwittingly, the very calculus that they reject happens to resolve the problem that they pose (“indirectconfounding”) in just four steps (Pearl, 2015a; Pearl, 2015b)

202

Page 80: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Causal Thinking in the Twilight Zone

University of California, Los Angeles, CA. Forthcoming, Proceedings of the NationalAcademy of Sciences.

Bareinboim, E., Tian, J. and Pearl, J. (2014). Recovering from selection bias in causal andstatistical inference. In Proceedings of the Twenty-eighth AAAI Conference on ArtificialIntelligence (C. E. Brodley and P. Stone, eds.). AAAI Press, Palo Alto, CA. Best PaperAward, http://ftp.cs.ucla.edu/pub/statser/r425.pdf.

Berk, R. (2004). Regression Analysis: A Constructive Critique. Sage, Thousand Oaks, CA.

Cliff, N. (1983). Some cautions concerning the application of causal modeling methods.Multivariate Behavioral Research, 18, 115-126.

Cochran, W. (1965). The planning of observational studies of human population. Journalof the Royal Statistical Society (Series A), 128 234-255.

Cochran, W. G. (1983). Planning and Analysis of Observational Studies. Wiley, New York.

Cox, D. (1958). The Planning of Experiments. John Wiley and Sons, NY.

Cox, D. and Wermuth, N. (2015). Design and interpretation of studies: Relevant conceptsfrom the past and some extensions. Observational Studies, 1, 165–170.

Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics. Philosoph-ical Transactions of the Royal Society of London, Series A, 222, 309–368.

Freedman, D. (1987). As others see us: A case study in path analysis (with discussion).Journal of Educational Statistics, 12, 101-223.

Guttman, L. (1977). What is not what in statistics. The Statistician 26 81-107.

Holland, P. (1995). Some reflections on Freedmans critiques. Foundations of Science, 1,50-57. URL http://arxiv.org/pdf/1505.02452v1.pdf

Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomed-ical Sciences: An Introduction. Cambridge University Press, New York.

Mohan, K., Pearl, J. and Tian, J. (2013). Graphical models for inference with missingdata. In Advances in Neural Information Processing Systems 26 (C. Burges, L. Bottou,M.Welling, Z. Ghahramani and K.Weinberger, eds.). Curran Associates, Inc., 1277-1285.http://papers.nips.cc/paper/4899-graphical-models-for-inference-with-missing-data.pdf

Morgan, S. L. and Winship, C. (2014). Counterfactuals and Causal Inference: Methodsand Principles for Social Research (Analytical Methods for Social Research). 2nd ed.Cambridge University Press, New York.

Pearl, J. (2014a). The deductive approach to causal inference. Journal of Causal Inference,2, 115-129.

Pearl, J. (2014b). Interpretation and identification of causal mediation. PsychologicalMethods, 19 459-481.

Pearl, J. (2015a). Indirect Confounding and Causal Calculus (On three papers by Cox andWermuth). Blog entry: http://www.mii.ucla.edu/causality/.

Pearl, J. (2015b). Indirect Confounding and Causal Calculus (On three papers by Coxand Wermuth). Tech. Rep. R-457, http://ftp.cs.ucla.edu/pub/statser/r457.pdf,Department of Computer Science, University of California, Los Angeles, CA.

Pearl, J. and Bareinboim, E. (2014). External validity: From do-calculus to transportabilityacross populations. Statistical Science, 29, 579-595.

Robins, J. (1986). A new approach to causal inference in mortality studies with a sus-tained exposure period applications to control of the healthy workers survivor effect.Mathematical Modeling, 7, 1393-1512.

203

Page 81: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Pearl

Rosenbaum, P. and Rubin, D. (1983). The central role of propensity score in observationalstudies for causal effects. Biometrika, 70, 41-55.

Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies. Journal of Educational Psychology, 66, 688-701.

Rubin, D. (2008). Authors reply (to Ian Shriers Letter to the Editor). Statistics in Medicine,27, 2741-2742.

Salsburg, D. (2002). The Lady Tasting Tea: How Statistics Revolutionized Science in theTwentieth Century. Henry Holt and Company, LLC, New York.

Sobel, M. (2008). Identification of causal parameters in randomized studies with mediatingvariables. Journal of Educational and Behavioral Statistics, 33, 230-231.

Tian, J. and Shpitser, I. (2010). On identifying causal effects. In Heuristics, Probabilityand Causality: A Tribute to Judea Pearl (R. Dechter, H. Geffner and J. Halpern, eds.).College Publications, UK, 415444.

VanderWeele, T. (2015). Explanation in Causal Inference: Methods for Mediation andInteraction. Oxford University Press, New York.

Wermuth, N. (1992). On block-recursive regression equations. Brazilian Journal of Prob-ability and Statistics (with discussion), 6, 156.

204

Page 82: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 205-211 Submitted 5/15; Published 8/15

Cochran’s Causal Crossword

Paul R. Rosenbaum [email protected]

Department of Statistics, Wharton School

University of Pennsylvania

Philadelphia, PA 19104-6340 US

Abstract

In discussing the “step from association to causation,” Cochran described a certain “multi-phasic attack” as “one of the most potent weapons in observational studies.” This methodemphasized assembling several weak strands of evidence that become stronger throughmutual support by virtue of intersecting in appropriate ways.

Keywords: Differential effects; elaborate theories; evidence factors; intersecting strandsof evidence; quasi-experiments.

1. Introduction

Cochran’s “Observational Studies” is the less familiar, less readily accessible member of apair of papers in which Cochran (1965, 1972) outlined the general structure of observationalstudies as a type of statistical investigation. Studies of this type were not new in 1965,nor was the attempt to think systematically about them (e.g., Campbell and Stanley 1963,Hill 1965), but Cochran was the first person to define the subject abstractly, that is, as asubject applicable to and informed by many academic disciplines. It is fitting that DylanSmall’s new interdisciplinary journal Observational Studies makes Cochran’s (1972) papereasily accessible once again.

Cochran’s papers have many interesting aspects, but I will focus on just one aspectthat appears in different forms near the end of both papers. The final sections of Cochran(1965, 1972) are entitled “The step from association to causation,” and “Judgment aboutcausality.” These sections make several distinct and useful points, but I would like to focuson one of these, first in §2 by quoting what Cochran says, then in §3 by adding someinterpretation.

2. What Cochran says

In discussing judgments about causality, in going beyond modelling empirical associationsto reach causal conclusions, Cochran (1965, 1972) speaks again and again about “manydifferent consequences,” “variety of consequences,” the “mechanism by which the effect isproduced,” and “completely different type[s] of research.” How can many individually weakstrands of evidence combine to become strong evidence by considering many different andvaried consequences of a causal mechanism?

In one of the more often quoted remarks about causal inference, Cochran (1965, p. 252)wrote:

c⃝2015 Paul R. Rosenbaum.

Page 83: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Rosenbaum

First, as regards planning. About 20 years ago, when asked in a meetingwhat can be done in observational studies to clarify the step from associationto causation, Sir Ronald Fisher replied: “Make your theories elaborate.” Thereply puzzled me at first, since by Occam’s razor the advice usually given isto make theories as simple as is consistent with the known data. What SirRonald meant, as the subsequent discussion showed, was that when constructinga causal hypothesis one should envisage as many different consequences of itstruth as possible, and plan observational studies to discover whether each ofthese consequences is found to hold.

After presenting a few illustrations of “many different consequences,” Cochran (1965, p.252) continues:

Of course, the number and variety of consequences depends on the nature of thecausal hypothesis, but imaginative thinking will sometimes reveal consequencesthat were not at first realized, and this multi-phasic attack is one of the mostpotent weapons in observational studies. In particular, the task of decidingbetween alternative hypotheses is made easier, since they may agree in predictingsome consequences but will differ in others.

The second paper repeats similar points in different words and adds (Cochran 1972, p.89):

A claim of proof of cause and effect must carry with it an explanation of themechanism by which the effect is produced. Except in cases where the mecha-nism is obvious and undisputed, this may require a completely different type ofresearch from the observational study that is being summarized.

3. Analogies and methods

3.1 A limited analogy: The cable of many slender fibers

That individually weak strands of evidence may combine to form strong evidence was mostfamously suggested by Charles Sanders Peirce (1868):

[We should] trust rather to the multitude and variety of . . . arguments than tothe conclusiveness of any one. [Our] reasoning should not form a chain whichis no stronger than its weakest link, but a cable whose fibers may be ever soslender, provided they are sufficiently numerous and intimately connected.

Although memorable, the cable analogy distorts in a key respect: the many fibersof a cable play identical roles in forming a strong cable, but strands of evidence mustexhibit variety, must speak to different consequences of a theory, because, as Cochran says,“deciding between alternative hypotheses is made easier, since they may agree in predictingsome consequences but will differ in others.”

There is a better analogy.

206

Page 84: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran’s Causal Crossword

3.2 A better analogy: a crossword puzzle

“Generalization naturally starts from the simplest, the most transparent particular case,”Georg Polya (1968, p. 60) wrote in discussing heuristic reasoning in mathematics. This sim-plest, most transparent case is not the important or general case that led us to be concernedwith the topic in question; rather, it is the least cluttered, most immediately accessible andsurveyable case, the example that perfectly exemplifies one issue in isolation from unneededcomplications. Susan Haack (1995) suggests that the simplest, most transparent case ofweak strands of evidence becoming stronger by virtue of mutual support is the case of acrossword puzzle. She writes (1995, pp. 81-82):

The model is not . . . how one determines the soundness or otherwise of a math-ematical proof; it is, rather, how one determines the reasonableness or otherwiseof entries in a crossword puzzle. . . . [T]he crossword model permits pervasivemutual support, rather than, like the model of a mathematical proof, encour-aging an essentially one-directional conception. . . . How reasonable one’s con-fidence is that a certain entry in a crossword is correct depends on: how muchsupport is given to this entry by the clue and any intersecting entries that havealready been filled in; how reasonable, independently of the entry in question,one’s confidence is that those other already filled-in entries are correct; and howmany of the intersecting entries have been filled in.

Haack is making two points here, the obvious one being that much of the conviction wedevelop that a crossword puzzle is filled-in correctly comes not from the individual clues,but from entries intersecting in appropriate ways. When we first pencil in an entry basedon a clue, we may doubt that it is correct, but later, when other entries meet it in anappropriate way, we may be nearly certain it is correct, even though the direct evidencefrom the clue remains unconvincing on its own. It is important to recognize that, beyondthis obvious point, there is a second point. Haack’s second, subtle, point relates to herphrase above: “independently of the entry in question.” She is concerned to exhibit mutualsupport without vicious circularity. If I can deduce B from assuming A, and if I can deduceA from assuming B, then the assertion of A-and-B based on these two deductions would bea logical error — vicious circularity — because both deductions are perfectly compatiblewith both A and B being false. In the crossword, two entries may meet appropriatelyyet both be incorrect entries. Haack is saying that B provides support for A only to theextent that we are confident about B not employing the support provided by its intersectionwith A, and A provides support to B only to the extent that we are confident about A notemploying the support provided by its intersection with B; but, with this caveat, A and Bmay each support the other. The appropriate intersection of A and B provides support forboth A and B, but we may reflect upon the evidence for B that does not derive from itsappropriate intersection with A, and Haack refers to this as the “independent security” ofB. Haack (1995, p. 84–86) continues:

The idea of independent security is easiest to grasp in the context of the cross-word analogy . . . How reasonable one’s confidence is that 4 across is correctdepends, inter alia, on how reasonable one’s confidence is that 2 down is cor-rect. True, how reasonable one’s confidence is that 2 down is correct in turn

207

Page 85: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Rosenbaum

depends, inter alia, on how reasonable one’s confidence is that 4 across is cor-rect. But in judging how reasonable one’s confidence is that 4 across is correctone need not, for fear of getting into a vicious circle, ignore the support given itby 2 down; it is enough that one judge how reasonable one’s confidence is that2 down is correct leaving aside the support given it by 4 across.

In a crossword puzzle, entries need not intersect to provide mutual support. If 2 downmeets both 4 across and 6 across, then an entry in 6 across may support the entry in 2down, and the entry in 2 down may support the entry in 4 across, so the entry in 6 acrosssupports the entry in 4 across even though 6 across and 4 across do not intersect.

Consider the same ideas in a biological context. A high level of exposure to a toxin,such as cigarette smoke, is associated with a particular disease, say a particular cancer,in a human population, where experimentation with toxins is not ethical. A controlledrandomized experiment shows that deliberate exposure to the toxin causes this same cancerin laboratory animals. A DNA-adduct is a chemical derived from the toxin that is covalentlybound to DNA, perhaps disrupting or distorting DNA transcription. Exposure to thetoxin is observed to be associated with DNA-adducts in lymphocytes in humans; e.g.,Phillips (2002). A further controlled experiment shows that the toxin causes these DNA-adducts in cell cultures. A case-control study finds cases of this cancer have high levels ofthese DNA-adducts, whereas noncases (so-called “controls”) have low levels. A pathologystudy finds these DNA-adducts in tumors of the particular cancer under study, but not inmost tumors from other types of cancer. Certain genes are involved in repairing DNA, forinstance, in removing DNA adducts; see Goode et al. (2002). In human populations, arare genetic variant has a reduced ability to repair DNA, in particular a reduced abilityto remove adducts, and people with this variant exhibit a higher frequency of this cancer,even without high levels of exposure to the toxin. Each of these entries in the larger puzzleis quite tentative as an indicator that the toxin causes cancer in humans, and some of theentries do not directly intersect; e.g., the rare genetic variant is not directly linked to thetoxin. Yet, the filled in puzzle with its many intersections may be quite convincing.

Consider the same ideas in an economic context. Economic understanding depends, inpart, on mathematical theories that derive predictions of economic actions from behavioralassumptions, and, in part, on empirical studies of how people or institutions do act inparticular economic contexts. Taken in isolation, the assumptions in one mathematicaltheory may be quite speculative. Taken in isolation, the findings in one empirical studymay be quite insecure, ambiguous and tentative. However, one mathematical theory mayintersect with many empirical studies, and may also intersect with many other mathematicaltheories. Important economic facts – say, a high level of unemployment among recent highschool graduates in a particular region at a particular time – may be compatible withseveral mutually incompatible economic theories – say, a theory that emphasizes rigiditiesin the labor market, or another that emphasizes the absence of a mechanism to provideadequate investments in human capital. But each theory intersects many particular facts,many empirical studies, and many other theories. Clarification comes, if it does, when aninitially speculative theory has correctly met so many ambiguous facts or tentative empiricalfindings that the theory is no longer speculative, the facts no longer ambiguous, the findingsno longer tentative.

208

Page 86: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran’s Causal Crossword

3.3 What would it mean to take Cochran’s advice seriously?

If you took Cochran’s advice seriously, then you would ask of each new study what it con-tributes to the currently incomplete, partly penciled-in puzzle. You would welcome thecompletion of a new entry, even a small entry, compatible with the current tentative com-pletion. You would also welcome a compelling new entry that challenged some currententries. You would welcome the suggestion that a particular entry is mistaken and consti-tutes a barrier to correct completion of the puzzle. A pencil and an eraser would be twotools of equal importance. You would agree with Sunstein (2005) in finding positive value indissonance and dissent, and you would agree with Rescher (1995) in finding positive valuein consensus only to the extent that this consensus has its origins in a rational appraisalof the evidence, whereupon the mere existence of consensus would have little importancebeyond its important origins. You would tolerate inconsistency and uncertainty as neces-sary stepping stones on a path to greater consistency and greater certainty. You wouldwelcome systematic attempts to take stock, to view the tentative completion as a whole,the appraisal of the gaps, the parts that appear secure, the other parts that are uncertain,needing work, in conflict, perhaps mistaken. You would welcome careful, patient, methodi-cal scientific work. You would agree with Kafka (1917): “All human errors are impatience,the premature breaking off of what is methodical.”

To take Cochran’s advice seriously is to be skeptical of investigations that derive stoutconclusions from slender evidence. It is to be skeptical of grand studies and grand con-clusions, the suggestion that a single proposed entry settles a major issue, that consistentcompletion of the puzzle is inevitable given this one entry, and hence consistent completionis not needed and not worth the effort.

3.4 Methods

Several statistical methods cultivate varied strands of evidence within a single study, eachstrand being weak on its own, each strand vulnerable in a different way, but with theseveral strands gaining in strength if they agree in appropriate ways. Traditional methodsare quasi-experimental designs; see Campbell and Stanley (1963), Shadish et al. (2002),West et al. (2008) and Wong et al. (2015). More recent methods include evidencefactors (Rosenbaum 2010, 2015; Zhang et al. 2011), differential effects (Rosenbaum 2006,2013, 2015; Zubizarreta et al. 2014) and attempts to integrate qualitative and quantitativecausal inference (Rosenbaum and Silber 2001; Weller and Barnes 2014). Vanderweele (2015)expands on one of Cochran’s themes, the role of mechanisms as evidence. Yang et al.(2014) encourage the tolerance of statistical inferences that terminate in dissonance, thatis, inferences that demonstrate unresolved inconsistencies among intersecting strands ofevidence.

Acknowledgments

Supported by a grant from the Measurement, Methodology and Statistics Program of theUS National Science Foundation.

209

Page 87: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Rosenbaum

References

Campbell, D. T., Stanley, J. C. (1963). Experimental and Quasi-experimental Designs forResearch. Chicago: Rand McNally.

Cochran, W. G. (1965). The planning of observational studies of human populations (withDiscussion). Journal of the Royal Statistical Society A, 234-266.

Cochran, W. G. (1972). Observational Studies. Statistical Papers in Honor of George W.Snedecor. Ames: Iowa State University Press, 70-90.

Goode, E. L., Ulrich, C. M. and Potter, J. D. (2002). Polymorphisms in DNA repair genesand associations with cancer risk. Cancer Epidemiology Biomarkers and Prevention,11:1513-1530.

Haack, S. (1995). Evidence and Inquiry. Oxford: Blackwell.

Hill, A. B. (1965). The environment and disease: association or causation?. Proceedings ofthe Royal Society of Medicine, 58:295-300.

Kafka, F. (1917). The Blue Octavo Notebooks. Cambridge: Exact Change.

Peirce, C. S. (1868). Some consequences of four incapacities. Journal of SpeculativePhilosophy, 2, 140-157. Reprinted in: Talisse, R. B. and Aikin, S. F., eds. (2011)The Pragmatism Reader: From Peirce through the Present, Cambridge MA: HarvardUniversity Press.

Phillips, D. H. (2002). Smoking-related DNA and protein adducts in human tissues. Car-cinogenesis, 23:1979-2004.

Polya, G. (1968). Mathematics and Plausible Reasoning, Volume II, 2nd edition. Princeton,NJ: Princeton University Press.

Rescher, N. (1995). Pluralism: Against the Demand for Consensus. New York: OxfordUniversity Press.

Rosenbaum, P. R. and Silber, J. H. (2001). Matching and thick description in an observa-tional study of mortality after surgery. Biostatistics, 2, 217-232.

Rosenbaum, P. R. (2006). Differential effects and generic biases in observational studies.Biometrika, 93:573-586.

Rosenbaum, P. R. (2010). Evidence factors in observational studies. Biometrika, 97:333-345.

Rosenbaum, P. R. (2013). Using differential comparisons in observational studies. Chance,26:18-25.

Rosenbaum, P. R. (2015). How to see more in observational studies: Some new quasi-experimental devices. Annual Review of Statistics and Applications, 2:21-48.

Shadish, W. R., Cook, T. D., Campbell, D. T. (2002). Experimental and Quasi-experimentalDesigns for Generalized Causal Inference. Boston: Houghton Mifflin.

Sunstein, C. R. (2005). Why Societies Need Dissent. Harvard University Press.

Vanderweele, T. J. (2015). Explanation in Causal Inference. New York: Oxford.

Weller, N. and Barnes, J. (2014). Finding Pathways: Mixed-method Research for StudyingCausal Mechanisms. Cambridge University Press.

West, S. G., Duan, N., Pequegnat, W., Gaist, P., Des Jarlais, D. C., Holtgrave, D., Szapoc-znik, J., Fishbein, M., Rapkin, B., Clatts, M., Mullen, P. D. (2008). Alternatives to therandomized controlled trial. American Journal of Public Health, 98:1359-66.

210

Page 88: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Cochran’s Causal Crossword

Wong, M., Cook, T. D., and Steiner, P. M. (2015). Adding design elements to improve timeseries designs: No Child Left Behind as an example of causal pattern matching. Journalof Research on Educational Effectiveness, 8:245-279.

Yang, F., Zubizarreta, J. R., Small, D. S., Lorch, S., and Rosenbaum, P. R. (2014). Disso-nant conclusions when testing the validity of an instrumental variable. American Statis-tician, 68:253-263.

Zhang, K., Small, D. S., Lorch, S., Srinivas, S., and Rosenbaum, P. R. (2011). Using splitsamples and evidence factors in an observational study of neonatal outcomes. Journal ofthe American Statistical Association, 106:511-524.

Zubizarreta, J. R., Small, D. S., Rosenbaum, P. R. (2014). Isolation in the construction ofnatural experiments. Annals of Applied Statistics, 8:2096-2121.

211

Page 89: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 212-216 Submitted 3/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Donald B. Rubin [email protected]

Department of Statistics

Harvard University

Cambridge, MA 02138, USA

First, I have to thank Dylan Small for inviting me to contribute comments on thereprinted Cochran (1972) target article. This is, in fact, the third time that I have readthis nice, almost colloquial, chapter Bill Cochran wrote in honor of his coauthor, GeorgeSnedecor. The first time was about 1970 when I was finishing my PhD under Bill’s direction,and he circulated it as a pre-print. The second time was while I was writing my own chapter(Rubin 1984) summarizing Bill’s contributions to observational studies, which appeared inthe volume edited by Rao and Sedransk; because my discussion of it is less than three pagesand the original is in a relatively massive book, I include it here in the appendix.

But what would I say that’s different today than I said back in 1984? Obviously, I stillam awed by Bill’s straightforward and no-nonsense style of communication – that hasn’tchanged. But I think that today I would include some points not emphasized in 1984.

One aspect that I think I should have emphasized is the usefulness of the formal idea foran “assignment mechanism” to distinguish between randomized experiments versus obser-vational studies and the formal concept of “potential outcomes” to define precisely causaleffects. That is, under the “stable unit treatment value assumption” (SUTVA; Rubin 1980,1986), Yi(1) are the ith units values of outcomes under the active treatment and Yi(0) arethe i-th unit’s values of outcomes under the control treatment, where the causal effect ofthe active versus the control treatment for unit i is the comparison of these two potentialoutcomes. Also, the assignment mechanism is the probability distribution of the vectorof treatment indicators, W, given the arrays of potential outcomes and covariates; thisperspective was termed “Rubin’s Causal Model” by Holland (1986), but the potential out-comes notation had its roots in the work of Neyman (1923) in the context of randomizedexperiments; the term “assignment mechanism” and the use of potential outcomes to definecausal effects in general originates with work in the 1970s (Rubin 1974, 1976, 1977, 1978).

In my 1984 discussion of Cochran (1972), I did not emphasize the clarity that thisconceptualization brings to causal inference in observational studies. In hindsight, I thinkthat omission was because that conceptualization seemed so obvious to me. It is only inrecent years that I have been befuddled by all the confusion created by some writers whoeschew this formulation with its attendant clarity. The recent text by Imbens and Rubin(2015) hopefully contributes to rectifying this situation, at least from my perspective.

Another noteworthy omission in my 1984 discussion is my recent focus on the importanceof outcome-free design for observational studies (Rubin 2006, 2008). I am not alone inhaving this current emphasis; see, for example Yue (2006) and D’Agostino and D’Agostino(2007). In hindsight I wished that I had emphasized this aspect, although with generous

c⃝2015 Donald B. Rubin.

Page 90: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

interpretation one could read that theme into parts of Cochran (1972), although I do notsee much distinction made between things like propensity score design, which is blind tooutcome data, and model-based adjustment methods, which require outcome data and soare subject to inapposite manipulation. I think that this desire to correct this omissionarises from being repeatedly exposed to more problematic examples in recent years.

A final comment on Cochran (1972) concerns a statement from his concluding section“Judgement About Causality” where he fairly blatantly reveals his disappointment in theanswers provided by scientific philosophers. Often decisions about interventions must bemade, even if based on limited empirical evidence, and we should help decision makers makesensible decisions under clearly stated assumptions so that “consumers” of the conclusionabout the effects of some intervention can honestly weigh the support for that conclusion.

In conclusion: A fine choice for a classic article to reprint.

Appendix

Cochran (1972) is an easy to read article that begins with a reviwe of examples of observa-tional studies and notes their increased numbers in recent years. He mentions, as examplesof observational studies for treatment effects, the Cornell study of seat belts (Kihlberg andNarragon, 1964), studies of smoking and health (U.S. Surgeon General Advisory’s Com-mittee Report, 1964), the halothane study (Bunker et al., 1969), and the Coleman report(Coleman et al., 1966). In many ways this article is an updated conversational summaryof Cochran (1965). Here, even more than before, mature advice to the investigator is thefocus of the paper.

First, the investigator should clearly state the objective and hypothesis of the studybecause such “...statements perform the valuable purpose of directing attention to the com-parisons and measurements that will be needed.” Second, the investigator should carefullyconsider the type of study. Cochran seems to be more obviously negative about studieswithout control groups than he was earlier:

Single group studies are so weak logically that they should be avoided when-ever possible...

Single group studies emphasize a characteristic that is prominent in the anal-ysis of nearly all observational studies – the role of judgment. No matter howwell-constructed a mathematical model we have, we cannot expect to plan a sta-tistical analysis that will provide an almost automatic verdict. The statisticianwho intends to operate in this field must cultivate an ability to judge and weighthe relative importance of different factors whose effects cannot be measured atall accurately.

Comparison groups bring a great increase in analytical insight. The influenceof external causes on both groups will be similar in many types of study andwill cancel or be minimized when we compare treatment with no treatment.But such studies raise a new problem – How do we ensure that the groups arecomparable?

Cochran still emphasizes the importance of effective measurement:

213

Page 91: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Rubin

The question of what is considered relevant is particularly important in pro-gram evaluation. A program may succeed in its main objectives but have unde-sirable side effects. The verdict on the program may differ depending on whetheror not these side effects are counted in the evaluations...Since we may have tomanage with very imperfect measurements, statisticians need more technicalresearch on the effects of errors of measurement.

But the primary statistical advice Cochran has to offer is on controlling bias: “Thereduction of bias should, I think, be regarded as the primary objective – a highly precise es-timate of the wrong quantity is not much help...In observational studies, three methods arein common use in an attempt to remove bias due to extraneous variables...Blocking, usuallyknown as matching in observational studies...Standardization (adjustment by subclassifica-tion)...Covariance (with x’s quantitative) used just as in experiments.” The de-emphasis ofefficiency relative to bias removal was evident when I began my thesis work under Cochranin 1968. The results of this thesis (Rubin, 1970) in large part published in Rubin (1973a,1973b), and summarized in Cochran and Rubin (1973), lead to some new advice on thetradeoff between matching and covariance: In order to guard against nonlinearities in theregression of y on x, the combination of regression and matching appears superior to eithermethod alone.

Recent work (Rubin, 1979) extends this conclusion to more than one x. Specifically,based on Monte Carlo results with 24 moderately nonlinear but parallel response surfacesand 12 bivariate normal distributions of x, and using percentage reduction in expectedsquared bias of treatment effect as the criterion, it appears quite clear that the combinationof matched sampling and regression adjustment is superior to either matching or regressionadjustment alone. Furthermore, Mahalanobis metric matching, which defines the distancebetween a treatment and control unit using the inverse of the sample covariance matrix ofthe matching variables and then sequentially finds the closest unmatched control unit foreach experimental unit, was found superior to discriminant matching, which forms the bestlinear discriminant between the groups and sequentially finds the closest unmatched controlunit for each experimental unit with respect to this discriminant. Moreover, regressionadjustment that estimates the regression coefficient from the regression of the matchedpair y differences on the matched pair x differences is superior to the standard covariance-adjusted estimator, which estimates the coefficients from the pooled within-group covariancematrix.

Cochran goes on to offer advice on sample sizes, handling nonresponse, the use of a pilotstudy, the desire for a critical colleague in the planning stages, and the relation betweensample and target populations.

It is not surprising that this article concludes with a short section called “JudgmentAbout Causality.” Cochran’s views are somewhat more bluntly presented here than inprevious writing:

It is well known that evidence of a relationship between x and y is no proofthat x causes y. The scientific philosophers to whom we might turn for expertguidance on this tricky issue are a disappointment. Almost unanimously andwith evident delight they throw the idea of cause and effect overboard...A claimof proof of cause and effect must carry with it an explanation of the mechanism

214

Page 92: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

by which the effect is produced. Except in cases where the mechanism is obviousand undisputed, this may require a completely different type of research fromthe observational study htat is being summarized. Thus in most cases the studyends with an opinion or judgment about causality, not a claim of proof.

Cochran closes with the standard advice to make causal hypotheses complex:

Given a specific causal hypothesis that is under investigation, the investiga-tor should think of as many consequences of the hypothesis as he can and inthe study try to include response measurements that will verify whether theseconsequences follow.

References

Bunker, J. P. et al., eds. (1969). The national halothane study. Washington, D.C.: USGPO.

Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser.A, 128 : 234-66.

Cochran, W.G. and Rubin, D.B. (1973). Controlling bias in observational studies: a review.Sankhya A, 35, 417–446.

Coleman, J. S. (1966). Equality of educational opportunity. Washington, D.C.: USGPO.

D’Agostino, R.B. Jr. and D’Agostino, R.B. Sr. (2007). Estimating treatment effects usingobservational data. Journal of the American Medical Association, 297, 314–316.

Holland, P. W. (1986). Statistics and causal inference (with discussion). Journal of theAmerican Statistical Association, 81, 945–970.

Imbens, G.W. and Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomed-ical Sciences: An Introduction. Cambridge University Press, New York.

Neyman, J. (1923). On the Application of Probability Theory to Agricultural Experiments:Essay on Principles, Section 9. Translated in Statistical Science, 5, 465-480, 1990.

Rubin, D.B. (1970). The Use of Matched Sampling and Regression Adjustment in Obser-vational Studies, Ph.D. thesis, Department of Statistics, Harvard University.

Rubin, D.B. (1973a). Matching to remove bias in observational studies. Biometrics, 29,159–183.

Rubin, D.B. (1973b). The use of matched sampling and regression adjustment to removebias in observational studies. Biometrics, 29, 184–203. Biometrics

Rubin, D.B. (1974). Estimating causal effects of treatments in randomized and nonran-domized studies. Journal of Educational Psychology, 66, 688–701.

Rubin, D.B. (1976). Inference and missing data (with discussion). Biometrika, 63, 581–592.

Rubin, D.B. (1977). Assignment to treatment group on the basis of a covariate. Journal ofEducational Statistics, 2, 1–26.

Rubin, D.B. (1978). Bayesian inference for causal effects: The role of randomization. Annalsof Statistics, 6, 34-68.

Rubin, D.B. (1979). Using multivariate matched sampling and regression adjustment tocontrol bias in observational studies. Journal of the American Statistical Association, 74,318–328.

215

Page 93: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Rubin

Rubin, D.B. (1980). Discussion of “Randomization analysis of experimental data in theFisher randomization test by Basu.” Journal of the American Statistical Association, 75,591–593.

Rubin, D.B. (1984). William G. Cochran’s contribution to the design, analysis and evalua-tion of observational studies. In W.G. Cochran’s Impact on Statistics, eds., P.S.R.S. Raoand J. Sedransk. Wiley, New York.

Rubin, D.B. (1986). Which ifs have causal answers? (comment on “Statistics and causalinference”by P.W. Holland). Journal of the American Statistical Association, 81, 961–962

Rubin, D.B. (2008). For objective causal inference, design trumps analysis. Annals ofApplied Statistics, 2, 808–840.

Rubin, D.B. and Waterman, R.L. (2006). Estimating causal effects of marketing interven-tions using propensity score methodology. Statistical Science, 21, 206–222.

United States Surgeon General’s Advisory Committee Report (1964). Smoking and Health.U.S. Department of Health, Education and Welfare, Washington D.C.

216

Page 94: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 217-219 Submitted 3/15; Published 8/15

Comment on Cochran’s “Observational Studies”

Herbert L. Smith [email protected]

Department of Sociology and Population Studies Center

University of Pennsylvania

Philadelphia, PA 19104, USA

Cochran’s perspective on observational studies (inter alia, Cochran 1965, Cochran 1972)remains a touchstone for population studies and other fields in which observations are plentyand experiments are implausible. It repays careful study and reflection. Cochran (1972,pp. 77-78) starts by distinguishing “analytical studies” from the “type of observationalstudy [that] is narrower in scope[, where t]he investigator has in mind agents, procedures,or experiences that may produce certain causal effects...on people” [my emphasis]. Theessay discusses difficulties that arise when treatments cannot be assigned at random, as inan experiment. It also discusses many other features of research design that have nothingto do with random assignment but much to do with our warrant for causal inference,including the need to be explicit regarding what the research is about (the specification ofwhat is to be estimated and where effects are supposed to obtain), trade-offs in measurementbetween accuracy and bias, sample size, and non-response. Somewhere along the line “doingcausation” became associated with establishing the internal validity of a study and/orform of estimation, and everything else got swept into external validity and measurement–important, to be sure; but somehow not as important as “causal analysis.” To re-readCochran (1972) is to realize this should not have happened (Smith 2009b, 2013).

In the discussion of sample size, Cochran (1972, p. 87) distinguishes between “estimatinga single overall effect of the treatment” and the alternative situation in which “the variationin effect with an x is of major interest.” In the latter circumstance, randomization does notbuy one out of the need to specify accurately what is being estimated and for whom (Smith1990). The definition of treatment effects at the unit level (Rubin 2005) has made clear howcontingent all causal inference is on varying qualities of the population. Many of the xs areunobservable and are confounded with selection into treatment such that familiar statisticalestimators must be re-conceptualized as to pertain to causal effects only in certain slices ofthe population (Angrist, Imbens, and Rubin 1996; Morgan and Winship 2007, ch. 7).

The renewed emphasis on the population heterogeneity of treatment effects (Xie 2013)has paid great dividends in the social sciences. It was always suspected that “returns toeducation” may be exaggerated given that those who obtain college educations may be pre-cisely the type of people who would do well financially in the absence of a degree; notoriouscounterfactuals such as Bill Gates and Steve Jobs have only burnished the argument foradjusting downward maintained causal effects to adjust for positive selectivity into highereducation. Although this may indeed be the case at the upper-end of the social backgroundand ability distribution, there is a concomitant negative selection such that those most likely

c⃝2015 Herbert L. Smith.

Page 95: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Smith

to benefit from a college education are those least likely to be able to or to be encouragedto seek one (Hout 2012).

Cochran (1972, p. 87) wrote that “[i]n modern studies, standards with regard to thenonresponse problem seem to me to be lax.” The same is doubtless true in our post-modernera, but the problem has become so much worse that is hard to disentangle lax standardsin survey research from social change (technologies, household and family structures, trustin institutions, community adhesion, non-stop marketing, and just plain fatigue with whatwas once an innovation but is now a hovering presence) . Moreover, a blind faith in what areessentially arbitrary standards for response rates can distract from serious studies of biasdue to non-response (Asch, Jedrziewski, and Christakis 1997). We are learning interestingthings. Early polling failures with respect to predicting presidential election outcomes havelong been the basis for introducing students to problems of coverage and non-response biasalike (Freedman, Pisani, and Purves 1978, pp. 301-307). There are now so many polls,executed at varying degrees of departure–typically large–from the ideal of sampling framesisomorphic with target populations, respondent selection with known probabilities, andsuccessful effort to deter non-response; and so many elections; that we know that non-response, though large in the extreme, is not a problem with respect to bias (Gelman andKing 1993, pp. 423-427). Happily, all sorts of biases seem to cancel one another out, to thepoint where a mass of such polls, undifferentiated with respect to the classic criteria for suchobservational studies, provide a firm basis for some impressive forecasting of election results(Linzer 2013). Similar results have also been obtained via post-stratification adjustment ofpolls with no pretense to sampling (Wang et al. 2015).

But there are no guarantees that it will be ever thus–the 2015 British parliamentaryelection appears to be a contemporary example of results defying polls (if not modeling ofpoll data). Moreover, there are few if any similar domains of survey research in which theextent of observation is so great and the criterion for validity so decisive. Our hunches andreceived wisdom may not suffice. In a mail-out, mail-back survey of over 100,000 registerednurses, only one-third could be induced to respond. The possible extent of potential biaswas large, and a careful follow-up survey of over 1,000 non-respondents (where only ninepercent proved to be “hard core”) confirmed differential non-response with respect to anumber of demographic characteristics, including gender, race, national origin, and edu-cation. These are among the characteristics often available in sampling frames, hence thebasis of post-stratification weighting schemes. Yet the follow-up survey of non-respondentsalso revealed that in spite of the bias with respect to these demographic factors there wasno bias whatsoever regarding the core content items that had motivated the study (Smith2009a). I would be hard-pressed to dissent from Cochrans (1972, p. 89) conclusion, that“observational studies are an interesting and challenging field which demands a great dealof humility, since we can claim only to be groping toward the truth.”

References

Angrist, J.D., Imbens, G.W., and Rubin, D.B. (1996). Identification of causal effects usinginstrumental variables. Journal of the American Statistical Association, 91, 444-455.

Asch, D.A., Jedrziewski, M.K. and Christakis, N.A. (1997). Response rates to mail surveyspublished in medical journals. Journal of Clinical Epidemiology, 50, 1129–1136.

218

Page 96: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on Cochran’s “Observational Studies”

Cochran, W. G. (1965). The planning of observational studies. J. Roy. Statist. Soc. Ser.A, 128 : 234-66.

Cochran, W,G.. (1972). Observational studies. In Statistical Papers in Honor of GeorgeW. Snedecor, ed. T.A. Bancroft, 70-90. Iowa University Press, Ames. Reprinted inObservational Studies, 1, 126–136.

Gelman, A. and King, G. (1993). Why are American presidential election campaign polls sovariable when votes are so predictable? British Journal of Political Science, 23, 409-451.

Hout, M. (2012). Social and economic returns to college education in the United States.Annual Review of Sociology, 38, 379-400.

Linzer, D.A. (2013). Dynamic Bayesian forecasting of presidential elections in the states.Journal of the American Statistical Association, 108, 124-134.

Rubin, D.B. (2005). Causal inference using potential outcomes. Journal of the AmericanStatistical Association, 100, 322-331.

Smith, H.L. (1990). Specification problems in experimental and nonexperimental socialresearch. Sociological Methodology, 20, 59-91.

Smith, H.L. (2009a). Double sample to minimize bias due to non-response in a mail survey.PSCWorking Paper Series, December. http://repository.upenn.edu/psc_working_papers/20.

Smith, H.L. (2009b). Causation and Its Discontents. In Causal Analysis in PopulationStudies, ed. H. Engelhardt, H.-P. Kohler, and A. F’urnkranz-Prskawetz, 23342. TheSpringer Series on Demographic Methods and Population Analysis 23. Springer Nether-lands. http://link.springer.com/chapter/10.1007/978-1-4020-9967-0_10.

Smith, H.L. (2013). Research design: Toward a realistic role for causal analysis. In Hand-book of Causal Analysis for Social Research, ed. S.L. Morgan. 45-73. Springer Nether-lands. http://link.springer.com/chapter/10.1007/978-94-007-6094-3_4.

Wang, W., Rothschild, D., Goel, S. and Gelman, A. (2015). Forecasting elections withnon-representative polls. International Journal of Forecasting, 31, 980–991.

Xie, Y. (2013). Population heterogeneity and causal inference. Proceedings of the NationalAcademy of Sciences, 110, 6262–6268.

219

Page 97: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 220-222 Submitted 3/15; Published 8/15

Comment on “Observational Studies” by Dr. W.G. Cochran(1972)

Mark J. van der Laan [email protected]

Division of Biostatistics, School of Public Health

University of California, Berkeley

Berkeley, CA 94720, USA

This article “Observational Studies” by Dr. W.G. Cochran (1972) is not only a must-read but also an excellent article motivating the need for the Observational Studies journalthat is now having its first issue.

Reading this paper by Dr. Cochran, a few points came to my mind. To start withhis description of the growing importance of observational studies aiming to shed light oncausal effects could have been written today with a few minor changes. For example, theimportance of governments in driving the need for well designed observational studies, as hedescribes, is as timely as ever with the FDA running large scale safety analysis programs todetermine harmful side effects of FDA-approved drugs, and with the precision medicine ini-tiatives at the national and state level to improve patient care while making it cost-effective,to just name a few. These initiatives demand setting up large observational studies, suchas the Sentinel Project, and create collaborations among government, insurance companies,industry and academics. By having a clear goal in mind, these initiatives provide uniqueopportunities for inspirational multidisciplinary bundles of scientific activities, all havingthe eye on the ball.

Dr. Cochran’s articles represent the writing of a wise scientist that greatly cares aboutthe real world, the truth, and the impact of these observational studies on society. Theending of his article is also telling about his great care: “In conclusion, observational studiesare an interesting and challenging field which demands a good deal of humility, since wecan claim only to be groping toward the truth.”

Dr. Cochran provides a wonderful roadmap for planning observational studies, whileproviding crystal clear examples to demonstrate the dangers that lurk in the backgroundand can completely destroy the success of an observational study. His demonstrations areas timely as ever, and the mistakes he warns against are not only still common place in thecurrent era of Big Data, but are, in fact, even more prevalent than ever.

Even though much of his wisdom appears to be common place by now in a typicalepidemiology or biostatistics education, he puts his finger on crucial spots that are easilyoverlooked by most practitioners and data analysts involved in designing observationalstudies, including its statistical analysis plan.

In particular, Dr. Cochran clarifies the importance of defining the causal quantity ofinterest and addressing what and how data needs to be measured in order to be able tolearn the value of this causal quantity from the observed data. In addition, he stressesthat once one has the question of interest in mind the decisions regarding the design can

c⃝2015 Mark J. van der Laan.

Page 98: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Comment on “Observational Studies” by Dr. W.G. Cochran (1972)

now be targeted towards this question, resulting in what one might call targeted designs. An important part of our research has centered on precisely developing targeted groupsequential designs that not only optimally estimate the desired causal quantity, but alsoadapt the design (e.g., allocation of treatment) towards its goal based on what one learnedfrom past data (e.g., Chambaz et al, 2015), without sacrificing the scientific rigor of acontrolled design.

Most importantly, time after time Dr. Cochran expresses concern about the underlyingassumptions under which the design combined with the proposed analysis actually answerthe question of interest. For example, he cares about the validity of a regression model, andthat only with relatively few variables. I can only imagine how concerned he would be whenwe run a high dimensional linear regression analysis with hundreds or thousands of variables,as if these models approximate the truth. He emphasizes the need for supplementary studies,including pilot studies, to obtain better understanding of measurement errors “so that wecan work with realistic models”.

Clearly, Dr. Cochran does care about using statistical models that are realistic, and, evenwhen the study is an observational study, he wants to control as much about the experimentas possible and incorporate this knowledge about the experiment in the statistical analysisplan for assessing the desired causal effect. Compare this with the common approach thatthrows one of our standard unrealistic parametric regression models at the data based onthe format of the data, and manually tuning its choice to get the desired answer! In Dr.Cochrans discussion on the effect of misspecified models on bias and variance, he concludeswith “Reduction of bias should be the primary objective”. From this perspective, a recentopinion piece “Why we need a statistical revolution” (van der Laan, 2015) is very much in thespirit of Dr. Cochran. Our research in the field Targeted Learning (e.g., van der Laan andRose, 2011) is aimed to respond to the enormous challenges our field is confronted with,by providing methods that optimally learn a specific target quantity, only incorporatingreal knowledge about the data generating experiment, fully utilizing the state of the art inmachine learning through Super Learning, while still providing formal statistical inference. Iwould have loved the opportunity to talk with Dr. Cochran to hear his view on these robustmethods based on realistic models, aiming to minimize bias, and maximize precision.

After having pointed out the lack of statistical methods that appropriately deal withnonresponse, Dr. Cochran concludes: “Fortunately, nonresponse can often be reducedmaterially by hard work during the study, but definite plans for this need to be madein advance.” He realizes that observational studies require as much careful planning as acontrolled experiment, and that hard work can prevent missingness or provide a fundamentalunderstanding of the missingness mechanism so that statistical methods can correct for biasinduced by informative censoring accordingly.

Beyond the great concern Dr. Cochran expresses regarding statistical bias due to viola-tions of assumptions such as linearity of a regression model or measurement error assump-tions, he pays particular attention to the non-testable assumptions required to draw causalconclusions. He states that these non-testable assumptions should not only be clearly pre-sented and discussed in a separate section of a manuscript, but substantial effort should beinvested in additional analyses that can shed some light on these assumptions, and possibleexplanations of the statistical findings should be provided.

221

Page 99: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

van der Laan

Cochran strongly recommends the inclusion of a colleague who plays the role of devils’sadvocate to hit the weak spots of the statistical analysis plan. He would have liked theuse of negative controls (i.e., in the actual data set of interest one assesses the effect of avariable on an outcome for which it is known that the causal effect equals zero) to showcasepossible causal bias in the statistical method.

In light of the political maneuvering taking place in the current Big Data arena, thefollowing remark by Dr. Cochran is very relevant and timely: “In numerous instances thechoice seems to lie between doing a study much smaller and narrower in scope than desiredbut with high quality of measurement, or an extensive study with measurements of dubiousquality. I am seldom sure what to advise.”

To conclude this commentary, Dr. Cochran is one of the very important contributors toour discipline, and his spirit is at least as important now as it was at the time. His spiritstands for the advancement of science going after truth; careful and targeted planning ofobservational studies; targeting of the statistical approach towards the scientific questionof interest and integrating the knowledge about the experiment; and hard work combinedwith humility when it comes to drawing conclusions. I am convinced that this new journalObservational Studies will stand for all this, greatly advance our scientific discipline, andthereby honor Dr. Cochran and the likes of him accordingly.

References

Chambaz, A. and van der Laan, M. J. (2014). Inference in targeted group-sequentialcovariate-adjusted randomized clinical trials. Scandinavian Journal of Statistics, 41, 104–140.

van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observa-tional and Experimental Data. Springer, New York.

van der Laan, M.J. (2015). Why we need a statistical revolution.http://www.stats.org/super-learning-and-the-revolution-in-knowledge/

Zheng, W., Chambaz, A. and van der Laan, M.J. (2015, forthcoming). Group sequen-tial clinical trials with response-adaptive randomization. In Modern Adaptive Random-ized Clinical Trials: Statistical, Operational, and Regulatory Aspects, ed. A. Sverdlov.Springer, New York. See also http://biostats.bepress.com/ucbbiostat/paper323

222

Page 100: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 223-230 Submitted 3/15; Published 8/15

Observational Studies and Study Designs: An EpidemiologicPerspective

Tyler J. VanderWeele [email protected]

Department of Epidemiology and Department of Biostatistics

Harvard University

Boston, MA 02115, USA

Cochran’s Contribution to Observational Studies

Cochran’s remarkable paper, published over forty years ago, clearly articulates, or touchesupon and anticipates, numerous principles of research for observational studies that arestill of central relevance today. Amongst the principles and ideas covered in his paper are:the central importance of a control group (Section 3); issues of confounding and confound-ing adjustment (Sections 3 and 6); the relevance of baseline measurements of the outcome(Section 3); the issue of direction of causation even in what today would be called an in-terrupted time-series designs (Section 3); issues of measurement error including differentialmeasurement error (Section 4); the importance of examining effects on multiple outcomesin assessing the relevance of a treatment or exposure (Section 4) - an issue that is still verymuch neglected today; randomized experiments as a template for thinking about observa-tional studies (Section 5); matching, blocking, standardization, and covariance adjustment(Section 6); issues of generalizability and external validity (Section 7.5); and of course thechallenges, and relevance, of causal inference (Section 7.6).

The discussion in Cochran’s paper moreover anticipates numerous important ideas thatwere to be developed fully only much later. His discussion of matching and standardizationand the difficulties encountered with multiple covariates (Section 6) anticipates the need forthe use of propensity scores (Rosenbaum and Rubin, 1983, 1984) in the handling of theseissues. His discussion of the role of blocking and covariance adjustment in observationalstudies to both increase precision and control bias, along with the position that in observa-tional studies reduction of bias ought to be given priority (Section 5), partially anticipatesRosenbaum’s notion of design sensitivity being more important than efficiency in observa-tional studies (Rosenbaum, 2010). His discussion of causation as a change in the treatmentvariable leading to a change in the outcome (rather than merely being associated) antici-pates the resurrection of Neyman’s potential outcomes notation (Neyman, 1923) by Rubin(1974, 1978) for use in observational studies just a couple of years after Cochran’s paper waswritten. His mention of “repeated before and repeated after” studies (Section 3), thoughnot developed, anticipates in some sense Robins’ development of concepts and methodologyfor the causal effects of time-varying exposures (Robins, 1986, 1997; Robins et al., 2000).Cochran’s point that assessing the explanation of, or mechanisms for, an effect may requirea completely different type of research from that assessing the overall effect (Section 7.6)

c⃝2015 Tyler J. VanderWeele.

Page 101: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

VanderWeele

anticipates in some sense the literature on mediation and the challenges therein (Robins andGreenland, 1992; Pearl, 2001; VanderWeele, 2015). His encouraging the use of outcomesfor which an exposure has no effect on the outcome (Section 7.6) anticipated more formaldevelopment of the use of “negative controls” for reasoning about causality (Lipsitch et al.,2010; Tchetgen Tchetgen, 2014; Imbens and Rubin, 2015). In many ways then Cochran’spaper provided a roadmap for a great deal of research on observational studies and causalinference that was to take place in the decades to follow.

Observational Studies and Study Design

Cochran’s discussion of observational studies is perhaps best understood as a statistician’sperspective. Many of the issues he had discussed are still at the forefront of statisticalresearch on observational studies today. Within observational studies in epidemiology,however, while all of Cochran’s principles and discussion are still relevant, the scope ofsuch studies is somewhat broader. Cochran distinguishes two broad types of observationalstudies: (i) analytic surveys and (ii) the studies of the effects of treatments. His focus,he notes, is on the latter. Implicit throughout nearly all of Cochran’s discussion of thissecond type of study is that the design of the study in question is what we might todaycall a cohort study. However, many other types of observational study designs are arguablyalso relevant in assessing the effects of treatments, and many of these designs have arisenwithin epidemiology. Indeed, as discussed below, one of epidemiology’s central methodolog-ical contributions has been to study design and to the development of new types of studydesigns. Several of these new designs are described below. However, before doing so, it willbe good to consider more precisely what it is we mean by terms such as “study” and “studydesign.”

Cochran describes an observational study as a study in which “the investigator is re-stricted to taking selected observations or measurements on the process under study [anddoes not] interfere in the process in the way that one does in a controlled laboratory typeof experiment.” Both (i) analytic surveys and (ii) studies of the effects of treatments fallwithin this description. His distinction above between these two might, however, be under-stood either as one of the data available (e.g. all measurements simultaneous versus somebefore and some after an exposure or event) or as one of purpose (descriptive versus causal).These two dimensions do not entirely coincide. In a survey in which all variables are mea-sured on a single occasion, it may be possible to retrospectively assess certain exposures ortreatments which can then be used in causal/etiologic research. Likewise, even with dataavailable in a longitudinal cohort study, a simple description of the sample characteristicsat baseline may, in some instances, be of considerable interest. We should thus distinguishbetween the data itself and how it was collected (study as a data resource) versus how thedata is to be used (study as a particular empirical inquiry).

The term “study” might thus refer to either (i) “the process of systematically collectingdata and the resulting data resource” or (ii) “the use of data to address a specific inquiryor set of related inquiries” (often resulting in a paper or report). Cochran’s use of “study”seems usually, though perhaps not always, to be in the latter sense. Even when he refersto specific studies by naming the data resource (e.g. the “National Halothane Study”), thisis often in the context of a specific empirical inquiry. And indeed many studies (processes

224

Page 102: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies and Study Designs: An Epidemiologic Perspective

of data collection and resulting data resource) are often focused upon addressing a singleempirical question or set of closely related questions. This is especially the case withrandomized trials in which the effectiveness of a specific intervention on a specific populationis assessed as rigorously as possible. But, as will also be discussed further below, a singlestudy (i.e. a single data resource) can often be used for multiple studies in the second sensei.e. for multiple empirical inquiries on different topics. When referring to a “study” it isthus good to distinguish between a “study” as the collecting of data resulting in a dataresource and “study” as the steps taken to address a specific empirical inquiry or set ofrelated inquiries.

Likewise “study design” might be understood either as (i) the design of the data col-lection process for a specific data resource, or (ii) the design of the analytic process andthe use of that data resource to address a specific empirical question. Once again, mostof Cochran’s discussion of the design of observational studies concerns study design in thislatter sense. When the study (data resource) is put together for a focuses specific inquirythese two aspects of study design are then integrally intertwined. But, in other cases, thestudy (data resource) is put together for multiple purposes and can be used to addressmultiple very different inquiries; in these cases often these two aspects of study design maybe quite distinct. Even after a study (data collection resulting in a data resource) has beenconducted and the data is available, considerable work may be required in determining howthat data resource can be used to address a specific empirical question.

The term “study design” does also have at least one yet further, and perhaps evenmore common use, that of a template. A specific “study design” might be understood asa template in which the actions undertaken to generate and collect the data fit a certainpattern or mold. A randomized trial might be viewed as one type of template, a closedcohort study as another type of template. Each of these is a type of study design. Thus,with the term “study design”, we have at least three uses: (i) the details of the designof the data collection process for a specific data resource (e.g. the design of the “NationalHalothane Study” or the “Framingham Heart Study”), (ii) the design of the analytic processused, including the data selected and the analyses conducted, to answer a specific empiricalinquiry, and (iii) a template for obtaining data with a particular set of shared characteristicswith respect to the data-generating mechanism. This third use of “study design” meritssome further discussion, and in the following section several new types of study designs thathave arisen in epidemiology are described.

Study Designs in Epidemiology

As indicated above, many different types of study designs (templates for the collection ofdata) can be used to assess causal/etiologic questions. Randomized trials and cohort typedesigns are perhaps the most common, but many other types of designs are available. InSection 4 of his paper, Cochran makes brief mention that cancer patients may be betterinformed of instances of cancer among blood relatives than are perhaps the controls who arefree of cancer. Cochran is perhaps here touching upon the case-control design (though thisis not entirely clear and a cohort design with case status ascertained at the end of follow-up, and perhaps certain covariate data ascertained retrospectively at that time, mightalso be in view). In any case, such case-control designs collect exposure/treatment data,

225

Page 103: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

VanderWeele

along with covariate data, on a set of individuals with the outcome, and also on a set ofcontrols (selected either from the non-cases or from the underlying population). Intuitively,if the exposure has no effect on the outcome, the prevalence of the exposure should be nohigher among the cases than among the controls. More specifically, in such case-controldesigns, if adequate control has been made for confounding, then it is possible to assessthe effect of the exposure on the outcome at least on the odds ratio or rate ratio scale;one can obtain the same odds ratio or rate ratio from case-control data as one wouldhave obtained in the underlying cohort (Miettinen, 1976; Prentice and Pyke, 1979). Iffurther information is available on the prevalence of the outcome or the exposure, furtherprogress can be made in estimating effects on the difference scale as well (Rothman et al.,2008). Such case-control studies are subject to additional biases, many of which have todo with selection of controls, but case-control designs can be very efficient ways to studycause-effect relationships especially when outcomes are rare. Such designs developed out ofepidemiology, but there is no reason in principle why these designs could not be used in thesocial sciences as well, especially for rare outcomes for which larger sample sizes are neededwhen using cohort designs. Arguably their more extensive use in epidemiology is due tohistoric, rather than substantive, considerations. See Keogh and Cox (2014) for a recentcomprehensive overview of case-control designs.

Epidemiologic study designs have also been extended in a number of other additionalways including varieties of case-control designs such as nested and matched case-controldesigns (cf. Rothman et al., 2008), two-stage sampling case-control designs (White, 1982;Breslow and Cain, 1988), and variants of the case-control design such as the case-cohortdesign (Prentice, 1986; Barlow, 1999), or two-stage designs involving both ecologic andcase-control data (Haneuse and Wakefield, 2008). These other study designs could likewisehave applicability outside of epidemiology. Even more creative, specialized, and in someway surprising designs have arisen within epidemiology that address questions of cause-effect relations. So called case-crossover designs collect data only on subjects who have theoutcome in question and use exposure information in window of time substantially priorto the outcome (e.g. a week before the outcome) comparing this to exposure immediatelypreceding the outcome, to examine whether particular exposures triggered the outcome inquestion (Maclure, 1991; Maclure and Mittleman, 2000). Such designs have been used forexample to study triggers for MI (Mittleman et al., 1993) and the relationship between cell-phone use and car accidents (Redelmeier and Tibshirani, 1997; McEvoy et al., 2005). Theyare useful for etiologic research but address a different type of causal question (e.g. doesthe exposure trigger the outcome in a short window of time following the exposure) than isaddressed in cohort or case-control studies which focus on longer term effects (cf. Maclure,2007). Such distinctions make clear the importance of specifying the object of investigationin etiologic research (Miettinen and Karp, 2012). It is not sufficient to simply specify theexposure and outcome being studied, but also the measure (prevalence, incidence; difference,ratio, rate, etc.) and the time-frame.

The case-crossover design described above is subject to biases that can arise from tem-poral trends in the exposure and various other types of designs such as the self-controlledcase-series design (Farrington, 1995; Whitaker et al., 2006), the case-time-control design(Suissa, 1995), and the case-case-time-control design (Wang et al., 2011) have been devel-oped to help address some of these biases. This case-only type approach has also been

226

Page 104: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies and Study Designs: An Epidemiologic Perspective

modified to study the effects of spatially relevant features or exposures, e.g. proximity topower lines that are only on one side of a street, by using data only on cases, and examininghow distances to the feature would be modified if mirror images were taken with respect tosome other spatial characteristic e.g. the center line of a street. Such designs are sometimescalled case-specular study designs (Zaffanella et al., 1999). Yet another case-only studydesign variant arises from the observation, essentially simply a consequence of Bayes’ The-orem, that if a genetic and an environmental exposure are independent in distribution ina population then the odds ratio relating the genetic and environmental exposures amongthe cases gives the measure of multiplicative interaction in the effects of the two exposureson the outcome in the population (Piergorsch et al., 1996; Schmidt and Schaid, 1999).

As with case-control designs, each of these other designs makes additional assumptionsabove and beyond those in a cohort study, and are thus subject to a wider range of biases(cf. Greenland, 1996; Rothman et al., 2008). But each of these designs also has the potentialto give substantial insight in etiologic research, even though they are not traditional follow-up cohort studies and even though they sometimes do not even have a control group astraditionally conceived. In some instances, a study (data resource) produced for the purposeof one specific empirical inquiry can be useful in addressing an entirely different empiricalstudy question by making use of one of the aforementioned, rather clever, study designs(templates). It is arguably the decoupling of the notion of study as data resource and thatof study as template and as the use that is made of the data, that allows for such cleversecondary uses of data, and that perhaps led to the development of these new study designswithin epidemiology to begin with. Such decoupling has arguably also been importantin trying to provide some unification of the different types of epidemiologic study designs(Miettinen and Karp, 2012). Once again, even though these designs were developed withinepidemiology, there is generally nothing now restricting their use in other fields. Otherdisciplines would likely benefit from the use of these alternative observational study designsas well.

Large Multi-Purpose Cohort Studies

Even with cohort designs, the practice of epidemiology has arguably allowed for new insights.Cochran discusses the need for clear specific focused hypotheses and argues that a studyshould be designed in light of these (Section 2). It is hard to argue against such advice.However, if a study is conceived of as the creation of a data resource (e.g. the Nurses HealthStudy, the Framingham Heart Study etc.) then such a study can in fact be used to addressmultiple hypotheses. Hundreds of individual studies in the form of published empiricalinquiries have come out of the Nurses Health Study on topics as diverse as nutrition, cancer,depression, social support, etc. In such a context, the principles of study design (understoodas the design of the process by which the data resource is created and the resulting contentof that resource) are arguably somewhat different than that for a study with a single narrowhypothesis. In the creation of a study (data collection resulting in a data resource) that is tobe used for multiple research questions, careful thought must be given to the possibility ofmany hypotheses, what confounding variables are relevant not for just one exposure, or oneoutcome, but a whole host of exposure-outcome relationships; often extensive questionnairesand detailed follow-up to avoid non-response are needed to ensure adequate data.

227

Page 105: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

VanderWeele

The considerations when a single data resource is to be used for numerous empiricalinquiries are challenging; but the undertaking of such studies (i.e. creation of such data re-sources) hold tremendous promise for research. Cochran, in his paper, discusses the difficulttrade-off between conducting “a study much smaller...than desired but with high qualitymeasurements, or an extensive study with measurements of dubious quality” (Section 4).In one of the rare moments of uncertainty in his paper, Cochran concludes: “I am seldomsure what to advise.” However, with large cohort studies that are to be used for multiplepurposes, and which, often as a consequence, have extensive human and financial resourcesin support of them, there is sometimes no need to choose between the two. Large studieswith excellent and extensive measurements are at least sometimes achievable. It is, onceagain, arguably the conceptual decoupling of “study” as the creation of a data resource and“study” as a specific empirical inquiry that underlies the pursuit of such large multi-purposecohort studies.

Of course, even with a multi-purpose study (data resource), once the data has beencollected, and a specific topic of study (empirical inquiry) has been selected, and an appro-priate study design (template) chosen, a great deal of work may still remain in determininghow best to make use of the data to answer the specific inquiry. And here, irrespective ofthe design (template) selected, all of Cochran’s principles come, once again, into play. Is-sues of selecting controls, baseline measurements, confounders and confounder adjustment,measurement error, method of adjustment, generalizability and external validity, and theassessing of the evidence for judging causation must all be carefully evaluated. Cochran’sdiscussion of observational studies is indeed as relevant today as it was forty years ago.

References

Barlow, W.E., Ichikawa, L., Rosner, D., and Izumi, S. (1999). Analysis of case-cohortdesigns. Journal of Clinical Epidemiology, 52:1165-1172.

Breslow N and Cain K (1988). Logistic regression for two-stage case-control data. Biometrika,75:11–20.

Farrington CP. (1995). Relative incidence estimation from case series for vaccine safetyevaluation. Biometrics, 51:228–235.

Greenland S. (1996). Confounding and exposure trends in case-crossover and case-time-control designs. Epidemiology, 7(3):231-239.

Haneuse, S.J.-P.A. and Wakefield, J.C. (2008). The combination of ecological and casecon-trol data. Journal of the Royal Statistical Society: Series B, 70:73–93.

Imbens, G.W., and Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomed-ical Sciences: An Introduction. Cambridge University Press.

Keogh, R.H., and Cox. D.R. (2014). Case-control Studies. Cambridge University Press.

Lipsitch, M., Tchetgen Tchetgen, E. and Cohen, T. (2010). Negative controls: a tool fordetecting confounding and bias in observational studies. Epidemiology, 21:383-388.

McEvoy SP, Stevenson MR, McCartt AT, Woodward M, Haworth C, Palamara P andCercarelli R. (2005). Role of mobile phones in motor vehicle crashes resulting in hospitalattendance: a case-crossover study. BMJ 2005; 331.

Maclure, M. (1991). The case-crossover design: a method for studying transient effects onthe risk of acute events. American Journal of Epidemiology, 133:144-153.

228

Page 106: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies and Study Designs: An Epidemiologic Perspective

Maclure, M. (2007). “Why me?” versus “why now?” – differences between operationalhypotheses in case-control versus case-crossover studies. Pharmacoepidemiology and DrugSafety, 16:850–853.

Maclure M and Mittleman MA. (2000). Should we use a case-crossover design? AnnualReview of Public Health, 21:193–221.

Miettinen, O.S. (1976). Estimability and estimation in case-referent studies. AmericanJournal of Epidemiology, 103:226–235.

Miettinen, O.S. and Karp, I. (2012). Epidemiological Research: An Introduction. Springer.

Mittleman MA, Maclure M, Tofler GH, Sherwood JB, Goldberg RJ and Muller JE. (1993).Triggering of acute myocardial infarction by heavy physical exertion. New England Jour-nal of Medicine, 329:1677-1683.

Neyman, J. (1923). Sur les applications de la thar des probabilities aux experiences Agar-icales: Essay des principle. Excerpts reprinted (1990) in English (D. Dabrowska and T.Speed, Trans.) in Statistical Science, 5:463–472.

Pearl, J. (2001). Direct and indirect effects. In Proceedings of the Seventeenth Conferenceon Uncertainty and Artificial Intelligence. Morgan Kaufmann, San Francisco, 411–420.

Piegorsch, W.W., Weinberg, C.R., Taylor, J.A. (1994). Non-hierarchical logistic modelsand case-only designs for assessing susceptibility in population-based case-control studies.Statistics in Medicine, 13:153–162.

Prentice, R.L. (1986). A case-cohort design for epidemiologic cohort studies and diseaseprevention trials. Biometrika, 73:1–11.

Prentice, R.L. and Pyke, R. (1979). Logistic disease incidence models and case-controlstudies. Biometrika, 66:403–411.

Redelmeier DA and Tibshirani RJ. (1997). Association between cellular telephone calls andmotor vehicle collisions. New England Journal of Medicine, 336: 453–458.

Robins, J.M. (1986). A new approach to causal inference in mortality studies with sus-tained exposure period – application to control of the healthy worker survivor effect.Mathematical Modelling, 7:1393–1512.

Robins, J.M. (1997). Causal inference from complex longitudinal data. In: Latent VariableModeling and Applications to Causality. Lecture Notes in Statistics (120), M. Berkane,Editor. Springer Verlag, New York, 69–117.

Robins, J.M. and Greenland, S. (1992). Identifiability and exchangeability for direct andindirect effects. Epidemiology, 3:143–155.

Robins J.M., Hernan M.A. and Brumback B. (2000). Marginal structural models and causalinference in epidemiology. Epidemiology, 11:550–560.

Rosenbaum, P.R. (2010). Design sensitivity and efficiency in observational studies. Journalof the American Statistical Association, 105:692–702.

Rosenbaum P.R. and Rubin D.B. (1983). The central role of the propensity score in obser-vational studies for causal effects. Biometrika, 70:41–55.

Rosenbaum, P.R. and Rubin D.B. (1984). Reducing bias in observational studies usingsubclassification on the propensity score. Journal of the American Statistical Association,79:516–524.

Rothman, K.J., Greenland, S. and Lash, T.L. (2008). Modern Epidemiology, 3rd edition.Philadelphia: Lippincott Williams and Wilkins.

229

Page 107: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

VanderWeele

Rubin, D. (1974). Estimating causal effects of treatments in randomized and non-randomizedstudies. Journal of Educational Psychology, 66:688–701.

Rubin, D.B. (1978). Bayesian inference for causal effects: The role of randomization. Annalsof Statistics, 6:34–58.

Schmidt, S., and Schaid, D.J. (1999). Potential misinterpretation of the case-only study toassess gene-environment interaction. American Journal of Epidemiology, 150:878–885.

Suissa, S. (1995). The case-time-control design. Epidemiology, 6(3):248–253.Tchetgen Tchetgen, E.J. (2014). The control outcome calibration approach for causal in-

ference with unobserved confounding. Am. J. Epidemiol. 179:633-640.VanderWeele, T.J. (2015). Explanation in Causal Inference: Methods for Mediation and

Interaction. Oxford University Press, New York.Wang, S., Linkletter, C., Maclure, M., Dore, D., Mor, V., Buka, S. and Wellenius, G.A.

(2011). Future-cases as present controls to adjust for exposure-trend bias in case-onlystudies. Epidemiology, 22(4): 568-574.

Whitaker HJ, Farrington CP and Musonda P. (2006). Tutorial in Biostatistics: The self-controlled case series method. Statistics in Medicine, 25(10): 1768–1797.

White, J. (1982). A two stage design for the study of the relationship between a rareexposure and a rare disease. American Journal of Epidemiology, 115:119–128.

Zaffanella, L.E., Savitz, D.A., Greenland, S., and Ebi, K.L. (1998). The residential case-specular method to study wire codes, magnetic fields, and disease. Epidemiology, 9:16–20.

230

Page 108: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Observational Studies 1 (2015) 231-240 Submitted 4/15; Published 8/15

Reflections on “Observational Studies”: Looking Backwardand Looking Forward

Stephen G. West [email protected]

Arizona State University and Freie Universitat Berlin

Tempe, AZ 85287, USA

Abstract

The classic works of William Cochran and Donald Campbell provided an important foun-dation for the design and analysis of non-randomized studies. From the remarkably similarperspectives of these two early figures, distinct perspectives have developed in statistics andin psychology. The potential outcomes perspective in statistics has focused on the concep-tualization and the estimation of causal effects. This perspective has led to important newstatistical models that provide appropriate adjustments for problems like missing data onthe outcome variable, treatment non-compliance, and pre-treatment differences on baselinecovariates in non-randomized studies. The Campbell perspective in psychology has focusedon practical design procedures that prevent or minimize the occurrence of problems thatpotentially confound the interpretation of causal effects. It has also emphasized empiricalcomparisons of the estimates of causal effects obtained from different designs. Greaterinterplay between the potential outcomes and Campbell perspectives, together with con-sideration of applications of a third perspective developed in computer science by JudeaPearl, portend continued improvements in the design, conceptualization, and analysis ofnon-randomized studies.

1. Reflections on “Observational Studies”: Looking Backward andLooking Forward

William G. Cochran in statistics and Donald T. Campbell in psychology provided much ofthe foundation for the major approaches currently taken to the design and analysis of non-randomized studies in my field of psychology. The initial similarity of the positions takenby Cochran (1965, 1972, 1983) and Campbell (1957, 1963/1966) in their early writings onthis topic is remarkable. Their work helped define the area and raised a number of keyissues that have been the focus of methodological work since that time. Truly significantprogress has been made in providing solutions to several of these key issues. Going forwardto the present, work in statistics by Cochrans students (particularly Donald Rubin andhis students) and in psychology by Campbells colleagues (particularly Thomas Cook andWilliam Shadish) have diverged in their emphases. Reconsideration of the newer work fromthe foundation of Cochran and Campbell helps identify some persisting issues.

2. Looking Backward

Cochran (1972) defined the domain of observational studies as excluding randomization,but including some agents, procedures, or experiences...[that] are like those the statistician

c⃝2015 Stephen G. West.

Page 109: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

West

would call treatments in a controlled experiment... (p. 1). This definition reflects a middleground between experiments and surveys without intervention, two areas in which Cochranhad made major contributions (Cochran, 1950; 1963). The goal and the challenge of theobservational study is causal inference: Did the treatment cause a change in the outcome?The domain established by Cochran’s definition can still be considered relevant today. Thereis continued debate over exactly what quantities should be called “treatments in a controlledexperiment” (e.g., Holland, 1986; Rubin, 2010). And some authors (e.g., Cook, Shadish& Wong, 2008; Rosenbaum, 2010; Rubin, 2006) appear to have narrowed Cochran’s moreinclusive definition of observational study to focus only on those designs that include baselinemeasures, non-randomized treatment and control groups, and at least one outcome measure.I will restrict the use of observational study to this narrower definition below, using the termnon-randomized design to indicate the more inclusive definition.

Cochran (1972, section 3) discusses several designs that might be used to investigate theeffects of a treatment in the absence of randomization. He also provides discussion of somepotential confounders that potentially undermine the causal interpretation of any prima fa-cie observed effects of the treatment. Campbell (Campbell & Stanley, 1963/1966) attemptedto describe the full list of the non-randomized and randomized designs then available, in-cluding some he helped invent (e.g., the regression discontinuity design, Thistlethwaite &Campbell, 1957). Associated with each non-randomized design are specific types of po-tential confounders that undermine causal inference. Campbell attempted to enumeratea comprehensive list of potential confounders which he termed threats to internal validity.These threats represented “an accumulation of our fields criticisms of each other’s research”(Campbell, 1988, p. 322). Among these are such threats as history, maturation, instru-mentation, testing, statistical regression, and attrition in pretest-posttest designs, selectionin designs comparing non-randomized treatment and control groups only using a posttestmeasure, and interactions of selection with each of the earlier list of threats in observationalstudies (narrow definition). Both Cochran and Campbell clearly recognized the differentialability of each of the non-randomized designs to account for potential confounders.

Echoing his famous earlier quoting of Fisher to “Make your theories elaborate” (Cochran,1965, p. 252), Cochran (1972, p. 10) stated that “the investigator should think of as manyconsequences of the hypothesis as he can and in the study try to include response mea-surements that will verify whether these consequences follow.” Campbell (1968; Campbell& Stanley, 1963/1966; Cook & Campbell, 1979) emphasized the similar concept of pat-tern matching in which the ability of the treatment and each of the plausible confoundersto account for the obtained pattern of results is compared. Campbell emphasized thatboth response variables and additional design features (e.g., multiple control groups hav-ing distinct strengths and weaknesses; repeated pre-treatment measurements over time) beincluded in the design of the study to distinguish between the competing explanations.

Cochran (1972) also considered several of the ways in which measurement could af-fect the results of observational studies, notably the effects of accuracy and precision ofmeasurement on the results of analyses and the possibility that measurements were non-equivalent in the treatment and comparison groups. Campbell and Stanley (1963/1966) in-cluded measurement-related issues prominently among their threats to internal validity andCampbell and Fiske (1957) offered methods of detecting potential biases (termed “methodeffects”) associated with different approaches to measurement (e.g., different types of raters;

232

Page 110: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Reflections on “Observational Studies”

different measurement operations). Based on his experiences attempting to evaluate com-pensatory education programs, Campbell particularly emphasized the role of measurementissues, notably unreliability and lack of stability over time of baseline measurements, inproducing artifactual results in the analysis of observational studies (Campbell & Boruch,1975; Campbell & Erlebacher, 1970; Campbell & Kenny, 1999).

3. Looking Forward to the Present

From the initial similarity of the perspectives of Cochran and Campbell on non-randomizedstudies, their followers have diverged in their emphases. In statistics, Donald Rubin, one ofCochran’s students, has developed the potential outcomes approach to causal inference (Ru-bin, 1978; 2005; Imbens & Rubin, 2015), which provides a formal mathematical statisticalapproach for the conceptualization and the analysis of the effects of treatments. In psy-chology, Campbell’s colleagues have continued to develop aspects of his original approachfocusing on systematizing our understanding of design approaches to ruling out threatsto internal validity. I highlight a few of these differences below (see West & Thoemmes,2010 for a fuller discussion). Table 1 summarizes the typical design and statistical analysisapproaches to strengthening causal inference associated with some randomized and non-randomized designs that come out of the Rubin and Campbell perspectives, respectively.

3.1 Rubin’s Potential Outcomes Model

The potential outcomes model has provided a useful mathematical statistical framework forconceptualizing many issues in randomized and nonrandomized designs. This frameworkstarts with the (unattainable) ideal of comparing the response of a single participant un-der the treatment condition with the response of the same participant under the controlcondition at the same time and in the same setting. Designs that approximate this idealto varying degrees can be proposed including the randomized experiment, the regressiondiscontinuity design, and the observational study. The potential outcomes framework forcesout the exact assumptions needed to meet the ideal and defines mathematically the precisecausal effects that can be achieved if these assumptions can be met. The framework drawsheavily on Rubin’s (1976; Little & Rubin, 2002) seminal work on developing unbiased es-timates of parameters when data are missing. Randomized experiments can be conceivedof as a design in which no observations are available for treatment group participants inthe control group and no observations are available for control group participants in thetreatment group, but data are missing completely at random.

The potential outcomes framework permits the unbiased estimation of the magnitudeof the average causal effect in experiments given that four assumptions are met: (a) suc-cessful randomization, (b) full compliance to the assigned treatment condition, (c) fullmeasurement of the outcome variables, and (d) the stable unit treatment value assumption(SUTVA: participants outcomes are unaffected by the treatment assignments of others; nohidden variations of treatments). In cases when these assumptions break down (“brokenrandomized experiments”), new approaches requiring additional assumptions have beendeveloped from the foundation of the potential outcomes model. Angrist, Imbens, andRubin (1996) developed an approach that provides unbiased estimates of the causal effectfor those participants who would take the treatment in a randomized experiment if assigned

233

Page 111: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

West

Table 1: Key Assumptions/Threats to Internal Validity and Example Remedies for Ran-domized Experiments and Non-randomized Alternative Designs

Assumption or Threat Typical Approaches to Mitigating the Threatto Internal Validitya Design Approach Statistical Approach

Randomized ExperimentIndependent units temporal or geographical multilevel analysis;

isolation of units other statistical adjustmentfor clustering

Stable Unit Treamtent Value Assumption temporal or geographical statistical adjustment for(SUTVA): Other treatment conditions isolation of treatment groups measured exposure todo not affect participant’s outcome; other treatmentsNo hidden variation in treatments

Full treatment adherence incentive for adherence instrumental variableanalysis (assume

exclusion restriction)No attrition sample retention procedures missing data

(assumemissing at random)

Regression Discontinuity DesignFunctional form of relationship Replication with different cutpoint; Nonparametric regression;between assignment variable and Nonequivalent dependent variables Sensitivity analysisoutcome is properly modeled

Interrupted Time Series AnalysisFunctional form of the relationship nonequivalent control series in which Diagnostic plots

for the time series is properly modeled; intervention is not introduced; switching (autocorrelogram; spectralanother historical event, a change in replication in which intervention is density).population (selection), or a change in introduced at another time point; Sensitivity analysis

measures coincides with the introduction nonequivalent dependent measure.of the intervention.

Observational StudyMeasured baseline variables equated Multiple control groups; Propensity score analysis;

Unmeasured baseline variables equated Nonequivalent dependent measures Sensitivity analysis;Differential maturation Additional pre-and-post intervention Subgroup analysis

measurementsa Note. The list of assumptions/threats to internal validity identifies issues that commonly occur in each of thedesigns. The alternative designs may be subject to each of the issues listed for the RE in addition to the issueslisted for the specific design. The examples of statistical and design approaches for mitigating the threat to internalvalidity illustrate some commonly used approaches and are not exhaustive. For the observational study design,Rubin’s and Campbell’s perspectives differ so that the statistical and design approaches do not map 1:1 ontothe assumptions/threats to internal validity that are listed. Reprinted from West, S. G. (2009). Alternatives torandomized experiments. Current Directions in Psychology, 18, 299–304.

to the treatment condition but take the control if assigned to the control condition (a.k.a.,complier average causal effect; see Sagarin, West, Ratnikov, Homan, Ritchie, & Hansen,2014 for a recent review of approaches to treatment non-compliance). Little and Rubin

234

Page 112: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Reflections on “Observational Studies”

(2002) and Yang and Maxwell (2014) offer methods for estimating unbiased causal effectswhen there is attrition from measurement of the outcome variable.

In the context of observational studies, Rosenbaum and Rubin (1983) developed propen-sity score analysis as a vehicle to adjust for the effect of a large number of covariates (po-tential confounders) measured at baseline (see Austin, 2011 and West, Thoemmes, Cham,Renneberg, Schultze, & Weiler, 2014 for recent reviews). Hong and Raudenbush (2005;2013) extended propensity score analysis to provide proper estimates of the average causaleffect when the treatment was delivered to a pre-existing group (e.g., treatments deliveredto an existing classrooms of students) in group-based observational studies. Thoemmes andWest (2011) developed approaches to the analysis of group-based observational studies inwhich the treatment is delivered to originally independent individuals who are constitutedinto groups for the purpose of the study. In each case, the potential outcomes perspec-tive provides the foundation for conceptualizing the analysis, helps identify the necessaryassumptions, and specifies the exact causal effect that may be estimated.

The potential outcomes approach has been particularly fruitful for the analysis of de-signs in which there are baseline covariates, outcome measures collected at a single timepoint, treatment and control conditions, and in which treatment assignment is assumedto be either independent of potential confounders or to be independent of all potentialconfounders after conditioning on covariates (ignorable). It has provided a valuable toolfor analyzing randomized experiments, broken randomized experiments, the regression dis-continuity design, and the observational study (narrow definition). However, the potentialoutcomes approach becomes more challenging to apply in designs in which measurement ofthe outcome variable is extended over time (e.g., interrupted time series designs, see Im-bens & Wooldridge, 2009) or there are time-varying treatments (see Hong & Raudenbush),2006. In addition, assumptions underlying the application of the potential outcomes ap-proach (e.g., ignorability–no other unmeasured confounders exist) may not be not testable,an important limitation.

3.2 Campbell’s Practical Working Scientist Approach

In psychology, design-based approaches have been given priority in the Campbell traditionover statistical adjustment of treatment effects: “When it comes to causal inference fromquasi-experiments, design rules, not statistics” (Shadish & Cook, 1999, p. 300). The prefer-ence is to use the strongest design that can be implemented in the research context (Shadish,Cook, & Campbell, 2002). Advice is given to researchers about justifications that may begiven to individual participants, communities, and organizations so that they will permitthe strongest possible design to be implemented. Each of the potential threats to internalvalidity associated with the specific design is then carefully evaluated for plausibility in theresearch context. Attempts are then made to prevent or minimize the threat. For example,Ribisl et al. (1996) developed a valuable compendium of the then-available strategies ofminimizing attrition in longitudinal designs (which needs updating to incorporate new tech-nologies for tracking and keeping in contact with individual participants). Efforts to preventthe threat are supplemented by the identification of design elements that specifically addresseach identified threat (e.g., multiple pretests; multiple control groups; see Shadish et al.,2002, p. 157). Then, the pattern of obtained results is compared to the results predicted by

235

Page 113: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

West

the hypothesis and by each of the threats to internal validity (confounders). Shadish et al.present illustrations of the use of this strategy and numerous research applications in whichit has been successful. Campbell’s approach does not eschew the elegant statistical analysisapproaches offered by the potential outcomes framework (indeed, it welcomes them), but itgives higher priority to design solutions.

There are two key difficulties in applying Campbell’s approach to non-randomized de-signs. First, the researcher’s decision to rule out or ignore certain threats to internal validitya priori may be incorrect. In Campbells approach, there are suggestions to use commonsense, prior research, and theory to eliminate some potential threats. While often a goodguide, common sense, prior research, and theory might be misleading. Second, althoughthe pattern matching strategy can be compelling when the evidence supporting one hy-pothesis is definitive, more ambiguous partial support for several competing hypotheses canalso occur. Rosenbaum (2010) offers some methods of integrating results across multipledesign elements, but methods of formally assessing the match of the obtained results to thehypothesis and each of the plausible threats to validity need further development.

Campbell emphasized a practical approach that mimics the approach of the working sci-entist. One intriguing development within the Campbell tradition is the empirical compar-ison of the results of overlapping study designs in which a randomized and non-randomizeddesign are both used. A series of papers have compared non-randomized designs to the goldstandard of the randomized experiment under conditions in which the two designs share acommon treatment group and ideally sample from the same population of participants. Nodifference was found between the estimates of the magnitude of the treatment effect from therandomized experiment and the regression discontinuity design (Cook & Wong, 2008; seealso Shadish et al., 2011 for a randomized experiment comparing the estimates of treatmenteffects from the two designs). Similarly, no differences were found between the estimates ofthe treatment effect from a randomized experiment and interrupted time series design (St.Clair, Cook, & Halberg, 2014). In contrast, syntheses of existing comparison studies (Cook,Shadish, & Wong, 2008) as well as randomized experiments comparing the effect sizes forparticipants randomly assigned to treatment and control conditions or randomly assignedto self-select between the same treatment and control conditions (Shadish, Clark, & Steiner,2008) identified cases in which the two designs produced comparable and non-comparableresults. Several factors facilitated obtaining comparable results (Cook, Shadish, & Wong,2008; Cook, Steiner, & Pohl, 2009; Steiner, Cook, Shadish, & Clark, 2010): (a) a knownselection rule that determines assignment to treatment and control conditions, (b) inclusionof measures of all relevant confounders in the statistical adjustment model, (c) inclusion ofa pretest measure of the outcome variable in the statistical adjustment model, (d) reliablemeasurement of the covariates, and (e) control participants being selected from the samepopulation as the treatment participants (originally descriptively termed a “focal, local”control group). Generalization of the results of these studies was originally limited by thesmall number of research contexts in the existing studies upon which the conclusions werebased. However, in an ongoing project more than 65 studies of this type have been identifiedand compiled thus far and this database continues to expand. Synthesis of these studies isincreasingly providing a practical basis for the design of non-experimental studies that canhelp minimize bias in the estimates of causal effects of treatments.

236

Page 114: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Reflections on “Observational Studies”

4. Conclusion

Cochran and Campbell helped define the domain of non-randomized studies and the threatsto internal validity (potential confounders) that may compromise the interpretation of ob-served treatment effects. Subsequent work in statistics and psychology has taken comple-mentary paths. Work in the potential outcomes model tradition in statistics has substan-tially improved the conceptualization of causal effects and has provided improved estimatesof their magnitude. Work in psychology following in the tradition of Campbell has em-phasized the development of practical methods for researchers to improve the design ofnon-randomized studies. Some of this work has built on the insights of Cochran and Camp-bell to make theories complex and to design studies so that pattern matching may be useddistinguish between potential explanations. Some of this work has re-emphasized the impor-tance of some of Cochran and Campbell’s views about the key status of baseline measuresof the outcome variable and the importance of highly reliable measurement of baseline mea-sures in non-randomized designs, features that have received less attention in the potentialoutcome approach. Within the potential outcomes framework, less attention has been givento the development of new methods in which the observations are measured repeatedly overtime or time-varying treatments are implemented (but see e.g., Robins & Hernan, 2009).Within the Campbell framework, practical methods for strengthening causal inferences inthe interrupted time series have been developed (Shadish et al., 2002) and new work hasfocused on improving the design and analysis of the single subject design (Shadish, 2014;Shadish, Hedges, Pustejovsky, Rindskopf, Boyajian, & Sullivan, 2014). And within bothapproaches the development of formal methods for assessing Cochran’s elaborate theoriesor Campbell’s pattern matching has received relatively little attention. This situation maybe beginning to change. Judea Pearl’s (2009) approach to causal inference developed incomputer science is being applied to systems of variables. Although Pearl’s approach canbe seen as having its own limitations (Shadish & Sullivan, 2012; West & Koch, 2014), it hashelped sharpen our conceptualization of some causal inference problems (e.g., mediation inwhich treatment causes changes in intermediate variables [mediators], which, in turn, pro-duce changes in the outcome variable; which confounders should and not be controlled inobservational studies). It has also provided challenges to the potential outcomes approachgiven its alternative (but overlapping) approach to the conceptualization and estimationof causal effects. Although important advances have occurred since the foundational workof Cochran and Campbell, greater interplay among the potential outcomes, Campbell, andPearl perspectives portends continued improvements in our design, conceptualization, andanalysis of non-randomized studies.

Acknowledgments

I thank Thomas D. Cook, William R. Shadish, and Dylan Small for their comments on anearlier version of the manuscript.

237

Page 115: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

West

References

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects usinginstrumental variables (with discussion). Journal of the American Statistical Association,91, 444–472.

Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects ofconfounding in observational studies. Multivariate Behavioral Research, 46(3), 399–424.

Campbell, D. T. (1957). Factors affecting the validity of experiments in social settings.Psychological Bulletin, 54, 297–332.

Campbell, D. T. (1966). Pattern matching as an essential in distal knowing. In K. R.Hammond (Ed.), The psychology of Egon Brunswick (pp. 81-106). Holt, Rinehart, &Winston. New York.

Campbell, D. T. (1988). Can we be scientific in applied social science? In E. S. Overman(Ed.), Methodology and epistemology for social science: Selected papers of Donald T.Campbell. University of Chicago Press, Chicago.

Campbell, D. T., and Boruch, R. F. (1975). Making the case for randomized assignmentto treatments by considering the alternatives: Six ways in which quasi-experimentalevaluations in compensatory education tend to underestimate effects. In C. A. Bennett& A. A. Lumsdaine (Eds.), Evaluation and experiments: Some critical issues in assessingsocial programs (pp. 195–296). Academic Press, New York.

Campbell, D. T., and Fiske, D. W. (1957). Convergent and discriminant validation by themultrait-multimethod matrix. Psychological Bulletin, 56, 81–105.

Campbell, D. T., and Erlebacher, A. E. (1970). How regression artifacts can mistakenlymake compensatory education programs look harmful. In J. Hellmuth (Ed.), The dis-advantaged child: Vol. 3, Compensatory education: A national debate (pp. 185–210).Brunner/Mazel, New York.

Campbell, D. T., and Kenny, D. A. (1999). A primer on regression artifacts. Guilford, NewYork.

Campbell, D. T., and Stanley, J. C. (1963/1966). Experimental and quasi-experimentaldesigns for research. Rand McNally, Chicago. Originally published in N. L. Gage (Ed.),Handbook of research on teaching (pp. 171–246). Rand McNally, Chicago.

Cochran, W. G. (1965). The planning of observational studies (with discussion). Journalof the Royal Statistical Society, Series A, 128, 234–266.

Cochran, W. G. (1972). Observational studies. In T. A. Bancroft (Ed.) (1972), Statisticalpapers in honor of George W. Snedecor (pp. 77-90). Ames, IA: Iowa State UniversityPress. Reprinted in Observational Studies, 1, 126–136.

Cochran, W. G. (1983). Planning and analysis of observational studies. Wiley, New York.

Cochran, W. G., and Cox, G. M. (1950). Experimental designs. Wiley, New York.

Cochran, W. G. (1953). Sampling techniques. New York, NY: Wiley.

Cook, T. D., and Wong, V. C. (2008). Empirical tests of the validity of the regressiondiscontinuity design. Annales d’Economie et de Statistique, 91-92, 127–150.

Cook, T. D., Shadish, W. R., and Wong, V. C. (2008). Three conditions under whichobservational studies produce the same results as experiments. Journal of Policy Analysisand Management, 27(4), 724–750.

238

Page 116: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

Reflections on “Observational Studies”

Cook, T. D., Steiner, P. M., and Pohl, S. (2009). Assessing how bias reduction is influencedby covariate choice, unreliability and data analytic mode: An analysis of different kindsof within-study comparisons in different substantive domains. Multivariate BehavioralResearch, 44, 828-847.

Holland, P. W. (1986). Statistics and causal inference. Journal of the American StatisticalAssociation, 81(396), 945–960.

Hong, G., and Raudenbush, S. W. (2006). Evaluating kindergarten retention policy. Journalof the American Statistical Association, 101(475), 901-910.

Hong, G., and Raudenbush, S. W. (2013). Heterogeneous agents, social interactions, andcausal inference. In S. L. Morgan (Ed.), Handbook of Causal Analysis for Social Research(pp. 331-352). Springer Netherlands.

Imbens, G. W., and Rubin, D. B. (2015). Causal inference for statistics, social, and biomed-ical sciences: An introduction. Cambridge University Press, New York.

Imbens, G. W., and Wooldridge, J. W. (2009). Recent developments in the economics ofprogram evaluation. Journal of Economic Literature, 47, 5–86.

Little, R. J. A., and Rubin, D. B. (2002). Statistical inference with missing data (2nd Ed.).Wiley, Hoboken, NJ.

Ribisl, K. M., Walton, M. A., Mowbray, C. T., Luke, D. A., Davidson, W. S., and Bootsmiller,B. J. (1996). Minimizing participant attrition in panel studies through the use of effec-tive retention and tracking strategies: Review and recommendations. Evaluation andProgram Planning, 19(1), 1–25.

Robins J. M., and Hernan M. A. (2009). Estimation of the causal effects of time-varying ex-posures. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds), Advancesin Longitudinal Data Analysis. Chapman and Hall, New York.

Rosenbaum, P. R. (2010). Design of observational studies. New York: Springer.

Rosenbaum, P. R., and Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal effects. Biometrika, 70, 41–55.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.

Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization.Annals of Statistics, 6, 34–58.

Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, deci-sions. Journal of the American Statistical Association, 100, 322–331.

Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge University Press,New York.

Rubin, D. B. (2010). Reflections stimulated by the comments of Shadish (2010) and Westand Thoemmes (2010). Psychological Methods, 15(1), 38–46.

Sagarin, B. J., West, S. G., Ratnikov, A., Homan, W. K., Ritchie, T. D., and Hansen, E.J. (2014). Treatment noncompliance in randomized experiments: Statistical approachesand design issues. Psychological Methods, 19(3), 317–333.

Shadish, W. R. (2014). Statistical analyses of single-case designs: The shape of things tocome. Current Directions in Psychological Science, 23, 139–146.

Shadish, W. R., M. H. Clark and Peter M. Steiner (2008). Can nonrandomized experimentsyield accurate answers? A randomized experiment comparing random to nonrandomassignment (with commentary). Journal of the American Statistical Association, 103,1334–1356.

239

Page 117: Observational Studies and Comments · observational study data sets and how to access them. The goal of the descriptions of observational study data sets is to enable readers to form

West

Shadish, W. R., and Cook, T. D. (1999). Design rules: More steps towards a completetheory of quasi-experimentation. Statistical Science, 14, 294–300.

Shadish, W. R., Cook, T. D., and Campbell, T. D. (2002). Experimental and quasi-experimental designs for general causal inference. Wadsworth, Boston.

Shadish, W. R., Galindo, R., Wong, V. C., Steiner, P. M., and Cook, T. D. (2011). Arandomized experiment comparing random to cutoff-based assignment. PsychologicalMethods, 16(2), 179–191.

Shadish, W.R., Hedges, L.V., Pustejovsky, J., Rindskopf, D.M., Boyajian, J.G. and Sullivan,K.J. (2014). Analyzing single-case designs: d, G, multilevel models, Bayesian estimators,generalized additive models, and the hopes and fears of researchers about analyses. InT. R. Kratochwill and J. R. Levin (Eds.), Single-Case Intervention Research: Method-ological and Statistical Advances (pp. 247–281). American Psychological Association,Washington, D.C.

Shadish, W.R. and Sullivan K. (2012). Theories of causation in psychological science. In H.Cooper (Ed.), APA Handbook of Research Methods in Psychology (Vol. 1, pp. 23–52).Washington, D.C.: American Psychological Association.

St. Clair, T., Cook, T. D. and Hallberg, K. (2014). Examining the internal validity andstatistical precision of the comparative interrupted times series design by comparisonwith a randomized experiment. American Journal of Evaluation, 35(3), 311–327.

Thistlethwaite, D. L., and Campbell, D. T. (1960). Regression-discontinuity analysis: Analternative to the ex post facto experiment. Journal of Educational Psychology, 51, 309–317.

Thoemmes, F. J., and West, S. G. (2011). The use of propensity scores for nonrandomizeddesigns with clustered data. Multivariate Behavioral Research, 46(3), 514–543.

West, S. G., and Koch, T. (2014). Restoring Causal Analysis to Structural Equation Mod-eling. Review of Judea Pearl, Causality: Models, Reasoning, and Inference (2nd. Ed).Structural Equation Modeling, 21, 161–166.

West, S. G., and Thoemmes, F. (2010). Campbell’s and Rubin’s perspectives on causalinference. Psychological Methods, 15, 18-37.

West, S. G., Cham, H., Thoemmes, F., Renneberg, B., Schulze, J., and Weiler, M. (2014).Propensity scores as a basis for equating groups: Basic principles and application inclinical treatment outcome research. Journal of consulting and clinical psychology, 82(5),906–919.

Yang, M., and Maxwell, S. E. (2014). Treatment effects in randomized longitudinal trialswith different types of non-ignorable dropout. Psychological Methods, 19, 188-210.

240