THE DESIGN OF EXPERIMENTSsekhon.berkeley.edu/causalinf/fisher_design_experiments_ch1-2.pdf · an inverse method, prevented thescience from extending its views beyond problems of that

I

THE DESIGN OF EXPERIMENTS

I

INTRODUCTION

1. The Grounds on which Evidence is Disputed

WHEN any scientific conclusion is supposed to be provedon experimental evidence, critics who still refuse toaccept the conclusion are accustomed to take one oftwo lines of attack. They may claim that the interpretation of the experiment is faulty, that the resultsreported are not in fact those which should have beenexpected had the conclusion drawn been' justified, orthat they might equally well have arisen had the conclusion drawn been false. Such criticisms of interpretation are usually treated as falling within the domain ofstatistics. They are often made by professed statisticians

. against the work of others whom they regard as ignorantof or incompetent in statistical technique; and, sincethe interpretation of any considerable body of data islikely to involve computations, it 'is natural enoughthat questions involving the logical implications of theresults of the arithmetical processes employed, should

.be relegated to the statistician. At least I make nocomplaint of this convention. The statistician cannotevade the responsibility for understanding the processeshe applies or recommends. My immediate point isthat the questions involved can be dissociated from allthat is strictly technical in the statistician's craft, and,when so detached, are questions only of the right use of

A

ElGuapo

Typewritten Text

ElGuapo

Typewritten Text

ElGuapo

Typewritten Text

ElGuapo

Typewritten Text

Chapters 1 and 2 from The Design of Experiments by Ronald A. Fisher. 8th Edition, 1966.

human reasoning powers, with which all intelligentpeople, who hope to be intelligible, are equally concerned, and on which the statistician, as such, speakswith no special authority. The statistician cannotexcuse himself from the duty of getting his head clearon the principles of scientific inference, but equally noother thinking man can avoid a like obligation.

The other type of criticism to which experimentalresults are exposed is. that the experiment itself wasill designed, or, of course, badly executed. If wesuppose that the experimenter did what he intendedto do, both of these points come down to the questionof the design, or the logical structure of the experiment.This type of criticism is usually made by what I mightcall a heav)TIveight authority.· Prolonged experience,or at least the long possession of a scientific reputation,is almost a pre-requisite for developing successfully thisline of attack. Technical details are seldom in evidence.The authoritative assertion "His controls are totallyinadequate" must have temporarily discredited manya promising line of work; and such an authoritarianmethod of judgment must surely continue, humannature being what it is, so long as theoretic~1 notionsof the principles of experimental design are lacking-·notions just as clear and explicit as we are accustomed

.to apply to technical details.Now the essential point is that the two sorts of

criticism I have mentioned are aimed only at differentaspects of the same whole, although they are usuallydelivered by different sorts of people and in very differentlanguage. If the design of an experiment is faulty,any method of interpretation \i;hich makes it out to bedecisive must be faulty too. I t is true that there are agreat many experimental procedures which are welldesigned in that they may lead to decisive conclusions,

but on other occasions may fail to do so; in such cllSes,if decisive conclusions are in fact drawn when theyare unjustified, we may say that the fault is wholly inthe interpretation, not in the design. But the fault ofinterprets:tion, even in these cases, lies in overlookingthe characteristic features of the design which lead tothe result being sometimes inconclusive, or conclusiveon some questions but. not on all. To understandcorrectly the one aspect of the problem is to understandthe other. Statistical procedure and experimentaldesign are only two different aspects of the same whole,and that whole comprises all the logical requirementsof the complete process of adding to natural knowledgeby experimentation.

3tND.UCTION

2. The Mathematical Attitude towards InductionIn the foregoing paragraphs the subject-matter of

this book has been regarded from the point of view ofan experimenter, who wishes to carry out his workcompetently, and having done so wishes to safeguardhis results, so far as they are validly established, fromignorant criticism by different sorts of superior persons.I have assumed, as the experimenter always doesassume, that it is possible to draw valid inferences fromthe results of experimentation; that it is possible toargue from consequences to causes, from observationsto hypotheses; as a statistician would say, from asample to the population from which the sample wasdravm, or, as a logician might put it, from the particularto the general. It is, however, certain that manymathematicians, if pressed on the point, would say thatit is not possible rigorously to argue from the particularto the general; that all such arguments must involvesome sort of guesswork, which they might admit to beplausible guesswork, but the rationale of which, they

INTRODUCTION•

would be unwilling, as mathematicians, to discuss.We may at once admit that any inference from theparticular to the general must be attended with somedegree of uncertainty, but this is not the same as toadmit that such inference cannot be absolutely rigorous,for the nature and degree of the uncertainty may itselfbe capable of rigorous expression. In the theory ofprobability, as developed in its application to games ofchance, we have the classic example proving this possibility. If the gamblers' apparatus are really true orunbiased, the probabilities of the different possibleevents, or combinations of events, can be inferred by arigorous deductive argument, although the outcome ofany particular game is recognised to be uncertain. Themcre fact that inductive inferences are uncertain cannot,therefore, be accepted as precluding perfectly rigorousand unequivocal inference.

Naturally, writers on probability have made deter.mined efforts to include the problem of inductiveinference within the ambit of the theory of mathematicalprobability, developed in discussing deductive problemsarising in games of chance. To illustrate how muchwas at one time thought to have been achieved in thisway, I may quote a very lucid statement by Augustusde Morgan, published in 1838, in the preface to hisessay on probabilities in The Cabinet CycloptEdia. Atthis period confidence in the theory of inverse probability, as it was called, had reached, under the influenceofkaplace, its highest point. Boole's criticisms had notyet been made, nor the more decided rejection of thetheory by Venn, Chrystal, and later writers. De Morganis speaking of the advances in the theory which wereleading to its wider application to practical problems.

"There was also another circumstance which· stoodin the way of the first investigators, namely, the not

having considered, or, at least, not having discoveredthe method of reasoning from the happening of anevent to the probability of one or another cause. Thequestions treated) in the third chapter of this workcould not therefore be attempted by them. Given anhypothesis presenting the necessity of one or anotherout of a certain, and not very large, number of consequences, they could de,termine the chance that anygiven one or other of those consequences should arrive;

. but given an event as having happened, and whichmight have been the consequence of either of severaldifferent causes, or explicable by either of severaldifferent hypotheses, they could not infer the probabilitywith which the happening of the event should cause thedifferent hypotheses to be viewed. But, just as innatural philosophy the selection of an hypothesis bymeans of observed facts is always preliminary to anyattempt at deductive discovery; so in the applicationof the notion of probability to the actual affairs of life,the process of reasoning from observed events to theirmost probable antecedents mustgo before the directuse of any such antecedent, cause, hypothesis, or whatever it may be correctly termed. These two obstacles,therefore, the mathematical difficulty, and the want ofan inverse method, prevented the science from extendingits views beyond problems of that simple nature whichgames of chance present."

Referring to the inverse method, he later adds:"This was first used by the Rev. T. Bayes, and theauthor, though now almost forgotten, deserves the mosthonourable remembrance from all who treat the historyof this science."

5INDUCTION

,l,!

INTRODUCTION4

3. The Rejection of Inverse P~obabllity

Whatever may have been true in r838, it is certainlynot true to-day that Thomas Bayes is almost forgotten.That he seems to have been the first man in Europeto have seen the importance of developing an exact al1dquantitative theory of inductive reasoning; of arguingfrom observational facts to the theories which mightexplain them, is surely a sufficient claim to a place inthe history of science. But he deserves honourableremembrance for one fact, also, in addition to thosementioned. by de Morgan. Having perceived theproblem and devised an axiom which, if its truth weregranted, would bring inverse inferences within the scopeof the theory of mathematical 'probability, he wassufficiently critical of its validity to try to avoid the'axiomatic approach, and, perhaps for the same reason,to withhold his entire treatise from publication until hisdoubts should have been satisfied. In the event, thework was published after his death by his friend, Price,and we cannot say what views he ultimately held on thesubject.

The discrepancy of opinion among historical writerson probability is'so great that to mention the subject isunavoidable. It would, however, be out of place hereto argue the point in detail. I will only state threeconsiderations which will explain why, in the practicalapplications of the subject, I shall not assume the truthof Bayes' axiom. Two of these reasons would, I think,be generally admitted, but the first, I can well imagine,might be indignantly repudiated in some quarters.The first is this: The axiom leads to apparent mathematical contradictions. In explaining these contradictions away, advocates of inverse probability seemforced to regard mathematical probability, not as anobjective quantity measured byobservablefrequencies, but

4. The Logic of the Laboratory

In fact, in the course of this book, I propose toconsider a number of different types of experimentation,with especial reference to their logical structure, and toshow that when the appropriate precautions are takento make this structure complete, entirely valid inferencesmay be drawn from them, without using the disputedaxiom. If this can be done, we shaJl, in the course ofstudies having directly practical aims, have overcomethe theoretical difficulty of inductive inferences.

Inductive inference is the only process known tous by which essentialJy new knowledge comes into theworld. To make clear the authentic conditions of itsvalidity is the kind of contribution to the intellectualdevelopment of mankind which we should expect

as measuring merely psychological tendencies, theoremsrespecting which are useless for scientific purposes.

My second reason is that it is the nature of an axiomthat its truth should be apparent to any rational mindwhich fully apprehends its meaning. The axiom ofBayes has certainly been fully apprehended by a goodmany rational minds, including that of its author,without carrying this conviction of necessary truth.This, alone, shows that it cannot be accepted as theaxiomatic basis of a rigorous argument.

My third reason is that inverse probability has beenonly very rarely used in the justification of conclusionsfrom experimental facts, although the theory has beenwidely taught, and is widespread in the literature ofprobability. Whatever the reasons are which giveexperimenters confidence that they can draw valid conclusions from their results, they seem to act just aspowerfulJy whether the experimenter has heard of thetheory of inverse probability or not.

7LOGIC OF THE LABORATORYINTRODUCTION, 6

experimental science would ultimately supply. Menhave always been capable of some mental processes ofthe kind we call "learning by experience." Doubtlessthis experience was often a very imperfect basis, andthe reasoning processes used ih interpreting it were veryinsecure; but there must have been in these processesa sort of embryology of knowledge, by which newknowledge was gradually produced. Experimentalobservations are only experience carefully planned inadvance, and designed to form a secure basis of newknowledge; that is, they are systematically related tothe body of knowledge already acquired, and the resultsare deliberately observed, and put on record accurately.As the art of experimentation advances the principlesshould become clear by virtue of which this planningand designing achieve their purpose.

I t is as well to remember in this connection that theprinciples and method of even deductive reasoning wereprobably unknown for several thousand years after theestablishment of prosperous and cultured civilisations.We take a knowledge of these principles for granted,only because geometry is universally taught in schools.The method and material taught is essentially that ofEuclid's text-book of the third century B.C., and noone can make any progress in that subject withoutthoroughly familiarising his mind with the requirementsof a precise deductive argument. Assuming the axioms,the body of their logical consequences is built upsystematically and without ambiguity. Yet it is certainlysomething of an accident historically that this particulardiscipline should have become fashionable in the GreekUniversities, and later embodied in the curricula ofsecondary education. It would be difficult to overstatehow much the liberty of human thought has owed tothis fortunate circumstance. Since Euclid's time there

LOGIC OF THE LABORATORY 9

have been very long periods during which the right ofunfettered individual, judgment has been successfullydenied in legal, moral, and historical questions, but inwhich it has, none the less, survived, so far as purelydeductive reasoning is concerned, within the shelter ofapparently harmless mathematical studies.

The liberation of the human intellect must, however,remain incomplete so long as it is free only to work outthe consequences of a prescribed body of dogmaticdata, and is denied the access to unsuspected truths,which only direct observation can give. The development of experimental science has therefore done muchmore than to multiply the technical competence ofmankind; and if, in these introductory lines, I haveseemed to wander far from the immediate purpose ofthis book, it is only because the two topics with whichwe shall be concerned, the arts of experimental designand of the valid interpretation of experimental results,in so far as they can be technically perfected, mustconstitute the core of this claim to the exercise of fullintellectual liberty.

The chapters which follow are designed to illustratethe principles which are common to all experimentation,by means of examples chosen for the simplicity withwhich these principles are brought out. Next, to exhibitthe principal designs which have been found successfulin 'that field of experimentation, namely agriculture, inwhic\.! questions of design have been most thoroughlystudied, and to' illustrate their applicability to otherfields of work. Many of the most useful designs areextremely simple, and these deserve the greatest attention, as showing in what ways, and on what occasions,greater elaboration may be advantageous. The carefulreader should be able to satisfy himself not only, indetail, why some experiments have a complex structure,

rI

i,

.i:

INTRODUCTION8

IItII

II

II

THE PRINCIPLES OF EXPERIMENTATION,ILLUSTRATED BY A PSYCHO-PHYSICAL

. EXPERIMENT

5. Statement of Experiment

A LADY declares that by tasting a cup of tea made withmilk she can discriminate whether the milk or the teainfusion was first added to the cup. We will considerthe problem of designing an experiment by means ofwhich this assertion can be tested. For this purposelet us first lay down a simple form of experiment with aview to studying its limitations and its characteristics,both those which appear to be essential to the experimental method, when well developed, and those whichare not essential but auxiliary.

Our experiment consists in mixing eight cups oftea, four in one way and four in the other, and presentingthem to the subject for judgment in a random order.The subject has been told in advance of what the testwill consist, namely that she will be asked to taste eightcups, that these shall be four of each kind, and thatthey shall be presented to her in a random order, thatis in an order not determined arbitrarily by humanchoice, but by the actual manipulation of the physicalapparatus used in games'of chance, cards, dice, roulettes,etc" or, more expeditiously, from a published collection.of random sampling numbers purporting to give theactual results of such manipulation. Her task is todivide the 8 cups into two sets of 4, agreeing, if possible,with the treatments received.

iIj,j

INTRODUCTION

but also how a complex observational record may behandled with intelligibility and precision.

The subject is a new one, and in many ways themost that the author can hope· is to suggest possiblelines of attack on the problems with which others areconfronted. Progress in recent years has been rapid,and the few sections devoted to the subject in the author'sStatistical Methods for Research Workers, first publishedin 1925, have, with each succeeding edition, come toappear more and more inadequate. On purely statisticalquestions the reader must be referred to that book;on logic, and the analysis of meaning, to StatisticalMethods and Scientific Inference. The present volumeis an attempt to do more thorough justice to the problemsof planning and foresight with which the experimenteris confronted.

REFERENCES AND OTHER READING

T. BAVES (1763). An essay toward, solving a problem in the doctrineof chances. Philosophical Transactions of the Royal Society, liii.

37°·A. DE MORGAN (1838). An essay on probabilities and on their

application to life contingencies and insurance offices. Preface,vi. Longman & Co.

R. A. FISHER ('930). Inve,se probability. Proceeding' of theCambridge Philosophical Society, xxvi. 528-535.

R. A. FISHER ('932), Inverse probability and the use of likelihood.Proceedings of the Cambridge Philosophical Society, xxviii.

c 257-261. '

R. A, FISHER ('935). The logic of inductive inference. JournalRoyal Statistical Society, xcviii. 39-54.

R, A. FISHEl'< (r936). Uncertain inference. Proceedings of theAmerican Academy of Arts and Science" 7L 245-258.

R. A. FISHER ('925-r963). Statistical methods for research workers,Oliver and Boyd Ltd., Edinburgh.

R. A. FISHER (1956, '959) Statistical methods and scientific inference.Oliver and Boyd Ltd., Edinburgh.

'0

even the greatest possible success would give odds solow chat the result might, with considerable probability,be ascribed to chance.

7. The Test of Significance

It is open to the· experimenter to be more or lessexacting in respect of the smallness of the probabilityhe would require before he would be willing to admitthat his observations have demonstrated a positiveresult. It is obvious that an experiment would be uselessof which no possible result would satisfy him. Thus,if he wishes to ignore results having probabilities ashigh as 1 in 2o-the probabilities being of coursereckoned from the hypothesis that the phenomenonto be demonstrated is in fact absent-then it would beuseless for him to experiment with only 3 cups of teaof each kind. For 3 objects can be chosen out of 6 inonly 20 ways, and therefore complete success in thetest would be achieved without sensory discrimination,i.e. by " pure chance," in an average of 5 trials out ofroo. It is usual and convenient for experimenters totake 5 per cent. as a standard level of significance, inthe sense that they are prepared to ignore all resultswhich fail to reach this standard, and, by this means,to eliminate from further discussion the greater part ofthe fluctuations which chance causes have introducedinto their experimental results. No such selection caneliminate the whole of the possible effects· of chancecoincidence, and if we accept this convenient convention,and agree that an event which would occur by chanceonly once in 70 trials is decidedly " significant," in thestatistical sense, we thereby admit that no isolatedexperiment, however significant in itself, can suffice forthe experimental demonstration of any natural phenomenon; for the "one chance in a million"· will

13SIGNIFICANCE12 THE PRINCtrLES OF EXPERIMENTATION

6. Interpretation and its Reasoned Basis

In considering the appropriateness of any proposedexperimental design, it is always needful to forecast allpossible results of the experiment, and to have decidedwithout ambiguity what interpretation shall be placedupon each one of them. Further, we must know bywhat argument this interpretation is to be sustained.In the present instance we may argue as follows. Thereare 70 ways of choosing a group of 4 objects out of 8.This may be demonstrated by an argument familiar tostudents of " permutations and combinations," namely,that if we were to choose the 4 objects in succession we .should have successively 8, 7, 6, 5 objects to choosefrom, and could make our succession of choices in8 X 7 X 6 X 5, or 1680 ways. But in doing this we havenot only chosen every possible set of 4, but every possibleset in every possible order; and since 4 objects can bearranged in order in 4 x 3 x 2 X 1, or 24 ways, we mayfind the number of possible choices by dividing 1680by 24. The result, 70, is essential to our interpretationof the experiment. At best the subject can judge rightlywith every cup and, knowing that 4 are of each kind,this amounts to choosing, out of the 70 sets of 4 whichmight be chosen, that particular one which is correct.A subject without any faculty of discrimination w01.\ldin fact divide the 8 cups correctly into two sets of 4 inone trial out of 70, or, more properly, with a frequencywhich would approach 1 in 70 more and more nearlythe more often the test Were repeated. Evidently thisfrequency, with which unfailing success would beachieved by a person lacking altogether the facultyunder test, is calculable from the number of cups used.The odds could be made much higher by enlarging theexperiment, while. if the experiment were much smaller

14 THE PRINCIPLES OF EXPERIMENTATION

undoubtedly occur, with no less and no more than itsappropriate frequency, however surprised we may bethat it should occur to us. In order to assert that anatural phenomenon is experimentally demonstrablewe need, not an isolated record, but a reliable methodof procedure. In relation to the test of significance,we may say that a phenomenon is experimentallydemonstrable when we know how to conduct an experiment which will rarely fail to give us a statisticallysignificant result.

Returning to the possible results of the psychophysical experiment, having decided that if every cupwere rightly classified a significant positive result wotddbe recorded, or, in other words, that we should admitthat the lady had made good her claim, what shouldbe our conclusion if, for each kind of cup, her judgmentsare 3 right and I wrong? We may take it, in thepresent discussion, that any error in one set of judgmentswill be compensated by an error in the other, since itis known to the subject that there are 4 cups of eachkind. In enumerating the number of ways of choosing4 things out of 8, such that 3 are right and I wrong,we may note that the 3 right may be chosen, out of the4 available, in 4 ways and, independently of this choice,that the 1 wrong may be chosen, out of the 4 available,also in 4 ways. So that in all we could make a selectionof the kind supposed in 16 different ways. A similarargument shows that, in each kind of judgment, 2 maybe right and 2 wrong in 36 ways, 1 right and 3 wrongin 16 ways and none right and 4 wrong in 1 way only.I t should be noted that the frequencies of these fivepossible results of the experiment make up together,as it is obvious they should, the 70 cases out of 70.

I t is obvious, too, that 3 successes to 1 failure,although showing a bias, or deviation, III the right

I!,

Ii

NULL HYPOTHESIS 15

direction, could not be judged as statistically significantevidence of a real sensory discrimination. For itsfrequency of chance occurrence is 16 in 70, or morethan 20 per cent. Moreover, it is not the best possibleresult, and in judging of its significance we must takeaccount not only of its own frequency, but also of thefrequency of any better result. In the present instance" 3 right and I wrong" occurs 16 times, and" 4 right"occurs once in 70 trials, making 17 cases out of 70 asgood as or better than that observed. The reason forincluding cases better than that observed becomesobvious on considering what our conclusions wouldhave been had the case of 3 right and 1 wrong only1 chance, and the case of 4 right 16 chances of occurrenceout of 70. The rare case of 3 right and I wrong couldnot be judged significant merely because it was rare,seeing that a higher degree of success would frequentlyhave been scored by mere chance.

8. The Null Hypothesis

Our examination of the possible results of theexperiment has therefore led us to a statistical test ofsignificance, by which these results are divided intotwo classes with opposed interpretations. Tests ofsignificance are of many different kinds, which neednot be considered here. Here we are only concernedwith the fact that the easy calculation in permutationswhich we encountered. and which gave us our test ofsignificance, stands for something present in everypossible experimental arrangement; or, at least, forsomething required in its interpretation. The twoclasses of results which are distinguished by our test ofsignificance are, on the one hand, those which show asignificant discrepancy from a certain hypothesis;namely, in this case, the hypothesis that the judgments

w.ithout specifying what these death-rates actually are.In such cases it is evidently the equality rather thanany particular values of the death-rates that the experiment is designed to test, and possibly to disprove.

In cases involving statistical "estimation" theseideas may he extended to the simultaneous considerationof a series of hypothetical possibilities. The notion ofan error of the so-called" second kind," due to acceptingthe null hypothesis H when it is false" may then begiven a meaning in reference to the quantity to beestimated. It has no meaning with respect to simpletests of significance, in which the only available expectations are those which flow from the null hypothesisbeing true. Problems of the more elaborate typeinvolving estimation are discussed in Chapter IX.

9. Randomisation; the Physical Basis of the Validityof the Test

vVe have spoken of the experiment as testing acertain null hypothesis, namely, in this case, that thesubject possesses no sensory discrimination whatever ofthe kind claimed; we have, too, assigned as appropriateto this hypothesis a certain frequency distribution ofoccurrences, based on the equal frequency of the 70possible ways of assigning 8 objects to two classes of4 each; in other words, the frequency distributionappropriate to a classification by pure chance. Wehave now to examine the physical conditions of theexperimental technique needed to justify the assumpti?nthat, if discrimination of the kind under test is absent,the result of the experiment will he wholly governedby the laws of chance. I t is easy to see that it mightwell be otherwise. If all those cups made with themilk first had sugar added, while those made with thetea first had I)one. a very obvious difference in flavour

B

17RANDOMISATION16 THE l'lUNCIPLES OF EXPERIMENTATION

given are in no way influenced by the order in whichthe ingredients have been added; and on the otherhand, results which show no significant discrepancyfrom this hypothesis. This hypothesis, which mayormay not be impugned by the result of an experiment, _is again characteristic of all experimentation. Muchconfusion would often be avoided if it were explicitlyformulated when the experiment is designed. In relationto any experiment we may speak of this hypothesis asthe " null hypothesis," and it should be noted that thenull hypothesis is never proved or established, but ispossibly disproved, in the course of experimentation.Every experiment may be said to exist only in orderto give the facts a chance of disproving the null hypothesis.

I t might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensorydiscrimination between two different sorts of object, itmust therefore be able to prove the opposite hypothesis,that she can make some such discrimination. But thislast hypothesis, however reasonable or true it may be,is ineligible as a null hypothesis to be tested by experiment, because it is inexact. If it were asserted that thesubject would never be "'Tong in her judgments weshould again have an exact hypothesis, and it is easyto see that this hypothesis could be disproved by asingle failure, but could never be proved by any finiteamount of experimentation. It is evident that the nullhypothesis must he exact, that is free from vaguenessand ambiguity, because it must supply the basis of the" problem of distribution," of which the test of significance is the solution. A null hypothesis may, indeed,contain arbitrary elements, and in more complicatedcases often does so: as, for example, if it should assertthat the death-rates of two groups of animals are equal.

particular, to which causes of disturbance care shouldbe given, and which ought to be deliberately ignored.To ascertain, too, fot those which are not to be ignored,to what extent it is worth while to. take the trouble todiminish their magnitude. For our present purpose,however, it is only necessary to recognise that, whateverdegree of care and experimental skill is expended inequalising the conditions, other than the one. undertest, which are liable to affect the result, this equalisationmust always be to a greater or less extent incomplete,and in many important practical cases will certainly begrossly defective. We are concerned, therefore, thatthis inequality, whether it be great or small, shall notimpugn the exactitude of the frequency distribution,on the basis of which the result of the experiment is tobe appraised.

10. The Elfectiveness of Randomisation

The element in the experimental procedure whichcontains the essential safeguard is that the two modifications of the test beverage are to be prepared "inrandom order." This, in fact, is the only point in theexperimental procedure in which the laws of chance,which are to be in exclusive control of our frequencydistribution, have been explicitly introduced. Thephrase "random order" itself, .however, must beregarded as an incomplete instruction, standing as akind of shorthand symbol for the full procedure ofrandomisation, by which the validity of the test' ofsignificance may be guaranteed against corruption bythe causes of disturbance which have not been eliminated.To demonstrate that, with satisfactory randomisation,its validity is, indeed, wholly unimpaired, let us imagineall causes of disturbance--the strength of the infusion;the quantity of milk, the temperature at which it i~

19RANDOMISATION,8 THE PRINCIPLES OF EXPERIMENTATION

would have been introduced which might well ensurethat all those made with sugar should be classed alike.These groups might either be classified all right or allwrong, but in such a case the frequency of the criticalevent in which all cups are classified correctly would

. not be I in 70, but 35 in 70 trials, and the test of significance would be wholly vitiated. Errors equivalent inprinciple to this are very frequently incorporated inotherwise well-designed experiments.

It is no sufficient remedy to insist that" all the cupsmu:st be exactly alike" in every respect except that tobe tested. For this is a totally impossible requirementin our example, and equally in all other forms of experimentation. In practice it is probable that the cupswill differ perceptibly in the thickness or smoothnessof their material, that the quantities of milk added tothe different cups will not be exactly equal, that thestrength of the infusion of tea may change betweenpouring the first and the last cup, and that the temperature also at which the tea is tasted will change during thecourse of the experiment. These are only examples ofthe differences probably present; it would be impossibleto present an exhaustive list of such possible differencesappropriate to anyone kind of experiment, because theuncontrolled causes which may influence the result arealways strictly innumerable. When any such cause isnamed, it is usually perceived that, by increased labour

. and expense, it could be largely eliminated. Toofrequently it is assumed that such refinements constituteimprovements to the experiment. Our view, whichwill be much more fully exemplified in later sections, isthat it is an essential characteristic of experimentationthat it is carried out with limited resources, and anessential part of the subject of experimental design to'ascertain how these should be best applied; or, in

there will be none, by definition, or (c) effects superveningby chance independently from the treatments applied.Apart, therefore, from the avoidable error of the experimenter himself introducing with his test treatments, orsubsequently, other differences in treatment, the effectsof which the experiment is not intended to study, itmay be said that the simple precaution of randomisationwill silffice to guarantee the validity of the test ofsignificance', by· which· the result of the experiment isto be judged;

11. The Sensitiveness of an Experiment. Effects ofEnlargement and Repetition

A probable objection, which the subject might wellmake to the experiment so far described, is that only if .every cup is classified correctly will she be judgedsuccessful. A single mistake will reduce her performancebelow the level of significance. Her claim, however,might be, not that she could draw the distinction withinvariable certainty, but that, though sometimes mistaken, she would be right more often than not; andthat the experiment $hould be enlarged sufficiently, orrepeated sufficiently often, for her to be able to demonstrate the predominance of correct classifications inspite of occasional errors.

An extension of the calculation upon which the testof significance was based shows that an experimentwith r 2 cups, six of each kind, gives, on the null hypothesis, r chance in. 924 for complete success, and 36chances for 5 of each kind classified right and r wrong.As 37 is less than a twentieth of 924, such a test couldbe counted as significant, although a pair of cups havebeen wrongly classified; and it is easy to verify that,using larger numbers still, a significant result could beobtained with a still higher proportion of errors. By

PRECISION•'0 THE PRINCIPLES OF EXPERIMENTATION

tasted, etc.-to be predetermined for each cup; thensince these, on the null hypothesis, are the only causesinfluencing classification, we may say that the probabilities of each of the 70 possible choices or classifications·which the subject can make are also predetermined. If,now, after the disturbing causes are fixed, we assign,strictly at random, 4 out of the 8 cups to each of ourexperimental treatments, then every set of 4, whateverits probability of being SO classified, will certainly havea probability of exactlyr in 70 of being the 4, for example,to which the milk is added first. However importantthe causes of disturbance may be, even if they were tomake it certain that one particular set of 4 shouldreceive this classification, the probability that the 4 soclassified and the 4 which ought to have been so classifiedshould be the same, must be rigorously in accordancewith our test of significance.

It is apparent, therefore, that the random choice ofthe objects to be treated in different ways would be acomplete guarantee of the validity of the test of significance, if these treatments were the last in time of thestages in the physical history of the objects which mightaffect their experimental reaction. The circumstancethat the experimental treatments cannot always beapplied last, and may come relatively early in theirhistory, causes no practical inconvenience; for subsequent causes of differentiation, if under the axperimenter's control, as, for example, the choice of differentpipettes to be used with different flasks, can either bepredetermined before the treatments have been randomised, or, if this has not been done, can be randomisedon their own account; and other causes of differentiationwill be either (a) consequences of differences alreadyrandomised, or (b) natural consequences of the differencein treatment to be tested, of which on the null hypothesis

., THE PRINCIPLES OF EXPERIMENTATION

increasing the size of the e1<periment, we can render itmore sensitive, meaning by this that it will allow ofthe detection of a lower degree of sensory discrimination,or, in other words, of a quantitatively smaller departurefrom the null hypothesis. Since in every case theexperiment is capable of disproving, but never ofproving this hypothesis, we may say that the value ofthe experiment is increased whenever it permits thenull hypothesis to be more readily disproved.

The same result could be achieved by repeatingthe experiment, as originally designed, upon a numberof different occasions, counting as a success all thoseoccasions on which 8 cups are correctly classified. Thechance of success on each occasion being I in. 70, asimple application of the theory of probability showsthat 2 or more successes in 10 trials would occur, bychance, with a frequency below the standard chosenfor testing significance; so that the sensory discrimination would be demonstrated, although, in 8 attemptsout of 10, the subj6ct made one or more mistakes. Thisprocedure may be regarded as merely a second way ofenlarging the experiment and, thereby, increasing itssensitiveness, since in our final calculation we takeaccount of the aggregate of the entire series of results,whether successful or unsuccessful. It would clearlybe illegitimate, and would rob our calculation of itsbasis, if the unsuccessful results were not all broughtinto the account.

12. Qualitative Methods of increasing Sensitiveness

Instead of enlarging the e1<periment we may attemptto increase its sensitiveness by qualitative improvements; and these are, generally speaking, of two kinds:(a) the reorganisation of its structure, and (b) refinementsof technique. To illustrate a change of structure we

• PRECISION 23

might consider that, instead of li1<ing in advance that4 cups should be of each kind, determining by a randomprocess how the subdivision should be effected, wemight have allowed the treatment of each cup to bedetermined independently by chance, as by the toss ofa coin, so that each treatment has an equal chance ofbeing chosen. The chance of classifying correctly8 cups randomised in this way, without the aid of sensorydiscrimination, is I in 2", or I in 256 chances, and thereare only. S chances of classifying 7 right and I wrong;consequently the sensitiveness of the experiment hasbeen increased, while still using only 8 cups, and it ispossible to score a significant success, even if one isclassified wrongly. In many types of experime;lt,therefore, the suggested change in structure would beevidently advantageous. For the special requirementsof a psycho-physical e1<periment, however, we shouldprobably prefer to forego this advantage, since it wouldoccasionally occur that all the cups would be treatedalike, and this, besides bewildering tqe subject by anunexpected occurrence, would deny her the real advantage of judging by comparison.

Another possible alteration to the structure of theexperiment, which would, however, decrease its sensitiveness, would be to present determined, but unequal,numbers of the two treatments. Thus we might arrangethat 5 cups should be of the one kind and 3 of the other,choosing them properly by chance, and informing thesubject how many of each to expect. But since thenumber of ways of choosing 3 things out of 8 is only56, there is now, on the null hypothesis, a probabilityof a completely correct classification of I in 56. Itappears in fact that we cannot by these means do betterthan by presenting the two treatments in equal numbers,and the choice of this equality. is now seen to be

PRECISION

methods of obtaining equivalent results; and will bewise to choose whichever method is easiest to him,irrespective of the fact that previous experimentersmay have tried, and recommended as very important.or even essential, various ingenious and troublesomeprecautions.

12'1. Scientific Inference and Acceptance ProceduresIn "The Improvement of Natural Knowledge ",

that is, in learning. by experience, or by planned chainsof experimentation, conclusions are always provisionaland in the nature of progress reports, interpreting andembodying the evidence so far accrued. Convenient,as it is to note that a hypothesis is contradicted at somefamiliar level of significance such as 5% or 2% or 1%we do not, in Inductive Inference, ever need to losesight of the exact strength which the evidence has infact reached, or to ignore the fact that with further·trial it might come to be stronger, or weaker. Thesituation is entirely different in the Jield of AcceptanceProcedures, in which irreversible action may have tobe taken, and in which, whichever decision is arrived at;it is quite immaterial whether it is arrived at on strongevidence or on weak. All thab is needed is a Rule ofAction which is to be taken automatically, and withoutthought devoted to the individual decision. The procedure as a whole is arrived at by minimising the lossesdue to wrong decisions, or to unnecessary testing, and toframe such a procedure successfully the cost of suchfaulty decisions must be assessed in advance; equally,also, prior knowledge is required of the expected distribution of the material in supply. In the field of pureresearch no assessment of the cost of wrong conclusions,or of delay in arriving at more correct conclusions canconceivably be more than a pretence, and in any case

1.


justified by its giving to the experiment its maximalsensitiveness.

With respect to the refinements of technique, wehave seen above that these contribute nothing to thevalidity of the experiment, and of the test of significanceby which we determine itS result. They may, however,be important, and even essential, in permitting thephenomenon under test to manifest itself. Though thetest ,of significance remains valid, it may be that withoutspecial precautions even a definite sensory discrimination would have little chance of scoring a significantsuccess. If spme cups were made with India and somewith China: tea, even though the treatments wereproperly randomised, the subject might not be able todiscriminate the relatively small diffe~ence in flavourunder investigation, when it was confused with thegreater differences between leaves of different origin.Obviously, a similar difficulty could be introduced byusing in some cups raw milk and in others boiled, oreven condensed milk, or by adding sugar in unequalquantities. The subject has a right to claim, and itis in the interests of the sensitiveness of the experiment,that gross differences of these kinds should be excluded,and that the cups should, not as far as possi/;/e, but asfar as is practically convenient, be made alike in allrespects except that under test.

How far such experimental refinements should becarried is entirely a matter of judgment, based onexperience. The validity of the experiment is notaffected by them. Their sole purpose is to increaseits sensitiveness, and this object can usually be achievedin many other ways, and particularly by increasing thesize of the experiment. If, therefore, it is decided thatthe sensitiveness of the experiment should be increased,the experimenter has the choice between different


such an assessment would be inadmissible and irrelevantin judging the state of the scientific evidence; moreover, accurately assessable prior information is ordinarily known to be lacking. Such differences betweenthe 10gica1 situations should be borne in mind wheneverwe see tests of significance spoken of as "Rules ofAction ". A good deal of confusion has certainlybeen caused by the attempt to formalise the expositionof tests of significance in a logical framework differentfrom that for which they were in fact first developed.

REFERENCES AND OTHER READING

R. A. FISHER (1925-1963). Statistical methods for lesearch workels.Chap. III., §§ 15-19

R. A. FISHER (1926). The arlangement offield experiments. Journalof Ministry of Agriculture, xxxiii. 503-513.

Jr

I.,11J

III

A HISTORICAL EXPERIMENT ON GROWTH RATE

13. WE have illustrated a psycho-physical experiment,the result of which depends upon judgments, scored" right" or "wrong," and may be appropriatelyinterpreted by the method of the classical theory ofprobability. This method rests on the enumerationof the frequencies with which different combinationsof right or wrong judgments will occur, on the hypothesis to be tested. We may now illustrate an experimentin which the results are expressed in quantitativemeasures, and which is appropriately interpreted bymeans of the theory of errors.

In the introductory remarks to his book on "Theeffects of cross and self-fertilisation in the vegetablekingdom," Charles Darwin gives an account of theconsiderations which guided him in the design of hisexperiments 'and in the presentation of his data, whichwill serve well to illustrate the principles on whichbiological experiments may be made conclusive. Thepassage is of especial interest in illustrating the extremelycrude and unsatisfactory statistical methods availableat the time, and the manner in which careful attentionto commonsens.e considerations led to the adoption ofan experimental design, in itself greatly superior tothese methods of interpretation.

:14. Darwin's Discussion of the Data

"I long doubted whether i.t was worth while togive the measurements of each separate plant, but have

.,

Documents

THE DESIGN OF EXPERIMENTSsekhon.berkeley.edu/causalinf/fisher_design_experiments_ch1-2.pdf · an inverse method, prevented thescience from extending its views beyond problems of that