6
CONTEMPORARY PSYCHOLOGY, 1991, VOL. 36, NO. 2 (PP. 102-105) NOTE: This reprint created by the G.Loftus. Page numbers do not correspond to those in the original article. 1 CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2 On the Tyranny of Hypothesis Testing in the Social Sciences Gerd Gigerenzer, Zeno Swijink, Theodore Porter, Lorraine Daston, John Beatty, and Lorenz Krüger, The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge England: Cambridge University Press, 1989. 340 pp. IBSN 0-521-33115-3. $44.50 Review by: Geoffrey R. Loftus Gerd Gigerenzer, professor of psychology at the University of Salzburg (Austria), is coeditor, with L. Krüger, L.J. Daston, and M. Heidelberger, of The Probabilistic Revolution, Vol. 1: Ideas in History. Zeno Swijtink is assistant professor f history and philosophy of science at Indiana University Bloomington. Theodore Porter is associate professor of history at the University of Virginia (Charlottesville) and author of The Rise of Statistical Thinking. Lorraine Daston is pro- fessor of the history of science at the University of Göttingen (Germany) and author of Classical Probability in the Enlightenment. John Beatty is associate professor of the history of science and technology at the University of Minnesota (Minneapolis). Lorenz Krüger is professor of philoso- phy at the University of Göttingen (Germany). Geoffrey R. Loftus, professor of psychology at the University of Washington (Seattle), is coauthor, with E.F. Loftus of Essence of Statistics (2nd ed.). The Empire of Chance is about the history and current use of probability theory and statis- tics. The book provides a broad treatment of these topics; one could, accordingly, read or review it from quite a variety of different per- spectives. Because this review is for psychol- ogists, I will organize it around the book's in- sights into a question that I believe is at the heart of much malaise in psychological re- search: how has the virtually barren technique of hypothesis testing come to assume such im- portance in the process by which we arrive at our conclusions from our data? In what follows, I first describe why this question is timely and important. I then provide a brief synopsis of the book. And finally, I detail the book's answers to the question. The Ascent of Hypothesis Testing Since the 1940s, the practice of hypothesis testing has been seeping into all nooks and crannies of social-science methodology. Today, hypothesis testing constitutes the major foundation of data analysis in experimental psychology: it is used to justify conclusions from data in over 90% of articles in major psy- chology journals. Gigerenzer et al. underscore hypothesis testing's importance again and again - perhaps most dramatically when they para- phrase the editorial edicts issued by Arthur Melton (1962) upon assuming editorship of the august Journal of Experimental Psychology. "Melton's message was, in short, that manuscripts that did not reject the null hypothesis were al- most never published, and that results significant only at the 0.05 level were barely acceptable, whereas those significant at the 0.01 level de- served a place in the journal. Psychology stu- dents could no longer avoid statistics, and the experimenter who hoped to publish could no longer avoid a test of significance." (p. 206) The Molding of Young Minds We psychologists all learned about hy- pothesis testing during our undergraduate days. Many of us remember thinking at the time that

Loftus91 Tyranny

Embed Size (px)

DESCRIPTION

Loftus91 Tyranny

Citation preview

Page 1: Loftus91 Tyranny

CONTEMPORARY PSYCHOLOGY, 1991, VOL. 36, NO. 2 (PP. 102-105)NOTE: This reprint created by the G.Loftus. Page numbers do not correspond to those in the original article.

1 CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2

On the Tyranny of HypothesisTesting in the Social SciencesGerd Gigerenzer, Zeno Swijink,Theodore Porter, Lorraine Daston, JohnBeatty, and Lorenz Krüger,The Empire of Chance: HowProbability Changed Science andEveryday Life. Cambridge England:Cambridge University Press, 1989. 340pp. IBSN 0-521-33115-3. $44.50

Review by:Geoffrey R. Loftus

Gerd Gigerenzer, professor of psychology at the University of Salzburg (Austria), is coeditor,with L. Krüger, L.J. Daston, and M. Heidelberger, of The Probabilistic Revolution, Vol. 1: Ideas inHistory. •Zeno Swijtink is assistant professor f history and philosophy of science at IndianaUniversity Bloomington. •Theodore Porter is associate professor of history at the University ofVirginia (Charlottesville) and author of The Rise of Statistical Thinking. •Lorraine Daston is pro-fessor of the history of science at the University of Göttingen (Germany) and author of ClassicalProbability in the Enlightenment. •John Beatty is associate professor of the history of science andtechnology at the University of Minnesota (Minneapolis). •Lorenz Krüger is professor of philoso-phy at the University of Göttingen (Germany). •Geoffrey R. Loftus, professor of psychology at theUniversity of Washington (Seattle), is coauthor, with E.F. Loftus of Essence of Statistics (2nd ed.).

The Empire of Chance is about the historyand current use of probability theory and statis-tics. The book provides a broad treatment ofthese topics; one could, accordingly, read orreview it from quite a variety of different per-spectives. Because this review is for psychol-ogists, I will organize it around the book's in-sights into a question that I believe is at theheart of much malaise in psychological re-search: how has the virtually barren techniqueof hypothesis testing come to assume such im-portance in the process by which we arrive atour conclusions from our data?

In what follows, I first describe why thisquestion is timely and important. I then providea brief synopsis of the book. And finally, Idetail the book's answers to the question.

The Ascent of Hypothesis TestingSince the 1940s, the practice of hypothesis

testing has been seeping into all nooks and

crannies of social-science methodology.Today, hypothesis testing constitutes the majorfoundation of data analysis in experimentalpsychology: it is used to justify conclusionsfrom data in over 90% of articles in major psy-chology journals. Gigerenzer et al. underscorehypothesis testing's importance again and again- perhaps most dramatically when they para-phrase the editorial edicts issued by ArthurMelton (1962) upon assuming editorship of theaugust Journal of Experimental Psychology."Melton's message was, in short, that manuscriptsthat did not reject the null hypothesis were al-most never published, and that results significantonly at the 0.05 level were barely acceptable,whereas those significant at the 0.01 level de-served a place in the journal. Psychology stu-dents could no longer avoid statistics, and theexperimenter who hoped to publish could nolonger avoid a test of significance." (p. 206)

The Molding of Young MindsWe psychologists all learned about hy-

pothesis testing during our undergraduate days.Many of us remember thinking at the time that

Page 2: Loftus91 Tyranny

GEOFFREY R. LOFTUS HYPOTHESIS TESTING

CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2 2

it seemed kind of backward and perverse. Youestablish a null hypothesis, usually somethinglike "the population means are the same in allthe experimental conditions," or "the popula-tion correlation is zero." Some statistic like anF or a t or a z then provides evidence for oragainst the null hypothesis's plausibility.Depending on some arbitrary value of thestatistic's magnitude, you either reject or fail toreject the null hypothesis.

No one ever seemed to know exactly whathypothesis testing could tell you that was at allinteresting or important. Many budding psy-chologists (perhaps in wishful desperation)came to believe in a variety of murky and gen-erally incorrect implications of the process (forexample, that the p value's magnitude tells ussomething about an effect's magnitude or an ef-fect's replicability or the probability that the nullor alternative hypothesis is true or false).Somewhere along the line, however, we all in-ternalized one lesson that is entirely correct: themore you reject the null hypothesis, the morelikely it is that you'll get tenure.

Deficits of Hypothesis TestingDespite the stranglehold that hypothesis

testing has on experimental psychology, I findit difficult to imagine a less insightful means oftransiting from data to conclusions. A list (byno means exhaustive) of its more glaringdeficits are as follows.

What's the point of rejecting a hypothesisyou know is false to begin with? One is hard-pressed to think of a situation in which a nullhypothesis might plausibly be true. Consider atypical experiment in which, say, one isexamining the difference between two clinicaltreatments. One group of people is givenTreatment A and the other is given Treatment B.The null hypothesis is that the population meanoutcome measures are exactly the same for thetwo treatments.

No one would seriously consider this hy-pothesis to be literally true. So the results of ahypothesis test can only tell you is whether youhave sufficient experimental power to detect thepresence of whatever treatment effect must in-evitably exist. Yet the conclusions from theexperiment (and its suitability for publication)rest entirely on whether one is able to reject the

null hypothesis. It's bizarre. Gigerenzer et al.provide an apt summary from Nunnally (1960):"if rejection of the null hypothesis were the realintention in psychological experiments, thereusually would be no need to gather data." (p.210)

In fairness I should issue two caveats.First, some writers, (e.g., Hays, 1973) suggestthe use of a "null range" rather than an exactnull hypothesis. However a null range (1) israrely suggested (Hays himself tucks it away inan end-of-the-book Bayesian-methods section),(2) is even more rarely (if ever) used in practice,and (3) is more a kind of contorted interval-estimation technique than a bona-fide hy-pothesis-testing technique.

The second caveat is that, while difficult, itis not impossible to concoct experimental sit-uations in which a null hypothesis may beplausible and rejection of the null hypothesisinteresting (attempts to demonstrate the exis-tence of parapsychological phenomena provideexamples). However, the vast majority of so-cial-science experiments are of the TreatmentA/Treatment B variety rather than of the plau-sible-null-hypothesis variety.

Where are the error bars i n social-science journals? The emphasis on hypothesistesting produces a concomitant de emphasis onan alternative technique for coping withstatistical error that is simple, direct, intuitive,and has wide acceptance in the natural sciences:the use of confidence intervals. Whereas hy-pothesis testing emphasizes a very narrowquestion (do the population means fail to con-form to specific pattern?) the use of confidenceintervals emphasizes a much broader question(what are the population means?). Knowingwhat the means are, of course, implies knowingwhether they fail to conform to a specificpattern, although the reverse is not true. In thissense, use of confidence intervals subsumes theprocess of hypothesis testing.

The complications of experimentalpower. The emphasis on rejecting the null hy-pothesis also produces a de emphasis on exper-imental power (e.g., Guilford's widely read,Fundamental Statistics in Psychology andEducation declared power, as late as 1956, to be"too complicated to discuss.") Things haveimproved in the intervening 35 years, but only

Page 3: Loftus91 Tyranny

GEOFFREY R. LOFTUS HYPOTHESIS TESTING

3 CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2

marginally so. When power is discussed, it istypically discussed within the context of hy-pothesis testing. The focus, accordingly, is onthe probability of a Type-II error which is fairlymeaningless, given (1) the almost universal lackof quantitative alternative hypotheses and (2)the implausibility of the null hypothesis tobegin with. (All readers whose students can'tseem to understand power when it's discussedin this way, please raise your hands. Ha! Nowonder Guilford gave up). A more enlighten-ing exposition of power would relate it to con-fidence intervals: the more power you have, thesmaller are your confidence intervals, i.e., thebetter your knowledge of where populationmeans are.

Biases against the null hypothesis. Thereis a profound asymmetry between conclusionsissuing from rejecting the null hypothesis onthe one hand, and failing to reject it on the other(see Greenwald, 1975 for a lengthy discussionof this bias). Somewhat ironically, the mainreason for this asymmetry is that, as noted,accepting the null hypothesis is almost always aguaranteed error. However, if there is sufficientexperimental power, then failing to reject a nullhypothesis can be interesting anyway, as itimplies that some specific model can accountfor the data reasonably well. Although there isa small tradition within psychology ofconclusion-by-root-mean-square-error, thistradition is overwhelmed by the alternative ofconclusion-by-p-value.

Off-the-shelf assumptions. Finally theprocess of hypothesis testing forces (or at leastnudges) theoretical psychology into a fairy-taleland wherein the assumptions and traditions ofhypothesis testing (usually parametrichypothesis testing) are all correct. Variancesare unaffected by treatments, variables aredistributed normally, monotonic relationshipsare linear relationships, and all results arereduced to binary statements about the presenceor absence of main and interaction effectswithin a linear model.

It isn't actually the validity of these as-sumptions that I worry about. It's well knownthat most relevant sampling distributions arerobust against most assumption violations (and,of course, there are always non-parametrictests). Rather, the problem is that our early

(and continuing) imprinting on these assump-tions and traditions engenders strong biasesagainst formulating theories incorporatingother, perhaps more interesting and realistic as-sumptions. Thus, psychological theory be-comes generic analysis-of-variance theory andthe potential for insight is lost. To paraphraseFreedman, Rothenberg, and Sutch (1983), wholeveled analogous complaints about standardregression analysis in econometric theory, itseems that off-the-shelf assumptions produceoff-the-shelf conclusions.

A Synopsis of the BookThe point of this lengthy prologue has

been to argue that, given its rather severe short-comings, the question of "why hypothesistesting" is a baffling one. What light doGigerenzer et al. have to shed on this question?Let me start the answer by providing a briefsketch of the book itself.

Historical OriginsThe Empire of Chance accomplishes two

goals that, roughly, constitute the first and sec-ond halves of the book. The first goal is toprovide a history of the twin disciplines ofprobability and statistical theory. The saga be-gins at the beginning with Pascal in the 17thcentury. Then, driven by the triple influences ofeconomics (on what basis should we set in-surance rates?), science (how are we supposedto deal with variability in our data?), and phi-losophy (what is meant by chance, anyway?), itwends its way to modern times when, in thewords of the penultimate chapter title,"Numbers rule the world."

The authors emphasize controversy. Rightfrom the start, they point out, different thinkersand practitioners of probability and statisticshad rather different ways of viewing both fun-damental concepts and their applications topractical issues. These differences never seemto show up in the social sciences either in text-books or in lectures. I had never, for example,quite realized how much Ronald Fisher andEgon Pearson disdained each other's view-points. The descriptions of their differences areinstructive for the practitioner because they un-derscore two very different ways of applyingprobability theory to statistical issues that havesomehow become merged and misbegotten.

Page 4: Loftus91 Tyranny

GEOFFREY R. LOFTUS HYPOTHESIS TESTING

CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2 4

(The descriptions are also entertaining as gos-sip).

Current PracticeThe book's second goal is to illustrate ap-

plications of probability and statistics in fourfields: biology, physics, psychology, and "thereal world" (e.g., sports). In each of these in-stances two distinct themes emerge. The firstcarries over the book's historical motif, dealingwith the question: how did the current usage ofprobability and statistics emerge over time?The second theme involves the interplay ofdata-analysis techniques on the one hand andtheoretical development on the other. Withinpsychology, for example, a prominent viewarising from the statistical tradition, that of"mind as statistician," permeates subfields asdiverse as signal-detection theory and causalreasoning.

On Hypothesis TestingThe Empire of Chance is a wide-ranging

book. Its insight about the ascent of hypothesistesting constitutes but one of several themes ofinterest to scientists in general and psycholo-gists in particular. But to me it is the most in-teresting theme and its inclusion is the mostimportant reason that psychologists should readthe book. As described by Gigerenzer et al.,there are two major reasons why hypothesistesting has come to enjoy its present statelyposition.

Miscast ObjectivityFirst, for mid-20th-century experimental

psychologists (status hungry, perhaps, amidsttheir natural-science colleagues) hypothesistesting provided the illusion of endowing psy-chological data - which are intrinsically compli-cated, messy, multidimensional, and subjective -with a seductive simplicity and objectivity.Banished was the slovenly panoply of "eye-balling curves, personal judgment, descriptionwithout inference, or bargaining with thereader" to be replaced by one clean, simple, re-freshing rule: if p < .05 (or so) an effect is real;otherwise, it's not. In the domain of pure re-search, this switch eased the decision processfor journal editors, while in the domain ofpractical applications (particularly educationaland military applications) it provided re-

searchers with a new concept, easily explainableto policy makers, that could be used to justify(or denounce) the implementation of noveltechniques.

Statistical NewspeakThe second reason was that legions of

"Statistics for the Social Sciences" textbookwriters simply ignored history. They leftenormous gaps in their teachings (such asFisher's development of interval-estimationtechniques or his long-forgotten distinctionbetween a significant result and the demonstra-tion of a natural phenomenon) and they be-stowed upon their subject matter a consensus ofits founders that was simply never there tobegin with. Gigerenzer et al. illustrate this latterprocess as it applied to the controversy betweenFisher (who emphasized the binary test of asingle null hypothesis) and Pearson andNeyman (who viewed a statistical test as ameans of choosing among a slate of candidatehypotheses). The authors note that,...almost no [social science statistics] text pre-sented Neyman and Pearson's theory as an alter-native to Fisher's, still less as a competing theory.The great mass of texts tried to fuse the contro-versial ideas into some hybrid statistical the-ory...Of course this meant doing the impossible.But...statisticians were eager to sell, and psychol-ogists were eager to buy the method of inductiveinferences. The statistical texts now taught hy-brid statistics, of which neither Fisher nor, to besure, Neyman and Pearson would have ap-proved. The type-II errors became added tonull hypothesis testing (although it could not bedetermined in this context), Neyman andPearson's interpretation of the level of signifi-cance as the proportion of type-I errors in thelong run became mishmashed with Fisher's andso on. Whatever the textbooks taught, it was notindicated that some of the ideas stemmed fromFisher, others from Neyman and Pearson. Thehybrid statistics was presented anonymously, asif it were the only truth, as if there existed onlyone type of statistics. There was no mention ofthe existence of a deep controversy, much less ofthe controversial issues, nor of the existence ofalternative statistical theories...(p. 208)Students and practitioners of statistics were ac-cordingly left with the impression that "statis-tics is statistics" and all that is really necessaryis to learn the rules - which leaves very little in-centive or historical precedent for innovation

Page 5: Loftus91 Tyranny

GEOFFREY R. LOFTUS HYPOTHESIS TESTING

5 CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2

and creativity. This remarkable state of affairsis analogous to engineers teaching (and believ-ing) that light consists only of waves, while ig-noring its particle characteristics - and losing inthe process, of course, any motivation to pursuethe most interesting puzzles and paradoxes inthe field.

ConclusionsThe negative tone that I've adopted in this

review shouldn't be construed to reflect an op-position to the teaching and use of quantitativemethods. Social science research is, perforce,cursed with more than its share of statistical er-ror and inaccessible measures, which have to bedealt with somehow. A thorough knowledge of(at the very least) probability theory,measurement, and descriptive statistics is acritical component of any social scientist'smethodological arsenal.

The Empire of Chance provides an histor-ical context that constitutes a refreshing coun-terpoint to the tedious catechism delivered bythe vast majority of social-science statisticstextbooks that have appeared over the past fiftyyears. It is not a textbook itself, but it includes

most of the background information that anysocial scientist should have in order to realizethat probability theory and statistics is far morethan just a static collection of dry equations andinviolate rules.

ReferencesFreedman, D.A., Rothenberg, T. and Sutch, R.

(1983). On Energy Policy Models. Journalof Business and Economic Statistics, 1, 24-36.

Greenwald, A.G. (1975). Consequences ofprejudice against the null hypothesis.Psychological Bulletin, 83, 1-20.

Guilford, J.P. (1942). Fundamental Statistics inPsychology and Education. First Edition,1942; third edition, 1956. New York:McGraw-Hill.

Hays, W. (1973). Statistics for the SocialSciences (second edition). New York: Holt.

Melton, A.W. (1962). Editorial. Journal ofExperimental Psychology, 64, 553-557.

Nunnally, J. (1960). The place of statistics inpsychology. Educational andPsychological Measurement, 20, 641-650.

Page 6: Loftus91 Tyranny

GEOFFREY R. LOFTUS HYPOTHESIS TESTING

CONTEMPORARY PSYCHOLOGY, 1991, Vol. 36, No. 2 6