View
215
Download
1
Category
Tags:
Preview:
Citation preview
The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans
Andreas SchwabIowa State University
Institute of Technology - Bandung
Universitas Gadjah MadaApril 30, 2014
Whom you really would like to be here?
Bill StarbuckUniversity of Oregon
The Case Against Null Hypothesis Significance Testing: Flaws, Alternatives, and Action Plans
William Starbuck Andreas SchwabEric AbrahamsonBruce Thompson
Research Method DivisionProfessional Development Workshop
Atlanta 2006 , Philadelphia 2007, Anaheim 2008, Chicago 2009, Montreal 2010 , San Antonio 2011,Orlando 2013.
Donald HatfieldJose CortinaRay HubbardLisa Lambert
Researchers Should Make Thoughtful Assessments Instead of Null-Hypothesis Significance Tests
Andreas Schwab, Iowa State UniversityEric Abrahamson, Columbia University
Bill Starbuck, University of OregonFiona Fidler, La Trobe University
Perspective Article
2011 Organization Science, 22(4), 1105-1120.
What is wrong withNull-Hypothesis Significance Testing ?
Formal Statistics Perspective: Nothing!
Application Perspective: Nearly everything!
Main Message: NHST simply does not answer the questions
we are really interested in. Our ritualized NHST applications impede
scientific progress.
NHSTs have been controversial for a long time
Fisher proposed NHSTs in 1925
Immediately, Neyman & Pearson questioned testing a null-hypothesis without testing any alternative hypothesis.
Other complaints have been added over time.
Statistics textbooks teach a ritualized use of NHSTs without reference to these complaints.
Many scholars remain unaware of the strong arguments against NHSTs.
NHSTs make assumptions that many studies do not satisfy
NHSTs calculates statistical significance based on a sampling distribution for a random sample.
For any other type of sample, NHSTs results have no meaningful interpretation.Non-random samplesPopulation data If data is incomplete, missing data
unlikely to be random
NHSTs portray truth as dichotomous and definite (= real , important , and certain)
Either reject or fail to reject the null hypothesis.
Ritualized choice of same arbitrary significance levels for all studies (p < .05).
"Cliff effects" amplify very small differences in the data into very large differences in implications.
No explicitly discussion and reporting of detailed uncertainty information impedes model testing and development (Dichotomous Thinking).
NHSTs do not answer the questions we are really interested in
H0: A new type of training has no effect on knowledge of nurses.
NHST estimates probability of observing the actual effect in our data due to random sampling -- assuming H0 is true.
If p is small, we consider H0 is unlikely to be true.
... and we conclude training has an effect on nurses' knowledge.
NHSTs do not answer the questions we are really interested in
Problem 1: In most cases, we already know H0 is never true. Any intervention will have some effect – potentially small. (nill hypotheses)
Problem 2: Apparent validity of findings becomes a function of researchers efforts due to sample size sensitivity of NHSTs. (sample-size sensitivity)
Problem 3: The important question is not whether an effect is different from zero, but whether the effect is large enough to matter. (effect size evaluation)
Problem 4: No direct probability statements if H0 or H1 are true given the data. (inverse probability fallacy & infused meaning)
Higher-order negative consequences of the ritualized NHST applications
Risks of false-positive findings
Risks of false-negative findings
Corrosion of research ethics
Higher-order consequences:Risk of false-positive findings
NHSTs uses a low threshold for what is considered important (p < .05; typical sample sizes).
Empirical research is a search for "needles in a haystack" (Webster & Starbuck, 1988). In management research, the average
correlation between unrelated variables is not zero but 0.09.
When choosing two variables at random, NHST offers a 67% chance of significant findings on the first try, and a 96% chance with three tries for average reported sample sizes.
Hence, we mistake lots of “straws” for “needles”
Second-order consequences:“Significant” findings often do not replicate
Published NHST research findings often do not replicate or duplicate. (Type 1 error)
Three-eighths of most cited and discussed medical treatments supported by significant results in initial studies were later disconfirmed. (Ioannidis, 2005)
Refusal of management journals to publish successful or failed replications:Discourages replication studies.Distorts meta-analyses. Supports belief in false claims.
Second-order consequences:“Significant” findings often do not replicate
P Values and Replication p = .01 false-positive 11% p = .05 false-positive 29%
P Hacking effect size sensitivity
choice of alternative dependent variables
choice of alternative independent and control variables
choice within statistical procedures
choice of moderating variables
Simulation studies show combined effect of choices: 60% or more false-positives (Simmons et al. 2011)
clustering of published p-values below .05, .01 and .001 suggests p-hacking (Simonsohn et al. 2013)
Second-order consequences: Risks of false-negative findings
For extremely beneficial or detrimental outcomes, the p < .05 threshold can be too high. (Type 2 error)
Example: Hormone treatments
NHSTs with fixed significance thresholds ignore important trade-offs between costs and benefits of research outcomes.
Third-order consequences:NHSTs corrode researchers’ motivation and ethics
Often repeated and very public misuse of NHSTs creates cynicism and confusion.Familiar applications of NHST are
published. Justified deviations from the familiar
attract extra scrutiny followed by rejection. Research feels more like a game played to
achieve promotion or visibility -- less of a search for truth or relevant solutions.
Accumulation of useful scientific knowledge is hindered.
NHSTs have severe limitations
How can we do better?
Start by Considering Contingencies
One attraction of NHSTs is superficial versatility. Researchers can use the same tests in most contexts.
However, this appearance of similarity is deceptive and, in itself, causes poor evaluations.
Research contexts in management are extremely diverse.
Researchers should take account for and discuss these contingencies (methodological
toolbox).
Improvements – an example
Effects of training on 59 nurses’ knowledge about nutrition. Traditional NHST told us that training had a “statistically significant” effect, but it did not show us: How much knowledge changed (effect
size). The actual variability and uncertainty of
these changes.
1: Focus on effects size measures and tailor them to research contexts
What metrics best capture changes in the dependent variables? Describe effects in the meaningful units used to
measure dependent variables – tons, numbers of people, bales, barrels.
Example: Percentage of correct answers by nurses on knowledge tests!
Other effect size measures (e.g., ∆ R2, Cohen's d , f 2, ώ2, Glass's ∆) (Cumming, 2011)
Would multiple assessments be informative? Nurses, patients, hospital administrators, society may
need different measures of effects. Triangulation opportunities. Should measures capture both benefits and their costs?
2: Report the uncertainty associated with measures of effects Report variability and uncertainty of effect
estimates (e.g., confidence intervals) Although nurses’ knowledge rose 21% on
average, changes ranged from -23% to +73%. Some nurses knew less after training!
Alternatives to CIs include likelihood ratios of alternative hypotheses and posterior distributions of estimated parameters.
Show graphs of complete distributions – say, the probability distribution of effect sizes. (Tukey 1977; Kosslyn 2006)
Reporting CIs supports aggregation of findings across studies (meta analyses). (Cumming 2010)
Endorsement of effects size and CI reporting by APA Manual
"The degree to which any journal emphasizes (or de-emphasizes) NHST is a decision of the individual editor. However, complete reporting of all tested hypotheses and estimates of appropriate effect sizes and confidence intervals are the minimum expectation for all APA journals."
APA Manual (2010, p. 33)
3: Compare new data with baseline models rather than null hypotheses
Compare favored theories with hypotheses more challenging than a no-effect hypothesis.
Alternative treatments as baselines
Naïve Baseline type 1: Data arise from very simple random processes. Example: Suppose that organizational survival
is a random walk.
Naïve baseline type 2: Crude stability or momentum processes. Example: Tomorrow will be the same as today.
3: ... more information on baseline models
William H. Starbuck
University of Oregon
Andreas Schwab
Iowa State University
Using Baseline Models to Improve Theories About Emerging Markets
Research Methodology in Strategy and Management
Advances in International Management Research
Why Baseline Modelling is better than Null-Hypothesis Testing: Examples from International Business
Research
4: Can Bayesian statistics help?
Revisit: NHSTs answer the wrong question. Probability of observing data assuming null-
hypothesis is true
Pr(data|H0)
Question of interest: Probability of proposed hypothesis being true
given the observed data (Arbuthnot, 1710; Male vs. female birth rates)
Pr(H1|data)
Bayesian approaches try to answer the later question!
4: … more information on Bayesian stats
William H. StarbuckUniversity of Oregon
Andreas SchwabIowa State University
Research Method DivisionProfessional Development Workshops
Eugene D. HahnSalisbury University
Zhanyun ZhaoRider University
Philadelphia, August 2014
Advanced Bayesian Statistics: How to Conduct and Publish High-Quality Bayesian
Studies
How to promote and support methodological change
Please speak up – When null hypotheses cannot be true When researchers apply NHSTs to non-
random samples or to entire populations When people misinterpret significance tests When researchers draw definitive conclusions
from results that is inherently uncertain and probabilistic
When not statistically significant findings may be substantively very important
When researchers do not report effect sizes
Critics of NHSTs and P-Values
NHSTs and P-Values have been likened to:
Mosquitoes (ANNOYING AND IMPOSSIBLE TO SWAT AWAY)Emperor's New Clothes (fraught with obvious problems that everyone ignores)Sterile Intellectual Rake (that ravishes science but leaves it with no progeny)"Statistical Hypothesis Inference Testing" (because it provides a more fitting acronym)
… and support your colleagues when they raise such issues!
The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans
Andreas SchwabIowa State University
Institute of Technology - Bandung
Universitas Gadjah MadaApril 30, 2014
The Case Against Null Hypothesis Significance Testing
Additional Slides
Individuals infuse more meaning into NHSTs than these tests can offer.
NHSTs estimate the probability that the data would occur in a random sample -- if the H0 were true.p does NOT represent the probability that
the null hypothesis is true given the data.1 – p does NOT represent the probability
that H1 is true.
NHSTs are frequently misinterpreted
With large samples, NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings.
Consequently, a researcher who gathers a large enough sample can reject any point null hypothesis.
Computer technology facilitates efforts to obtain larger samples.
However, using smaller samples is not the solution because power problems help turn “noise” into significant effects.
NH Significance depends on researchers’ efforts
Small samples offer limited protection against false positives.
NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings.
Risks of exploiting instability of estimates. Journals should require the following final
sentences: "... or maybe this will turn out to be unreplicable noise" in font size of (3000 ÷ N)
NH Significance and Statistical Power
Recommended literature
Cumming, Geoff (2011): Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge, New York.
Effect SizesCI
Recommended literatureStephen M. Kosslyn (2006)
Graph Design for the Eye and Mind.
John W. Tukey (1977)
Exploratory Data Analysis.
5: Use robust statistics to make estimates, especially robust regression
Actual distributions of data often deviate from probability distributions tests assume. Example: Even with samples from perfect
Normal populations, ordinary least-squares regression (OLS) makes inaccurate coefficient estimates for samples smaller than 400.
With samples from non-Normal distributions, OLS becomes even more unreliable.
Robust statistics seek to provide more accurate estimates when data deviate from assumptions.
4: To support generalization and replicability, frame hypotheses within very simple models
• If seeking applicability, beware of using many independent variables.
• If seeking generalization to new data, beware of using many independent variables.
• A few independent variables are useful, but the optimum occurs after a few.
• Additional variables fit random noise or idiosyncratic effects.
Recommended