Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans

Andreas SchwabIowa State University

Institute of Technology - Bandung

Universitas Gadjah MadaApril 30, 2014

Whom you really would like to be here?

Bill StarbuckUniversity of Oregon

The Case Against Null Hypothesis Significance Testing: Flaws, Alternatives, and Action Plans

William Starbuck Andreas SchwabEric AbrahamsonBruce Thompson

Research Method DivisionProfessional Development Workshop

Atlanta 2006 , Philadelphia 2007, Anaheim 2008, Chicago 2009, Montreal 2010 , San Antonio 2011,Orlando 2013.

Donald HatfieldJose CortinaRay HubbardLisa Lambert

Researchers Should Make Thoughtful Assessments Instead of Null-Hypothesis Significance Tests

Andreas Schwab, Iowa State UniversityEric Abrahamson, Columbia University

Bill Starbuck, University of OregonFiona Fidler, La Trobe University

Perspective Article

2011 Organization Science, 22(4), 1105-1120.

What is wrong withNull-Hypothesis Significance Testing ?

Formal Statistics Perspective: Nothing!

Application Perspective: Nearly everything!

Main Message: NHST simply does not answer the questions

we are really interested in. Our ritualized NHST applications impede

scientific progress.

NHSTs have been controversial for a long time

Fisher proposed NHSTs in 1925

Immediately, Neyman & Pearson questioned testing a null-hypothesis without testing any alternative hypothesis.

Other complaints have been added over time.

Statistics textbooks teach a ritualized use of NHSTs without reference to these complaints.

Many scholars remain unaware of the strong arguments against NHSTs.

NHSTs make assumptions that many studies do not satisfy

NHSTs calculates statistical significance based on a sampling distribution for a random sample.

For any other type of sample, NHSTs results have no meaningful interpretation.Non-random samplesPopulation data If data is incomplete, missing data

unlikely to be random

NHSTs portray truth as dichotomous and definite (= real , important , and certain)

Either reject or fail to reject the null hypothesis.

Ritualized choice of same arbitrary significance levels for all studies (p < .05).

"Cliff effects" amplify very small differences in the data into very large differences in implications.

No explicitly discussion and reporting of detailed uncertainty information impedes model testing and development (Dichotomous Thinking).

NHSTs do not answer the questions we are really interested in

H0: A new type of training has no effect on knowledge of nurses.

NHST estimates probability of observing the actual effect in our data due to random sampling -- assuming H0 is true.

If p is small, we consider H0 is unlikely to be true.

... and we conclude training has an effect on nurses' knowledge.

NHSTs do not answer the questions we are really interested in

Problem 1: In most cases, we already know H0 is never true. Any intervention will have some effect – potentially small. (nill hypotheses)

Problem 2: Apparent validity of findings becomes a function of researchers efforts due to sample size sensitivity of NHSTs. (sample-size sensitivity)

Problem 3: The important question is not whether an effect is different from zero, but whether the effect is large enough to matter. (effect size evaluation)

Problem 4: No direct probability statements if H0 or H1 are true given the data. (inverse probability fallacy & infused meaning)

Higher-order negative consequences of the ritualized NHST applications

Risks of false-positive findings

Risks of false-negative findings

Corrosion of research ethics

Higher-order consequences:Risk of false-positive findings

NHSTs uses a low threshold for what is considered important (p < .05; typical sample sizes).

Empirical research is a search for "needles in a haystack" (Webster & Starbuck, 1988). In management research, the average

correlation between unrelated variables is not zero but 0.09.

When choosing two variables at random, NHST offers a 67% chance of significant findings on the first try, and a 96% chance with three tries for average reported sample sizes.

Hence, we mistake lots of “straws” for “needles”

Second-order consequences:“Significant” findings often do not replicate

Published NHST research findings often do not replicate or duplicate. (Type 1 error)

Three-eighths of most cited and discussed medical treatments supported by significant results in initial studies were later disconfirmed. (Ioannidis, 2005)

Refusal of management journals to publish successful or failed replications:Discourages replication studies.Distorts meta-analyses. Supports belief in false claims.

Second-order consequences:“Significant” findings often do not replicate

P Values and Replication p = .01 false-positive 11% p = .05 false-positive 29%

P Hacking effect size sensitivity

choice of alternative dependent variables

choice of alternative independent and control variables

choice within statistical procedures

choice of moderating variables

Simulation studies show combined effect of choices: 60% or more false-positives (Simmons et al. 2011)

clustering of published p-values below .05, .01 and .001 suggests p-hacking (Simonsohn et al. 2013)

Second-order consequences: Risks of false-negative findings

For extremely beneficial or detrimental outcomes, the p < .05 threshold can be too high. (Type 2 error)

Example: Hormone treatments

NHSTs with fixed significance thresholds ignore important trade-offs between costs and benefits of research outcomes.

Third-order consequences:NHSTs corrode researchers’ motivation and ethics

Often repeated and very public misuse of NHSTs creates cynicism and confusion.Familiar applications of NHST are

published. Justified deviations from the familiar

attract extra scrutiny followed by rejection. Research feels more like a game played to

achieve promotion or visibility -- less of a search for truth or relevant solutions.

Accumulation of useful scientific knowledge is hindered.

NHSTs have severe limitations

How can we do better?

Start by Considering Contingencies

One attraction of NHSTs is superficial versatility. Researchers can use the same tests in most contexts.

However, this appearance of similarity is deceptive and, in itself, causes poor evaluations.

Research contexts in management are extremely diverse.

Researchers should take account for and discuss these contingencies (methodological

toolbox).

Improvements – an example

Effects of training on 59 nurses’ knowledge about nutrition. Traditional NHST told us that training had a “statistically significant” effect, but it did not show us: How much knowledge changed (effect

size). The actual variability and uncertainty of

these changes.

1: Focus on effects size measures and tailor them to research contexts

What metrics best capture changes in the dependent variables? Describe effects in the meaningful units used to

measure dependent variables – tons, numbers of people, bales, barrels.

Example: Percentage of correct answers by nurses on knowledge tests!

Other effect size measures (e.g., ∆ R2, Cohen's d , f 2, ώ2, Glass's ∆) (Cumming, 2011)

Would multiple assessments be informative? Nurses, patients, hospital administrators, society may

need different measures of effects. Triangulation opportunities. Should measures capture both benefits and their costs?

2: Report the uncertainty associated with measures of effects Report variability and uncertainty of effect

estimates (e.g., confidence intervals) Although nurses’ knowledge rose 21% on

average, changes ranged from -23% to +73%. Some nurses knew less after training!

Alternatives to CIs include likelihood ratios of alternative hypotheses and posterior distributions of estimated parameters.

Show graphs of complete distributions – say, the probability distribution of effect sizes. (Tukey 1977; Kosslyn 2006)

Reporting CIs supports aggregation of findings across studies (meta analyses). (Cumming 2010)

Endorsement of effects size and CI reporting by APA Manual

"The degree to which any journal emphasizes (or de-emphasizes) NHST is a decision of the individual editor. However, complete reporting of all tested hypotheses and estimates of appropriate effect sizes and confidence intervals are the minimum expectation for all APA journals."

APA Manual (2010, p. 33)

3: Compare new data with baseline models rather than null hypotheses

Compare favored theories with hypotheses more challenging than a no-effect hypothesis.

Alternative treatments as baselines

Naïve Baseline type 1: Data arise from very simple random processes. Example: Suppose that organizational survival

is a random walk.

Naïve baseline type 2: Crude stability or momentum processes. Example: Tomorrow will be the same as today.

3: ... more information on baseline models

William H. Starbuck

University of Oregon

Andreas Schwab

Iowa State University

Using Baseline Models to Improve Theories About Emerging Markets

Research Methodology in Strategy and Management

Advances in International Management Research

Why Baseline Modelling is better than Null-Hypothesis Testing: Examples from International Business

Research

4: Can Bayesian statistics help?

Revisit: NHSTs answer the wrong question. Probability of observing data assuming null-

hypothesis is true

Pr(data|H0)

Question of interest: Probability of proposed hypothesis being true

given the observed data (Arbuthnot, 1710; Male vs. female birth rates)

Pr(H1|data)

Bayesian approaches try to answer the later question!

4: … more information on Bayesian stats

William H. StarbuckUniversity of Oregon


Research Method DivisionProfessional Development Workshops

Eugene D. HahnSalisbury University

Zhanyun ZhaoRider University

Philadelphia, August 2014

Advanced Bayesian Statistics: How to Conduct and Publish High-Quality Bayesian

Studies

How to promote and support methodological change

Please speak up – When null hypotheses cannot be true When researchers apply NHSTs to non-

random samples or to entire populations When people misinterpret significance tests When researchers draw definitive conclusions

from results that is inherently uncertain and probabilistic

When not statistically significant findings may be substantively very important

When researchers do not report effect sizes

Critics of NHSTs and P-Values

NHSTs and P-Values have been likened to:

Mosquitoes (ANNOYING AND IMPOSSIBLE TO SWAT AWAY)Emperor's New Clothes (fraught with obvious problems that everyone ignores)Sterile Intellectual Rake (that ravishes science but leaves it with no progeny)"Statistical Hypothesis Inference Testing" (because it provides a more fitting acronym)

… and support your colleagues when they raise such issues!

The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans


Institute of Technology - Bandung

Universitas Gadjah MadaApril 30, 2014

The Case Against Null Hypothesis Significance Testing

Additional Slides

Individuals infuse more meaning into NHSTs than these tests can offer.

NHSTs estimate the probability that the data would occur in a random sample -- if the H0 were true.p does NOT represent the probability that

the null hypothesis is true given the data.1 – p does NOT represent the probability

that H1 is true.

NHSTs are frequently misinterpreted

With large samples, NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings.

Consequently, a researcher who gathers a large enough sample can reject any point null hypothesis.

Computer technology facilitates efforts to obtain larger samples.

However, using smaller samples is not the solution because power problems help turn “noise” into significant effects.

NH Significance depends on researchers’ efforts

Small samples offer limited protection against false positives.

NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings.

Risks of exploiting instability of estimates. Journals should require the following final

sentences: "... or maybe this will turn out to be unreplicable noise" in font size of (3000 ÷ N)

NH Significance and Statistical Power

Recommended literature

Cumming, Geoff (2011): Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge, New York.

Effect SizesCI

Recommended literatureStephen M. Kosslyn (2006)

Graph Design for the Eye and Mind.

John W. Tukey (1977)

Exploratory Data Analysis.

5: Use robust statistics to make estimates, especially robust regression

Actual distributions of data often deviate from probability distributions tests assume. Example: Even with samples from perfect

Normal populations, ordinary least-squares regression (OLS) makes inaccurate coefficient estimates for samples smaller than 400.

With samples from non-Normal distributions, OLS becomes even more unreliable.

Robust statistics seek to provide more accurate estimates when data deviate from assumptions.

4: To support generalization and replicability, frame hypotheses within very simple models

• If seeking applicability, beware of using many independent variables.

• If seeking generalization to new data, beware of using many independent variables.

• A few independent variables are useful, but the optimum occurs after a few.

• Additional variables fit random noise or idiosyncratic effects.

Documents

Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada