40
The Case against Null- Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada April 30, 2014

Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

  • Upload
    ruana

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans. Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada April 30, 2014. Whom you really would like to be here?. Bill Starbuck University of Oregon. - PowerPoint PPT Presentation

Citation preview

Page 1: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans

Andreas SchwabIowa State University

Institute of Technology - Bandung

Universitas Gadjah MadaApril 30, 2014

Page 2: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Whom you really would like to be here?

Bill StarbuckUniversity of Oregon

Page 3: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

The Case Against Null Hypothesis Significance Testing: Flaws, Alternatives, and Action Plans

William Starbuck Andreas SchwabEric AbrahamsonBruce Thompson

Research Method DivisionProfessional Development Workshop

Atlanta 2006 , Philadelphia 2007, Anaheim 2008, Chicago 2009, Montreal 2010 , San Antonio 2011,Orlando 2013.

Donald HatfieldJose CortinaRay HubbardLisa Lambert

Page 4: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Researchers Should Make Thoughtful Assessments Instead of Null-Hypothesis Significance Tests

Andreas Schwab, Iowa State UniversityEric Abrahamson, Columbia University

Bill Starbuck, University of OregonFiona Fidler, La Trobe University

Perspective Article

2011 Organization Science, 22(4), 1105-1120.

Page 5: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

What is wrong withNull-Hypothesis Significance Testing ?

Formal Statistics Perspective: Nothing!

Application Perspective: Nearly everything!

Main Message: NHST simply does not answer the questions

we are really interested in. Our ritualized NHST applications impede

scientific progress.

Page 6: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

NHSTs have been controversial for a long time

Fisher proposed NHSTs in 1925

Immediately, Neyman & Pearson questioned testing a null-hypothesis without testing any alternative hypothesis.

Other complaints have been added over time.

Statistics textbooks teach a ritualized use of NHSTs without reference to these complaints.

Many scholars remain unaware of the strong arguments against NHSTs.

Page 7: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

NHSTs make assumptions that many studies do not satisfy

NHSTs calculates statistical significance based on a sampling distribution for a random sample.

For any other type of sample, NHSTs results have no meaningful interpretation.Non-random samplesPopulation data If data is incomplete, missing data

unlikely to be random

Page 8: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

NHSTs portray truth as dichotomous and definite (= real , important , and certain)

Either reject or fail to reject the null hypothesis.

Ritualized choice of same arbitrary significance levels for all studies (p < .05).

"Cliff effects" amplify very small differences in the data into very large differences in implications.

No explicitly discussion and reporting of detailed uncertainty information impedes model testing and development (Dichotomous Thinking).

Page 9: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

NHSTs do not answer the questions we are really interested in

H0: A new type of training has no effect on knowledge of nurses.

NHST estimates probability of observing the actual effect in our data due to random sampling -- assuming H0 is true.

If p is small, we consider H0 is unlikely to be true.

... and we conclude training has an effect on nurses' knowledge.

Page 10: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

NHSTs do not answer the questions we are really interested in

Problem 1: In most cases, we already know H0 is never true. Any intervention will have some effect – potentially small. (nill hypotheses)

Problem 2: Apparent validity of findings becomes a function of researchers efforts due to sample size sensitivity of NHSTs. (sample-size sensitivity)

Problem 3: The important question is not whether an effect is different from zero, but whether the effect is large enough to matter. (effect size evaluation)

Problem 4: No direct probability statements if H0 or H1 are true given the data. (inverse probability fallacy & infused meaning)

Page 11: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Higher-order negative consequences of the ritualized NHST applications

Risks of false-positive findings

Risks of false-negative findings

Corrosion of research ethics

Page 12: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Higher-order consequences:Risk of false-positive findings

NHSTs uses a low threshold for what is considered important (p < .05; typical sample sizes).

Empirical research is a search for "needles in a haystack" (Webster & Starbuck, 1988). In management research, the average

correlation between unrelated variables is not zero but 0.09.

When choosing two variables at random, NHST offers a 67% chance of significant findings on the first try, and a 96% chance with three tries for average reported sample sizes.

Hence, we mistake lots of “straws” for “needles”

Page 13: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Second-order consequences:“Significant” findings often do not replicate

Published NHST research findings often do not replicate or duplicate. (Type 1 error)

Three-eighths of most cited and discussed medical treatments supported by significant results in initial studies were later disconfirmed. (Ioannidis, 2005)

Refusal of management journals to publish successful or failed replications:Discourages replication studies.Distorts meta-analyses. Supports belief in false claims.

Page 14: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Second-order consequences:“Significant” findings often do not replicate

P Values and Replication p = .01 false-positive 11% p = .05 false-positive 29%

P Hacking effect size sensitivity

choice of alternative dependent variables

choice of alternative independent and control variables

choice within statistical procedures

choice of moderating variables

Simulation studies show combined effect of choices: 60% or more false-positives (Simmons et al. 2011)

clustering of published p-values below .05, .01 and .001 suggests p-hacking (Simonsohn et al. 2013)

Page 15: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Second-order consequences: Risks of false-negative findings

For extremely beneficial or detrimental outcomes, the p < .05 threshold can be too high. (Type 2 error)

Example: Hormone treatments

NHSTs with fixed significance thresholds ignore important trade-offs between costs and benefits of research outcomes.

Page 16: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Third-order consequences:NHSTs corrode researchers’ motivation and ethics

Often repeated and very public misuse of NHSTs creates cynicism and confusion.Familiar applications of NHST are

published. Justified deviations from the familiar

attract extra scrutiny followed by rejection. Research feels more like a game played to

achieve promotion or visibility -- less of a search for truth or relevant solutions.

Accumulation of useful scientific knowledge is hindered.

Page 17: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

NHSTs have severe limitations

How can we do better?

Page 18: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Start by Considering Contingencies

One attraction of NHSTs is superficial versatility. Researchers can use the same tests in most contexts.

However, this appearance of similarity is deceptive and, in itself, causes poor evaluations.

Research contexts in management are extremely diverse.

Researchers should take account for and discuss these contingencies (methodological

toolbox).

Page 19: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Improvements – an example

Effects of training on 59 nurses’ knowledge about nutrition. Traditional NHST told us that training had a “statistically significant” effect, but it did not show us: How much knowledge changed (effect

size). The actual variability and uncertainty of

these changes.

Page 20: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

1: Focus on effects size measures and tailor them to research contexts

What metrics best capture changes in the dependent variables? Describe effects in the meaningful units used to

measure dependent variables – tons, numbers of people, bales, barrels.

Example: Percentage of correct answers by nurses on knowledge tests!

Other effect size measures (e.g., ∆ R2, Cohen's d , f 2, ώ2, Glass's ∆) (Cumming, 2011)

Would multiple assessments be informative? Nurses, patients, hospital administrators, society may

need different measures of effects. Triangulation opportunities. Should measures capture both benefits and their costs?

Page 21: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

2: Report the uncertainty associated with measures of effects Report variability and uncertainty of effect

estimates (e.g., confidence intervals) Although nurses’ knowledge rose 21% on

average, changes ranged from -23% to +73%. Some nurses knew less after training!

Alternatives to CIs include likelihood ratios of alternative hypotheses and posterior distributions of estimated parameters.

Show graphs of complete distributions – say, the probability distribution of effect sizes. (Tukey 1977; Kosslyn 2006)

Reporting CIs supports aggregation of findings across studies (meta analyses). (Cumming 2010)

Page 22: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Endorsement of effects size and CI reporting by APA Manual

"The degree to which any journal emphasizes (or de-emphasizes) NHST is a decision of the individual editor. However, complete reporting of all tested hypotheses and estimates of appropriate effect sizes and confidence intervals are the minimum expectation for all APA journals."

APA Manual (2010, p. 33)

Page 23: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

3: Compare new data with baseline models rather than null hypotheses

Compare favored theories with hypotheses more challenging than a no-effect hypothesis.

Alternative treatments as baselines

Naïve Baseline type 1: Data arise from very simple random processes. Example: Suppose that organizational survival

is a random walk.

Naïve baseline type 2: Crude stability or momentum processes. Example: Tomorrow will be the same as today.

Page 24: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

3: ... more information on baseline models

William H. Starbuck

University of Oregon

Andreas Schwab

Iowa State University

Using Baseline Models to Improve Theories About Emerging Markets

Research Methodology in Strategy and Management

Advances in International Management Research

Why Baseline Modelling is better than Null-Hypothesis Testing: Examples from International Business

Research

Page 25: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

4: Can Bayesian statistics help?

Revisit: NHSTs answer the wrong question. Probability of observing data assuming null-

hypothesis is true

Pr(data|H0)

Question of interest: Probability of proposed hypothesis being true

given the observed data (Arbuthnot, 1710; Male vs. female birth rates)

Pr(H1|data)

Bayesian approaches try to answer the later question!

Page 26: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

4: … more information on Bayesian stats

William H. StarbuckUniversity of Oregon

Andreas SchwabIowa State University

Research Method DivisionProfessional Development Workshops

Eugene D. HahnSalisbury University

Zhanyun ZhaoRider University

Philadelphia, August 2014

Advanced Bayesian Statistics: How to Conduct and Publish High-Quality Bayesian

Studies

Page 27: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

How to promote and support methodological change

Please speak up – When null hypotheses cannot be true When researchers apply NHSTs to non-

random samples or to entire populations When people misinterpret significance tests When researchers draw definitive conclusions

from results that is inherently uncertain and probabilistic

When not statistically significant findings may be substantively very important

When researchers do not report effect sizes

Page 28: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Critics of NHSTs and P-Values

NHSTs and P-Values have been likened to:

Mosquitoes (ANNOYING AND IMPOSSIBLE TO SWAT AWAY)Emperor's New Clothes (fraught with obvious problems that everyone ignores)Sterile Intellectual Rake (that ravishes science but leaves it with no progeny)"Statistical Hypothesis Inference Testing" (because it provides a more fitting acronym)

Page 29: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

… and support your colleagues when they raise such issues!

Page 30: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans

Andreas SchwabIowa State University

Institute of Technology - Bandung

Universitas Gadjah MadaApril 30, 2014

Page 31: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

The Case Against Null Hypothesis Significance Testing

Additional Slides

Page 32: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Individuals infuse more meaning into NHSTs than these tests can offer.

NHSTs estimate the probability that the data would occur in a random sample -- if the H0 were true.p does NOT represent the probability that

the null hypothesis is true given the data.1 – p does NOT represent the probability

that H1 is true.

NHSTs are frequently misinterpreted

Page 33: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

With large samples, NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings.

Consequently, a researcher who gathers a large enough sample can reject any point null hypothesis.

Computer technology facilitates efforts to obtain larger samples.

However, using smaller samples is not the solution because power problems help turn “noise” into significant effects.

NH Significance depends on researchers’ efforts

Page 34: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Small samples offer limited protection against false positives.

NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings.

Risks of exploiting instability of estimates. Journals should require the following final

sentences: "... or maybe this will turn out to be unreplicable noise" in font size of (3000 ÷ N)

NH Significance and Statistical Power

Page 35: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Recommended literature

Cumming, Geoff (2011): Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge, New York.

Effect SizesCI

Page 36: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

Recommended literatureStephen M. Kosslyn (2006)

Graph Design for the Eye and Mind.

John W. Tukey (1977)

Exploratory Data Analysis.

Page 37: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

5: Use robust statistics to make estimates, especially robust regression

Actual distributions of data often deviate from probability distributions tests assume. Example: Even with samples from perfect

Normal populations, ordinary least-squares regression (OLS) makes inaccurate coefficient estimates for samples smaller than 400.

With samples from non-Normal distributions, OLS becomes even more unreliable.

Robust statistics seek to provide more accurate estimates when data deviate from assumptions.

Page 38: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada
Page 39: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada

4: To support generalization and replicability, frame hypotheses within very simple models

• If seeking applicability, beware of using many independent variables.

• If seeking generalization to new data, beware of using many independent variables.

• A few independent variables are useful, but the optimum occurs after a few.

• Additional variables fit random noise or idiosyncratic effects.

Page 40: Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada