53
Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology Society for Philosophy of Science in Practice (SPSP) June 17, 2016 Deborah G Mayo Caitlin Parker Virginia Tech

Mayo & parker spsp 2016 june 16

Embed Size (px)

Citation preview

Page 1: Mayo & parker   spsp 2016 june 16

Using Philosophy of Statistics to Make Progress in the Replication Crisis

in Psychology

Society for Philosophy of Science in Practice (SPSP)June 17, 2016

Deborah G MayoCaitlin ParkerVirginia Tech

Page 2: Mayo & parker   spsp 2016 june 16

2

Statistical Crisis of Replication

• Everywhere you look, “Science is in Crisis”

• Researchers report exciting findings (statistical) which promptly disappear

• High profile failures of replication not just in the social sciences, but also in biology, have led people to take it more seriously

Page 3: Mayo & parker   spsp 2016 june 16

3

Reforms without philosophy of statistics are blind

• Taskforces, journalistic reforms, and debunking treatises are legion

• Proposed methodological reforms––many welcome (preregistration)–some quite radical

• The issue cries out for illumination from philosophers of science wanting to be relevant to practice

Page 4: Mayo & parker   spsp 2016 june 16

4

Replication crisis in social psychology

• Diederik Stapel, the social psychologist who fabricated his data (2011)

• Investigating Stapel revealed a culture of verification bias; selective reporting so common; they called them questionable research practices (QRPs)

Page 5: Mayo & parker   spsp 2016 june 16

5

“I see a train-wreck looming,” Daniel Kahneman, calls for a “daisy chain” of replication in Sept. 2012

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA)

Page 6: Mayo & parker   spsp 2016 june 16

6

American Statistical Society (ASA):Statement on P-values

“The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as…to ban P-values” (ASA 2016)

Page 7: Mayo & parker   spsp 2016 june 16

7

I was a ‘philosophical observer’ at the ASA P-value “pow wow”

Page 8: Mayo & parker   spsp 2016 june 16

8

“Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician

Page 9: Mayo & parker   spsp 2016 june 16

9

Error Statistics

• Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes

• The inference may be in error

• It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities)

Page 10: Mayo & parker   spsp 2016 june 16

10

“p-value. …to test the conformity of the particular data under analysis with H0 in some respect:…we find a function t = t(y) of the data, to be called the test statistic, such that• the larger the value of t the more

inconsistent are the data with H0;• The random variable T = t(Y) has a

(numerically) known probability distribution when H0 is true.

…the p-value corresponding to any t asp = p(t) = P(T ≥ t; H0)”

(Mayo and Cox 2006, p. 81)

Page 11: Mayo & parker   spsp 2016 june 16

11

• Clearly, if even larger differences than t occur fairly frequently under H0 (p-value is not small), there’s scarcely evidence of incompatibility 

• But a small p-value doesn’t warrant inferring a genuine statistical effect H, let alone a scientific conclusion H*

• Sticking to Fisherian tests with a single null hypothesis encourages this fallacy (NHST)

Stat-Sub fallacy  H => H*

Page 12: Mayo & parker   spsp 2016 june 16

12

Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that exhaust the parameter

space

• So the fallacy of rejection HH* is impossible

• Rejecting the null only indicates statistical alternatives

Page 13: Mayo & parker   spsp 2016 june 16

13

• It’s not that we’re keen to defend many common uses of significance tests

• The criticisms are often based on misunderstandings; consequently so are many “reforms”

• Replication research falls into paradox

Page 14: Mayo & parker   spsp 2016 june 16

14

A paradox for significance test critics

Critic: It’s much too easy to get small P-values.

You: Why do they find it so difficult to replicate the small P-values others found?* 

Is it easy or is it hard? *(Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on replication in psychology)

Page 15: Mayo & parker   spsp 2016 june 16

15

• R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”)

• Sufficient finagling—cherry-picking, P-hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim C gets support, even if it’s unwarranted by evidence

(biasing selection effects)

• Note: Rejecting a null taken as support for some non-null claim C

Page 16: Mayo & parker   spsp 2016 june 16

16

You report: Such results would be difficult to achieve under the assumption of H0

When in fact such results are common under the assumption of H0

(Formally):

• You say Pr(P-value < Pobs; H0) = Pobs (small)

• But in fact Pr(P-value < Pobs; H0) = high

Page 17: Mayo & parker   spsp 2016 june 16

17

Severity Requirement:If data x0 agree with a claim C, but the test procedure had little or no capability of finding flaws with C (even if the claim is incorrect), then x0 provide poor evidence for C

• Such a test fails a minimal requirement for a stringent or severe test

• Our account: severe testing based on error statistics (requires reinterpreting tests)

Page 18: Mayo & parker   spsp 2016 june 16

18

Our view alters the role of probability: typically just 2

• Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0.

(e.g., Bayesian, likelihoodist)—with regard for inner coherency

• Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)

Page 19: Mayo & parker   spsp 2016 june 16

19

What happened to using probability to assess error probing capacity and severity?

• Neither “probabilism” nor “performance” directly captures it

• Good long-run performance is a necessary, not a sufficient, condition for severity

Page 20: Mayo & parker   spsp 2016 june 16

20

• Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs—

• It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data

Page 21: Mayo & parker   spsp 2016 june 16

21

A claim C is not warranted _______

O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer)

O Performance: unless it stems from a method with low long-run error

O Probativism (severe testing) something (a fair amount) has been done to probe ways we can be wrong about C

Page 22: Mayo & parker   spsp 2016 june 16

22

O If you assume probabilism, error probabilities are relevant for inference only by misinterpretation False!

O They play a crucial role in appraising well-testedness

O It’s crucial to be able to say, C is highly believable or plausible but poorly tested

O With this in mind go back to the paradox of replication

Page 23: Mayo & parker   spsp 2016 june 16

23

• Critic: It’s too easy to satisfy standard significance thresholds

• You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)?

• Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, data-dredging

• You: So, the replication researchers want methods that pick up on and block these biasing selection effects.

• Critic: Actually the “reforms” recommend methods where the need to alter p-values due to data dredging vanishes

Page 24: Mayo & parker   spsp 2016 june 16

24

Likelihood Principle (LP)The vanishing act links to a pivotal disagreement in the philosophy of statistics battles

In probabilisms (Bayes factors, posteriors), the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0) for x0 fixed

• They condition on the actual data,

• error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)

Page 25: Mayo & parker   spsp 2016 june 16

25

All error probabilities violate the LP (even without selection effects):

“Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, p. 436) 

“The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects.” (Rosenkrantz, 1977, p. 122)

Page 26: Mayo & parker   spsp 2016 june 16

26

Leader of a big meta-research center: taking account of biasing selection effects “defies

scientific sense”“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…

But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010)

(To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford)

Page 27: Mayo & parker   spsp 2016 june 16

27

What’s wrong with attempts to restore credibility of science via existing reforms?• The reforms (based on “probabilisms”) enable

rather than check unreliable results due to biasing selection effects

• ignores the need to control severity (probativeness)

• Replication research doesn’t critically assess whether the measurements are picking up on the purported phenomenon

• PSP should get involved

Page 28: Mayo & parker   spsp 2016 june 16

28

• Replication research doesn’t critically assess whether the measurements are picking up on the purported phenomenon

Part II

Page 29: Mayo & parker   spsp 2016 june 16

29

Replication Initiatives

“Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence.” (Open Science Collaboration, 2015)O Replication initiatives such as the Reproducibility Project

use direct replications

O Direct replications strive for fidelity to the original procedure, through “consultation with original authors, obtaining original materials, and internal review”O Discriminate experimental effects from statistical

artifacts

Page 30: Mayo & parker   spsp 2016 june 16

30

Work completed in Aug 2015

O Benefits for replications in the OSC study that wrapped up August 2015

O Preregistered, designed to have high power

O Free of “perverse incentives” of usual research: guaranteed to be published

O No file drawers

Page 31: Mayo & parker   spsp 2016 june 16

31

O Our problem with these projects…. they stick to what might be called “purely statistical” issues: can we get a low P-value or not?

O As important as these statistical issues are, researchers are overlooking important issues related to the way the statistical and theoretical hypotheses are linked to one another

Page 32: Mayo & parker   spsp 2016 june 16

32

O Consider a case where researchers, upon repeated replication, consistently find a statistically significant effect. O A (legitimate) statistically significant

result does not necessarily support the corresponding research hypothesis.

O Without additional probing, or knowledge of adequate connections between the models involved in the experimental inquiry, we have not shown support for any causal model or even that there is an effect that is independent of the experimental algorithm

Statistical ≠> substantive (H ≠> H*)

Page 33: Mayo & parker   spsp 2016 june 16

33

Replications and Severe Tests

O In error-statistical terms: O Data from a replication inquiry

warrants a hypothesis if and only to the extent that the experimental inquiry has subjected that hypothesis to a severe test.

Page 34: Mayo & parker   spsp 2016 june 16

34

O Recall severity requirement: If data x0 agree with a claim C, but the test procedure had little or no capability of finding flaws with C (even if the claim is incorrect), then x0 provide poor evidence for C

O Flaws with a substantive/theoretical hypothesis H* are not probed by a test of the statistical hypothesis HO The inference from a statistically significant

result to H* fails to pass with severity 

Statistical ≠> substantive (H ≠> H*)

Page 35: Mayo & parker   spsp 2016 june 16

35

A hypothesis that must be considered:

Our findings reflect the inability of our inquiry to severely probe the (substantive) research hypothesis H*

Severely Testing H*

Page 36: Mayo & parker   spsp 2016 june 16

36

O It is fallacious to see a low p-value as automatically licensing the inference to a genuine effect.

O For analogous reasons, it is erroneous to jump from merely having a reliable statistical effect to accepting a research hypothesis.O Inferring support for the research hypothesis

requires us to rule out sources of error such as problems with experimental design, experimental assumptions, and model specification.

O The significance test (ANOVA, regression, etc.) doesn’t have the capacity to uncover error at the level of the research hypothesis

Replications and Severe Tests

Page 37: Mayo & parker   spsp 2016 june 16

37

Examples of sources of error that can undermine the link between the statistical and substantive research hypotheses, such that the inquiry has a poor capacity to uncover error:

O Using measurements that don’t actually capture what we’re interested inO Scales with poor validityO Poorly chosen proxy variables

O Lacking experimental controlO Using samples that are aware of the goal of

the study

Page 38: Mayo & parker   spsp 2016 june 16

38

O Study: “The Value of Believing in Free Will: Encouraging a Belief in Determinism Increases Cheating” (Vohs & Schooler 2008)

O Hypothesis: O Inducing participants to believe that

human behavior is predetermined will increase their cheating behavior

O Does experiment track this?

Thinking About Replications

Page 39: Mayo & parker   spsp 2016 june 16

39

How Should We Interpret Replication Results?

O OSC: “In some cases, the replications increase confidence in the reliability of the original results; in other cases, the replications suggest that more investigation is needed to establish the validity of the original findings” (2015)

O Implication: consistent results suggest the original findings may be trusted; the validity of the original findings only comes into question when results are not replicated

Page 40: Mayo & parker   spsp 2016 june 16

40

How Should We Interpret Replication Results?

O Should we interpret negative (direct) replications as indicating that the original was a false positive?

O Failed replications oft dismissed for supposed failure to be faithful to the procedure, or even just incompetence on behalf of the second set of researchers

Page 41: Mayo & parker   spsp 2016 june 16

41

Warrant for a hypothesis is related to the capacities of the test the hypothesis is subjected toDifferent procedures are capable of providing different levels of warrant for a hypothesis O For this reason it is important to distinguish between

having “positive results”, e.g. in the form of a rejected null, and having good evidence for a theory O For example: if there is something wrong with an

experiment's protocol such that it guarantees the generation of a positive result regardless of the truth of the research hypothesis, the results will be highly replicable – but this is not good evidence for the associated theory

Page 42: Mayo & parker   spsp 2016 june 16

42

This takes us back to a point about proposed statistical reforms…

Page 43: Mayo & parker   spsp 2016 june 16

43

Potential Reforms That Court Fallacious Inferences

To block unwarranted inferences due to selection effects, one reform that has been proposed (e.g., by Gelman) is the following:

Include your beliefs in the interpretation of the evidence – i.e., assign a low prior to the implausible hypothesis – to prevent declaring there’s statistical evidence for some unbelievable claim

Page 44: Mayo & parker   spsp 2016 june 16

44

The Problem With PlausibilityPROBLEM: This also countenances blurring the statistical and substantive hypotheses! O Whether or not we have tested a hypothesis well

is a different matter from whether it is plausibleO The relationship between the plausibility of

substantive hypotheses and the spuriousness of experimental results is by no means direct.

O Even if a purported effect is entirely plausible, it may be highly implausible that a particular inquiry has generated good evidence for accepting it.O We can easily imagine cases where a hypothesis we

know to be true is “supported” using biased data or an irrelevant experiment

Page 45: Mayo & parker   spsp 2016 june 16

45

The Problem With PlausibilityOther potential issues

O Researchers often sincerely believe their hypotheses - how to gauge plausibility?

O Now you’ve got two sources of flexibility, priors and biasing selection effects

Page 46: Mayo & parker   spsp 2016 june 16

46

Tasks for Philosophers of Science

O The mess surrounding the replication crisis specifically is rife with problems amenable to the work of philosophers of science and statistics

O Yet despite the large amount of dust kicked up around replicability problems in psychology, there has been a striking lack of professional philosophical investigation into the matter

O This has left psychologists in a curious situation, where attempts at reform are made prior to having any internally consistent notion of what it means to perform probative experimental inquiry or practice "good statistics".

Page 47: Mayo & parker   spsp 2016 june 16

47

Tasks for Philosophers of Science

Philosophers of science and statistics are equipped for the process of clarifying concepts and uncovering presuppositions of those engaged in research and reform projects; uncovering tensions or inconsistencies within positions; improving on existing methodology; and helping solve scientific problems.

They can help restore credibility to scientific enterprises, by inculcating consistent interpretations of statistical tests and explaining motivations behind methodological rules

Page 48: Mayo & parker   spsp 2016 june 16

48

Part I References• Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical

Inference: A Discussion, edited by L. J. Savage. London: Methuen.

• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (forthcoming). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology.  Invited paper for special issue on “Bayesian hypothesis testing.”

• Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.

• Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

• Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.

• Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46.

• Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press.

• Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall.

• Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

Page 49: Mayo & parker   spsp 2016 june 16

49

• Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.

• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

• Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

• Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press.

• Gilbert, D. Twitter post: https://twitter.com/dantgilbert/status/470199929626193921

• Gill, comment: On the “Suspicion of Scientific Misconduct by Jens Forster by Neuroskeptic May 6, 2014 on Discover Magazine Blog: http://blogs.discovermagazine.com/neuroskeptic/2014/05/06/suspicion-misconduct-forster/#.Vynr3j-scQ0.

• Goldacre, B. 2008. Bad Science. HarperCollins Publishers.

• Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588); 7; online 04Feb2016.

• Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013.

• Handwerk, B. 2015. “Scientists Replicated 100 Psychology Studies, and Fewer than Half Got the Same Results.” Smithsonian Magazine (August 27, 2015) http://www.smithsonianmag.com/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist

• Hasselman, F. and Mayo, D. 2015, April 17. “seveRity” (R-program). Retrieved from osf.io/k6w3h

Page 50: Mayo & parker   spsp 2016 june 16

50

• Levelt Committee, Noort Committee, Drenth Committee. 2012. 'Flawed science: The fraudulent research practices of social psychologist Diederik Stapel', Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel. (https://www.commissielevelt.nl/)

• Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

• Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

• Mayo, D. G. 2016. 'Don't Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary', The American Statistician, online March 7, 2016. http://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108.

• Mayo, D. G. Error Statistics Philosophy Blog: errorstatistics.com

• Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

• Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

• Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

Page 51: Mayo & parker   spsp 2016 june 16

51

• Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

• Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

• Open Science Collaboration (Nozeck, B. et al). 2015. “Estimating the Reproducibility of Psychological Science.” Science 349(6251)

• Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

• Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

• Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.

• Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

• Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.

• Smithsonian Magazine (See Handwerk)

• Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2.

• Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician Link to ASA statement & Commentaries (under supplemental):

http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108

Page 52: Mayo & parker   spsp 2016 june 16

52

Part II ReferencesO Aarts et al. 2015. “Estimating the Reproducibility of Psychological

Science.” Science 349: 6251, 943 - aac47168.O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian

Analysis 1 (3): 385–402. O Flam, F. D. "The Odds, Continually Updated." The New York Times (29

Sept. 2014): D1. The New York Times. The New York Times, 29 Sept. 2014. Web. <http://www.nytimes.com/2014/09/30/science/the-odds-continually-updated.html>.

O Nadelhoffer, Thomas, Jason Shepard, Eddy Nahmias, Chandra Sripada, and Lisa Thomson Ross. 2014. "The Free Will Inventory: Measuring Beliefs about Agency and Responsibility." Consciousness and Cognition 25: 27-41.

O Viney, Wayne, David A. Waldman, and Jacqueline Barchilon. 1982. "Attitudes Toward Punishment in Relation to Beliefs in Free Will and Determinism." Human Relations, 35:11, 939-950.

O Vohs, Kathleen D., and Jonathan W. Schooler. 2008. "The Value of Believing in Free Will: Encouraging a Belief in Determinism Increases Cheating. "Psychological Science 19.1, 49-54.

O Wagenmakers, Wetzels, Borsboom, and van der Maas. 2011. “Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi: Comment on Bem (2011).” Journal of Personality and Social Psychology 100: 426-432.

Page 53: Mayo & parker   spsp 2016 june 16

53

Abstract:  Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.