sience 2.0 : an illustration of good research practices in a real study

science 2.0

an illustration of good research practices in a real study

wolf vanpaemel kortrijk, march 9 2015

1. the crisis in psychology

Why can we definitively say that? Because psychology often does not meet the five basic requirements for a field to be considered scientifically rigorous: clearly defined terminology, quantifiability, highly controlled experimental conditions, reproducibility and, finally, predictability and testability.

2% data verzonnen

- mundane 'regular' misbehaviours present greater threats to the scientific enterprise than those caused by high-profile misconduct cases such as fraud.

- first assessment of questionable research practices (QRP)

- 2002 assessment: NIH funded research1768 mid-career (52% response rate)1479 early-career(43% response rate)

- first assessment of QRP in psychology

- 2155 respondents (36% response rate)

http://www.hbs.edu/faculty/Pages/profile.aspx?facId=589473

the problems of QRP are widespread, and have very severe consequences

why is that the case?

“never attribute to malice what can be adequately explained by incompetence”

the main reasons are lack of guidelines, and the high publication pressure

i’m not interested in fraud (e.g., diederik stapel who made up his own data)

preventing fraud requires a different approach

2. science 2.0

a new way of doing science that aims to increase the confidence in research results

not one, single, coherent whole

a demonstration of science 2.0 with a real study

reference: Steegen, S., Dewitte, L., Tuerlinckx, F., & Vanpaemel, W. (2014). Measuring the crowd within again: A pre-registered replication study. Frontiers in Psychology, 5, 786, 1-8. doi:10.3389/fpsyg.2014.00786

paper: http://ppw.kuleuven.be/okp/_pdf/Steegen2014MTCWA.pdf

OSF page: https://osf.io/ivfu6/

http://dx.doi.org/10.3389/fpsyg.2014.00786

http://ppw.kuleuven.be/okp/_pdf/Steegen2014MTCWA.pdf



https://osf.io/ivfu6/



based on some recommendations on good research practices made in the literature

based on some recommendations on good research practices made in the literature

• not exhaustive • non-directive examples• for inspiration

most recommendations can be implemented separately from each other• not an all or none package deal

crowd within effect (vul & pashler, 2008)

• averaging multiple guesses from one person provides a better estimate than either guess alone

crowd within effect (vul & pashler, 2008)

• averaging multiple guesses from one person provides a better estimate than either guess alone

experiment

• 8 general knowledge questionse.g., what percent of the world's roads are in India?

• guess 1 guess 2

1. replication2. registration3. high power

4. bayesian statistics5. alpha level 6. estimations7. co-pilot multi-software approach

8. distinction between confirmatory and exploratory analyses9. open science

what? how? why?

features of science 2.0

before data collection

after data collection/during data analysis

after data analysis

2.1 replicate!

replication

what?

do the same, following the experimental and analytical procedure as closely as possible

direct replication study

replication

what?

things can never always the same

indicate the known differences

replication

how?

communicate with the original authors; ask information; and feedback

ideal for masterproefnot much focus on creativity but more on skill building

replication

why?

- lots of variability between studied phenomena- lots of variability between labs/replications- what can we learn from a single study?

2.2 register!

registration

what?

we specified all research details before data collection

registration

what?


data collection

• sample size planning (stopping rule; see below)

registration

what?


data collection


• recruitment: how to recruit participants (e.g., pool)

registration

what?


data collection



data analysis

• data cleaning plan (when to delete data)

registration

what?


data collection



data analysis

• data cleaning plan (when to delete data)• analysis plan

registration

what?


data collection



data analysis


- which exact hypotheses to test

registration

what?


data collection



data analysis


- which exact hypotheses to test- which variables to use

registration

what?


data collection



data analysis


- which exact hypotheses to test- which variables to use - analyses for testing the hypotheses

registration

what?


data collection



data analysis


- which exact hypotheses to test- which variables to use - analyses for testing the hypotheses

• code for the analyses

registration

what?


experimental details (optional)

• experimental materials- stimuli (questions)- exact instructions

registration

what?


experimental details (optional)

• experimental materials- stimuli (questions)- exact instructions

• experimental procedure- randomization etc

registration

how?

• Registered Report

- new format of publishing- review prior to data collection- accepted papers then are (almost)

guaranteed publication if the authors follow through with the registered methodology

AIMS Neuroscience; Attention, Perception & Psychophysics; Cortex; Drug and Alcohol Dependence; Experimental Psychology, Frontiers in Cognition; Perspectives on Psychological Science; Social Psychology; …

registration

how?

• Registered Report

• “independent” pre-registration

e.g., Open Science Framework (OSF)- open source software project - free

registration

why?

prevent readers from thinking you might have exploited your researchers degrees of freedom

extreme flexibility in • data collection • eg data peeking

• data analysis• what is an outlier ?• when to add covariates ?• when to transform the data ?

• reporting• did you report all variables, conditions, experiments,

analyses ?

registration

why?

prevent readers from thinking you might have exploited your researchers degrees of freedom

exploiting researchers degrees of freedom can lead to an increase in false positives

-- without adjustment, a true hypothesis will always be rejected if sampling continues long enough

if you can convince readers that you didn’t exploit the researchers degrees of freedom, they will put more confidence in your result; it will be seen as more trustworthy

2.3 power up!

high power

what?

among the decisions you have to make and register in advance is when you’ll stop collecting data

our stopping rule was based on fixing the sample size

fixing the sample size was based on a power calculation

power = P(reject null hypothesis | null hypothesis is false)

high power

what?

as far as constraining the researchers degrees of freedom is concerned, low power is as good as high power

we aimed for high power (95%)

high power

how?

compute sample size needed to achieve desired power level- given the statistical test- given the significance level- given the effect size (e.g., based on previous

studies)

high power

how?

compute sample size needed to achieve desired power level- given the statistical test- given the significance level- given the effect size (e.g., based on previous

studies)

G*Power, R packages (pwr), …

high power

why?

• low power reduces the probability of discovering effects that are there

• low power reduces the probability that a significant result reflects a true effect (button et al., 2013)

• low power leads to an inflation of estimated effect sizes• only overestimates will be significant

there are other stopping rules!

sources for how to do decide when to stop collecting data

-when I have a participant with the name of my mother

-availability---when the day/testweek is over

-when I have a fixed number of participants---100--- based on power calculations--- based on accuracy in parameter estimation

in general, the most important thing is that you do it, more than how to do it

all these stopping rules are equally valid to constrain the researchers degrees of freedom

but some will lead to better, research than other---more informative ---more precise and less biased estimates of e.g.

effect size

2.4 go bayes

NHST & Bayesian testing

what?

we did not just use Null Hypothesis Significance Testing (NHST i.e. p-values) but also Bayes factors (the p-value of Bayesian statistics)

the core of bayesian statistics is bayes’ rule

bayes treats probabilities as degrees of belief


what?

we can use bayes to compute the belief in our hypothesis H, given the data d

bayes rule tells us how we should update our belief about H after observing data


how?

• several online tools (e.g., Rouder’s website)

• BayesFactor package in R (Morey & Rouder, 2014)


why?

• p(H|d) seems exactly what science needs

• evidence for null hypothesis• intuitive to interpret• consistent: correct answer in large

sample limit• exact for small sample size• clear interpretation of evidence• based on the observed data, not on

hypothetical replications of experiments

2.5 lower alpha

pr

obab

ility

of H

1

1

.99

.97

.90

.75

.50

2.6 test and estimate

NHST & estimation

what?

we did not just use p-values and Bayes factors but also effect size estimates and their confidence intervals

how?

Matlab, R, SPPS, ESCI (Cumming, 2013), …

why?

diverts focus from the presence of an effectto the more informative size of an effectand its precision

2.7 co-pilot

co-pilot multi-software approach

what/how?

• two people independently processed and analyzed the same data …

• … using different software (MATLAB, SPSS)

why?

decreases the likelihood of errors

errors are easily made:

50% of published papers in psychology contain reporting errors (bakker & wicherts, 2011)

e.g, error sample size planning (G*Power)

2.8 distinguish between confirmatory and exploratory

clear distinction between confirmatory and exploratory (post hoc) analyses

what?

we indicated whether the analyses where specified before seeing the data, or based on the data (see registration)

how?

be transparent

easy when having registered

why?

you still want to report analyses you thought about too late! they can be useful for generating hypotheses

2.9 go open

open science

what?

we made our full research output publicly available to everybody- experimental materials (stimuli,

questionnaire items, instructions, and so on)

- raw data- processed data- code for data processing- code for confirmatory analyses- code for post-hoc analyses- paper

open science

how?

Open Science Framework (public)

-online repository-free-under development

goal: share and find research materials make study materials (experimental material, data, code, …) public so that other researchers can find, use and cite them

several other sharing possibilities

open science

how?

Open Science Framework (public)

make sure OSF is not the only place where your stuff is!

who knows what will happen with these servers in 20 years?

unclear what the best data format is

open science

why?

• the current standards of what is considered research output (paper with summary statistics and conclusion) are not inspired by desiderata for good science, but rather by arbitrary and outdated technical constraints (paper + publishing costs)

•if we would start doing science right now, in the computer and internet age, we would probably set a completely different standard

open science

why?

• facilitates - replication studies- follow up studies (e.g., use same

stimuli)- new or re-analyses- meta-analyses- accumulation of scientific

knowledge- detection of errors or fraud

• yields useful teaching material

open science

why?

• increases visibility

• increases citability

• decreases number of emails about experiments, data or analyses, …

• is a moral obligation to tax payer (publicly funded research is a public good)

3 discussion

3.1 why not?

1. replication2. registration3. high power

4. bayesian statistics5. alpha level 6. estimations7. co-pilot multi-software approach

8. distinction between confirmatory and exploratory analyse9. open science

what? how? why?why not?

features of science 2.0

before data collection

after data collection/during data analysis

after data analysis

replication

why not

-it is impossible!---things can never always the same (e.g. population)---the details of the original study are lost (e.g., which questions used in a post experimental interview)

-it is a waste of time and resources!---should we value novelty more than truth?

-it is not good for my career---can I publish this?

registration

why not?

• it takes time, thought and effort

• it is harder than it seems!• writing the code help a lot

• exploration might be the only possibility

• domain specific (qualitative studies? complex studies?)

high power

why not?

• can be hard to guess expected effect size or trust published effect size

• often requires large sample size• collaborate!

• restricted to NHST framework

Bayes it

why not?

• priors

• education?

• Bayes factors are hard to compute

Bayes it

why not?

• priors

• education?

• Bayes factors were are hard to compute

Open up

why not?

sharing data takes time

sharing data might jeopardize a potential future publicationbut: embargo period

Other(co-pilot, alpha, confirmation vs exploration, estimation)

why not?

lack of education

old habits

takes time and is not rewarded

3.2 feasibility

this illustration used a very simple study• replication study• easily administered 8-item questionnaire• basic t test

this made pre-registration, sample size planning, high power, estimation, bayesian statistics, sharing protocol, code and data, co-pilot multi software, etc probably much easier than in most other studies

but everything is also possible (though harder) for non-replication studies!

feasibility will depend on the type and scope of your research

science 2.0 is no package deal

---you can register, but not share---you can share, but not use bayes

some practices are graded--- you can register without code--- you can estimate without reporting CI

3.3 what should i take home?

• the (psychological) literature is littered with spurious findings

• which results can you trust?– has this result been replicated?– did the researchers exploit their researchers degrees of

freedom? – is the evidence based on NHST with a liberal alpha level?– was the analysis correct (e.g., at least, check dfs; better do

the analysis yourself with the shared data and code)– ???

3.4 is there a crowd within effect?

Is there a crowd within effect?

successful replication

• error guess 1 > error average• error guess 2 > error average

the end(or the beginning!)