science 2.0
an illustration of good research practices in a real study
wolf vanpaemel kortrijk, march 9 2015
1. the crisis in psychology
Why can we definitively say that? Because psychology often does not meet the five basic requirements for a field to be considered scientifically rigorous: clearly defined terminology, quantifiability, highly controlled experimental conditions, reproducibility and, finally, predictability and testability.
2% data verzonnen
- mundane 'regular' misbehaviours present greater threats to the scientific enterprise than those caused by high-profile misconduct cases such as fraud.
- first assessment of questionable research practices (QRP)
- 2002 assessment: NIH funded research1768 mid-career (52% response rate)1479 early-career(43% response rate)
- first assessment of QRP in psychology
- 2155 respondents (36% response rate)
the problems of QRP are widespread, and have very severe consequences
why is that the case?
“never attribute to malice what can be adequately explained by incompetence”
the main reasons are lack of guidelines, and the high publication pressure
i’m not interested in fraud (e.g., diederik stapel who made up his own data)
preventing fraud requires a different approach
2. science 2.0
a new way of doing science that aims to increase the confidence in research results
not one, single, coherent whole
a demonstration of science 2.0 with a real study
reference: Steegen, S., Dewitte, L., Tuerlinckx, F., & Vanpaemel, W. (2014). Measuring the crowd within again: A pre-registered replication study. Frontiers in Psychology, 5, 786, 1-8. doi:10.3389/fpsyg.2014.00786
paper: http://ppw.kuleuven.be/okp/_pdf/Steegen2014MTCWA.pdf
OSF page: https://osf.io/ivfu6/
based on some recommendations on good research practices made in the literature
based on some recommendations on good research practices made in the literature
• not exhaustive • non-directive examples• for inspiration
most recommendations can be implemented separately from each other• not an all or none package deal
crowd within effect (vul & pashler, 2008)
• averaging multiple guesses from one person provides a better estimate than either guess alone
crowd within effect (vul & pashler, 2008)
• averaging multiple guesses from one person provides a better estimate than either guess alone
experiment
• 8 general knowledge questionse.g., what percent of the world's roads are in India?
• guess 1 guess 2
1. replication2. registration3. high power
4. bayesian statistics5. alpha level 6. estimations7. co-pilot multi-software approach
8. distinction between confirmatory and exploratory analyses9. open science
what? how? why?
features of science 2.0
before data collection
after data collection/during data analysis
after data analysis
2.1 replicate!
replication
what?
do the same, following the experimental and analytical procedure as closely as possible
direct replication study
replication
what?
things can never always the same
indicate the known differences
replication
how?
communicate with the original authors; ask information; and feedback
ideal for masterproefnot much focus on creativity but more on skill building
replication
why?
- lots of variability between studied phenomena- lots of variability between labs/replications- what can we learn from a single study?
2.2 register!
registration
what?
we specified all research details before data collection
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
data analysis
• data cleaning plan (when to delete data)
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
data analysis
• data cleaning plan (when to delete data)• analysis plan
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
data analysis
• data cleaning plan (when to delete data)• analysis plan
- which exact hypotheses to test
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
data analysis
• data cleaning plan (when to delete data)• analysis plan
- which exact hypotheses to test- which variables to use
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
data analysis
• data cleaning plan (when to delete data)• analysis plan
- which exact hypotheses to test- which variables to use - analyses for testing the hypotheses
registration
what?
we specified all research details before data collection
data collection
• sample size planning (stopping rule; see below)
• recruitment: how to recruit participants (e.g., pool)
data analysis
• data cleaning plan (when to delete data)• analysis plan
- which exact hypotheses to test- which variables to use - analyses for testing the hypotheses
• code for the analyses
registration
what?
we specified all research details before data collection
experimental details (optional)
• experimental materials- stimuli (questions)- exact instructions
registration
what?
we specified all research details before data collection
experimental details (optional)
• experimental materials- stimuli (questions)- exact instructions
• experimental procedure- randomization etc
registration
how?
• Registered Report
- new format of publishing- review prior to data collection- accepted papers then are (almost)
guaranteed publication if the authors follow through with the registered methodology
AIMS Neuroscience; Attention, Perception & Psychophysics; Cortex; Drug and Alcohol Dependence; Experimental Psychology, Frontiers in Cognition; Perspectives on Psychological Science; Social Psychology; …
registration
how?
• Registered Report
• “independent” pre-registration
e.g., Open Science Framework (OSF)- open source software project - free
registration
why?
prevent readers from thinking you might have exploited your researchers degrees of freedom
extreme flexibility in • data collection • eg data peeking
• data analysis• what is an outlier ?• when to add covariates ?• when to transform the data ?
• reporting• did you report all variables, conditions, experiments,
analyses ?
registration
why?
prevent readers from thinking you might have exploited your researchers degrees of freedom
exploiting researchers degrees of freedom can lead to an increase in false positives
-- without adjustment, a true hypothesis will always be rejected if sampling continues long enough
if you can convince readers that you didn’t exploit the researchers degrees of freedom, they will put more confidence in your result; it will be seen as more trustworthy
2.3 power up!
high power
what?
among the decisions you have to make and register in advance is when you’ll stop collecting data
our stopping rule was based on fixing the sample size
fixing the sample size was based on a power calculation
power = P(reject null hypothesis | null hypothesis is false)
high power
what?
as far as constraining the researchers degrees of freedom is concerned, low power is as good as high power
we aimed for high power (95%)
high power
how?
compute sample size needed to achieve desired power level- given the statistical test- given the significance level- given the effect size (e.g., based on previous
studies)
high power
how?
compute sample size needed to achieve desired power level- given the statistical test- given the significance level- given the effect size (e.g., based on previous
studies)
G*Power, R packages (pwr), …
high power
why?
• low power reduces the probability of discovering effects that are there
• low power reduces the probability that a significant result reflects a true effect (button et al., 2013)
• low power leads to an inflation of estimated effect sizes• only overestimates will be significant
there are other stopping rules!
sources for how to do decide when to stop collecting data
-when I have a participant with the name of my mother
-availability---when the day/testweek is over
-when I have a fixed number of participants---100--- based on power calculations--- based on accuracy in parameter estimation
in general, the most important thing is that you do it, more than how to do it
all these stopping rules are equally valid to constrain the researchers degrees of freedom
but some will lead to better, research than other---more informative ---more precise and less biased estimates of e.g.
effect size
2.4 go bayes
NHST & Bayesian testing
what?
we did not just use Null Hypothesis Significance Testing (NHST i.e. p-values) but also Bayes factors (the p-value of Bayesian statistics)
the core of bayesian statistics is bayes’ rule
bayes treats probabilities as degrees of belief
NHST & Bayesian testing
what?
we can use bayes to compute the belief in our hypothesis H, given the data d
bayes rule tells us how we should update our belief about H after observing data
NHST & Bayesian testing
how?
• several online tools (e.g., Rouder’s website)
• BayesFactor package in R (Morey & Rouder, 2014)
NHST & Bayesian testing
why?
• p(H|d) seems exactly what science needs
• evidence for null hypothesis• intuitive to interpret• consistent: correct answer in large
sample limit• exact for small sample size• clear interpretation of evidence• based on the observed data, not on
hypothetical replications of experiments
2.5 lower alpha
pr
obab
ility
of H
1
1
.99
.97
.90
.75
.50
2.6 test and estimate
NHST & estimation
what?
we did not just use p-values and Bayes factors but also effect size estimates and their confidence intervals
how?
Matlab, R, SPPS, ESCI (Cumming, 2013), …
why?
diverts focus from the presence of an effectto the more informative size of an effectand its precision
2.7 co-pilot
co-pilot multi-software approach
what/how?
• two people independently processed and analyzed the same data …
• … using different software (MATLAB, SPSS)
why?
decreases the likelihood of errors
errors are easily made:
50% of published papers in psychology contain reporting errors (bakker & wicherts, 2011)
e.g, error sample size planning (G*Power)
2.8 distinguish between confirmatory and exploratory
clear distinction between confirmatory and exploratory (post hoc) analyses
what?
we indicated whether the analyses where specified before seeing the data, or based on the data (see registration)
how?
be transparent
easy when having registered
why?
you still want to report analyses you thought about too late! they can be useful for generating hypotheses
2.9 go open
open science
what?
we made our full research output publicly available to everybody- experimental materials (stimuli,
questionnaire items, instructions, and so on)
- raw data- processed data- code for data processing- code for confirmatory analyses- code for post-hoc analyses- paper
open science
how?
Open Science Framework (public)
-online repository-free-under development
goal: share and find research materials make study materials (experimental material, data, code, …) public so that other researchers can find, use and cite them
several other sharing possibilities
open science
how?
Open Science Framework (public)
make sure OSF is not the only place where your stuff is!
who knows what will happen with these servers in 20 years?
unclear what the best data format is
open science
why?
• the current standards of what is considered research output (paper with summary statistics and conclusion) are not inspired by desiderata for good science, but rather by arbitrary and outdated technical constraints (paper + publishing costs)
•if we would start doing science right now, in the computer and internet age, we would probably set a completely different standard
open science
why?
• facilitates - replication studies- follow up studies (e.g., use same
stimuli)- new or re-analyses- meta-analyses- accumulation of scientific
knowledge- detection of errors or fraud
• yields useful teaching material
open science
why?
• increases visibility
• increases citability
• decreases number of emails about experiments, data or analyses, …
• is a moral obligation to tax payer (publicly funded research is a public good)
3 discussion
3.1 why not?
1. replication2. registration3. high power
4. bayesian statistics5. alpha level 6. estimations7. co-pilot multi-software approach
8. distinction between confirmatory and exploratory analyse9. open science
what? how? why?why not?
features of science 2.0
before data collection
after data collection/during data analysis
after data analysis
replication
why not
-it is impossible!---things can never always the same (e.g. population)---the details of the original study are lost (e.g., which questions used in a post experimental interview)
-it is a waste of time and resources!---should we value novelty more than truth?
-it is not good for my career---can I publish this?
registration
why not?
• it takes time, thought and effort
• it is harder than it seems!• writing the code help a lot
• exploration might be the only possibility
• domain specific (qualitative studies? complex studies?)
high power
why not?
• can be hard to guess expected effect size or trust published effect size
• often requires large sample size• collaborate!
• restricted to NHST framework
Bayes it
why not?
• priors
• education?
• Bayes factors are hard to compute
Bayes it
why not?
• priors
• education?
• Bayes factors were are hard to compute
Open up
why not?
sharing data takes time
sharing data might jeopardize a potential future publicationbut: embargo period
Other(co-pilot, alpha, confirmation vs exploration, estimation)
why not?
lack of education
old habits
takes time and is not rewarded
3.2 feasibility
this illustration used a very simple study• replication study• easily administered 8-item questionnaire• basic t test
this made pre-registration, sample size planning, high power, estimation, bayesian statistics, sharing protocol, code and data, co-pilot multi software, etc probably much easier than in most other studies
but everything is also possible (though harder) for non-replication studies!
feasibility will depend on the type and scope of your research
science 2.0 is no package deal
---you can register, but not share---you can share, but not use bayes
some practices are graded--- you can register without code--- you can estimate without reporting CI
3.3 what should i take home?
• the (psychological) literature is littered with spurious findings
• which results can you trust?– has this result been replicated?– did the researchers exploit their researchers degrees of
freedom? – is the evidence based on NHST with a liberal alpha level?– was the analysis correct (e.g., at least, check dfs; better do
the analysis yourself with the shared data and code)– ???
3.4 is there a crowd within effect?
Is there a crowd within effect?
successful replication
• error guess 1 > error average• error guess 2 > error average
the end(or the beginning!)