NESCent visit: Measuring progress toward a cultural norm of shared (and reused!) biomedical...

Preview:

DESCRIPTION

Preliminary work and future directions in measuring biomedical research data sharing

Citation preview

Measuring progress toward a cultural norm of

shared (and reused!)biomedical research data

Heather Piwowar

Department of Biomedical InformaticsUniversity of Pittsburgh

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Shared data benefits science

VerifyUnderstandExtendExploreCombineSynergizeTrainReduce

But... costly for authorsFindOrganizeDocumentDeidentifyFormatDecideAskSubmit

Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???

As a result, policy makers have spent lots of time and money ....

http://www.flickr.com/photos/tonivc/2283676770/

http://www.flickr.com/photos/johnnyvulkan/381941233/

... on initiatives, requests, requirements, and tools

NIH data sharing plan requirement

Journal requirements

Public databases

Data sharing grids like BIRN and caBIG

Data formatting standards

Editorials, letters to the editor, discussion....

lots of data sharing!

http://www.genome.jp/en/db_growth.html

but how much isn’t shared?

what isn’t shared?

who isn’t sharing it?why not?

what can we do about it?

how much does it matter?

you can not manage what you do not measure

http://www.flickr.com/photos/archeon/2941655917/

1. Is there benefit for those who share?

2. Do journal policies increase rates of sharing?

3. What other factors are correlated with sharing and withholding data?

research questions

microarray data

http://en.wikipedia.org/wiki/DNA_microarray

http://en.wikipedia.org/wiki/Image:Heatmap.png

http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG

microarray data

http://www.flickr.com/photos/sunrise/35819369/

1. Is there benefit for those who share?

currency of value?

Citations.

$50!

Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215

Prior work focused on the citation advantage of an open access publishing model.

Our question: are articles that share their raw research data cited more than articles that don’t?

dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)

citationsISI Web of Science Citation index, citations from 2004-2005

data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine

statisticsMultivariate linear regression

Note:log scale

In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%)

Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308

• collect a larger dataset for citation analysis (stay tuned)

• investigate other datatypes

• examine citation context

future work

http://www.flickr.com/photos/ryanr/142455033/

2. Do journal data sharing policies increase sharing?

“An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …”

http://www.nature.com/authors/editorial_policies/availability.html

http://www.nature.com/nature/journal/v453/n7197/index.html

Prior work examined data sharing policies in biomedicine, but these reviews are now dated, consider a variety of resources, and don’t correlate policy to behaviour.

McCain. Science Communication, Vol. 16, No. 4. (1 June 1995), pp. 403-431

NAS. Sharing Publication-Related Data and Materials. (2003), p. 33

Our aim: look at data sharing policies within Instruction to Author statements of 70 journals, as they apply to gene expression microarray data.

Very diverse policies in terms of:• statements of policy motivation• datatype-specific policies• requested vs. required• data location• data format• data completeness• timeliness of sharing• consequences for not sharing• exceptions

content of data sharing policies

No applicable policy (43%)

Weak policy (24%)

should, recommend, requestmust, but without database accession number

Strong policy (33%)

must, required, condition of publicationrequires database accession number

strength of data sharing policies

Journal has a data sharing policy?

Impact

Factor

Open

Access?

Society

Publisher?

•! Biochemistry

&Molecular Biology

•! Oncology

strength of data sharing policiesmultivariate associations

High-impact journals

tend to have

a strong data-sharing

policy

strength of data sharing policiesassociated with impact factor

For each of the 70 journals,

we measured the percent of articles that were cited from within GEO and ArrayExpress.

We considered this a proxy for percent of articles with shared data.

data sharing policiesassociated with amount of sharing

% of articles with shared data

Impact

Factor

Open

Access?

Society

Publisher?

•! Genetics &

Heredity

•! Multidisciplinary Sciences

Having a data-sharing policy?

data sharing policiesassociated with amount of sharing

• our corpus of “gene expression microarray” articles may have included some that reused data and did not themselves produce primary data

• these results should be considered preliminary, pending a more precise filter (stay tuned)

http://www.flickr.com/photos/vlastula/300102949/

• use a more precise filter to isolate data producing articles and thereby understand the absolute levels of data sharing

• investigate other datatypes

• look at associations with reviewer instructions and opinions

future work on journal policies

• are they effective? (stay tuned)

• what do people propose in data sharing plans? Do they do what they propose? Why not?

• quantify the perceived worth of data sharing plans and accomplishments in funding and promotion decisions

future work on funder policies

http://www.flickr.com/photos/cogdog/123072/

3. What other factors are correlated with sharing and withholding data?

Prior work has focused on surveys and studies of intention.

Our aim: measure associations between observed data sharing behaviour and environmental variables

Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.

Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.

Blumenthal et al. Acad Med. 2006

Ochsner et al. manually reviewed 20 journals for 2007:

400 studies

200 shared their microarray data

Ochsner et al. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods, 5(12), 991.

pilot dataset

Is research data shared after publication?

Funder mandates

Journalimpact factor

Investigator “experience”

Journalmandates

pilot variables

funder mandates

NIH 2003 Data Sharing Requirement

Requires a data sharing plan

for studies funded after October 2003

that receive more than $500 000 in direct funding per year

Assumed data sharing requirement was applicable if:

the NIH grant numbers associated with PubMed entry had

$750 000 in total funding any year since 2004

plus

a NIH grant number with a leading “1” or “2” since 2004

funder mandates

Publication history and impact proxy

First and last authors:

• years since first paper• h-index (the largest number N such that

an author has N papers cited at least N times)

• a-index

author experience

Author publication history:

Citation counts:

Author-ity web serviceTorvik & Smalheiser. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.

Author name disambiguation:

Derived h-index (pubmedi citation indices):

author experience

Is research data shared after publication?

Funder mandates

Journalimpact factor

Investigator “experience”

Journalmandates

pilot variables

Univariate odds ratios

Multivariate logistic regression

stats

Is research data shared after publication?

Funder mandates

Journalimpact factor

Investigator “experience”

Journalmandates

Statistically significantNot statistically significant

results of pilot

33%

results of pilot

results of pilot

results of pilot

results of pilot

results of pilot

results of pilot

results of pilot

More samples, more variables

http://www.flickr.com/photos/krcla/2069243613/

PhD dissertation

Developed and evaluated automated methods to:

•Identify studies that generate datasets that could potentially be shared

•Determine which of these have in fact been shared

More samples:

To identify studies that generate datasets,

use a query on the full text of published articles:

("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)

To determine which articles have shared data,

use a query on the full text of published articles:

pubmed_gds[filter] and query ArrayExpress

More variables:

Use PubMed and a variety of other internet resources...

funded by NIH?

size of grant

sharing plan req’d?

funded by non-NIH?

impact factor

strength of policy

open access?

number of microarray studies published

years since first paper

h-index

a-index

previously shared?

previously reused?

gender

sector

size

impact rank

country

humans?

mice?

plants?

cancer?

clinical trial?

number of authors

year

Funder Journal Investigator Institution Study

Univariate odds ratios

Multivariate logistic regression

Exploratory factor analysis

stats

http://www.flickr.com/photos/skrb/2427171774/

results?

1. Is there benefit for those who share?

2. Do journal policies increase rates of sharing?

3. What other factors are correlated with sharing and withholding data?

research questions

what’s next?

• citation analysis of larger cohort

• journal policies with refined filter

• beyond microarray data

• deeper into journal and funder policies

• and, finally....

future work previously mentioned...

Reuse.

http://www.flickr.com/photos/boitabulle/3668162701/

who reuses data?when?

why aren’t they?

which datasets are most likely to be reused?

what can we do about it?

how many datasets could be reused but aren’t?

why?

who doesn’t?

One possible reuse research agenda

1. Inventory reuse acknowlegement patterns

2. Build full-text and metadata filters to identify instances of data reuse

3. Analyze patterns in data reuse choices

4. Survey data producers and data consumers to augment with intentions and perspectives

Resources

• GEO list of reuse articles (currently 618)

• Previous work in citation context classification

• Amazon Mechanical Turk for annotation

• Experimental Philosophy for insight into cultural norms

• ...Teufel et al. (2006) Automatic classification

of citation function. EMNLP.

• readers

• reusers

• authors

• editors

• reviewers

• funders

• database designers, maintainers, curators

• patients, subjects, or populations

Stakeholders

For their perspectives,

and also to design studies that have actionable results

for these groups

I post my data, code, and statistical scripts athttp://www.dbmi.pitt.edu/piwowar

Share yours too!

http://www.flickr.com/photos/myklroventine/892446624/

Data sharing plan

thank you

Dept of Biomedical Informatics at U of Pittsburgh

NLM for training grant funding

Open science online community and those who release their articles, datasets and photos openly

Dr Wendy Chapman for her support and feedback

“Does anyone want your data?

That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.

Your data, too, may simply be awaiting an effective matchmaker.”

Got data? Nature Neuroscience 10, 931 (2007)

variables

Journal mandates

Blumenthal et al. Acad Med. 2006

industry involvement

perceived competitiveness of field

male

sharing discouraged in training

human participants

academic productivity

0 1 2 3

Correlates with self‐reported data withholding

Campbell et al. JAMA 2002.

sharing is too much effort

want student or jr faculty to publish more

they themselves want to publish more

cost

industrial sponsor

confidentiality

commercial value of results0% 20% 40% 60% 80%

Self‐reported reasons for data withholding

self-reported denying a request in last 3 years

trainees self-reported denying a request

been denied access to data, materials, code

authors “not able to retrieve raw data”

not willing to release data

0% 10% 20% 30% 40%

Prevalence of data withholding via surveys

Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.

Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.

Recommended