87
Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh

Thesis Proposal Piwowar Presentation 20091109

Embed Size (px)

DESCRIPTION

Presented at ASIS&T 2009 in the student awards section. The presentation contains an overview of my dissertation proposal, as 2009 winner of the Thomson Reuters Information Science Doctoral Dissertation Proposal Scholarship, administered by the ASIS&T Information Science Education Committee

Citation preview

Page 1: Thesis Proposal Piwowar Presentation 20091109

Foundational studies for measuring the impact, prevalence, and patterns 

of publicly sharing biomedical research data

Heather Piwowar

Department of Biomedical InformaticsUniversity of Pittsburgh

Page 4: Thesis Proposal Piwowar Presentation 20091109

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 5: Thesis Proposal Piwowar Presentation 20091109

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 6: Thesis Proposal Piwowar Presentation 20091109

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 7: Thesis Proposal Piwowar Presentation 20091109

Sharing research data

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 8: Thesis Proposal Piwowar Presentation 20091109

http://www.flickr.com/photos/75166820@N00/5318468/

Page 9: Thesis Proposal Piwowar Presentation 20091109

Shared data benefits science

VerifyUnderstandExtendExploreCombineSynergizeTrainReduce

Page 10: Thesis Proposal Piwowar Presentation 20091109

But... costly for authorsFindOrganizeDocumentDeidentifyFormatDecideAskSubmit

Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???

Page 11: Thesis Proposal Piwowar Presentation 20091109

As a result, policy makers have spent lots of time and money ....

http://www.flickr.com/photos/tonivc/2283676770/

http://www.flickr.com/photos/johnnyvulkan/381941233/

Page 12: Thesis Proposal Piwowar Presentation 20091109

... on initiatives, requests, requirements, and tools

Funder data sharing requirements

Journal requirements and requests

Databases

Data sharing collaboration grids

Standards

Editorials, letters to the editor, discussion....

Page 14: Thesis Proposal Piwowar Presentation 20091109

lots of data sharing!

http://www.genome.jp/en/db_growth.html

Page 15: Thesis Proposal Piwowar Presentation 20091109

but how much isn’t shared?

what isn’t shared?

who isn’t sharing it?why not?

what can we do about it?

how much does it matter?

Page 16: Thesis Proposal Piwowar Presentation 20091109

you can not manage what you do not measure

http://www.flickr.com/photos/archeon/2941655917/

Page 18: Thesis Proposal Piwowar Presentation 20091109

Related research

Data usually collected via surveys and/or manual audits

http://www.flickr.com/photos/jima/606588905/

Page 19: Thesis Proposal Piwowar Presentation 20091109

Models of data and knowledge sharing

Page 20: Thesis Proposal Piwowar Presentation 20091109

Andriessen. Conditions for the willingness to share knowledge, 2006.

Page 21: Thesis Proposal Piwowar Presentation 20091109

Harder. SMG WP 6/2008 .

Page 22: Thesis Proposal Piwowar Presentation 20091109
Page 23: Thesis Proposal Piwowar Presentation 20091109

Cabrera and Cabrera. Int J of HR Mgmt. 2005.

Page 24: Thesis Proposal Piwowar Presentation 20091109
Page 25: Thesis Proposal Piwowar Presentation 20091109

Kuo. JASIST. 2008.

Page 26: Thesis Proposal Piwowar Presentation 20091109

Limitations of the related research

• manual audits: small sample sizes

• surveys: few variables + self-reporting bias

• not much focus on measuring demonstrated behavior

• not much focus on rewards

• not much focus on policy

• not much focus on biomedical data other than DNA sequences

Page 27: Thesis Proposal Piwowar Presentation 20091109

Needed:

a study of data sharing behaviour and impact

that includes

• a measurement of demonstrated behavior• policy variables • estimate of rewards• a broad and deep selection of data creation instances

Page 28: Thesis Proposal Piwowar Presentation 20091109

Aim 1: Does sharing have benefit for those who share?

Aim 2: Can sharing and withholding be systematically measured?

Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?

Page 29: Thesis Proposal Piwowar Presentation 20091109

Scope of proposed study

studiesPublished studies with English full text available in a centralized portal

variables for examinationextracted from Medline and other sources

Page 30: Thesis Proposal Piwowar Presentation 20091109

http://en.wikipedia.org/wiki/DNA_microarray

http://en.wikipedia.org/wiki/Image:Heatmap.png

http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG

Microarray data

Page 32: Thesis Proposal Piwowar Presentation 20091109

Aim 1

Page 33: Thesis Proposal Piwowar Presentation 20091109

Aim 1:  Does sharing have benefit for those who share?

http://www.flickr.com/photos/sunrise/35819369/

Page 34: Thesis Proposal Piwowar Presentation 20091109

Aim 1:  Does sharing have benefit for those who share?

Benefit of value: Citations.

Page 35: Thesis Proposal Piwowar Presentation 20091109

Aim 1:  Does sharing have benefit for those who share?dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)

citationsISI Web of Science Citation index, citations from 2004-2005

data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine

statisticsMultivariate linear regression

Page 36: Thesis Proposal Piwowar Presentation 20091109

Aim 1:  Does sharing have benefit for those who share?

Page 37: Thesis Proposal Piwowar Presentation 20091109

Aim 1:  Does sharing have benefit for those who share?

Note the logarithmic scale

Page 38: Thesis Proposal Piwowar Presentation 20091109

Aim 1:  Does sharing have benefit for those who share?

In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%)

Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308

Page 39: Thesis Proposal Piwowar Presentation 20091109

Aim 1 conclusion:  data sharing has a benefit for sharers

Page 40: Thesis Proposal Piwowar Presentation 20091109

Next:  What factors predict sharing?

http://www.flickr.com/photos/ryanr/142455033/

Page 41: Thesis Proposal Piwowar Presentation 20091109

Next:  What factors predict sharing?

http://www.flickr.com/photos/ryanr/142455033/

Can I use the same methods of Aim 1 to choose studies and determine data sharing status?

Page 42: Thesis Proposal Piwowar Presentation 20091109

Next:  What factors predict sharing?

http://www.flickr.com/photos/ryanr/142455033/

Can I use the same methods of Aim 1 to choose studies and determine data sharing status?

No, those methods donʼt scale to identify or classify enough datapoints.

Page 43: Thesis Proposal Piwowar Presentation 20091109

Aim 2

Page 44: Thesis Proposal Piwowar Presentation 20091109

Need automated methods to:

Identify studies that generate datasets that could potentially be shared (Aim 2a)

Determine which of these have in fact been shared (Aim 2b)

Page 45: Thesis Proposal Piwowar Presentation 20091109

Aim 2a: Identify studies that create gene expression microarray data

http://www.flickr.com/photos/lofaesofa/248546821/

Page 46: Thesis Proposal Piwowar Presentation 20091109

Aim 2a: Identify studies that create gene expression microarray data

Easy, via MeSH indexing terms?

gene expression profiling and/ormicroarray analysis

Unfortunately, has neither high recall nor precision.

Page 47: Thesis Proposal Piwowar Presentation 20091109

Aim 2a: Identify studies that create gene expression microarray dataInstead, look for wetlab methods in full text:

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrezhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745

Page 48: Thesis Proposal Piwowar Presentation 20091109

Aim 2a: Identify studies that create gene expression microarray data

And query the full text through full-text query portals:

Page 49: Thesis Proposal Piwowar Presentation 20091109

Aim 2a: Identify studies that create gene expression microarray data

query developmentUse supervised natural language processing techniques on a corpus of Open Access articles

query evaluation400 studies that created gene expression microarray data, as identified by Ochsner et al (2008)

goal>90% precision, and sufficient recall to retrieve >1250 articles

Page 50: Thesis Proposal Piwowar Presentation 20091109

Aim 2b

Page 51: Thesis Proposal Piwowar Presentation 20091109

Aim 2b: Identify studies that share their expression microarray data

http://www.flickr.com/photos/dcassaa/422261773/

Page 52: Thesis Proposal Piwowar Presentation 20091109

Aim 2b: Identify studies that share their expression microarray data

Page 53: Thesis Proposal Piwowar Presentation 20091109

Aim 2b: Identify studies that share their expression microarray data

Page 54: Thesis Proposal Piwowar Presentation 20091109

Aim 2b: Identify studies that share their expression microarray data

pmc_gds[filter]

+ text processing on ArrayExpress website

Enough? Unbiased?

Page 55: Thesis Proposal Piwowar Presentation 20091109

Aim 2b: Identify studies that share their expression microarray data

reference standard200 the 400 studies that created gene expression microarray data have shared their microarray data, as identified by Ochsner et al (2008)

goalEstablish that filter has >70% recall with an unbiased representation of MeSH terms, dataset size, and dataset species

Page 56: Thesis Proposal Piwowar Presentation 20091109

Aim 3

Page 57: Thesis Proposal Piwowar Presentation 20091109

Aim 3 – How often is data shared? What predicts sharing? How can we model sharing behavior?

http://www.flickr.com/photos/ryanr/142455033/

Page 58: Thesis Proposal Piwowar Presentation 20091109

Aim 3a:  Prevalence of data sharing

Page 59: Thesis Proposal Piwowar Presentation 20091109

Aim 3a:  Prevalence of data sharing

PubMed ID

PortalCreated data?

234345456567678789890901

PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?

Page 60: Thesis Proposal Piwowar Presentation 20091109

Aim 3a:  Prevalence of data sharing

PubMed ID

PortalCreated data?

234345456567678789890901

PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?

Page 61: Thesis Proposal Piwowar Presentation 20091109

Aim 3a:  Prevalence of data sharing

PubMed ID

PortalCreated data?

234345456567678

PMC YesHighPr YesScirus YesPMC YesPMC Yes

Page 62: Thesis Proposal Piwowar Presentation 20091109

Aim 3a:  Prevalence of data sharing

PubMed ID

PortalCreated data?

Shared data?

234345456567678

PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO

Page 63: Thesis Proposal Piwowar Presentation 20091109

Aim 3a:  Prevalence of data sharing

PubMed ID

PortalCreated data?

Shared data?

234345456567678

PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO

Prevalence =    Number with Shared dataNumber with Created data

Page 64: Thesis Proposal Piwowar Presentation 20091109

Aim 3b:  Correlates with data sharing

Page 65: Thesis Proposal Piwowar Presentation 20091109

Aim 3b:  Correlates with data sharing

PubMed ID

PortalCreated data?

Shared data?

234345456567678

PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO

Covariates

Page 66: Thesis Proposal Piwowar Presentation 20091109

Aim 3b:  Correlates with data sharing

Features to include:• Does the journal have a data sharing policy?• Is the study funded by the NIH?• Is it subject tot the NIH data sharing plan

requirement?• Number of authors• Journal impact factor• Are the experimental samples from humans?• Disease of study• Year of publication• …

Page 67: Thesis Proposal Piwowar Presentation 20091109

Aim 3b:  Correlates with data sharing

PubMed ID

PortalCreated data?

Shared data?

Journal policy

NIH funds?

# authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Page 68: Thesis Proposal Piwowar Presentation 20091109

Aim 3b:  Correlates with data sharing

Univariate odds ratiosMultivariate logistic regression

Page 69: Thesis Proposal Piwowar Presentation 20091109

Aim 3b:  Correlates with data sharing

PubMed ID

PortalCreated data?

Shared data?

Journal policy

NIH funds?

# authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Shared data?

Journal policy? NIH funded? # authors ...

Page 70: Thesis Proposal Piwowar Presentation 20091109

Aim 3c: Model of data sharing

Page 71: Thesis Proposal Piwowar Presentation 20091109

Aim 3c: Model of data sharing

PubMed ID

PortalCreated data?

Shared data?

Journal policy

NIH funds?

# authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Page 72: Thesis Proposal Piwowar Presentation 20091109

Aim 3c: Model of data sharing

Exploratory factor analysis

Page 73: Thesis Proposal Piwowar Presentation 20091109

Aim 3c: Model of data sharing

PubMed ID

PortalCreated data?

Shared data?

Journal policy

NIH funds?

# authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Shared data?

Mandates Amount of Collaboration

...

Page 74: Thesis Proposal Piwowar Presentation 20091109

Aim 3c: Model of data sharing

PubMed ID

PortalCreated data?

Shared data?

Journal policy

NIH funds?

# authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Shared data?

Mandates Amount of Collaboration

...StrongWeak

Page 76: Thesis Proposal Piwowar Presentation 20091109

Limitations• Association does not imply causation

• Important influences will be missed due to focus on measurable variables

• Some derived variables involve many estimates and assumptions

• Only considering public sharing in primary centralized databases

• Only one datatype

• Only research studies made available in full-text portals

Page 77: Thesis Proposal Piwowar Presentation 20091109

Risks and contingency plans

NLP performance may be inadequatesupplement with manual annotating via Mechanical Turk

Author ambiguity may introduce extreme outliersuse Author-ity (Smalheiser and Torvik, 2005) for name

disambiguation

Unable to derive a robust exploratory factor modeltry other clustering techniques

Several variables may be unexpectedly difficult to extract and cross-references

if not essential, defer analysis of that variable

Page 78: Thesis Proposal Piwowar Presentation 20091109

Aim 1: Does sharing have benefit for those who share?

Aim 2: Can sharing and withholding be systematically measured?

Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?

pilot completed.

Now: full dataset collection

Current status

Page 79: Thesis Proposal Piwowar Presentation 20091109

Anticipated contributions

• Published assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing

• Publicly available dataset associating microarray study publications with data sharing status

• Generalizable approach for developing practical, real-world information retrieval using centralized full-text query portals

• Preliminary model of data sharing behaviour based on this large dataset

Page 80: Thesis Proposal Piwowar Presentation 20091109

Future work

• Identify and model data reuse

• Citation analysis of the large cohort

• Supplement with survey responses

http://www.flickr.com/photos/cogdog/123072/

Page 81: Thesis Proposal Piwowar Presentation 20091109

I post my data, code, and statistical scripts athttp://www.dbmi.pitt.edu/piwowar

Share yours too!

http://www.flickr.com/photos/myklroventine/892446624/

Data sharing plan

Page 82: Thesis Proposal Piwowar Presentation 20091109

Thanks to: ➡ the NLM for funding training grant 5 T15 LM007059-22 ➡ the Dept of Biomedical Informatics at the U of Pittsburgh➡ my committee

Dr Wendy Chapman Biomed InformaticsDr Ellen Detlefsen iSchoolDr Madhavi Ganapathiraju BioinformaticsDr Brian Butler Katz School of BusinessDr Gunther Eysenbach U of Toronto, Health Policy

Mgmt and Evaluation

Page 83: Thesis Proposal Piwowar Presentation 20091109
Page 84: Thesis Proposal Piwowar Presentation 20091109

aim

Funder Journal Investigator Institution Study

Is research data shared after publication?

Page 85: Thesis Proposal Piwowar Presentation 20091109

self-reported denying a request in last 3 years

trainees self-reported denying a request

been denied access to data, materials, code

authors “not able to retrieve raw data”

not willing to release data

0% 10% 20% 30% 40%

Prevalence of data withholding via surveys

Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.

Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.

Page 86: Thesis Proposal Piwowar Presentation 20091109

Campbell et al. JAMA 2002.

sharing is too much effort

want student or jr faculty to publish more

they themselves want to publish more

cost

industrial sponsor

confidentiality

commercial value of results0% 20% 40% 60% 80%

Self‐reported reasons for data withholding

Page 87: Thesis Proposal Piwowar Presentation 20091109

Blumenthal et al. Acad Med. 2006

industry involvement

perceived competitiveness of field

male

sharing discouraged in training

human participants

academic productivity

0 1 2 3

Correlates with self‐reported data withholding