Upload
heather-piwowar
View
2.473
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presented at ASIS&T 2009 in the student awards section. The presentation contains an overview of my dissertation proposal, as 2009 winner of the Thomson Reuters Information Science Doctoral Dissertation Proposal Scholarship, administered by the ASIS&T Information Science Education Committee
Citation preview
Foundational studies for measuring the impact, prevalence, and patterns
of publicly sharing biomedical research data
Heather Piwowar
Department of Biomedical InformaticsUniversity of Pittsburgh
Sharing research data
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
http://www.flickr.com/photos/75166820@N00/5318468/
Shared data benefits science
VerifyUnderstandExtendExploreCombineSynergizeTrainReduce
But... costly for authorsFindOrganizeDocumentDeidentifyFormatDecideAskSubmit
Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???
As a result, policy makers have spent lots of time and money ....
http://www.flickr.com/photos/tonivc/2283676770/
http://www.flickr.com/photos/johnnyvulkan/381941233/
... on initiatives, requests, requirements, and tools
Funder data sharing requirements
Journal requirements and requests
Databases
Data sharing collaboration grids
Standards
Editorials, letters to the editor, discussion....
http://www.flickr.com/photos/mesh/14102209/
lots of data sharing!
http://www.genome.jp/en/db_growth.html
but how much isn’t shared?
what isn’t shared?
who isn’t sharing it?why not?
what can we do about it?
how much does it matter?
you can not manage what you do not measure
http://www.flickr.com/photos/archeon/2941655917/
http://www.flickr.com/photos/archeon/2941655917/
Related research
Data usually collected via surveys and/or manual audits
http://www.flickr.com/photos/jima/606588905/
Models of data and knowledge sharing
Andriessen. Conditions for the willingness to share knowledge, 2006.
Harder. SMG WP 6/2008 .
Cabrera and Cabrera. Int J of HR Mgmt. 2005.
Kuo. JASIST. 2008.
Limitations of the related research
• manual audits: small sample sizes
• surveys: few variables + self-reporting bias
• not much focus on measuring demonstrated behavior
• not much focus on rewards
• not much focus on policy
• not much focus on biomedical data other than DNA sequences
Needed:
a study of data sharing behaviour and impact
that includes
• a measurement of demonstrated behavior• policy variables • estimate of rewards• a broad and deep selection of data creation instances
Aim 1: Does sharing have benefit for those who share?
Aim 2: Can sharing and withholding be systematically measured?
Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?
Scope of proposed study
studiesPublished studies with English full text available in a centralized portal
variables for examinationextracted from Medline and other sources
http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG
Microarray data
http://farm3.static.flickr.com/2146/2389590651_9bbcc9d07e.jpg
Aim 1
Aim 1: Does sharing have benefit for those who share?
http://www.flickr.com/photos/sunrise/35819369/
Aim 1: Does sharing have benefit for those who share?
Benefit of value: Citations.
Aim 1: Does sharing have benefit for those who share?dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)
citationsISI Web of Science Citation index, citations from 2004-2005
data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine
statisticsMultivariate linear regression
Aim 1: Does sharing have benefit for those who share?
Aim 1: Does sharing have benefit for those who share?
Note the logarithmic scale
Aim 1: Does sharing have benefit for those who share?
In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%)
Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308
Aim 1 conclusion: data sharing has a benefit for sharers
Next: What factors predict sharing?
http://www.flickr.com/photos/ryanr/142455033/
Next: What factors predict sharing?
http://www.flickr.com/photos/ryanr/142455033/
Can I use the same methods of Aim 1 to choose studies and determine data sharing status?
Next: What factors predict sharing?
http://www.flickr.com/photos/ryanr/142455033/
Can I use the same methods of Aim 1 to choose studies and determine data sharing status?
No, those methods donʼt scale to identify or classify enough datapoints.
Aim 2
Need automated methods to:
Identify studies that generate datasets that could potentially be shared (Aim 2a)
Determine which of these have in fact been shared (Aim 2b)
Aim 2a: Identify studies that create gene expression microarray data
http://www.flickr.com/photos/lofaesofa/248546821/
Aim 2a: Identify studies that create gene expression microarray data
Easy, via MeSH indexing terms?
gene expression profiling and/ormicroarray analysis
Unfortunately, has neither high recall nor precision.
Aim 2a: Identify studies that create gene expression microarray dataInstead, look for wetlab methods in full text:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrezhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
Aim 2a: Identify studies that create gene expression microarray data
And query the full text through full-text query portals:
Aim 2a: Identify studies that create gene expression microarray data
query developmentUse supervised natural language processing techniques on a corpus of Open Access articles
query evaluation400 studies that created gene expression microarray data, as identified by Ochsner et al (2008)
goal>90% precision, and sufficient recall to retrieve >1250 articles
Aim 2b
Aim 2b: Identify studies that share their expression microarray data
http://www.flickr.com/photos/dcassaa/422261773/
Aim 2b: Identify studies that share their expression microarray data
Aim 2b: Identify studies that share their expression microarray data
Aim 2b: Identify studies that share their expression microarray data
pmc_gds[filter]
+ text processing on ArrayExpress website
Enough? Unbiased?
Aim 2b: Identify studies that share their expression microarray data
reference standard200 the 400 studies that created gene expression microarray data have shared their microarray data, as identified by Ochsner et al (2008)
goalEstablish that filter has >70% recall with an unbiased representation of MeSH terms, dataset size, and dataset species
Aim 3
Aim 3 – How often is data shared? What predicts sharing? How can we model sharing behavior?
http://www.flickr.com/photos/ryanr/142455033/
Aim 3a: Prevalence of data sharing
Aim 3a: Prevalence of data sharing
PubMed ID
PortalCreated data?
234345456567678789890901
PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?
Aim 3a: Prevalence of data sharing
PubMed ID
PortalCreated data?
234345456567678789890901
PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?
Aim 3a: Prevalence of data sharing
PubMed ID
PortalCreated data?
234345456567678
PMC YesHighPr YesScirus YesPMC YesPMC Yes
Aim 3a: Prevalence of data sharing
PubMed ID
PortalCreated data?
Shared data?
234345456567678
PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO
Aim 3a: Prevalence of data sharing
PubMed ID
PortalCreated data?
Shared data?
234345456567678
PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO
Prevalence = Number with Shared dataNumber with Created data
Aim 3b: Correlates with data sharing
Aim 3b: Correlates with data sharing
PubMed ID
PortalCreated data?
Shared data?
234345456567678
PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO
Covariates
Aim 3b: Correlates with data sharing
Features to include:• Does the journal have a data sharing policy?• Is the study funded by the NIH?• Is it subject tot the NIH data sharing plan
requirement?• Number of authors• Journal impact factor• Are the experimental samples from humans?• Disease of study• Year of publication• …
Aim 3b: Correlates with data sharing
PubMed ID
PortalCreated data?
Shared data?
Journal policy
NIH funds?
# authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Aim 3b: Correlates with data sharing
Univariate odds ratiosMultivariate logistic regression
Aim 3b: Correlates with data sharing
PubMed ID
PortalCreated data?
Shared data?
Journal policy
NIH funds?
# authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Shared data?
Journal policy? NIH funded? # authors ...
Aim 3c: Model of data sharing
Aim 3c: Model of data sharing
PubMed ID
PortalCreated data?
Shared data?
Journal policy
NIH funds?
# authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Aim 3c: Model of data sharing
Exploratory factor analysis
Aim 3c: Model of data sharing
PubMed ID
PortalCreated data?
Shared data?
Journal policy
NIH funds?
# authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Shared data?
Mandates Amount of Collaboration
...
Aim 3c: Model of data sharing
PubMed ID
PortalCreated data?
Shared data?
Journal policy
NIH funds?
# authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Shared data?
Mandates Amount of Collaboration
...StrongWeak
http://www.flickr.com/photos/donjuanna/322798429/
Limitations• Association does not imply causation
• Important influences will be missed due to focus on measurable variables
• Some derived variables involve many estimates and assumptions
• Only considering public sharing in primary centralized databases
• Only one datatype
• Only research studies made available in full-text portals
Risks and contingency plans
NLP performance may be inadequatesupplement with manual annotating via Mechanical Turk
Author ambiguity may introduce extreme outliersuse Author-ity (Smalheiser and Torvik, 2005) for name
disambiguation
Unable to derive a robust exploratory factor modeltry other clustering techniques
Several variables may be unexpectedly difficult to extract and cross-references
if not essential, defer analysis of that variable
Aim 1: Does sharing have benefit for those who share?
Aim 2: Can sharing and withholding be systematically measured?
Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?
pilot completed.
Now: full dataset collection
Current status
Anticipated contributions
• Published assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing
• Publicly available dataset associating microarray study publications with data sharing status
• Generalizable approach for developing practical, real-world information retrieval using centralized full-text query portals
• Preliminary model of data sharing behaviour based on this large dataset
Future work
• Identify and model data reuse
• Citation analysis of the large cohort
• Supplement with survey responses
http://www.flickr.com/photos/cogdog/123072/
I post my data, code, and statistical scripts athttp://www.dbmi.pitt.edu/piwowar
Share yours too!
http://www.flickr.com/photos/myklroventine/892446624/
Data sharing plan
Thanks to: ➡ the NLM for funding training grant 5 T15 LM007059-22 ➡ the Dept of Biomedical Informatics at the U of Pittsburgh➡ my committee
Dr Wendy Chapman Biomed InformaticsDr Ellen Detlefsen iSchoolDr Madhavi Ganapathiraju BioinformaticsDr Brian Butler Katz School of BusinessDr Gunther Eysenbach U of Toronto, Health Policy
Mgmt and Evaluation
aim
Funder Journal Investigator Institution Study
Is research data shared after publication?
self-reported denying a request in last 3 years
trainees self-reported denying a request
been denied access to data, materials, code
authors “not able to retrieve raw data”
not willing to release data
0% 10% 20% 30% 40%
Prevalence of data withholding via surveys
Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.
Campbell et al. JAMA 2002.
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results0% 20% 40% 60% 80%
Self‐reported reasons for data withholding
Blumenthal et al. Acad Med. 2006
industry involvement
perceived competitiveness of field
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3
Correlates with self‐reported data withholding