Upload
heather-piwowar
View
3.104
Download
1
Embed Size (px)
DESCRIPTION
Presentation by Heather Piwowar as part of UBC's Open Access Week 2010
Citation preview
Open research data
Heather PiwowarDataONE postdoc with Dryad and NESCent, UBC
@researchremix
OA week 2010University of British Columbia
#1
It matters
http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
http://www.flickr.com/photos/jsmjr/62443357/
http://www.flickr.com/photos/camilleharrington/3587294608/
http://www.flickr.com/photos/rkuhnau/3318245976/
http://www.flickr.com/photos/conformpdx/1796399674/
http://www.flickr.com/photos/rkuhnau/3317418699/
http://www.flickr.com/photos/zemlinki/261617721/
http://www.flickr.com/photos/tracenmatt/3020786491/
http://www.flickr.com/photos/the-o/2078239333/
http://www.flickr.com/photos/75166820@N00/5318468/
#2
Wayfinding + progress
http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
Which data?
http://www.flickr.com/photos/paulhami/1020538523//
Where?
http://www.flickr.com/photos/paulhami/1020538523//
With whom?
http://www.flickr.com/photos/paulhami/1020538523//
When?
http://www.flickr.com/photos/paulhami/1020538523//
Under what terms?
http://www.flickr.com/photos/paulhami/1020538523//
FindOrganizeDocumentDeidentifyFormatAskSubmit
Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???
not very motivating.
http://www.flickr.com/photos/tonivc/2283676770/
http://www.flickr.com/photos/johnnyvulkan/381941233/
a) policies + expectations
- NSF- Joint Data Archiving Policy- BioMed Central- PLoS
b) repositories
- datatype-based- institution-based- discipline-based- journal-based
c) standards
- data licenses- data citation - IDs for datasets, people, entities
d) part of something bigger
- open government data- citizen science- supplemental materials- dataset-based usage metrics- awards, recognition
#3
Is it working?
http://www.genome.jp/en/db_growth.html
lots of data sharing!
but how much isn’t shared?
what isn’t shared?
who isn’t sharing it?why not?
what can we do about it?
how much does it matter?
you can not manage what you do not measure
quote: Lord Kelvinhttp://www.flickr.com/photos/archeon/2941655917/
http://www.flickr.com/photos/ryanr/142455033/
Why is it important?Are we sure?
Errors.
Gore et al 1977, Kantoer and Taylor 1994, McGuigan 1995, Hurlbert and White 1993
More than half of all papers contain errors
5‐10% contain errors that change the conclusions
Ok, let’s share on request.
Doesn’t work
self-reported denying a request in last 3 years
trainees self-reported denying a request
been denied access to data, materials, code
authors “not able to retrieve raw data”
not willing to release data
0% 10% 20% 30% 40%
Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.
Don’t get the email
Evangelou et al. FASEB J. 2006.Wren. Bioinformatics 2008.Wren et al. EMBO Rep 2006.
Say no
Hedstrom. Society of Am Archivists Ann Meeting. 2008.
want to publish more papers first
want exclusive use
ensure data confidentiality
control
avoid cost of preparation0% 10% 20% 30% 40% 50%
Ask why
Reidpath et al. Bioethics 2001.
`Before I send you the data could I ask what you want it for?'
`Can you be more explicit, please, about the analyses you have in mind and what you plan to do with them?'
`We'll have to discuss your request with the other coauthors. Before we do that, I'd like to know your proposed analysis plan.'
`We are not finished using the data, but when we are finished with it, we would be open to requests for the data.'
`Any use of the data other than for the specific purpose laid down in the contract of collaboration is effectively ruled out.'
Not efficient.
Not efficient. Not fair.
Campbell et all 2000
Not random:
‐ young
‐ productive
Has real costs.Survey of doctoral students and postdocs:
28-50% reported withholding negative effects:• hurt progress of their research, • hurt rate of discovery in their lab/research group, • hurt quality of their relationships with academic
scientists, • hurt quality of their education, • hurt level of communication in their lab/research
group.
Vogeli et al. Acad Med. 2006 Feb; 81(2):128-36
Ok, then on a website?No. Urls stop working.
Evangelou et al. FASEB J. 2006.Wren. Bioinformatics 2008.Wren et al. EMBO Rep 2006.
Ok, in a repository?
lots of data sharing!
http://www.genome.jp/en/db_growth.html
http://www.flickr.com/photos/g_kat26/4255119413/
http://www.flickr.com/photos/jima/606588905/
Combined, these full-text portals reach 85% of the articles available through U of Pittsburgh library subscriptions.
microarray data
http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG
11,603 studies that created
gene expression microarray data
Is research data shared after publication?
Funder Journal Investigator Institution Study
funded by NIH?
size of grant
sharing plan req’d?
funded by non-NIH?
impact factor
strength of policy
open access?
number of microarray studies published
years since first paper
# pubs
# citations
previously shared?
previously reused?
gender
sector
size
impact rank
country
humans?
mice?
plants?
cancer?
clinical trial?
number of authors
year
Funder Journal Investigator Institution Study
“An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …”
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
journal data sharing policy
journal rank
institution rank
Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
funding level
PubMed grant lists + NIH grant details
study type
author gender
and so on...
124 variables
11,603 studies
25% had links from datasets in databases
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Year article published
Pro
po
rtio
n o
f a
rtic
les w
ith
da
tase
ts f
ou
nd
in
GE
O o
r A
rra
yE
xp
ress
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Proportion of articles with shared datasets, by year
Across time
What can we do about it?
What can we do about it?
Funder policies.
19%
Piwowar and Chapman. Journal of Informetrics 2010
What can we do about it? Journal policies.
We looked at data sharing policies within Instruction to Author statements of 70 journals, as they apply to gene expression microarray data.
Piwowar and Chapman. ELPUB 2008
No applicable policy (43%)
Weak policy (24%)
should, recommend, requestmust, but without requiring database accession number
Strong policy (33%)
must, required, condition of publicationrequires database accession number
strength of data sharing policies
High-impact journals
tend to have
a strong data-sharing
policy
Articles published in journals with a strong data-sharing policy are more likely to have publicly
available datasets
What can we do about it?
Learn
• Learn from those who do it well• Focus on places that need it
Ph
ysio
l G
en
om
ics
PL
oS
Ge
ne
t
Ge
no
me
Bio
l
Microbiology
PL
oS
On
e
BM
C G
en
om
ics
Pla
nt
Ce
ll
Ge
no
me
Re
s
Eu
ka
ryo
t C
ell
Ap
pl E
nviro
n M
icro
bio
lB
MC
Me
d G
en
om
ics
Hu
m M
ol G
en
et
Pro
c N
atl A
ca
d S
ci U
S A
Infe
ct
Imm
un
Am
J R
esp
ir C
ell
Mo
l B
iol
De
v B
iol
J B
acte
rio
l
Mo
l E
nd
ocrin
ol
BM
C C
an
ce
r
Pla
nt
Ph
ysio
lB
iol R
ep
rod
Blood
J I
mm
un
ol
FA
SE
B J
To
xic
ol S
ci
J E
xp
Bo
tN
ucle
ic A
cid
s R
es
Diabetes
Mo
l C
ell B
iol
Mo
l C
an
ce
r T
he
r
BM
C B
ioin
form
atics
Ste
m C
ells
FE
BS
Le
tt
J N
eu
rosci
Am
J P
ath
ol
J B
iol C
he
m
J V
iro
l
OTHER
Ca
nce
r R
es
J C
lin
En
do
crin
ol M
eta
b
Pla
nt
Mo
l B
iol
Clin
Ca
nce
r R
es
Genomics
Inve
st
Op
hth
alm
ol V
is S
ci
Mo
l H
um
Re
pro
dCarcinogenesis
Gene
Endocrinology
Oncogene
Ca
nce
r L
ett
Bio
ch
em
Bio
ph
ys R
es C
om
mu
n
Pro
port
ion o
f data
sets
share
d
0.0
0.2
0.4
0.6
0.8
1.0 Journals(Physiological Genomics)
Sta
nfo
rd U
niv
ers
ity
Un
ive
rsity o
f P
en
nsylv
an
ia
Un
ive
rsity o
f Illin
ois
Un
ive
rsity o
f C
alif
orn
ia,
Lo
s A
ng
ele
s
Un
ive
rsity o
f W
isco
nsin
, M
ad
iso
n
Un
ive
rsity o
f W
ash
ing
ton
Un
ive
rsity o
f C
alif
orn
ia,
Da
vis
Th
e U
niv
ers
ity o
f B
ritish
Co
lum
bia
Un
ive
rsity o
f C
alif
orn
ia,
Sa
n F
ran
cis
co
Un
ive
rsity o
f F
lorid
a
Un
ive
rsity o
f C
alif
orn
ia,
Sa
n D
ieg
o
Un
ive
rsity o
f M
inn
eso
ta,
Tw
in C
itie
s
Ba
ylo
r C
olle
ge
of
Me
dic
ine
OTHER
Ma
x P
lan
ck G
ese
llsch
aft
Ha
rva
rd U
niv
ers
ity
Du
ke
Un
ive
rsity M
ed
ica
l C
en
ter
Ya
le U
niv
ers
ity
Jo
hn
s H
op
kin
s U
niv
ers
ity
Un
ive
rsity o
f P
itts
bu
rgh
Wa
sh
ing
ton
Un
ive
rsity in
Sa
int
Lo
uis
Un
ive
rsity o
f T
oro
nto
Un
ive
rsity o
f C
alif
orn
ia,
Be
rke
ley
Un
ive
rsity o
f M
ich
iga
n,
An
n A
rbo
r
Mic
hig
an
Sta
te U
niv
ers
ity
Na
tio
na
l C
an
ce
r In
stitu
te
To
kyo
Da
iga
ku
Pro
po
rtio
n o
f d
ata
se
ts s
ha
red
0.0
0.2
0.4
0.6
0.8
1.0
Institutions(Stanford)
1
101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
Pro
po
rtio
n o
f d
ata
se
ts s
ha
red
0.0
0.2
0.4
0.6
0.8
1.0
Institutionrank
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Institution high citations & collaboration
Journal impact
Journal policy consequences & long halflife
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Journal impact
Journal policy consequences & long halflife
Institution high citations & collaboration
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Institution high citations & collaboration
Journal impact
Journal policy consequences & long halflife
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy0.95Count of R01 & other NIH grants
Authors prev GEOAE sharing & OA & microarray creation
NO K funding or P funding
Journal impact
Journal policy consequences & long halflife
Institution high citations & collaboration
NOT animals or mice
Instititution is government & NOT higher ed
Last author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
0.95Amount of NIH funding
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
Multivariate nonlinear regression with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
0.95Amount of NIH funding
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
Multivariate nonlinear regression with interactions
Carrot?
http://www.flickr.com/photos/sunrise/35819369/
currency of value?
Citations.
currency of value?
Citations.
$50!
Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)
citationsISI Web of Science Citation index, citations from 2004-2005
data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine
statisticsMultivariate linear regression
Note:log scale
~70%
Next?
http://www.flickr.com/photos/gatewaystreets/3838452287/
Impact of JDAP
Abadie et al. Journal of the American Statistical Association 2010
Reuse.
http://www.flickr.com/photos/boitabulle/3668162701/
http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg.png
#4
We are the culture.Let’s do it.
http://www.flickr.com/photos/joellevand/279468607/
http://www.flickr.com/photos/huzzahvintage/4577075021/
a) in our communities
- strengthening policies:- journal, conference, institutional
- decision-makers- role-models and educators
b) in our tools
- measure opinions- measure use- be transparent!
c) with our data
- share it.- ugly? incomplete? strange?
“Flawed, but out there” is a million times better than “perfect, but unattainable”
http://sciblogs.co.nz/seeing-data/2010/10/12/the-zen-of-open-data/
“Does anyone want your data?
That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.
Your data, too, may simply be awaiting an effective matchmaker.”
Got data? Nature Neuroscience (2007)
I post my data, code, and statistical scripts: http://researchremix.org
Share yours too!
http://www.flickr.com/photos/myklroventine/892446624/
More info?
• OATP oa.data tag on Connotea, Twi1er
• FriendFeed• Mendeley “data sharing” group
• @researchremix [email protected]
thank youTodd Vision,
Michael Whitlock, Wendy Chapman
The open science online community and those who release their articles, datasets and photos openly
http://www.flickr.com/photos/youraddresshere/6649228/