Upload
victor-mathews
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
C D
LUC Curation Center
Defining the Data Citation Problem in the DataNet Context
December 2009
John Kunze, University of California Curation Center, Oakland, CA
Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN
Patricia Cruse, University of California Curation Center, Oakland, CA
Carol Tenopir, School of Information Sciences, University of Tennessee, Knoxville, TN
Todd Vision, Department of Biology, University of North Carolina, Chapel Hill, NC
William Michener, University Libraries, University of New Mexico, Albuquerque, NM
C D
LUC Curation Center
Data’s shameful neglect
“Research cannot flourish if data are not preserved and made accessible.
All concerned must act accordingly.”
10 September 2009
C D
LUC Curation Center
The scientific record is at risk
• Incompatible formats, models, semantics• Poor preservation practice• Dispersed sources• Science needs this record to verify
findings and test new hypotheses• Record at risk planet at risk
C D
LUC Curation Center
Collage: J. Callaway, USF
C D
LUC Curation Center
Data preservation is hard;start small with data publication
The risk is complex, with social and technical dimensions – can we start small?
• Insight: data that drive much scientific journal literature is produced in islands of practice, resulting in unshared, incompatible datasets
• Hypothesis: establishing a system of data publishing will promote data sharing and re-use by providing standards and producer incentives
Publishing Sharing Use Preservation
C D
LUC Curation Center
Data publishing challenges
• Datasets encompass everything– Data plus documents, images, audio, video, etc.– Tension between standardization and innovation
• Data is similar to software, but even more specialized– OK to maintain in-house, but tedious to prepare for release
– Technical dependence complicates long-term maintenance
– Internal consistency requirements, plus provenance
• Some built-in instability: long-term value of some data can depend on change, such as annotation
C D
LUC Curation Center
Data publication is hard; start small with data citation
Published data, from outset, will call for citations• Need links from journal articles to data used
Hypothesis: establishing simple, easy conventions for data citation will encourage its practice, hence data publishing, hence data preservation
data citation data publishing data preservation
C D
LUC Curation Center
Data citation leads to data setLuyssaert, S., I. Inglima and M. Jung. 2009. Global Forest Ecosystem Structure
and Function Data for Carbon Balance Research. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/949
http://dx.doi.org/10.3334/ORNLDAAC/949
http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=949
Leads often to one or more surrogates
If data set is archived, leads to data files
allspice1
C D
LUC Curation Center
Citation target
Small surrogate
Smaller surrogate
Smallest surrogate
C D
LUC Curation Center
Data citation examplesWorld Data Center for Paleoclimatology Data (NOAA)
Anderson, D.W., W.L. Prell, and N.J. Barratt. 1989. Estimates of sea surface temperature in the Coral Sea at the last glacial maximum. Paleoceanography 4(6):615-627. Data archived at the World Data Center for Paleoclimatology, Boulder, Colorado, USA. no identifier
Publishing Network for Geoscientific & Environmental Data in Germany
Nishioka, J et al. (2008): Profiles of iron concentration from GoFlow bottles during the CARUSO-EISENEX experiment, doi:10.1594/PANGAEA.701305, Supplement to: Nishioka, Jun; Takeda, Shigenobu; de Baar, Hein JW; Croot, Peter L; Boyé, Marie; Laan, Patrick; Timmermans, Klaas R (2005): Changes in the concentration of iron in different size fractions during an iron enrichment experiment in the open Southern Ocean, Marine Chemistry, 95(1-2), 51-63, doi:10.1016/j.marchem.2004.06.040
2 identifiers: 1 for publication,1 for data
C D
LUC Curation Center
More data citation examples
ICPSR
Kessler, Ronald C. National Comorbidity Survey: Baseline (NCS-1), 1990-1992 (Restricted Version) [Computer file]. ICPSR25381-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2009-05-11. doi:10.3886/ICPSR25381
Economic Modeling
Figure 3. Change of relative agricultural producer prices since 1998. Middle-income CIS show average for Russia, Kazakhstan, and Ukraine. …. Source: OECD, 2004 and CIS Statistics, 2003.
archival data center?
2 organizations listed, but which of their 100s of datasets
were used?
C D
LUC Curation Center
Contrasting citation styles
Some commonalities (who, when, where), but
• Prose is interspersed with metadata elements
• Standard citation format/recipe would be easy to read
• Not every citation had an actionable identifier
• Name of dataset and data subset used (what) unclear
• Archival commitment unclear
• Date of publication vs date of collection unclear
• One citation contained another citation (for publication)
C D
LUC Curation Center
What we want from data citation
• Precise identification of dataset– At level of version, file, table, etc., or groups thereof– So that readers can find and understand the data
• Credit to data producers and data publishers– Vital incentive for data sharing and archiving
• A link from the traditional literature to the data– Gives intellectual legitimacy to creation of data sets
• Research metrics for datasets– Sponsors want publication and retention numbers
C D
LUC Curation Center
Starter data citation wish list
– Any dataset, database, data file– All levels of granularity (table, row, cell)– For any snapshot (version, e.g., in time)– Any formatted view: XML, HTML, CSV, etc.– With and without annotations– Links to older, newer, and latest versions– Actionability (“Click-through”)– Persistence (validity into the future)
C D
LUC Curation Center
Datasets and documents have much in common
Data Document
Systematically organized
yes yes
Hierarchical yes yes
Machine readable Yes, with metadata for semantics help
Sort of, with schema structure (TEI, CCS docs)
C D
LUC Curation Center
Data citation wish list possibilitiesWe want it all, but might settle for initial partial solutions
All datasets? Well, maybe just archived datasets*
All levels of granularity? For any snapshot? All views?
Publisher-defined granules, versions, and views*
Plus older/newer version, and latest version?
Surrogate-based pointer to extant version chain*
With and without annotations? Annotation as publication*
What about actionability and persistence? Yes and yes*
(* Standards and archives needed for all)
C D
LUC Curation Center
Initiatives and outfits to watch
• DataCite initiative: to encourage data publishing via global data citation support: standards, persistent reference to datasets in regional archives
• Supplemental materials publishing standards for data, surrogates, and extended descriptions and methods, e.g., technical data application appendices
• Publishers: increased volume of submission
• Community standards (so many to choose from!): ORNL DAAC, Pangaea, GCMD, ESIP, GBIF, TDWG, OECD, NISO/NFAIS, IPYDIS, Dataverse, etc.
C D
LUC Curation Center
Data citation summary
Data citation helps publication and sharing, which helps preservation and re-use, which saves the planet
•Gives credit to data producers and data publishers– Vital incentive for data sharing and archiving
•Provides a link from traditional literature to data– Gives intellectual legitimacy to creation of data
•Research metrics for datasets– Sponsors want publication and retention numbers
•Need recipes and stuff, i.e., standards and archives