18
C D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN Patricia Cruse, University of California Curation Center, Oakland, CA Carol Tenopir, School of Information Sciences, University of Tennessee, Knoxville, TN Todd Vision, Department of Biology, University of North Carolina, Chapel Hill, NC William Michener, University Libraries, University of New Mexico, Albuquerque, NM

C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

Embed Size (px)

Citation preview

Page 1: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Defining the Data Citation Problem in the DataNet Context

December 2009

John Kunze, University of California Curation Center, Oakland, CA

Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN

Patricia Cruse, University of California Curation Center, Oakland, CA

Carol Tenopir, School of Information Sciences, University of Tennessee, Knoxville, TN

Todd Vision, Department of Biology, University of North Carolina, Chapel Hill, NC

William Michener, University Libraries, University of New Mexico, Albuquerque, NM

Page 2: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data’s shameful neglect

“Research cannot flourish if data are not preserved and made accessible.

All concerned must act accordingly.”

10 September 2009

Page 3: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

The scientific record is at risk

• Incompatible formats, models, semantics• Poor preservation practice• Dispersed sources• Science needs this record to verify

findings and test new hypotheses• Record at risk planet at risk

Page 4: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Collage: J. Callaway, USF

Page 5: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data preservation is hard;start small with data publication

The risk is complex, with social and technical dimensions – can we start small?

• Insight: data that drive much scientific journal literature is produced in islands of practice, resulting in unshared, incompatible datasets

• Hypothesis: establishing a system of data publishing will promote data sharing and re-use by providing standards and producer incentives

Publishing Sharing Use Preservation

Page 6: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data publishing challenges

• Datasets encompass everything– Data plus documents, images, audio, video, etc.– Tension between standardization and innovation

• Data is similar to software, but even more specialized– OK to maintain in-house, but tedious to prepare for release

– Technical dependence complicates long-term maintenance

– Internal consistency requirements, plus provenance

• Some built-in instability: long-term value of some data can depend on change, such as annotation

Page 7: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data publication is hard; start small with data citation

Published data, from outset, will call for citations• Need links from journal articles to data used

Hypothesis: establishing simple, easy conventions for data citation will encourage its practice, hence data publishing, hence data preservation

data citation data publishing data preservation

Page 8: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data citation leads to data setLuyssaert, S., I. Inglima and M. Jung. 2009. Global Forest Ecosystem Structure

and Function Data for Carbon Balance Research. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/949

http://dx.doi.org/10.3334/ORNLDAAC/949

http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=949

Leads often to one or more surrogates

If data set is archived, leads to data files

allspice1

Page 9: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Citation target

Small surrogate

Smaller surrogate

Smallest surrogate

Page 10: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data citation examplesWorld Data Center for Paleoclimatology Data (NOAA)

Anderson, D.W., W.L. Prell, and N.J. Barratt. 1989. Estimates of sea surface temperature in the Coral Sea at the last glacial maximum. Paleoceanography 4(6):615-627. Data archived at the World Data Center for Paleoclimatology, Boulder, Colorado, USA. no identifier

Publishing Network for Geoscientific & Environmental Data in Germany

Nishioka, J et al. (2008): Profiles of iron concentration from GoFlow bottles during the CARUSO-EISENEX experiment, doi:10.1594/PANGAEA.701305, Supplement to: Nishioka, Jun; Takeda, Shigenobu; de Baar, Hein JW; Croot, Peter L; Boyé, Marie; Laan, Patrick; Timmermans, Klaas R (2005): Changes in the concentration of iron in different size fractions during an iron enrichment experiment in the open Southern Ocean, Marine Chemistry, 95(1-2), 51-63, doi:10.1016/j.marchem.2004.06.040

2 identifiers: 1 for publication,1 for data

Page 11: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

More data citation examples

ICPSR

Kessler, Ronald C. National Comorbidity Survey: Baseline (NCS-1), 1990-1992 (Restricted Version) [Computer file]. ICPSR25381-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2009-05-11. doi:10.3886/ICPSR25381

Economic Modeling

Figure 3. Change of relative agricultural producer prices since 1998. Middle-income CIS show average for Russia, Kazakhstan, and Ukraine. …. Source: OECD, 2004 and CIS Statistics, 2003.

archival data center?

2 organizations listed, but which of their 100s of datasets

were used?

Page 12: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Contrasting citation styles

Some commonalities (who, when, where), but

• Prose is interspersed with metadata elements

• Standard citation format/recipe would be easy to read

• Not every citation had an actionable identifier

• Name of dataset and data subset used (what) unclear

• Archival commitment unclear

• Date of publication vs date of collection unclear

• One citation contained another citation (for publication)

Page 13: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

What we want from data citation

• Precise identification of dataset– At level of version, file, table, etc., or groups thereof– So that readers can find and understand the data

• Credit to data producers and data publishers– Vital incentive for data sharing and archiving

• A link from the traditional literature to the data– Gives intellectual legitimacy to creation of data sets

• Research metrics for datasets– Sponsors want publication and retention numbers

Page 14: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Starter data citation wish list

– Any dataset, database, data file– All levels of granularity (table, row, cell)– For any snapshot (version, e.g., in time)– Any formatted view: XML, HTML, CSV, etc.– With and without annotations– Links to older, newer, and latest versions– Actionability (“Click-through”)– Persistence (validity into the future)

Page 15: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Datasets and documents have much in common

Data Document

Systematically organized

yes yes

Hierarchical yes yes

Machine readable Yes, with metadata for semantics help

Sort of, with schema structure (TEI, CCS docs)

Page 16: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data citation wish list possibilitiesWe want it all, but might settle for initial partial solutions

All datasets? Well, maybe just archived datasets*

All levels of granularity? For any snapshot? All views?

Publisher-defined granules, versions, and views*

Plus older/newer version, and latest version?

Surrogate-based pointer to extant version chain*

With and without annotations? Annotation as publication*

What about actionability and persistence? Yes and yes*

(* Standards and archives needed for all)

Page 17: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Initiatives and outfits to watch

• DataCite initiative: to encourage data publishing via global data citation support: standards, persistent reference to datasets in regional archives

• Supplemental materials publishing standards for data, surrogates, and extended descriptions and methods, e.g., technical data application appendices

• Publishers: increased volume of submission

• Community standards (so many to choose from!): ORNL DAAC, Pangaea, GCMD, ESIP, GBIF, TDWG, OECD, NISO/NFAIS, IPYDIS, Dataverse, etc.

Page 18: C D LC D L UC Curation Center Defining the Data Citation Problem in the DataNet Context December 2009 John Kunze, University of California Curation Center,

C D

LUC Curation Center

Data citation summary

Data citation helps publication and sharing, which helps preservation and re-use, which saves the planet

•Gives credit to data producers and data publishers– Vital incentive for data sharing and archiving

•Provides a link from traditional literature to data– Gives intellectual legitimacy to creation of data

•Research metrics for datasets– Sponsors want publication and retention numbers

•Need recipes and stuff, i.e., standards and archives