42
Because good research needs good data Funded by: © D igital Cura tion Centre , 2009. License d under Creative Com mons BY-NC -S A 2.5 Scotl and: htt p://cre ativec om mons.org/licenses/by -nc-sa/2. 5/scotland/ Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC [email protected] High Heid Yin,

Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC [email protected]

Embed Size (px)

Citation preview

Page 1: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

Because good research needs good data

Funded by:

© Digital Curation Centre, 2009. Licensed under Creative Commons BY-NC-SA 2.5 Scotland:

http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/

Perspectives on Digital Curation, Data and PublishingWhy, How, Where?

Kevin Ashley

Director, DCC

[email protected]

•High Heid Yin,

Page 2: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

DATA

PUBLISHING

Page 3: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

TURN into a diagram

• Idea – funding – collection – analysis - publish

Page 4: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 5: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Overview

• Why care about citing and/or referencing data?• Data is different – and that matters• Approaches, and their strengths & weaknesses

Page 6: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

WHY

CARE?

#1

Page 7: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

LIDAR & RADAR images of ice cloud – H. Ruschennberg

Page 8: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

WHY

CARE?

#2

Page 9: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

142010 2012 2014 2016 2018 2020

02468

101214161820

2010 2012 2014 2016 2018 2020

University funding – the future is scary

Page 10: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

“The Data Behind The Graph”

• Data in support of publication should be as accessible as the publication itself

• Allows challenge, replication, understanding• Often undertaken by publishers, or ventures

associated with them• Sometimes ‘associated items’ to DOI of paper,

sometimes objects with own DOI

Page 11: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Integrity goes further…

• The data that I publish is not always the data that I collected

• Sometimes, that matters

Page 12: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

WHY

CARE?

#3

Page 13: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Impact

• Making data accessible increases citation rates• Better for authors; better for publishers• Piwowar, Day & Fridsma (2007):

• 45% of studies make data accessible• They receive 85% of citations

• Caution: correlation is not causation

Page 14: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

WHY

CARE?

#4

Page 15: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

I ♥ your data!

I don’t ♥ what you said about it.

Page 16: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

E.G…..

• Your data on rock types of Crete supported a publication in a geological journal

• Your data on rock types of Crete supports my theory about the sources of pigments used on Minoan pottery

• I won’t be publishing in a geological journal• Your conclusions about events 1 billion years ago

have no relevance for me

Page 17: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

WHY

CARE?

#5

Page 18: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Understanding Biodiversity

• We don’t understand what drives it• What helps, hinders speciation• We believe it to be good• No one project or data source is enough

Page 19: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Research on Biodiversity…

• Requires many different data sources• Not all will be published• Not all publications are for similar research

reasons, so…• Citing the publication is (often) irrelevant• Some is research data, other government or

reference data• There are probably gaps that need filling

Page 20: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

WHY

CARE?

#6

Page 21: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

People’s lives may depend on it

• Watch Josh Sommer @ SAGE Bionetworks

http://sagecongress.org/WP/presentations

Page 22: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Data

Is

Different

Page 23: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 24: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 25: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 26: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 27: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 28: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Where can it happen

Global, international

Nationally

InstitutionBy Subject

Research Group

Page 29: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 30: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Approaches at publication

• Giving data digital object identifiers (DOI)• E.g. DataCite• Capturing data subsets at point of publication• Freezing those subsets somewhere• Publication-led approach

Page 31: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Who are the actors

• Scholars, librarians, publishers…• … aren’t enough• They aren’t even the only people doing this now.• They are a very big part of the answer, though• Curation happens before, after, without,

publication

Page 32: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

One datacenter: NOAA• General form for citing published World Data Center for Paleoclimatology Data:

• Anderson, D.W., W.L. Prell, and N.J. Barratt. 1989. Estimates of sea surface temperature in the Coral Sea at the last glacial maximum. Paleoceanography 4(6):615-627. Data archived at the World Data Center for Paleoclimatology, Boulder, Colorado, USA

• General form for citing unpublished World Data Center for Paleoclimatology Data: • Rind, D. 1994. General Circulation Model Output Data Set. IGBP PAGES/World

Data Center for Paleoclimatology Data Contribution Series #1994-012. NOAA/NCDC Paleoclimatology Program, Boulder, Colorado, USA.

• Citation for Data archived via a Data Cooperative: • McAndrews, J.H. 1996. Martin Pond pollen record. In E.C. Grimm et al., editors,

North American Pollen Database. IGBP PAGES/World Data Center for Paleoclimatology. NOAA/NCDC Paleoclimatology Program, Boulder, Colorado, USA.

• Citation for group of contributors that is too large to cite individually:• Contributors of the International Tree-Ring Data Bank, IGBP PAGES/World Data

Center for Paleoclimatology, NOAA/NCDC Paleoclimatology Program, Boulder, Colorado, USA.

Page 33: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

But data changes, and is big

• When many publications use bits of one dataset…

• When a dataset changes hourly….• … and is petabyte-sized• …snapshots don’t cut it• They also lose context of original

Page 34: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Publication A Publication C Publication DPublication B

Data Object AData Object B Data Object C Data Object D

Original Source Data

Page 35: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Data Management vs Curation

• UC3 view• “Curation value can be added… by enabling

creative use, reuse, in whole or part or in aggregation…”

• Facilitate by:• Persistent citation & actionable reference• Discovery of content & contextual description• Annotation for enriched description

Page 36: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Citing big data accurately

• Don’t make copies – keep change records• Create reference mechanisms that allow

reference to a specific change point• C.f. ‘Memento’ technique for referencing web

pages• Requires cooperation between curator &

referencer

Page 37: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0 Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Row F1 F2 F3 F4

1 2.3 Y M 0

2 1 N F 0

3 3 N F 0

4 2 Y M 300

5 4 N f 0

Page 38: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

On Citing Data

• Peter Buneman. How to cite curated databases and how to make them citable. In Proceedings of the 18th Conference on Scientific and Statistical Database Management, pages 195-203, July 2006

• Some serious computer science – some for a very general audience

Page 39: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Buneman’s desiderata

• Let C be a citation, <C> the thing cited• D1: For any citation C, <C> shall be fixed• D2: Any citable thing T should contain a C such

that <C> = T• D3: Databases should be citable at multiple levels

of coarseness• D4: If C and P are citations and <P> is coarser

than <C>, then location info in P should be in C• D5: Versioning is done at database level

Page 40: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Acknowledgements

• De Spullenmannen & Harmen G Zijp (http://www.spullenmannen.nl/)

• Jaxa.jp (satellite image)• NOAA• NDAD/Crown Copyright/Happy Computers Ltd

Page 41: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Page 42: Because good research needs good data Funded by: Perspectives on Digital Curation, Data and Publishing Why, How, Where? Kevin Ashley Director, DCC director@dcc.ac.uk

4th Bloomsbury Conference on e-publishing, UCL, London – 20100624 Kevin Ashley, DCC CC-BY-SA

Summary

• Thinking of data solely as adjunct to publication is too narrow a view

• Current practice may not extend easily• Data is often living – treat it as such• There’s more to the world than scholarly research• Hidden data is wasted data