37
a centre of expertise in data curation and preservation Funded by: This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http: //creativecommons .org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Curation: making data suitable for re-use Chris Rusbridge Presentation at FIBS Seminar

Curation: making data suitable for re-use

  • Upload
    aulani

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Curation: making data suitable for re-use. Chris Rusbridge Presentation at FIBS Seminar. Contents. Science and digital curation What to do with your data: frontiers of practice Repository frontiers. Digital Curation Centre Mission. - PowerPoint PPT Presentation

Citation preview

Page 1: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Curation: making data suitable for re-use

Chris Rusbridge

Presentation at FIBS Seminar

Page 2: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Contents• Science and digital curation• What to do with your data: frontiers of

practice• Repository frontiers

Page 3: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Digital Curation Centre Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

Page 4: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007SDSS (Visual)

TWOMASS (Infrared)

Slide from Rajendra Bose

Page 5: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007 Slide from Rajendra Bose

Page 6: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

New discovery…• National Virtual Observatory

• Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”

Page 7: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Curation• Data increasingly important as evidence

• Key part of the scholarly record• Experimental verifiability (the basis of science)• Allows additional interpretations• Unrepeatable observations & experiments

(particularly environmental in broadest sense)• Legal, compliance & transactions• Cultural resources

Page 8: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

What kinds of data?• Observations

• eg UARS (Upper Atmosphere) Level 0: telemetry• UARS Level 1: measured physical parameters (post

calibration?)

• Derived data• UARS Level 2: calculated geophysical? profiles• UARS level 3: gridded, interpolated?

• Combined data• Crafted data

• Eg annotated gene/protein databases

• Descriptive (meta)data

Page 9: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

What to do with it?• Keep as part of experiment• Deposit in institutional or discipline repository

• Possible time-limited embargos

• Cite it• “Publish” in support for articles

Page 10: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Internet Archaeology: publication with data (sadly, a preservation nightmare!)

Page 11: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

What are the reusability issues?• Data not neutral to hypothesis• Hard to know the risks & pitfalls of a particular

dataset• Data not self-describing: hard to find

appropriate data• Hard to “understand” data once found• Hard to use data once understood

Page 12: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

What to do about it?• Build curation/reusability into your workflow

• Curation begins before creation• What’s easy at first becomes (impossibly) hard later• Describe your data (metadata)• Keep experimental parameters (technical, who, what, when,

where etc)• Keep data descriptions (schemas, “representation

information”, etc)• Keep data!

• Use standard/agreed formats for data• Make ownership & restrictions clear• Explain how to cite your data

Page 13: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Page 14: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Data resource stages• Curated data is created…

• Observations? Fixed!

• Or Acquired…• Data brought/bought from outside• Ingest

• Development• Derived, refined, combined, processed data• Potentially many stages

Page 15: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Context • Data meaningless without context

• Linkage• Metadata of many kinds• Workflow!

• Provenance • Authenticity • Computational lineage

Page 16: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Csat8-day composite

and subscene

Csat

E0

SST

8-day composite and subscene Pbopt calc

Ctot calc Zeu calc PPeu calc

PARsubscene

HRPT

NASA

University research group1

research group3 local

decision-making body

University research group2

Slide from Rajendra Bose

Page 17: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Access and re-use• Ethics and rights control access

• Weak in expressing this long-term

• Collaboration tools• Annotation, discussion, review• Re-use leading to change and development

• “Publication”• Not just in “print”• Underlying data should be “published”, too

• Citation…

Page 18: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Citation needs…• An efficient way to reference and access “archived” past states

of a changing dataset (work in progress, Buneman et al)• Not important for original observations

• Don’t mess with those data

• Less important for incremental datasets• Later stuff should not invalidate earlier

• Very important for revisable datasets• Eg Genomics… datasets that result from the combined work of

curators, or contain opinions or facts likely to change

• Eg Mapping… OS maps represent a huge database that changes on a daily basis

Page 19: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Who are the curation players?

Page 20: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Curation: Individual• “Small science 2-3 times more data than Big

science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best

shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation

• Individual “knows” too much• Documentation/metadata unlikely to be adequate

• Future: gone!

Page 21: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Department: eCrystals• Partnership with Institutional

Repository• Specialist department

archive (& national service)• Workflow recording of lab

parameters (R4L)• Public & private elements• Trying to build eCrystals

federation (eBank 3)• Future: likely to continue

Page 22: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Institution: Cambridge Chemistry• 175,000 small molecule

structures in CML• Alongside Archaeology,

Manuscripts, Learning Materials, etc

• No library curation skills; dependent on research group enthusiast

• Collection isolated from other Chemistry

• (Only 5 UK institutional repositories claim to hold data)

• Future: assured…

Page 23: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Community: LOCKSS?• Self-selected group of

collectors: closest to genuine open activity (despite Alliance)?

• Traditionally libraries collecting eJournals

• Model respects IPR• No domain expertise; rely on

origins• Data limitations…• Future: potentially very

persistent (low cost, high reliability, attack resistance, distributed)

Page 24: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Discipline: Atmospheric Science• Strong believer in need

for domain scientists as curators

• Significant participant in “community proxy” agenda-setting activities

• Internationally fragmented resources

• Future: mostly dependent on grant funding (but strong commitment)

Page 25: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Discipline: Pharmacology• International Scientific

Union• Attempting to build

credit for data contributions

• Future: extremely limited funding

Page 26: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Discipline: Bio/Health• UK PubMedCentral!

• (you heard about this earlier)

Page 27: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Issues: Nature article 23 June 05

• Databases in Peril• 51 out of 89 biological databases contacted reported they

were struggling financially• 7 have closed• Several being updated in owner’s spare time• (Notes that not all deserve long term support)

• [Nucleic Acids Research reports 858 databases in 2006!]

• Major issue: money

Page 28: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Publisher: Crystallography

• Publisher and Scientific Union

• Created key domain crystallographic standard (CIF)

• Strong motivator for deposit of structure data

• Consistent quality checks• DOIs used for structure data• Future: publishing business

model

•Slide from IUCr

Page 29: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

National bodies: British Library• Serious and robust

approach• Legal deposit powers &

responsibilities as driver• Oriented primarily

towards “cultural heritage” (broadly interpreted)

• Little data, no science domain experience

• Future: strong future commitment

Page 30: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

National bodies: TNA/NDAD• Specialist archive for

government datasets• Understand government

regulations, dynamics & requirements

• Subject generalists; disconnected from associated science

• Technology specialists (understand databases)

• Future: likely to pass eventually to The National Archives

Page 31: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

3rd parties: Portico• Specific area: eJournals• Depends on publisher

agreements• No data or domain

science expertise• Future: commitment

from Mellon + publishers + subscriptions, good funding mix

Page 32: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

3rd Parties: Iron Mountain?• Records management

IS a curation problem• Organisations like this

very likely to branch out• No domain science

expertise• Future: business case,

viability, stock market…

Page 33: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Institutions & the network• Institutions have fundamental sustainability• Disciplines have domain knowledge advantage

but sustainability is an issue • Can we get the best of both?

Page 34: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Intersections…Institution

1Institution

2Institution

3etc

Discipline 1

X X

Discipline 2

X X

Discipline 3

X X

etc

Page 35: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Who are the curation players again?

Page 36: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

BEWARE WEB 2.0!!!

Page 37: Curation: making data suitable for re-use

a centre of expertise in data curation and preservation

FIBS January 2007

Thank you