Upload
aulani
View
28
Download
1
Embed Size (px)
DESCRIPTION
Curation: making data suitable for re-use. Chris Rusbridge Presentation at FIBS Seminar. Contents. Science and digital curation What to do with your data: frontiers of practice Repository frontiers. Digital Curation Centre Mission. - PowerPoint PPT Presentation
Citation preview
a centre of expertise in data curation and preservation
Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
Curation: making data suitable for re-use
Chris Rusbridge
Presentation at FIBS Seminar
a centre of expertise in data curation and preservation
FIBS January 2007
Contents• Science and digital curation• What to do with your data: frontiers of
practice• Repository frontiers
a centre of expertise in data curation and preservation
FIBS January 2007
Digital Curation Centre Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”
a centre of expertise in data curation and preservation
FIBS January 2007SDSS (Visual)
TWOMASS (Infrared)
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
FIBS January 2007 Slide from Rajendra Bose
a centre of expertise in data curation and preservation
FIBS January 2007
New discovery…• National Virtual Observatory
• Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”
a centre of expertise in data curation and preservation
FIBS January 2007
Curation• Data increasingly important as evidence
• Key part of the scholarly record• Experimental verifiability (the basis of science)• Allows additional interpretations• Unrepeatable observations & experiments
(particularly environmental in broadest sense)• Legal, compliance & transactions• Cultural resources
a centre of expertise in data curation and preservation
FIBS January 2007
What kinds of data?• Observations
• eg UARS (Upper Atmosphere) Level 0: telemetry• UARS Level 1: measured physical parameters (post
calibration?)
• Derived data• UARS Level 2: calculated geophysical? profiles• UARS level 3: gridded, interpolated?
• Combined data• Crafted data
• Eg annotated gene/protein databases
• Descriptive (meta)data
a centre of expertise in data curation and preservation
FIBS January 2007
What to do with it?• Keep as part of experiment• Deposit in institutional or discipline repository
• Possible time-limited embargos
• Cite it• “Publish” in support for articles
a centre of expertise in data curation and preservation
FIBS January 2007
Internet Archaeology: publication with data (sadly, a preservation nightmare!)
a centre of expertise in data curation and preservation
FIBS January 2007
What are the reusability issues?• Data not neutral to hypothesis• Hard to know the risks & pitfalls of a particular
dataset• Data not self-describing: hard to find
appropriate data• Hard to “understand” data once found• Hard to use data once understood
a centre of expertise in data curation and preservation
FIBS January 2007
What to do about it?• Build curation/reusability into your workflow
• Curation begins before creation• What’s easy at first becomes (impossibly) hard later• Describe your data (metadata)• Keep experimental parameters (technical, who, what, when,
where etc)• Keep data descriptions (schemas, “representation
information”, etc)• Keep data!
• Use standard/agreed formats for data• Make ownership & restrictions clear• Explain how to cite your data
a centre of expertise in data curation and preservation
FIBS January 2007
a centre of expertise in data curation and preservation
FIBS January 2007
Data resource stages• Curated data is created…
• Observations? Fixed!
• Or Acquired…• Data brought/bought from outside• Ingest
• Development• Derived, refined, combined, processed data• Potentially many stages
a centre of expertise in data curation and preservation
FIBS January 2007
Context • Data meaningless without context
• Linkage• Metadata of many kinds• Workflow!
• Provenance • Authenticity • Computational lineage
a centre of expertise in data curation and preservation
FIBS January 2007
Csat8-day composite
and subscene
Csat
E0
SST
8-day composite and subscene Pbopt calc
Ctot calc Zeu calc PPeu calc
PARsubscene
HRPT
NASA
University research group1
research group3 local
decision-making body
University research group2
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
FIBS January 2007
Access and re-use• Ethics and rights control access
• Weak in expressing this long-term
• Collaboration tools• Annotation, discussion, review• Re-use leading to change and development
• “Publication”• Not just in “print”• Underlying data should be “published”, too
• Citation…
a centre of expertise in data curation and preservation
FIBS January 2007
Citation needs…• An efficient way to reference and access “archived” past states
of a changing dataset (work in progress, Buneman et al)• Not important for original observations
• Don’t mess with those data
• Less important for incremental datasets• Later stuff should not invalidate earlier
• Very important for revisable datasets• Eg Genomics… datasets that result from the combined work of
curators, or contain opinions or facts likely to change
• Eg Mapping… OS maps represent a huge database that changes on a daily basis
a centre of expertise in data curation and preservation
FIBS January 2007
Who are the curation players?
a centre of expertise in data curation and preservation
FIBS January 2007
Curation: Individual• “Small science 2-3 times more data than Big
science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best
shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation
• Individual “knows” too much• Documentation/metadata unlikely to be adequate
• Future: gone!
a centre of expertise in data curation and preservation
FIBS January 2007
Department: eCrystals• Partnership with Institutional
Repository• Specialist department
archive (& national service)• Workflow recording of lab
parameters (R4L)• Public & private elements• Trying to build eCrystals
federation (eBank 3)• Future: likely to continue
a centre of expertise in data curation and preservation
FIBS January 2007
Institution: Cambridge Chemistry• 175,000 small molecule
structures in CML• Alongside Archaeology,
Manuscripts, Learning Materials, etc
• No library curation skills; dependent on research group enthusiast
• Collection isolated from other Chemistry
• (Only 5 UK institutional repositories claim to hold data)
• Future: assured…
a centre of expertise in data curation and preservation
FIBS January 2007
Community: LOCKSS?• Self-selected group of
collectors: closest to genuine open activity (despite Alliance)?
• Traditionally libraries collecting eJournals
• Model respects IPR• No domain expertise; rely on
origins• Data limitations…• Future: potentially very
persistent (low cost, high reliability, attack resistance, distributed)
a centre of expertise in data curation and preservation
FIBS January 2007
Discipline: Atmospheric Science• Strong believer in need
for domain scientists as curators
• Significant participant in “community proxy” agenda-setting activities
• Internationally fragmented resources
• Future: mostly dependent on grant funding (but strong commitment)
a centre of expertise in data curation and preservation
FIBS January 2007
Discipline: Pharmacology• International Scientific
Union• Attempting to build
credit for data contributions
• Future: extremely limited funding
a centre of expertise in data curation and preservation
FIBS January 2007
Discipline: Bio/Health• UK PubMedCentral!
• (you heard about this earlier)
a centre of expertise in data curation and preservation
FIBS January 2007
Issues: Nature article 23 June 05
• Databases in Peril• 51 out of 89 biological databases contacted reported they
were struggling financially• 7 have closed• Several being updated in owner’s spare time• (Notes that not all deserve long term support)
• [Nucleic Acids Research reports 858 databases in 2006!]
• Major issue: money
a centre of expertise in data curation and preservation
FIBS January 2007
Publisher: Crystallography
• Publisher and Scientific Union
• Created key domain crystallographic standard (CIF)
• Strong motivator for deposit of structure data
• Consistent quality checks• DOIs used for structure data• Future: publishing business
model
•Slide from IUCr
a centre of expertise in data curation and preservation
FIBS January 2007
National bodies: British Library• Serious and robust
approach• Legal deposit powers &
responsibilities as driver• Oriented primarily
towards “cultural heritage” (broadly interpreted)
• Little data, no science domain experience
• Future: strong future commitment
a centre of expertise in data curation and preservation
FIBS January 2007
National bodies: TNA/NDAD• Specialist archive for
government datasets• Understand government
regulations, dynamics & requirements
• Subject generalists; disconnected from associated science
• Technology specialists (understand databases)
• Future: likely to pass eventually to The National Archives
a centre of expertise in data curation and preservation
FIBS January 2007
3rd parties: Portico• Specific area: eJournals• Depends on publisher
agreements• No data or domain
science expertise• Future: commitment
from Mellon + publishers + subscriptions, good funding mix
a centre of expertise in data curation and preservation
FIBS January 2007
3rd Parties: Iron Mountain?• Records management
IS a curation problem• Organisations like this
very likely to branch out• No domain science
expertise• Future: business case,
viability, stock market…
a centre of expertise in data curation and preservation
FIBS January 2007
Institutions & the network• Institutions have fundamental sustainability• Disciplines have domain knowledge advantage
but sustainability is an issue • Can we get the best of both?
a centre of expertise in data curation and preservation
FIBS January 2007
Intersections…Institution
1Institution
2Institution
3etc
Discipline 1
X X
Discipline 2
X X
Discipline 3
X X
etc
a centre of expertise in data curation and preservation
FIBS January 2007
Who are the curation players again?
a centre of expertise in data curation and preservation
FIBS January 2007
BEWARE WEB 2.0!!!
a centre of expertise in data curation and preservation
FIBS January 2007
Thank you