Digital | Curation | Centre
Digital Curation Centre www.dcc.ac.uk
Peter Burnhill, Michael Day, David Giaretta, Liz Lyon, Robin Rice, Bridget Robinson and Seamus Ross
Funded by:
2
Digital | Curation | Centre
Session Overview
1. Introduction & Briefing
2. Towards a Technical Model of Digital Curation: our R&D
3. Planning Delivery of Services & the Associates Network
3
Digital | Curation | Centre
1. Introduction & Briefing
• Background story on the DCC ‘So who’s that new kid on the block?’
• What is digital curation anyway? – ‘adding value’ & ‘ensuring longevity’
• Aims & objectives for the DCC– ‘improving the quality of what is done’
• Our planning & our progress– timelines & deliverables
• How does this relate to the JISC Programme?
4
Digital | Curation | Centre
Background to the DCC (1)
• Two parallel policy concerns1. Neglect of digital heritage, especially given investment in digitsation
programmes• JISC Continuing Access and Digital Preservation Strategy, 2002-2005
– eLib Programme, eLib3, Circular 5/97: Digital Preservation
• Digital Preservation Coalition formed in 2002
2. Differing data sharing practices in eScience, especially given huge data volumes
• Links between eScience Programme and JISC
• Report commissioned by JISC Cttee for Support of Research (Lord & Macdonald, May 2003)– twin drivers: Digital Preservation & Continuing Access (e-Science)– identified need for national digital curation centre
5
Digital | Curation | Centre
Interpretation of JISC policy
JISC plays 3 roles1. promotes, supports & develop management & preservation of
institutional and community digital materials for UK benefit2. partner to Research Council/AHRB & other national/international bodies3. as organization, appropriate grant conditions for JISC-funded creation of
digital resources; good practice for JISC created/managed materials
• “escalating scale and complexity of digital resources to be curated and the subsequent urgency of developing a critical mass of expertise, shared services and tools, for long-term digital preservation … require a step change in investment and approaches.
– “Over the next three years a greater emphasis on development of production services and tools … needed to build on previous research studies and projects.”
• “Digital preservation remains a challenging area in which techniques, costs, and skills are still in development: advocacy, dissemination and training, to embed preservation needs as appropriate in JISC funding programmes.”
6
Digital | Curation | Centre
Interpreting the implementation plan
• Risk assessment studies, eg ePrints– Calls to implement studies’ recommendations for services and integration
of preservation activity & standards into repositories funded by JISC.• Series of community calls to support records management and
digital preservation in institutions - cf FOI compliance.
• Establish Digital Curation Centre to: • Provide central focus of skilled staff & research
• links to wider network of development activity, researchers, & services
• Develop set of central services, standards, and tools • for a range of distributed digital data centres & preservation services, • across the Information Environment & Research Grid.
• JISC Partnership funding, – eg Web-archiving study: jointly funded by JCIE and Wellcome Trust
»
• Digital Preservation Coalition as an independent entity with JISC membership and sector activity supported by JISC.
• National preservation of e-journals, through RLN/RSLG
7
Digital | Curation | Centre
Back to the DCC Background (2)
• JISC Circular 6/03, initially issued June 2003– Call postponed, revised & re-issued with more
significant research component– Joint funding: JISC and e-Science Core Programme – £750K pa (outreach, services & development) £250K pa
(research)
– Unlikely that any single organisation could do what’s expected
– Expressions of Interest & Full Proposals from Consortia– Final selection made in December 2003– Negotiations & clarification in January 2004
8
Digital | Curation | Centre
Designation of DCC
• Task entrusted to Consortium of four institutional partners – Universities of Edinburgh (lead), Glasgow & Bath together with
CCLRC (Rutherford Appleton and Daresbury Laboratories)– brought together through the National eScience Centre
• jointly managed by Universities of Edinburgh & Glasgow
• Two 3-year awards made:– JISC funding started on 1st March 2004– EPSRC grant-funded starts on 1st September 2004
• Phase One set-up– some ‘early deliverables’ of website & helpdesk– preparation for full operation & launch of services in October– planning formal opening for early November 2004
9
Digital | Curation | Centre
Responsibilities across the DCC
• Them with titles …– Peter Burnhill, Director (Phase One)
with Robin Rice, Phase One Project Co-ordinator• (from EDINA & Data Library, University of Edinburgh)
– Peter Buneman Research Director (& PI on EPSRC grant)• Professor of Informatics, University of Edinburgh
– Liz Lyon, Associate Director (Community Support & Outreach)• Director of UKOLN, University of Bath
– Seamus Ross, Associate Director (Service Definition & Delivery)• Director of HATII [ERPANET], University of Glasgow
– David Giaretta, Associate Director (Development)• Head of Astronomical Software & Services, CCLRC
• Two significant & well known ‘Ex Portfolio’ names– Malcolm Atkinson, Director, NeSC– Chris Rusbridge, Director, Information Services, UofGlasgow
functional management & collaboration
Industry
research collaborators
standards bodies
testbeds& tools
communities of practice: users
community support & outreach
research
development co-ordination
service definition & delivery
management & admin support
curation organisations eg DPC
Collaborative Associates Network of DataOrganisations
11
Digital | Curation | Centre
What is this digital curation anyway?
The term Digital Curation is a new invention. • Digital Data Curation Task Force - Report of Strategy Discussion Day
(2002)– citing Tony Hey citing use by Dr John Taylor, Director General of the Research
Councils, to distinguish the actions involved in caring for digital data beyond its original use, from digital preservation. The concept’s reach extends beyond libraries.
– • The e-Science Curation Report (2003) proposed the following distinctions:
– Curation : managing & promoting the use of data from point of creation, to ensure fit- for-contemporary-purpose, available for discovery & re-use.
• For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose.
• Higher levels of curation will involve maintaining links with annotation & with other published materials.
– Archiving : curation activity which ensures that data are properly selected, stored, can be accessed
• logical and physical integrity is maintained over time, including security and authenticity. – Preservation : activity within archiving in which specific items of data are maintained
over time so that they can still be accessed and understood through changes in technology.
12
Digital | Curation | Centre
digital curation: ... digital objects and data, over their life-cycle, for current & future generations of use ...
= f(data curation & digital preservation)• data curation [when high current/ongoing interest]
– actions needed to maintain and utilise digital data & research results over entire life-cycle
– data creation & management; adding value; generating new sources of information & knowledge, for use
• digital preservation [for longevity;fall off in interest]– long-run technological/legal accessibility & usability– storage, maintenance & accessibility of information content in
digital material over the long-term, for use• OAIS concept of designated community
Digital curation redefined ...
13
Digital | Curation | Centre
Data curation in action
• Astronomy• Integrating and analysing distributed data (AstroGrid)• publishing multi-TB sky surveys (SuperCOSMOS & WFCAM)• interoperability standards (IVO Alliance)
• BioInformatics• data publishing: generic tools for XML export (EBI Biomart)• annotation tools for massive data sets (Pubmed, VOTable)• archiving tools for dynamic data sets (biological DBs)
• Environmental sciences• spatio-temporal annotation (OS Mastermap/ Mouse Atlas)
• Document management• Repository certification (RLG Task Force)
14
Digital | Curation | Centre
Digital preservation approaches
• Migration & Refreshment
• Emulation & Encapsulation
• Digital Archaeology & Rescue
• Document Format Specification
Robin Rice & Najla Semple, http://www.lib.ed.ac.uk/sites/digpres/
15
Digital | Curation | Centre
Communities of Practice: Social Sciences (IASSIST)
• History of sharing – economical in terms of both data collector and respondent
• Data about humans – problems of confidentiality confronted early on
• Mixed blessing of agreed proprietary formats (OSIRIS, SPSS, etc.) allows migration
• ‘Future-proofing’ - 30 years of data advocacy! – Tradition of data archiving & data citation – Building new data standards out of common experience
• data archivists, & data librarians: the new digital curators?• www.iassistdata.org
16
Digital | Curation | Centre
Unifying Themes for D C C
• ‘data as evidence’– for one or more designated communities
• ‘archival responsibility’– at one or more institutional levels– with institutional policies & individuals’ competence
• engage/discover communities of practice, to invoke/provoke good practices
– appraisal & retention/disposal– logical & physical integrity: authenticity/security
• research problems in productive research domains– eg Informatics, Law School
17
Digital | Curation | Centre
Aims & Objectives for the DCC
‘quality improvement in data curation & digital preservation’ – Initial focus: data as evidence for scholarly conclusions– Wider remit: worlds of scholarly communication & eLearning
• twin aims:excellence in research & excellence in service• need to bridge across communities:
– universities & research institutes– scientific data tradition & document tradition– multi-sectoral, international
18
Digital | Curation | Centre
We are all curators now ...
• The term “curation” builds on our understanding of the word “curator”– who keeps something for the public good, value of which often needs to
be brought out by the curator.
1. this open context implies more support for explicit policies with regard to data sharing, and it has major implications for structuring and tools.
2. the digital curator as ‘store-keeper’ closely linked to promoting new science, looking forward to identify new ways to serve present and future researchers.
• digital curator should take an active role in promoting and adding value to holdings
– manage the value of collection– adding links and annotation to provide context– recording provenance of changes made
19
Digital | Curation | Centre
Planning & Progress
• We must plan for the Long, with our 2020 Vision - 15yrs– we have large territory, and large expectation
• multi-disciplinary, multi data type, multi tradition/profession
• national and international, but also local and hidden from view
• a lot is going on
– how to ensure that we do something sensible with the ££’s and the trust we have been given?
– who/what should we plan to affect/effect?• policy-makers; ‘responsible curators’; (researchers?)
• how do we wish to be judged, and when?• collaboration & win-win-win scenarios
20
Digital | Curation | Centre
focii of attention in set-up phase
• Users: client, peer and policy communities– outreach & community support; service definition/delivery;
development co-ordination; research agenda
– user requirements analysis: Leona Carpenter (Focus Groups) • Consortium: ‘organisation’ from partner participation
– roles; commitment; norming/performing; operational communication; consortium agreement (IPR)
• Employers: institutional settings– re-deployment/appointments; accommodation;
commitment/reporting
-> Project Plan, as living document
21
Digital | Curation | Centre
• weekly AccessGrid/telecon; two face2face meetings– defining programme of deliverables; re-deploying & recruiting
staff; planning appointment of full time director in time for Launch
• early ‘deliverables’:
– www.dcc.ac.uk with links, presentations & progress updates
– [email protected] for contacts & offers of collaboration
• project plan submitted to JISC, late May 2004
• defining R & D programme & services for deliveryeg curation architecture; repository of tools & technical information
• engaging curators in existing community of practice
Phase One Progress, March -
Digital | Curation | Centre
Towards a Technical Model of Digital Curation: our R&D
David Giaretta
Funded by:
23
Digital | Curation | Centre
What can we rely on in the Long Term
• The bits - BIT PRESERVATION • Paper documents that people can read
– ISO standards
• The information we collect – either in the far future DCC or its successor
• Some kind of remote access• Some kind of computers• People?
24
Digital | Curation | Centre
Preservation “vs” Current Use
• There are already very many architectures to support immediate use of information– Including JISC architecture– Aim to support these
• Therefore chose to be guided by– long-term preservation aspects– to promote this we should emphasise
“interoperability” and “automated use” as far as possible.
– based initially on OAIS Reference Model – but add other ideas later
– bear e-Science in mind
25
Digital | Curation | Centre
OAIS Reference Model – Functional Model
4-1.2
MANAGEMENT
Ingest
Data Management
SIP
AIPDIP
queries
result setsAccess
PRODUCER
CONSUMER
Descriptive Info
AIP
orders
Descriptive Info
Archival Storage
Administration
Preservation Planning
26
Digital | Curation | Centre
OAIS – Preservation Planning - key aspects
• Representation Net
• Designated Communities & Knowledge Base
27
Digital | Curation | Centre
Representation Net
28
Digital | Curation | Centre
Preservation IssuesGiven a file or a stream of bits how does one know what
Representation Information is needed (this question applies to Representation Information itself as well as to the digital objects we are primarily interested in preserving and using); how does one know, for example, if this thing is in FITS format?
• Someone may simply “know” what it is and how to deal with it i.e. the bits are within the Knowledge Base
• One may be able to recognise the format by looking for various types of patterns.
• One may feed the bits into all available interpreters to see which accept the data as valid
• Other means…. • The only safe way: have an associated label which points to
the appropriate Representation Information– Note this does not exclude the other methods e.g. for data
rescue
29
Digital | Curation | Centre
High Level View
Example of use of Representation Information Labelling
30
Digital | Curation | Centre
Implications• A label must be attached to each piece of digital object as a necessary (but
not sufficient) condition for long-term preservation –logical attachment or packaging TBD by the DCC.
• The label should at least identify Representation Information. For long-term preservation this label must therefore be a DCC persistent identifier.
– allow some normalisation• In order for the Representation Information to be persistent then it should
either be held with the data object itself or be part of a central repository – part of the DCC. Thus the DCC needs a DCC Representation Information Repository. This repository would include
– a Format Repository (covering structural information) *automated use would be supported by use of formal description languages such as EAST (ISO 15889, http://east.cnes.fr/ ) or DFDL (http://forge.gridforum.org/projects/dfdl-wg/)
– a Semantic Repository with, for example, Data Dictionaries and Ontologies – Software Repository – with appropriate emulation capabilities
• Each piece of digital RI is also a digital object – which is understood either by the users’ Knowledge Base OR by further Representation Information. Therefore each piece of RI also has a label pointing to further RI.
31
Digital | Curation | Centre
Designated Community
• Techniques must be created for – defining a Knowledge Base – linking a Knowledge Base to a Designated
Community – linking Representation Information to a
Knowledge Base if possible
32
Digital | Curation | Centre
Representation Information (1)
• Structure – including Formats– Distinguish
• formats which are used mainly for rendering – to be followed by human inspection, and
• formats used for automated processing
• Implications:– Representation Information Repository
should define selected file formats using EAST and DFDL
– Definitions should include scientific objects and humanities objects
33
Digital | Curation | Centre
Representation Information (2)
• Semantics– Hard problem
• start with Data Dictionaries
– Implications: the Representation Information Repository
should include Data Dictionaries, followed by more general semantics
34
Digital | Curation | Centre
Representation Information (3) Time Dependent Information
– Many, perhaps most, datasets change over time and the state at each particular moment in time may be important. It may be useful to break the issue into separate parts.
• at each moment in time we could, in principle, take a snapshot and store it. That snapshot has its associated Representation Net.
• efficient storage of a series of snapshots may lead one to store differences or include time tags in the data (see for example P.Buneman, S. Khanna, and Wang-Chiew Tan. On the Propagation of Deletions and Annotations through Views. Proc.21st ACM Sym. on Principles of Database Systems.).
– Additional Representation Information would be needed which describes how to get to a particular time's snapshot from the efficiently encoded version.
– Also applies to ANNOTATION – who said what and when did they say it– Implications:
• These are area of active research within the consortium and the DCC should be able to provide
– advice and well tested tools for certain forms of efficient encoding of time dependent information
– advice on annotation – identifiers and Representation, perhaps in the form of software, for the associated
encodings
35
Digital | Curation | Centre
Representation Information (4)
• Actions and Processes (Behaviour?)– Some information has, as an integral part of its content, an
implicit or explicit process associated with it – this could be argued to be a type of semantics, however it is probably sufficiently different to need special classification. An examples of this is a database or other time dependent or reactive system such as a Neural Net.
– Emulations – Universal Virtual Computer (UVC)– Implications:
• Support Software emulation via a UVC (possibly based on JVM)
• Support time dependent or reactive systems
36
Digital | Curation | Centre
Persistent Ids
• Implications:– Use of existing, or creation of new, infrastructure
(standards, protocols, servers etc) for persistent IDs with adequate flexibility and longevity
• as part of the succession planning, agreement would be needed with appropriate organisation to act as backup and inheritor of DCC data.
37
Digital | Curation | Centre
Archival Information Package
38
Digital | Curation | Centre
Preservation Description Info
39
Digital | Curation | Centre
AIP implications – PDI
• define standard Preservation Metadata – based initially on OCLC work – including Michael Day’s work and also CCLRC work etc
• define adequate Packaging technique – almost certainly XML based
• define recommended tools and procedures for creating Fixity Information such as checksums and digests, together with associated Representation Information
• investigate authentication systems
40
Digital | Curation | Centre
Audit and Certification
• Implications:– facilitate production of standard(s) on which a certification
program can be based – work to establish accreditation and certification body in
preparation for offering audit and certification services – audit, certification and accreditation are potential sources
of long term funding for the DCC – software certification will require testbeds and testing
procedures. • Hardware and software systems will need to be purchased,
hired or borrowed. The DCC associates would be useful partners.
• We might expect Commercial software to be offered to us by the manufacturer for testing
• Testing commercial software could be fee based.
41
Digital | Curation | Centre
Implications for Research• Research needed on Representation Information (Structure and Semantics) e.g.
– Investigate fundamental limitations of bit-level descriptions and existing tools. – Contribute to DFDL definition– Investigate capabilities needed to describe rendered format (including Word, PDF etc)
• Data Virtualisation – define Science objects and “Humanities” objects• Research is needed to:
– Support Software emulation via a UVC (possibly based on JVM) – Support time dependent or reactive systems
• Research is needed to provide a solid basis on which we can develop persistent IDs with adequate flexibility and longevity
• Research is needed to allow the DCC to: – define standard Preservation Metadata – based initially on OCLC work – define adequate Packaging technique – almost certainly XML based – investigate authentication systems with a view to preparing recommendations for users
and consider offering, for example, a (fee-based) key storage service. • A rigorous theoretical basis must be put in place from which we can create
techniques for: – defining a Knowledge Base – linking a Knowledge Base to a Designated Community – linking Representation Information to a Knowledge Base if possible
42
Digital | Curation | Centre
Curation Manual
• Put in place quickly using international experts
• Updates annually
• Build to “curation encyclopaedia”
43
Digital | Curation | Centre
Document format specification
• They borrowed from records management tradition - institutions to create documents in standard or open formats, which are easier to preserve.
• Much easier to do in a strict records management environment with a published policy of retention schedules and a clear knowledge of internally produced records.
• Stipulating a specific file format is harder in a research environment where a wide range of digital materials are produced and have to be preserved.
• The move to DDI DTD in social science data world may be seen as an example of this preservation technique.
44
Digital | Curation | Centre
Services & Development• Turns Research into ‘Products for Research’ that our
communities can use with confidence– tracking and testing tools and standards
• that are correct, usable, reliable, well documentede.g. for ingest, repository management, data exchange, ontologies
• working with tool developers wherever possible• developing testbeds & interworking with other testbeds
– aim to gain leverage formats• working with other projects worldwide• using generic tools and techniques
– to develop strategies for emerging digital formats
– Metadata standards• long-term viability of metadata
• Registries underpin, to provide basis of Advisory Service
Digital | Curation | Centre
Scientist
Research Process
Secondary(derived)
data
Tertiarydata for
publication
Primary publication
Secondarypublication
Tertiarypublication
PeerReview
Pre-prints& e-Prints
Publicationarchives
Library - Peers - Public - Industry
PublicationProcess
Primary data
Web Content
Patent data
Research ProcessLevel 1curation
© Philip Lord, 2003
Digital | Curation | Centre
Scientist
Research Process
Secondary(derived)
data
Tertiarydata for
publication
Primary publication
Secondarypublication
Tertiarypublication
PeerReview
e-Prints
Publicationarchives
Library - Peers - Public - Industry
PublicationProcess
Primary data
Web Content
Patent data
Research Process
Researchbased on
data
Metadata
Archivist
© Philip Lord, 2003
Level 2curation
Archiveddata
Digital | Curation | Centre
Scientist
Research Process
Secondary(derived)
data
Tertiarydata for
publication
Primary publication
Secondarypublication
Tertiarypublication
PeerReview
e-Prints
Publicationarchives
Library - Peers - Public - Industry
PublicationProcess
Primary data
Web Content
Patent data
Research Process
Researchbased on
data
Metadata
CurationCurator
Curation Process
Data repositories
© Philip Lord, 2003
Level 3curation
Archiveddata
48
Digital | Curation | Centre
Faith in the medium
?
49
Digital | Curation | Centre
Faith in the technology