63
Requirements for Long-Term Preservation David Giaretta 1 st October 2009, Helsinki

Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Embed Size (px)

Citation preview

Page 1: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Requirements for Long-Term Preservation

David Giaretta1st October 2009, Helsinki

Page 2: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Digital Preservation…

Easy to do… …as long as you can provide money forever Easy to test claims about repositories… …as long as you live a long time

Page 3: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Digital Preservation

activities

Infrastructure

Information about

users and practices

ISO standard: OAIS

ISO standard: OAIS update

ISO standards: Audit and Certification

Tools

Relationship to related work and

community practices

Page 4: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Alliance for Permanent Access• The Alliance

aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information

The British Library European Organization for Nuclear Research [CERN]CSC — IT Center for ScienceDelegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]

http://www.alliancepermanentaccess.org/

Page 5: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Alliance for Permanent Access• The Alliance

aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information

The British Library European Organization for Nuclear Research [CERN]CSC — IT Center for ScienceDelegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]

http://www.alliancepermanentaccess.org/PARSE.Insight

Page 6: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Preservation is a Social activity

Sometimes are activities are personal “preserve for your future self” [Australia]

In the short term for re-use by colleagues and other people

In the long term for re-use by future generations

Neeri 20091-2 Oct 2009, Helsinki

Page 7: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Definitions (OAIS)

Long Term Preservation: The act of maintaining information, Independently Understandable by a Designated Community, and with evidence supporting its Authenticity, over the Long Term.

Long Term: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.

Neeri 20091-2 Oct 2009, Helsinki

Not just BIT preservation

Not just rendering

Information not just DATA or Documents

Authenticity

Page 8: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Things change/disappear

Software Hardware Environment

E.g. Network links to related information People

What is “common knowledge”

How can we ensure that the information trapped in the “bits” remains understandable despite all these changes?

Page 9: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Just Format?

sfqsftfoubujpo jogpsnbujpo svmftrepresentation information rules

You have a file

JHOVE tells you it is WORD version 7

Format – necessary but not sufficient:

formats can be used for multiple purposes e.g. audio files used to store configuration parameters

Page 10: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

XML enough?

<family> <father>John</father> <mother>Mary</mother> <son>Paul</son></family>

<VOTABLE version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1" xmlns="http://www.ivoa.net/xml/VOTable/v1.1"><RESOURCE><TABLE name="6dfgs_E7_subset" nrows="875"><PARAM arraysize="*" datatype="char" name="Original Source"

value="http://www-wfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz"><DESCRIPTION>URL of data file used to create this table.</DESCRIPTION></PARAM><PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT

demo usage."/><FIELD arraysize="15" datatype="char" name="TARGET"><DESCRIPTION>Target name</DESCRIPTION></FIELD><FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"><DATA><FITS><STREAM encoding='base64'>U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBmb3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAgICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAvIE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg

Page 11: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Data…

Level 2 GOME Satellite instrument data

Page 12: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Complex container objects

Neeri 20091-2 Oct 2009, Helsinki

Page 13: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Key OAIS Concepts

Claiming “This is being preserved” is untestable Essentially meaningless

Except “BIT PRESERVATION” How can we make it testable?

Claim to be able to continue to“do something” with it Understand/use

Need Representation Information Still meaningless…

Things are too interrelated Representation Information potentially unlimited

Designated Community Many other concepts identified Finer grained taxonomy than simply saying

Allows one to ask if one has all the required typesAvailable from: http://public.ccsds.org/publications/archive/650x0b1.pdf

“Metadata”

Page 14: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Representation Information

The Information Model is key

Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY

(this knowledge will change over time and region)

Page 15: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

OAIS Archival Information Package (AIP)

Neeri 20091-2 Oct 2009, Helsinki

Archival

Package

Contentfurther described by

Package Packaging

derivedfrom

describedby

delimitedby

DataObject

PhysicalObject

DigitalObject

StructureReferenceOther

Interpretedusing

Interpretedusing*

1

11...*

Bit

addsmeaning

to

Provenance Context Fixity AccessRights

Page 16: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Representation Information Network

Neeri 20091-2 Oct 2009, Helsinki

Page 17: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Preservation and Re-use Unfamiliar information

Preservation Digitally encoded information which must be

usable and understandable Unfamiliar because of separation in time

E-Science/GRID/CyberInfrastructure for data Digitally encoded information which must be

usable and understandable Unfamiliar because of separation in discipline or

location – even if created yesterday

Support automated usage where possible

Page 18: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

•Rep

•Info

/DISCIPLINE

•Virtualisation

Page 19: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Insight: stakeholders

Research• Research institutes (non-profit)• Universities• Academic libraries

Data management (preservation)• Data centres (profit / non-profit)• Libraries• Archives

Funding/policy• National Funding organisations• European funding• Corporate funding

Publishing• General (cross-community) publishers• Specific (community) publishers

Page 20: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Surveys to stakeholders

ResearchElsevier mailinglist (35,000 people), ESF, MCFA, Eurodoc, ALLEA, YEAR, Digital Humanities Observatory, etc.

Data management (preservation)LIBER, DPE, DPC, NCDD, DCC, D-lib Magazine, PADI, JISC mailing lists, CASPAR, Planets, etc.

Funding/policyESF, Alliance for Permanent Access, national funding agencies

PublishingInternational Association of STM publishers, Directory of Open Access Journals (DOAJ)

Page 21: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Surveys to stakeholders

Research

1397 responses

Data management (preservation)

273 responses

Funding/policy

< responses

Publishing

186 responses

Page 22: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Threats to preservation

1. Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.

2. Lack of sustainable hardware, software or support of computer environment may make the information inaccessible.

3. Evidence may be lost because the origin and authenticity of the data may be uncertain.

4. Access and use restrictions (e.g. Digital Rights Management) may not be respected in the future.

5. Loss of ability to identify the location of data.6. The current custodian of the data, whether an organisation

or project, may cease to exist at some point in the future.7. The ones we trust to look after the digital holdings may let

us down.

Page 23: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Threats to preservation (R)

The ones we trust to look after the digital holdings may let us down

The current custodian of the data may cease to exist

Loss of ability to identify the location of data

Access and use restrictions may not be respected in the future

Evidence may be lost

Lack of sustainable hardware/software

Users may be unable to understand or use the data

Page 24: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Threat Requirement for solution

Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

Ability to create and maintain adequate Representation Information

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Ability to share information about the availability of hardware and software and their replacements/substitutes

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Ability to bring together evidence from diverse sources about the Authenticity of a digital object

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Ability to deal with Digital Rights correctly in a changing and evolving environment

Loss of ability to identify the location of data An ID resolver which is really persistent

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation

The ones we trust to look after the digital holdings may let us down

Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

Page 25: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

FUTURE

• Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

• Non-maintainability of essential hardware, software or support environment may make the information inaccessible

• The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

• Access and use restrictions may not be respected in the future• Loss of ability to identify the location of data• The current custodian of the data, whether an organisation or

project, may cease to exist at some point in the future• The ones we trust to look after the digital holdings may let us

down

Page 26: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Roadmap

PARSE.Insight produced draft Preservation Infrastructure Roadmap

Now a SCIENCE DATA INFRASTRUCTURE ROADMAP after consultation with EU

Page 27: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Infrastructures for preservation

Social / Legal / Financial / Organisational

Agreements / Trust / Standards Costs/ Benefits/ RewardsTechnical components

Page 28: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Lessons from other Infrastructures

Need to “grow”, “encourage”, “foster” rather than “build”

include organisational, financial, legal & marketing

Provide services rather than specific technologies

Tackle “choke points” Various phases of development

Page 29: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Encouraging Organisational and Social change

Policies: mandates for depositing research data and funding agencies requirements:

Robust and reliable deposit places, where researchers can be sure their data will not get lost, be corrupted or misused with correct right access mechanisms.

Elements that increase comfort levels so that new users will know how to use and interpret the available data. .

Communication and awareness around these issues. Have publication of data as valued and as

referencable as is a publication of a paper in a journal.

Page 30: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Repository Audit and Certification

Standard for certification in OAIS Roadmap Initial work produced TRAC Now an official CCSDS Working Group Open virtual meetings, notes and documents:

http://www.digitalrepositoryauditandcertification.org Draft standard submitted to CCSDS/ISO to

form the basis of an international audit and certification process

Page 31: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

36

CASPAR Consortium

http://www.casparpreserves.eu

EU FP6 Integrated Project

Total spend approx. 16MEuro (8.8 MEuro from EU)

Started April 2006, for 42 months

Page 32: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

http://developers.casparpreserves.eu:8080

Page 33: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Preservation Data Flows and Strategies

More strategies than just “emulate or transform”

Page 34: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Creating an OAIS Archival Information Package

Page 35: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Modules and Dependencies:defining the Designated Community

README.txt

TEXT EDITORENGLISH

LANGUAGE

WINDOWS XP

FITS FILE

FITS STANDARD

PDF STANDARD

FITSJAVA s/w

JAVA VMPDF s/w

FITS DICTIONARY

DICTIONARYSPECIFICATION

UNICODESPECIFICATION

XMLSPECIFICATION

MULTIMEDIA PERFORMANCE DATA

C3D DirectX MAX/MSP

3D motiondata files

3D scenedata files

motion to musicmapping strategy

Page 36: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Modules and Dependencies: Examples (Semantic Web data)

ns4

ns2

ns1

ns3

RDF/S

modules and dependencies

Page 37: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Scenario: Intelligibility-aware Packaging

FITS

FITS STANDARD

PDF STANDARD

FITS DICTIONARY

DICTIONARYSPECIFICATION

UNICODESPECIFICATION

XMLSPECIFICATION

o2o1

P1

P2

C3D DirectX MAX/MSP

o3

P3

ZIP

• Gap(o2,P1) = • Gap(o2,P2) =

– {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION}

• Gap(o2,P3) = – {FITS, FITS_STANDARD, FITS_DICTIONARY,

DICTIONARY_SPECIFICATION, PDF_STANDARD, XML_SPECIFICATION, UNICODE_SPECIFICATION}

• Gap(o3,P3) = – {ZIP}

• Gap(o3, ) = – {ZIP, C3D, DirectX, MAX/MSP}

Page 38: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

E39. ActorKia Ng Activity of

Improvisation on the Violin

Expression of theImprovisation on the Violin

CR20. PerformSingleton

has_type

CR51. Attribution_RightSingleton

generates

LF1. Written_NormArt. X of Law Y

is_documented_in

Kia’s right to claim authorship

became_owner_of

is_on

created

carried_out

Work’s Provenance

Legislation

Rights Ontology CIDOC-CRM

E72. Legal Object

FRBRoo

F22. Self_contained_Expression

E7. Activity

F28. Expression_Creation

E30. Right

CR.Ownership Right

Derived Property

Rights

E7. ActivityKia claiming authorship

CR. Activity_TypeTo claim authorship

allows

has_type

performed_by

has_right_type

100% recall, <100% precision

100% precision

Example: Identification of an Attribution Right

Thanks to MetaWare

Page 39: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Provenance: Performing Arts

Thanks to ULeeds and CNRS

Page 40: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Authenticity

Neeri 20091-2 Oct 2009, Helsinki

Page 41: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Neeri 20091-2 Oct 2009, Helsinki

Threat Requirements for solutions

Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

Ability to create and maintain adequate Representation Information

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Ability to share information about the availability of hardware and software and their replacements/substitutes

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Ability to bring together evidence from diverse sources about the Authenticity of a digital object

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Ability to deal with Digital Rights correctly in a changing and evolving environment

Loss of ability to identify the location of data An ID resolver which is really persistent

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation

The ones we trust to look after the digital holdings may let us down

Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

Page 42: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Neeri 20091-2 Oct 2009, Helsinki

Threat CASPAR ComponentUsers may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

RepInfo toolkit, Packager and Registry – to create and store Representation Information.

In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate.

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes.

The Representation Information will include such things as software source code and emulators.

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity.

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Digital Rights and Access Rights tools allow one to virtualise and preserve the DRM and Access Rights information which exist at the time the Content Information is submitted for preservation.

Loss of ability to identify the location of data Persistent Identifier system: such a system will allow objects to be located over time.

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another.

The ones we trust to look after the digital holdings may let us down

The Audit and Certification standard to which CASPAR has contributed will allow a certification process to be set up.

Page 43: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Conclusions Preservation

Is a complex process involves more than just bits and formats metadata is too vague a term Transparency is vital

What is being preserved For whom For how long

OAIS is a good basis for preservation Recursion is an important concept in preservation Preservation threats must be countered by specific

tools and shared infrastructure componentsNeeri 2009

1-2 Oct 2009, Helsinki

Page 44: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Additional links CASPAR:

www.casparpreserves.eu PARSE.Insight:

www.parse-insight.eu Alliance for Permanent Access:

www.alliancepermanentaccess.eu Digital Curation Centre:

www.dcc.ac.uk Audit and certification:

wiki.digitalrepositoryauditandcertification.org OAIS:

http://public.ccsds.org/publications/archive/650x0b1.pdf http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.aspx

Page 45: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

END

Page 46: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki
Page 47: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Summary What is digital preservation? Transparency What is needed for digital preservation?

• Many strategies– Need to be clear about the scope of each

• Document/rendered object?

• Scientific data – processed/combined to produce new results?

• Other?

– How are all of the threats being addressed?

• What exactly is being preserved?

• For whom is it being preserved? – Designated Community must be specified

– Testability through understandability/usability

• How will it be handed on to future custodians

Page 48: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Umbrella framework Need to integrate in some sense many different

Systems Disciplines Funding Requirements

Projects producing preservation artefacts Representation Information Significant Properties Provenance etc

Page 49: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

About researchers

EU 44%, USA 33%, Other 23%

Per category

Page 50: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Data spectrum (R)

Page 51: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Cross-disciplinary use of research data

Page 52: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Sharing of data (R)Did you ever need digital research data gathered by other researchers that was not available?

Page 53: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Sharing of data (R)Do you presently make use of research data gathered by other researchers?

Page 54: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Sharing of data (R)Would you like to make use of research data gathered by other researchers?

Within discipline Outside discipline

Page 55: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Sharing of data (R)How open is your data?

Page 56: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Sharing of data (R)Which constrains do you see in making data open?

Page 57: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Sharing of data (R)How do you locate and access digital research data?

Page 58: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Linking of data (R)As researcher, do you think it is useful to link underlying research to formal literature?

Page 59: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Linking of data (P)Do you link references in your journals to underlying digital research data?

Page 60: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Linking of data (P)Do you as publisher charge separate fees when users want to access data associated with publications?

Page 61: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Linking of data (P)Can authors submit their underlying digital research data with their publication to the publisher?

Page 62: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

About fundingResearchers say :

Data managers say :

Publishers say :

Government (national funding)

Government (national funding)

Government (national funding)

Who should pay for data preservation?

Who should pay for preservation of publications?

Researchers say :

Data managers say :

Publishers say :

Government (national funding)

Government (national funding)

Government (national funding)

Page 63: Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

Who should pay? (P)For preservation of other research output