50
Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH--‐27TH,2013

Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Embed Size (px)

Citation preview

Page 1: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Digital PreservationDAVID GIARETTA (APA)

FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH--‐27TH,2013

Page 2: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Outline

Fundamental demandsFundamental conceptsTrustOAIS and Linked Data

Page 3: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Fundamental demands

Page 4: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Preservation and value

Why pays?Why?What to preserve?What value?

Page 5: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

ExamplesBooksWebScience dataWhat are the differences?

Page 6: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

ValueRIDING THE WAVE

Page 7: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Vision 2030

(2) Researchers and practitioners from any discipline are able to find, access and process the data they need. They can be confident in their ability to use and understand data and they can evaluate the degree to which the data can be trusted. Create a robust, reliable, flexible, green, evolvable data framework

with appropriate governance and long-term funding schemes to key services such as Persistent Identification and registries of metadata.

Propose a directive demanding that data descriptions and provenance are associated with public (and other) data.

Create a directive to set up a unified authentication and authorisation system.

Set Grand Challenges to aggregate domains. Provide “forums” to define strategies at disciplinary and cross-

disciplinary levels for metadata definition.IMPACT IF ACHIEVED Dramatic progress in the efficiency of the scientific process, and

rapid advances in our understanding of our complex world, enabling the best brains to thrive wherever they are.

Page 8: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Vision 2030

(3) Producers of data benefit from opening it to broad access and prefer to deposit their data with confidence in reliable repositories. A framework of repositories work to international standards, to ensure they are trustworthy.

Propose reliable metrics to assess the quality and impact of datasets. All agencies should recognise high quality data publication in career advancement.

Create instruments so long-term (rolling) EU and national funding is available for the maintenance and curation of significant datasets.

Help create and support international audit and certification processes.

Link funding of repositories at EU and national level to their evaluation.

Create the discipline of data scientist, to ensure curation and quality in all aspects of the system.

IMPACT IF ACHIEVED Data-rich society with information that can be used for new and

unexpected purposes. Trustworthy information is useable now and for future generations.

Page 9: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Vision 2030

(4) Public funding rises, because funding bodies have confidence that their investments in research are paying back extra dividends to society, through increased use and re-use of publicly generated data.

EU and national agencies mandate that data management plans be created.

IMPACT IF ACHIEVED Funders have a strategic view of the value of data produced.

Page 10: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Vision 2030

(6) The public has access and can make creative use of the huge amount of data available; it can also contribute to the data store and enrich it. All can be adequately educated and prepared to benefit from this abundance of information.

Create non-specialist as well as specialist data access, visualisation, mining and research environments.

Create annotation services to collect views and derived results. Create data recommender systems. Embed data science in all training and academic qualifications. Integrate into gaming and social networksIMPACT IF ACHIEVED Citizens get a better awareness of and confidence in sciences, and

can play an active role in evidence based decision making and can question statements made in the media.

Page 11: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Vision 2030

(7) Policy makers can make decisions based on solid evidence, and can monitor the impacts of these decisions. Government becomes more trustworthy.

Policy makers are able to make decisions based on solid evidence, and can monitor the impacts of these decisions. Government becomes more trustworthy.

IMPACT IF ACHIEVED Policy decisions are evidence-based to bridge the gap between

society and decision-making, and increase public confidence in political decisions.

Page 12: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Fundamental concepts

OAIS

Page 13: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Digital Preservation…

Easy to do… …as long as you can provide money forever Easy to test claims about repositories… …as long as you live a long time

Page 14: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Preservation techniques

For each technique look for evidence – what evidence? must at least make sure we consider different types of data◦ rendered vs non-rendered◦ composite vs simple◦ dynamic vs static◦ active vs passive

must look at all types of threats

Page 15: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Threats

Things change……◦Hardware◦Software◦Environment◦Tacit knowledge

Things become unfamiliar

Page 16: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Problems when preserving data

Preserve? Preserve what? For how long? How to test? Which people? Which organisations? How well?• Metadata? – What kind? How much?

Page 17: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

“A fundamental characteristic of our age is the raising tide of data – global, diverse, valuable and complex . In the realm of science, this is both an opportunity and a challenge.”

Report of the High-Level Group on Scientific Data, October 2010“Riding the Wave: how Europe can gain from the raising tide of scientific

data”

raising tide of data…

Requirements

Who pays?Why?

Page 18: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Raising tide of data…

Page 19: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Opportunities

Page 20: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Data contains numbers etc – need meaning

20

Page 21: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

...to be combined and processed to get this

21

Level 2 Level 0 Level 1

Processing Processing/ combining

Page 22: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Preserving digitally encoded information

Ensure that digitally encoded information are understandable and usable over the long term◦Long term could start at just a few years◦Chain of preservation

Need to do something because things become “unfamiliar” over time But the same techniques enable use of data which is “unfamiliar” right now

Page 23: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Preservation Planning

DataManagement

Archival Storage

AccessIngest

PRO

DUCER

CONSUMER

SIP

Descriptive Information

Descriptive Information

AIP AIP

queries

query responsesorders

DIP

MANAGEMENT

Administration

Lots of useful terminology

Page 24: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Key OAIS Concepts

Claiming “This is being preserved” is untestable◦ Essentially meaningless

◦ Except “BIT PRESERVATION” How can we make it testable?

◦ Claim to be able to continue to “do something” with it◦ Understand/use

◦ Need Representation Information Still meaningless…

◦ Things are too interrelated◦ Representation Information potentially unlimited

◦ Need to define a Designated Community – those we guarantee can understand – so we can test

Page 25: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

OAIS Information model: Representation Information

The Information Model is key

Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY

(this knowledge will change over time and region)

Does not demand that ALL Representation Information be collected at once.

A process which can be tested

Page 26: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Dictionary specification

XML

GOCE N1 filedescription

Representation NetworkGOCE Level 1

(N1 File Format)

GOCE Level 0

GOCE Level 0Processor

Algorithm

GOCE N1 fileDictionary

GOCE N1 filestandard

PDF standard

PDF software

Page 27: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Archival Information

Package

Preservation DescriptionInformation

Content Information further described by

Package Description

Packaging Information

derivedfrom

describedby

delimitedby

identifies

DataObject

RepresentationInformation

Physical Object

Digital Object

Structure Information

Semantic Information

Reference Information

Provenance Information

Context Information

Fixity Information

Other Representation

Information

Interpreted using

Bit

adds meaning

to

Access Rights

Information

Interpreted using

1

*

11...*

Page 28: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

RepresentationInformation

Provenance

has

has

Page 29: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

When things changes

We need to:◦ Know something has changed ◦ Identify the implications of that change◦ Decide on the best course of action for preservation◦ What RepInfo we need to fill the gaps

◦ Created by someone else or creating a new one

◦ If transformed: how to maintain data authenticity◦ Alternatively: hand it over to another repository◦ Make sure data continues to be usable

Page 30: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Transformation

Change the format e.g. ◦ Word PDF/A

◦ PDF/A does not support macros◦ GIF JPEG2000

◦ Resolution/ colour depth…….◦ Excel table FITS file

◦ NB FITS does not support formulae◦ Old EO or proprietary format HDF◦ Certainly need to change STRUCTURE RepInfo ◦ May need to change SEMANTIC RepInfo

Transformational Information Properties

Page 31: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Hand-over

Preservation requires funding Funding for a dataset (or a repository) may stop Need to be ready to hand over everything needed for preservation◦ OAIS (ISO 14721) defines “Archival Information Package

(AIP).◦ Issues:

◦ Storage naming conventions◦ Representation Information ◦ Provenance◦ Identifiers◦ ….

Page 32: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Threat Requirement for solutionUsers may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

Ability to create and maintain adequate Representation Information

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Ability to share information about the availability of hardware and software and their replacements/substitutes

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Ability to bring together evidence from diverse sources about the Authenticity of a digital object

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Ability to deal with Digital Rights correctly in a changing and evolving environment

Loss of ability to identify the location of data

An ID resolver which is really persistent

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation

The ones we trust to look after the digital holdings may let us down

Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

RepInfo toolkit, Packager and Registry – to create and store Representation Information.In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate.

Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes.The Representation Information will include such things as software source code and emulators.

Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity.

Packaging toolkit to package access rights policy into AIP

Persistent Identifier system: such a system will allow objects to be located over time.

Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another.

Certification toolkit to help repository manager capture evidence for ISO 16363 Audit and Certification

Page 33: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Infrastructure support

SCIDIP-ES◦ Converting CASPAR prototypes into robust services

Page 34: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Trust

Page 35: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Vision 2030(2) Researchers and practitioners from any discipline are able to find, access and process the data they need. They can be confident in their ability to use and understand data and they can evaluate the degree to which the data can be trusted. Create a robust, reliable, flexible, green, evolvable data framework

with appropriate governance and long-term funding schemes to key services such as Persistent Identification and registries of metadata.

Propose a directive demanding that data descriptions and provenance are associated with public (and other) data.

Create a directive to set up a unified authentication and authorisation system.

Set Grand Challenges to aggregate domains. Provide “forums” to define strategies at disciplinary and cross-

disciplinary levels for metadata definition.IMPACT IF ACHIEVED Dramatic progress in the efficiency of the scientific process, and

rapid advances in our understanding of our complex world, enabling the best brains to thrive wherever they are.

Page 36: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

TrustIssue Vision 2030 Short Medium Long authenticity of data

Scientists can establish the authenticity of the data they use

● Standardised system for provenance and related evidence in repositories. ● Standardised way to capture reputation of data producers and holders

● Adoption of machine readable provenance in major repositories ● Capture of reputation of producers and holders (see Social networking)

● 80% of repositories of scientific data have adequate machine readable evidence ● Automated ways to evaluate evidence of authenticity

validity of data

Users and systems will be able to evaluate the reputation of the data they use.

Annotation system for datasets, with efforts to formalise annotation language

Ranking system to allow systems to produce rankings of levels of trust (akin to Page rank but based on reputation rather than links)

Systems can choose datasets which are most trustworthy and can evaluate the risks involved in using less trusted data.

certification of repositories

People can make a judgement about which repositories can be trusted

International system of repository certification created

Certification demanded by EU and national funders

80% of major repositories of scientific data are certified

global trust issues

Users can deal with the global datasets with the same confidence as European sources

Discussions with US, China, etc

MOU with international agencies on common standards

International agreement so that users have evidence of authenticity for world-wide scientific data

Complexity of the system

People can trust that the ever more complex tangle of systems are doing the right thing

Simplify interfaces and entanglement.

Move towards autonomic, self-configuring, self-healing, self-optimising and self-protecting systems, with appropriate monitoring.

Systems have survived many generations of changes in technologies and architectures.

Page 37: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Reality checkWhat could jeopardise the

visionCounter by:

Lack of long term investment in critical components such as persistent identification

Identify new funding mechanismsIdentify new sources of fundingIdentify risks and benefits associated with digitally encoded information

Lack of preparation Ensure the required research is done in advanceLack of willingness to co-operate across disciplines/ funders/ nations

Apply subsidiarity principle so we do not step on researchers’ toesTake advantage of growing need of integration: within and across disciplines

Lack of published data Provide ways for data producers to benefit from publishing their data

Lack of trust Need ways of managing reputationsNeed ways of auditing and certifying repositoriesNeed quality, impact, and trust metrics for datasets

Not enough data experts Need to train data scientists and to make researchers aware of the importance of sharing their data

The infrastructure is not used Work closely with real users and build according to their requirementsMake data use interesting – for example integrating into gamesUse “data recommender” systems i.e. “you may also be interested in...”

Too complex to work Do not aim for a single top down systemEnsure effective governance and maintenance system (c.f. IETF)

Lack of coherent data description allowing re-use of data

Provide “forums” to define strategies at disciplinary and cross-disciplinary levels for metadata definition

From Riding the Wave

Page 38: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Trust issues

Has it been preserved properly?Is it of high quality?Has it been changed in some way?Does the pointer get me to the right object?

Page 39: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Has it been preserved properly?

Can the repository be trusted?Certification of various kindsISO16363 certification should be available soon Judged on the basis of evidence collected and examined

Page 40: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Is it of good quality?

More than one in ten scientists and doctors claim to have witnessed colleagues deliberately fabricating data in order to get their research published, a new poll has revealed.

The survey of almost 2,800 experts in Britain also found six per cent knew of possible research misconduct at their own institution that has not been properly investigated.

The poll for the hugely-respected British Medical Journal (BMJ)

http://www.dailymail.co.uk/sciencetech/article-2085814/Scientists-falsify-data-research-published-whistleblowers-bullied-keeping-quiet-claim-colleagues.html

Page 41: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Dirk Smeesters had spent several years of his career as a social psychologist at Erasmus University in Rotterdam studying how consumers behaved in different situations. Did colour have an effect on what they bought? How did death-related stories in the media affect how people picked products? And was it better to use supermodels in cosmetics adverts than average-looking women?

The questions are certainly intriguing, but unfortunately for anyone wanting truthful answers, some of Smeesters' work turned out to be fraudulent. The psychologist, who admitted "massaging" the data in some of his papers, resigned from his position in June after being investigated by his university, which had been tipped off by Uri Simonsohn from the University of Pennsylvania in Philadelphia. Simonsohn carried out an independent analysis of the data and was suspicious of how perfect many of Smeesters' results seemed when, statistically speaking, there should have been more variation in his measurements.

Dutch psychologist Diederik Stapel. He was found to have fabricated data for years and published it in at least 30 peer-reviewed papers, including a report in the journal Science about how untidy environments may encourage discrimination.

http://www.guardian.co.uk/science/2012/sep/13/scientific-research-fraud-bad-practice

Page 42: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Peer review of data ….is difficult

Page 43: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Lessons from APARSEN

Data Quality

Cost Models for preservation

Preservation tools

Preservation services

Page 44: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Has it been changed in some way?

OAIS defines Authenticity as: The degree to which a person (or system) regards an object as what it is purported to be. Authenticity is judged on the basis of evidence. Need to capture evidence –what evidence?

Page 45: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Authenticity evidence

Authenticity Model

Provenance capture◦ How to deal with combinations of data◦ How to deal with changes

Security and tampering with logs

Page 46: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

OAIS and Linked Data

Page 47: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

Linked Open Data: Issues

Links – just another dataset? Or do we have to view as part of a huge “cloud” is that cloud just another dataset? Is it just like archiving snapshots of the Web? Snapshots? But at different times across the cloud

HTTP URIs – how persistent? HTTP – how persistent? RDF – how persistent? What do the links mean?

Page 48: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

OAIS-related issues

Designated community Representation Information Provenance Rights Authenticity Trustability Is it easier to “poison” the system?

Page 49: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

OAIS / Linked Data questions

Can OAIS concepts be applied to the preservation of Linked Data?Do existing concepts apply?Are new concepts needed?What new terminology is needed?

Page 50: Digital Preservation DAVID GIARETTA (APA) FIRST PRELIDA WORKSHOP, TIRRENIA, JUNE 25TH-- ‐ 27TH,2013

ENDQUESTIONS?