19
National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC

National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC

Embed Size (px)

Citation preview

National Digital Repository®

Preserving the imperfect: reflections from NDAD and

elsewhere

Kevin Ashley

Head of Digital Archives Group

ULCC

2007-03-23 Presdb07 - Edinburgh 2

National Digital Repository®

Overview

• Issues that arise when databases are records• Informing (expensive, important) decisions• Tensions between ideal formats and non-ideal

data• Representation mechanisms for access control

and absent data• Concentrating on R&D issues

2007-03-23 Presdb07 - Edinburgh 3

National Digital Repository®

2007-03-23 Presdb07 - Edinburgh 4

National Digital Repository®

What is NDAD?

• A service for UK government records which exist as ‘structured information’

• Contains data + contextual information• Established in 1997 - service in March 1998• First service by a national archive to provide

online public access to preserved material• Selection undertaken by National Archives and

government departments• Everything else at ULCC: under contract to TNA

2007-03-23 Presdb07 - Edinburgh 5

National Digital Repository®

2007-03-23 Presdb07 - Edinburgh 6

National Digital Repository®

Preservation

• Data transformed to canonical form - originals kept

• Paper documentation digitised• Technical metadata produced or transformed• Consistency checks applied:

For transformation processAgainst original systemAgainst published information Internal cross-checks

2007-03-23 Presdb07 - Edinburgh 7

National Digital Repository®

Consequences

• Preservation far removed from creation• Unlike actively curated systems: preservation

and use can take place simultaneously• Multiple use scenarios - more than views

2007-03-23 Presdb07 - Edinburgh 8

National Digital Repository®

Where are the problems?

Management

2007-03-23 Presdb07 - Edinburgh 9

National Digital Repository®

Perfect Preservation Formats?

• DDI: XML-basedgood for survey/social science dataNot so good for complex relational stuffLikes clean data

• XML representationsMore flexibleNot so good when data is unclean

• As SQLMuch metadata or needs another schemeUseless for unclean data

2007-03-23 Presdb07 - Edinburgh 10

National Digital Repository®

How bad is bad?

• Data out of range is a quality problem, not a preservation problem (e.g. ‘Age’ of 230)

• But…Age = -20?Age = B0 ?Age = Thursday?

• All present problems if ‘Age’ is a positive integer in our preservation schema

• Date = ‘31 Feb 2007’ is syntactically but not semantically valid

2007-03-23 Presdb07 - Edinburgh 11

National Digital Repository®

More bad stuff

• Absent key fields or mandatory fields• Encoded data that uses bad codes

if days of week are 1 - 7, what is day 9? Day X ?

• ‘Encoded’ data which is stored translated• 1 - 1 mappings that aren’t

2007-03-23 Presdb07 - Edinburgh 12

National Digital Repository®

What’s the problem?

• Must preserve errors - their nature is informative• Would like to understand original system

behaviour with these errors• Don’t want to use tools that force all fields to be

text• Want a datatype like ‘almost always integer’ or

‘often a date’ - and intelligent behaviour when it isn’t.

2007-03-23 Presdb07 - Edinburgh 13

National Digital Repository®

How does it get that way?

• Data validation often in application, not database Isn’t always well-implemented

• People hack around the application• Past migrations were poor

2007-03-23 Presdb07 - Edinburgh 14

National Digital Repository®

Missing and absent values

• Common occurrence in survey and experimental data

• Different types of ‘missing’:No informationKnown to be unreadableRefused to answerSubject didn’t know

• All mechanisms for representation ad-hoc• Knowledge in application, not database• Query engines don’t understand concept

2007-03-23 Presdb07 - Edinburgh 15

National Digital Repository®

Access: restricted viewingID Name Fname Office

10246 Ashley Kevin 179

10579 Mouse Mickey 188

ID REG Date From To10246 X111ABC 1 Oct 98 Clapham Deptford10579 H179JKL 1 Apr 99 Land’s

EndJohnO’Groats

56999 A217HGB 23 Dec 97 Poole Sandy

REG Make Year ColourX111ABC Yugo 1999 Grey

H179JKL Trabant 1957 Brown

People

Trips

Vehicles

Not available until 2050

2007-03-23 Presdb07 - Edinburgh 16

National Digital Repository®

Access - goal

• Duplicate original system• Advanced analysis tools• Simple viewing via a generic tool• Multimedia datatypes • Extensible via object-like design

• Traditional database systems not up to task without significant additional effort

• Hence much software home-grown

2007-03-23 Presdb07 - Edinburgh 17

National Digital Repository®

New issues from temporal GIS

• Temporal GIS allows one system to represent changing features and knowledge

• Queries like:Which features are newer than feature X?What did area Y look like 10 years ago?What present-day names correspond to ‘Hetfelle’?

• In a preserved temporal GIS:What would the answer to question 2 have been if I

asked it 5 years ago?

2007-03-23 Presdb07 - Edinburgh 18

National Digital Repository®

Inconsistencies and errors

• Schools census - 4 datasets per year for different school types

• But 1976 only has 3 - no nursery schools• Further examination shows files have been

merged• Confirmation came from completed census

forms held by schools - not by government department

2007-03-23 Presdb07 - Edinburgh 19

National Digital Repository®

Cornell’s DP model