Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

Peter BunemanResearch Director

Digital Curation Centreand

School of Informatics University of Edinburgh

Funders:

The Research Agenda

Digital Curation Centrea centre of expertise in data curation and preservation

2

What is Digital Curation?

• Preserving stuff?– Librarians and archivists– Scientists (with huge amounts

of regular experimental data)

• Publishing stuff?– Publishers of “reference” data:

compendia, dictionaries, bibliographies, gazetteers, etc.

– Scientists (with lots of complex annotated data)

Both communities call themselves “curators” but at first sight they have almost orthogonal concerns

3

Their concerns look orthogonal, but…

• Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings?

• The “preservers” do more than preserve – they classify and annotate.– Shouldn’t they publish (and preserve) their

own work?

As you dig deeper you find that there is a lot of commonality.

4

Curated Databases are Central

Much/most scientific data is now in databases• They often do not contain source experimental data.

Sometimes just annotation/metadata• They borrow extensively from, and refer to, other

databases• You are now judged by your databases as well as your

(paper) publications!!• These databases are built and maintained with a great

deal of human or computational effort.

What makes a database?– it has internal structure or it changes.Size alone doesn’t qualify

5

The Research Agenda• Data integration and publishing

– Slowly coming to market. Publishing in community formats is a new twist• Annotation

– Everybody agrees this is important. No-one understands it.• Metadata extraction

– Semantic or otherwise, it’s a key part of annotation• Archiving and Appraisal

– What do we do about databases – they change!• Legal issues

– Can we at least help to clarify what is going on?• Provenance and data quality

– Again, we don’t fully understand it.• Organisational dynamics of repositories• Economic analyses of curation• Ontologies, performance, registries, structure evolution…

6

Archiving (preserving) databases

• How do you preserve something that changes every hour or minute?– Important for the scientific record – someone

might have cited your data at time t.

• Current practice– Create versions (how often?)– Log changes – Use diffs– Do nothing (common!)

7

A Sequence of Versions

8 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]

This relies on a deterministic / keyed model

Pushing time down

9

100 days of OMIM

Siz

e (b

ytes

) x

106

XMill(archive)

gzip(inc diff)

versionarchive, inc diff

Legend•archive•inc diff •version•compressed inc diff•compressed archive

Uncompressed

• Archive size is

1.01 times diff repository size

1.04 times size of largest version

Compressed

• archive size between 0.94 and 1 times compressed diff repository size

• gzip - unix compression tool

• XMill - XML compression tool

10

The Bottom Line

• Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file)

• Retrieval is a linear scan

• Works well with compression to less than 30% of current file. Archive is an XML file

• Archive as often as you like! (Almost)

• Works well with indexing

• Permits temporal queries on objects

11

How do we cite data?• A URL or citation to an article is already

unsatisfactory.– DCC client complaint: “I spend a lot of time

searching [electronic documents] for the part that is relevant to the citation.”

• The problem is much worse when you are citing something in a very large database.

• How do you use a citation to locate data?• How do you ensure that the citation

persists?– Connections with DB archiving and DOIs

12

• File and directory names that contain data/timit/train/dr1/fcjf0/sa1.wav

speaker-id: cjf0sex: f

sentence-id: sa1file-type: waveform

dialect-region:1type: training

corpus: timit

• Compound keys traditionally indicated location: BL MS Cotton Nero A.ix

Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left.

Location is typically informative?

13

Keys for XML

• Implicit keys are ubiquitous in scientific data formats (easily converted to XML)

• Some proposals for key specifications in XML work (DTD IDs, XML-Schema)

• “Deep citation” in digital libraries.

• Natural consequence of translating back from deterministic model to XML (node-labeled)

• Interactions with data models/formats

14

Relative keys

General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ...

Example: book{name}.chapter{number}.verse{number}

number specifiesverse only within chapter

number specifieschapter only within book

Also: bible{}.book{name}.chapter{number}.verse{number}

empty key: at most one bible node

15

Keys and file formats

• Understanding and registering formats is only a first step

• The real issue is still integration and transformation.

• Keys and other constraints may help

Remember: structured files are databases!

16

Data exchange on the Web

All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, health-care, ...

XML Publishing:• mapping relational data to XML• conforming to the predefined DTD

DB1 DB2

XMLDTD

Q: XML view

Web

XML

17

Progress report on DCC research(funding period: -2 weeks)

• Four new research fellows at Edinburgh:– Mags McGinley (legal practice) IP, copyright in databases– James Cheney (Cornell) Programming Languages, Digital

Libraries, XML compression– Tasos Kemensietsidis (Toronto) Data integration, P2P

databases– Rajendra Bose (UCSB) Earth sciences data. “Workflow”

provenance in scientific data.

• At UKOLN– Michael Day, metadata and Interoperability

• At CCLRC– Shoaib Sufi, data portals and metadata

• At Glasgow– Position in metadata extraction advertised

18

Progress report on DCC research(continued)

• Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group.

• Collaboration with – biologists (EBI & Edinburgh) on data publishing and

– astronomers (Edinburgh) on XML manipulation & representation of large data sets.

• First DCC research visitor (Michael Lesk)

• Work with partners in progress on – annotation

– DOIsPlease join us!!!

19

DCC has research positions in databases, digital curation, XML, web technology, fundamentals.

Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid)

Edinburgh is a great place!!

Contact Peter Buneman

[email protected]

Documents

Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation