Transcript
Page 1: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

Peter BunemanResearch Director

Digital Curation Centreand

School of Informatics University of Edinburgh

Funders:

The Research Agenda

Digital Curation Centrea centre of expertise in data curation and preservation

Page 2: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

2

What is Digital Curation?

• Preserving stuff?– Librarians and archivists– Scientists (with huge amounts

of regular experimental data)

• Publishing stuff?– Publishers of “reference” data:

compendia, dictionaries, bibliographies, gazetteers, etc.

– Scientists (with lots of complex annotated data)

Both communities call themselves “curators” but at first sight they have almost orthogonal concerns

Page 3: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

3

Their concerns look orthogonal, but…

• Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings?

• The “preservers” do more than preserve – they classify and annotate.– Shouldn’t they publish (and preserve) their

own work?

As you dig deeper you find that there is a lot of commonality.

Page 4: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

4

Curated Databases are Central

Much/most scientific data is now in databases• They often do not contain source experimental data.

Sometimes just annotation/metadata• They borrow extensively from, and refer to, other

databases• You are now judged by your databases as well as your

(paper) publications!!• These databases are built and maintained with a great

deal of human or computational effort.

What makes a database?– it has internal structure or it changes.Size alone doesn’t qualify

Page 5: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

5

The Research Agenda• Data integration and publishing

– Slowly coming to market. Publishing in community formats is a new twist• Annotation

– Everybody agrees this is important. No-one understands it.• Metadata extraction

– Semantic or otherwise, it’s a key part of annotation• Archiving and Appraisal

– What do we do about databases – they change!• Legal issues

– Can we at least help to clarify what is going on?• Provenance and data quality

– Again, we don’t fully understand it.• Organisational dynamics of repositories• Economic analyses of curation• Ontologies, performance, registries, structure evolution…

Page 6: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

6

Archiving (preserving) databases

• How do you preserve something that changes every hour or minute?– Important for the scientific record – someone

might have cited your data at time t.

• Current practice– Create versions (how often?)– Log changes – Use diffs– Do nothing (common!)

Page 7: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

7

A Sequence of Versions

Page 8: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

8 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]

This relies on a deterministic / keyed model

Pushing time down

Page 9: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

9

100 days of OMIM

Siz

e (b

ytes

) x

106

XMill(archive)

gzip(inc diff)

versionarchive, inc diff

Legend•archive•inc diff •version•compressed inc diff•compressed archive

Uncompressed

• Archive size is

1.01 times diff repository size

1.04 times size of largest version

Compressed

• archive size between 0.94 and 1 times compressed diff repository size

• gzip - unix compression tool

• XMill - XML compression tool

Page 10: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

10

The Bottom Line

• Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file)

• Retrieval is a linear scan

• Works well with compression to less than 30% of current file. Archive is an XML file

• Archive as often as you like! (Almost)

• Works well with indexing

• Permits temporal queries on objects

Page 11: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

11

How do we cite data?• A URL or citation to an article is already

unsatisfactory.– DCC client complaint: “I spend a lot of time

searching [electronic documents] for the part that is relevant to the citation.”

• The problem is much worse when you are citing something in a very large database.

• How do you use a citation to locate data?• How do you ensure that the citation

persists?– Connections with DB archiving and DOIs

Page 12: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

12

• File and directory names that contain data/timit/train/dr1/fcjf0/sa1.wav

speaker-id: cjf0sex: f

sentence-id: sa1file-type: waveform

dialect-region:1type: training

corpus: timit

• Compound keys traditionally indicated location: BL MS Cotton Nero A.ix

Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left.

Location is typically informative?

Page 13: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

13

Keys for XML

• Implicit keys are ubiquitous in scientific data formats (easily converted to XML)

• Some proposals for key specifications in XML work (DTD IDs, XML-Schema)

• “Deep citation” in digital libraries.

• Natural consequence of translating back from deterministic model to XML (node-labeled)

• Interactions with data models/formats

Page 14: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

14

Relative keys

General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ...

Example: book{name}.chapter{number}.verse{number}

number specifiesverse only within chapter

number specifieschapter only within book

Also: bible{}.book{name}.chapter{number}.verse{number}

empty key: at most one bible node

Page 15: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

15

Keys and file formats

• Understanding and registering formats is only a first step

• The real issue is still integration and transformation.

• Keys and other constraints may help

Remember: structured files are databases!

Page 16: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

16

Data exchange on the Web

All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, health-care, ...

XML Publishing:• mapping relational data to XML• conforming to the predefined DTD

DB1 DB2

XMLDTD

Q: XML view

Web

XML

Page 17: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

17

Progress report on DCC research(funding period: -2 weeks)

• Four new research fellows at Edinburgh:– Mags McGinley (legal practice) IP, copyright in databases– James Cheney (Cornell) Programming Languages, Digital

Libraries, XML compression– Tasos Kemensietsidis (Toronto) Data integration, P2P

databases– Rajendra Bose (UCSB) Earth sciences data. “Workflow”

provenance in scientific data.

• At UKOLN– Michael Day, metadata and Interoperability

• At CCLRC– Shoaib Sufi, data portals and metadata

• At Glasgow– Position in metadata extraction advertised

Page 18: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

18

Progress report on DCC research(continued)

• Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group.

• Collaboration with – biologists (EBI & Edinburgh) on data publishing and

– astronomers (Edinburgh) on XML manipulation & representation of large data sets.

• First DCC research visitor (Michael Lesk)

• Work with partners in progress on – annotation

– DOIsPlease join us!!!

Page 19: Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation

19

DCC has research positions in databases, digital curation, XML, web technology, fundamentals.

Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid)

Edinburgh is a great place!!

Contact Peter Buneman

[email protected]