Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation.

  • What is Digital Curation?Preserving stuff?Librarians and archivistsScientists (with huge amounts of regular experimental data)Publishing stuff?Publishers of reference data: compendia, dictionaries, bibliographies, gazetteers, etc.Scientists (with lots of complex annotated data)Both communities call themselves curators but at first sight they have almost orthogonal concerns

  • Their concerns look orthogonal, butShouldnt the publishers be concerned about the long-term usefulness of their findings?The preservers do more than preserve they classify and annotate.Shouldnt they publish (and preserve) their own work?As you dig deeper you find that there is a lot of commonality.

  • Curated Databases are CentralMuch/most scientific data is now in databasesThey often do not contain source experimental data. Sometimes just annotation/metadataThey borrow extensively from, and refer to, other databasesYou are now judged by your databases as well as your (paper) publications!!These databases are built and maintained with a great deal of human or computational effort.What makes a database?it has internal structure or it changes.Size alone doesnt qualify

  • The Research AgendaData integration and publishingSlowly coming to market. Publishing in community formats is a new twistAnnotationEverybody agrees this is important. No-one understands it.Metadata extractionSemantic or otherwise, its a key part of annotationArchiving and AppraisalWhat do we do about databases they change!Legal issuesCan we at least help to clarify what is going on?Provenance and data qualityAgain, we dont fully understand it.Organisational dynamics of repositoriesEconomic analyses of curationOntologies, performance, registries, structure evolution

  • Archiving (preserving) databasesHow do you preserve something that changes every hour or minute?Important for the scientific record someone might have cited your data at time t.Current practiceCreate versions (how often?)Log changes Use diffsDo nothing (common!)

  • A Sequence of Versions

  • Pushing time down[Driscoll, Sarnak, Sleator, Tarjan: Making Data Structures Persistent. ]This relies on a deterministic / keyed model

  • 100 days of OMIMUncompressedArchive size is 1.01 times diff repository size 1.04 times size of largest versionCompressed archive size between 0.94 and 1 times compressed diff repository sizegzip - unix compression toolXMill - XML compression tool

  • The Bottom LineCan archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file)Retrieval is a linear scanWorks well with compression to less than 30% of current file. Archive is an XML fileArchive as often as you like! (Almost)Works well with indexingPermits temporal queries on objects

  • How do we cite data?A URL or citation to an article is already unsatisfactory.DCC client complaint: I spend a lot of time searching [electronic documents] for the part that is relevant to the citation.The problem is much worse when you are citing something in a very large database.How do you use a citation to locate data?How do you ensure that the citation persists? Connections with DB archiving and DOIs

  • Location is typically informative?File and directory names that contain data

  • Keys for XMLImplicit keys are ubiquitous in scientific data formats (easily converted to XML)Some proposals for key specifications in XML work (DTD IDs, XML-Schema)Deep citation in digital libraries.Natural consequence of translating back from deterministic model to XML (node-labeled)Interactions with data models/formats

  • Relative keysGeneral form: Q{P1, ... , Pn }. Q{P1, ... , Pn } ...Example: book{name}.chapter{number}.verse{number}number specifiesverse only within chapternumber specifieschapter only within bookAlso: bible{}.book{name}.chapter{number}.verse{number}empty key: at most one bible node

  • Keys and file formatsUnderstanding and registering formats is only a first stepThe real issue is still integration and transformation.Keys and other constraints may helpRemember: structured files are databases!

  • Data exchange on the Web All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, health-care, ...XML Publishing:mapping relational data to XMLconforming to the predefined DTD

  • Progress report on DCC research(funding period: -2 weeks)Four new research fellows at Edinburgh:Mags McGinley (legal practice) IP, copyright in databasesJames Cheney (Cornell) Programming Languages, Digital Libraries, XML compressionTasos Kemensietsidis (Toronto) Data integration, P2P databasesRajendra Bose (UCSB) Earth sciences data. Workflow provenance in scientific data.At UKOLNMichael Day, metadata and InteroperabilityAt CCLRCShoaib Sufi, data portals and metadataAt GlasgowPosition in metadata extraction advertised

  • Progress report on DCC research(continued)Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group.Collaboration with biologists (EBI & Edinburgh) on data publishing and astronomers (Edinburgh) on XML manipulation & representation of large data sets.First DCC research visitor (Michael Lesk)Work with partners in progress on annotation DOIsPlease join us!!!

  • DCC has research positions in databases, digital curation, XML, web technology, fundamentals.Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid)Edinburgh is a great place!!Contact Peter Buneman


