25
Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) [email protected] [email protected]

Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) [email protected] [email protected]

Embed Size (px)

Citation preview

Page 1: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Introducing Phelix

an open source XML database system

Lou Burnard (HCU, Oxford)

Jakob Fix (Independent consultant)

[email protected]@free.fr

Page 2: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Phelix topics

Why an XML database anyway? Background: design goals and context Architecture Implementation and functionality Future plans

Page 3: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Text and data form a continuum

It's <date value=‘20010613'>Wednesday 13 June </date>

It's <dateStruct value=‘20010613' ><day type=“name”>Wednesday</day><day type=“number”>13</day><month>June</month></dateStruct>

Text is not a special kind of data. Data is a special kind of text.

Text is not a special kind of data. Data is a special kind of text.

It's <date >Wednesday 13 June </date>

Page 4: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

XML databases: the claim

In the old days, text and data were different Now, with XML, you can have your cake

and eat it too text can become data data can become text

Datacentric and docucentric worlds converge

Page 5: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

A case in point: descriptive bibliography

What’s the author’s name? What translators are there ? Which 20th c French works have more

than 400 pages? List titles containing less than 6 words

Perec, Georges Life - a users manual. Collins, 1988. Translated from the French [La vie mode d’emploi] by David Bellos. xviii+581 pp. 841.941 Literature - French - 20th century

Page 6: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

… even more so, for mss

long tradition of descriptive belle lettriste approach

necessitated by cultural complexity and lack of standardization

mss are unique objects, sometimes very valuable, sometimes the reverse…

…and spread across many different locations

Page 7: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

for example

Lithuania, Vilnius, Lietuvos nacionalinė Martyno Mažvydo biblioteka F101-26Jurij Fiodorovič Chrebtovičia (þðüè ôåäîðîâè÷ õðåáòîâè÷à) Žemės pardavimo raštas

Vilnius, 1538 rugpjūčio 23 ( kitoje lapo pusėje prierašai lenkų kalba)

Incipit: ß þðüè ôåäîðîâè÷ õðåáòîâè÷à.

Dvarionis Jurijus Fiodorovičius Chrebtovičia parduoda Kareiviškių dvarą su tarnais (išvardyti) ir Ininkovskio dykynę savo seseriai Hanai Martinovnai Chreptovičia Martinovajai Podcentkovskajai Kareiviškių dvarą Jurijui Chrebtovičia padovanojo jo senelė iždininkienė Hana Andriejevaja Aleksandrovičia . Dvarą Jurijus Chrebtovičia tvarkė atskirai nuo kitų tėvonijos valdų, turėdamas teisę šį dvarą valdyti savo nuožiūra.

Visos dvaro valdymo ir palikimo teisės perduodamos Jurijaus Chrebtovičia seseriai.

Dovanojimo raštu nustatyta, kad dešimtinė nuo Kareiviškių dvaro turi būti mokama Papiškių dvaro (karaliaus dvaro iždininko, Vilniaus laikytojo Ivano Andriejevičiaus valdos) Šv. Mikalojaus bažnyčiai

Aatskiras lapas: pergamentas 387+90 x 660 Textas rašytas per visą lapą, pusustavis, pereinantis į greitraštį.

Tekstas rašytas vieno raštininko aiškiu greitraščiu, tas pats raštininkas parašė F101-28

… and so on, for two more pages

Page 8: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

The MASTER project

Partners:Centre for Technology in the Arts, De Montfort UniversityHumanities Computing Unit, OxfordArnamagneæn Institute, Copenhagen Institut de recherche histoire de textes, Paris Royal Dutch Library, The HagueCzech National Library, Prague

Funded under EU Libraries Programme Sept 1998 - July 2001

Page 9: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

The MASTER plan

Develop a European standard for manuscript description compatible with other relevant standards

Implement demonstrator systems allowing distributed data capture integrated searchability

Disseminate and concertate

Page 10: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Consequences

emphasis on distributed data, from many different partners

short time scale support issues

Page 11: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

The <msDescription> element

What is a manuscript description? a text (MLE “record”) a bit of a text (MLE “crystal”) a description of a text? (MLE “header field”)

The answer depends on whether you’re making a finding aid a catalogue raisonné a digital surrogate

http://www.hcu.ox.ac.uk/TEI/Master/Reference/

Page 12: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Collections of msDescriptions

may be output from legacy systems may be created de novo traditional DBMS functionality

concurrent, multiple update referential integrity resilience

… simple file system is inadequate

Page 13: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

“Referential integrity”

authority files for language codes bibliographic sources persons referenced classification scheme etc.

essential for distributed collection, desirable for single large system

Page 14: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Design goals for Phelix

Open source Document repository functionality No document editing or updating Multi-document searching using XML

structure Adaptable and customizable

Page 15: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Implementation issues

Re-use existing tools where possible Functionality above performance Assumes networked academic environment Single DTD system

Page 16: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Why use a RDBMS?

mature, stable, portable, scalable widespread, easy to integrate foreign key access (usually) optimized … but assembling XML fragments is inherently

slow (see http://www.cs.wisc.edu/niagara/papers/vldb00XML.pdf)

OODBMS cost serious money and cannot be freely relicenced

Page 17: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

What is modelled in the RDBMS?

Not the semantics of the data but its XML representation

XML serializes a tree structure Phelix models the tree, not its meaning

XMLfragments

Page 18: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

The Phelix architecture

rdbms

Query

XMLfrag

scripting language

validating parser

xslt engine

rdbms

user agentserver/s

Page 19: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Current Phelix implementation

rdbms

QueryHTML

XMLfrag

PHP

expat/rxp

XML

sablotron

mySQL

IE5,Operajanus.oucs.ox.ac.uk/master

Page 20: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Current functionality

You can upload a valid XML msDescription, or an archive of them validate (some) content against external authority files publish them to other partners search all published msDescriptions view and download selected msDescriptions using your

own stylesheet save and review query results

You cannot validate or modify existing records

Page 21: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Storage of XML documents

1. The parser decomposes the document tree into atomic nodes

element pcdata fragments attribute-value pairs

2. Each node is stored as a row in the DBMS

3. Relationships between nodes are represented by pointers (aka foreign keys)

Page 22: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

XML Queries

XML Query or Xpath? choice was unclear at design time

QueryExpression and Query objects encapsulated in PHP layer access to ancestors, parents, attributes, content

Page 23: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Interfaces

There’s nothing as powerful as a good metaphor The Walkthrough The Basket Picking and Choosing

Designing the user interface is harder than designing the engine

Page 24: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Other interfaces

Sorry, no full DOM support (yet) Nodes are stored in database Interface customization layer

form design user supplied stylesheets … much remains to be done

Page 25: Introducing Phelix an open source XML database system Lou Burnard (HCU, Oxford) Jakob Fix (Independent consultant) lou.burnard@oucs.ox.ac.uk jakob@free.fr

Future plans

Initial spec: <msDescription>s Additional support for bibliography,

onomastics Optional stylesheet interface Generalized interface Mapping Asia project Free text search component?