Upload
amelia-houston
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Introducing Phelix
an open source XML database system
Lou Burnard (HCU, Oxford)
Jakob Fix (Independent consultant)
[email protected]@free.fr
Phelix topics
Why an XML database anyway? Background: design goals and context Architecture Implementation and functionality Future plans
Text and data form a continuum
It's <date value=‘20010613'>Wednesday 13 June </date>
It's <dateStruct value=‘20010613' ><day type=“name”>Wednesday</day><day type=“number”>13</day><month>June</month></dateStruct>
Text is not a special kind of data. Data is a special kind of text.
Text is not a special kind of data. Data is a special kind of text.
It's <date >Wednesday 13 June </date>
XML databases: the claim
In the old days, text and data were different Now, with XML, you can have your cake
and eat it too text can become data data can become text
Datacentric and docucentric worlds converge
A case in point: descriptive bibliography
What’s the author’s name? What translators are there ? Which 20th c French works have more
than 400 pages? List titles containing less than 6 words
Perec, Georges Life - a users manual. Collins, 1988. Translated from the French [La vie mode d’emploi] by David Bellos. xviii+581 pp. 841.941 Literature - French - 20th century
… even more so, for mss
long tradition of descriptive belle lettriste approach
necessitated by cultural complexity and lack of standardization
mss are unique objects, sometimes very valuable, sometimes the reverse…
…and spread across many different locations
for example
Lithuania, Vilnius, Lietuvos nacionalinė Martyno Mažvydo biblioteka F101-26Jurij Fiodorovič Chrebtovičia (þðüè ôåäîðîâè÷ õðåáòîâè÷à) Žemės pardavimo raštas
Vilnius, 1538 rugpjūčio 23 ( kitoje lapo pusėje prierašai lenkų kalba)
Incipit: ß þðüè ôåäîðîâè÷ õðåáòîâè÷à.
Dvarionis Jurijus Fiodorovičius Chrebtovičia parduoda Kareiviškių dvarą su tarnais (išvardyti) ir Ininkovskio dykynę savo seseriai Hanai Martinovnai Chreptovičia Martinovajai Podcentkovskajai Kareiviškių dvarą Jurijui Chrebtovičia padovanojo jo senelė iždininkienė Hana Andriejevaja Aleksandrovičia . Dvarą Jurijus Chrebtovičia tvarkė atskirai nuo kitų tėvonijos valdų, turėdamas teisę šį dvarą valdyti savo nuožiūra.
Visos dvaro valdymo ir palikimo teisės perduodamos Jurijaus Chrebtovičia seseriai.
Dovanojimo raštu nustatyta, kad dešimtinė nuo Kareiviškių dvaro turi būti mokama Papiškių dvaro (karaliaus dvaro iždininko, Vilniaus laikytojo Ivano Andriejevičiaus valdos) Šv. Mikalojaus bažnyčiai
Aatskiras lapas: pergamentas 387+90 x 660 Textas rašytas per visą lapą, pusustavis, pereinantis į greitraštį.
Tekstas rašytas vieno raštininko aiškiu greitraščiu, tas pats raštininkas parašė F101-28
… and so on, for two more pages
The MASTER project
Partners:Centre for Technology in the Arts, De Montfort UniversityHumanities Computing Unit, OxfordArnamagneæn Institute, Copenhagen Institut de recherche histoire de textes, Paris Royal Dutch Library, The HagueCzech National Library, Prague
Funded under EU Libraries Programme Sept 1998 - July 2001
The MASTER plan
Develop a European standard for manuscript description compatible with other relevant standards
Implement demonstrator systems allowing distributed data capture integrated searchability
Disseminate and concertate
Consequences
emphasis on distributed data, from many different partners
short time scale support issues
The <msDescription> element
What is a manuscript description? a text (MLE “record”) a bit of a text (MLE “crystal”) a description of a text? (MLE “header field”)
The answer depends on whether you’re making a finding aid a catalogue raisonné a digital surrogate
http://www.hcu.ox.ac.uk/TEI/Master/Reference/
Collections of msDescriptions
may be output from legacy systems may be created de novo traditional DBMS functionality
concurrent, multiple update referential integrity resilience
… simple file system is inadequate
“Referential integrity”
authority files for language codes bibliographic sources persons referenced classification scheme etc.
essential for distributed collection, desirable for single large system
Design goals for Phelix
Open source Document repository functionality No document editing or updating Multi-document searching using XML
structure Adaptable and customizable
Implementation issues
Re-use existing tools where possible Functionality above performance Assumes networked academic environment Single DTD system
Why use a RDBMS?
mature, stable, portable, scalable widespread, easy to integrate foreign key access (usually) optimized … but assembling XML fragments is inherently
slow (see http://www.cs.wisc.edu/niagara/papers/vldb00XML.pdf)
OODBMS cost serious money and cannot be freely relicenced
What is modelled in the RDBMS?
Not the semantics of the data but its XML representation
XML serializes a tree structure Phelix models the tree, not its meaning
XMLfragments
The Phelix architecture
rdbms
Query
XMLfrag
scripting language
validating parser
xslt engine
rdbms
user agentserver/s
Current Phelix implementation
rdbms
QueryHTML
XMLfrag
PHP
expat/rxp
XML
sablotron
mySQL
IE5,Operajanus.oucs.ox.ac.uk/master
Current functionality
You can upload a valid XML msDescription, or an archive of them validate (some) content against external authority files publish them to other partners search all published msDescriptions view and download selected msDescriptions using your
own stylesheet save and review query results
You cannot validate or modify existing records
Storage of XML documents
1. The parser decomposes the document tree into atomic nodes
element pcdata fragments attribute-value pairs
2. Each node is stored as a row in the DBMS
3. Relationships between nodes are represented by pointers (aka foreign keys)
XML Queries
XML Query or Xpath? choice was unclear at design time
QueryExpression and Query objects encapsulated in PHP layer access to ancestors, parents, attributes, content
Interfaces
There’s nothing as powerful as a good metaphor The Walkthrough The Basket Picking and Choosing
Designing the user interface is harder than designing the engine
Other interfaces
Sorry, no full DOM support (yet) Nodes are stored in database Interface customization layer
form design user supplied stylesheets … much remains to be done
Future plans
Initial spec: <msDescription>s Additional support for bibliography,
onomastics Optional stylesheet interface Generalized interface Mapping Asia project Free text search component?