XML Data: From Research to Standards

ICDE’2001, Heidelberg, Germany 1D. Florescu, J. Siméon

XML Data:From Research to

Standards

Daniela FlorescuPropel

Jérôme SiméonBell Laboratories


Data and the Web:A bit of history

• Research:> 1950’s: Lisp [Mac Carthy]

> 1960’s: Tree languages [Buchi]

> 1970’s: Relational DBs [Codd]

> 1990: Graphlog [Univ. Toronto]

> 1994: O2 extensions [INRIA]

> 1995: Tsimmis & OEM [Stanford]

> 1995: UnQL [UPenn]

Need to handle irregular Web data.Use graph data models.

• Internet industry:> 1957 : Sputnik launches ARPA

> 1972 : First demonstration of ARPANET

> 1989 : Number of hosts breaks 100,000> 1991 : CERN releases the World Wide

Web HTML as the support for information

> 1997 : 20 Million Hosts, 1 Million Web sites

> 1998 : W3C releases XML to represent information on the WebXML provides a syntax for irregular

textual Web information.

?


The secret of HTML success• Everybody can write it:

> HTML is simple> HTML is textual: it is human readable, you can use any

editor, ...

• Everybody can read it> HTML is portable on any platform> The browser is the universal application

• It connects pieces of information together> Through hypertext links


But new applications = new needs• Infomediaries:

– Search engines– Web portals– Digital libraries– Virtual enterprises

• Electronic services:– On-line catalogs and procurement– Comparison shoppers– Market places

• Scientific applications• Manufacturing engineering

etc.More than HTML: data on the Web

More than the browser: applications on the Web


The Secret of XML Popularity

It looks like HTML...> Simple, familiar, easy to learn, human-readable> Universal and portable> Supported by the W3C: trusted and quickly adopted by the

industry

…but it’s more than HTML!> Flexible: you can represent any information> Extensible: you can represent it the way you want!

<book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year>

</book> …


XML Is Only the Beginning...• How do you build applications ?

> There is an urgent need for XML tools

• Designing XML tools is a data management problem:> XML 1.0 to describe structured documents

~ Syntax for trees

> XML data models to describe the information content~ Data model for trees

> XML schemas to describe the structure of information~ Data definition language for trees

> XML languages to describe information processing~ Data manipulation language for trees


About the Tutorial• XML through database glasses• Contains:

> Up-to-date information about standards> Relationship with research> Convergence and divergences

• Divided in 4 parts:1. Introduction to XML 1.02. Data models 3. Schema languages4. Query languages

Please, please, please, ask questions!


Part IXML 1.0


About the W3C• Membership organization

• Different types of groups inside the W3C:– Working groups– Interest groups– Coordination groups

• Status of W3C documents:– Note– Working draft– Last Call– Candidate/proposed recommendation– Recommendation ~ Standard


XML activities inside W3C• Core XML

> eXtensible Markup Language (XML 1.0), namespaces, Infoset

• XML Linking> XML Pointer Language (XPointer), XML Linking language

• XML Schema

• XML Query> XML Data Model, Algebra and Query Language

• Document Object Model

• XSL> XPath> XSLT/XSL: Transformation and stylesheet language


XML 1.0:Well formed documents

<book year=“1967” ><title>The politics of experience</title><author>R.D. Laing</author><ref isbn=“1341-1444-555”/><section>

The great and true Amphibian, whose nature is disposed to…..

<title>Persons and experience</title> Even facts become...

</section> …</book>

• An XML Document is composed of:> markup: element, attributes> text: #PCDATA, CDATA

• Well-formed document:> verifies XML lexical conventions> contains properly nested elements with a single root element> can contain empty elements, mixed text and elements


XML 1.0:Valid documents

<?XML version=“1.0”?> <DOCTYPE book [ <!ELEMENT book (title, author*, publisher?,

section+)> <!ATTLIST book year CDATA #IMPLIED> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT section (#PCDATA | title | section)*> ]>...

• A Valid XML document verifies a Document Type Definition (DTD):> grammar for the document> constraints on the structure of elements, attributes, entities,

notations...> a DTD is optional

(We will see more about DTD in the schema part of the tutorial)


Some additional features• General entities &myentity;

> Declared as part of XML 1.0 or in a DTD> Used to escape characters, as macros for pieces of

documents& = &

> An XML document contains Unicode characters< = < = <

• Parameter entities %myentity;> Declared in a DTD, used as macros for pieces of DTDs

<!ENTITY %macro “publisher (#PCDATA)”> …

<!ELEMENT %macro;>


Even more additional features

• Namespaces mynames:name> a set of names identified by an URI> tags and attribute names become qualified names

(QName)

• Processing instructions> to embed processing in a document (e.g. Java applet in

HTML)

• Comments

<myns:section xmlns:myns=“http://caravel.inria.fr/mySchema” > <myns:title> Persons and experience</myns:title></myns:section>




Part IIData Model


Why a data model for XML ?

• As a support for physical/logical independence> XML can be stored in files, a native XML repository, a relational

database> XML can be virtual, as a view of a repository, integrated sources> XML can be in memory, using data structures in C, C++, Java, etc> XML can be streamed between processes

• To describe information content of XML documents> to agree and reason about information content, preservation

• To define semantics of operations:

> equality, etc.

For old & well-know (but good!) reasons


But XML has specifics• Serialization syntax

• Some information exists only after schema validation

> price is not a string but a decimal value> refs is not a string but a list of references

• One more motivation for a data model:To isolate the user from syntactic details of XML

<xsd:attribute name=“price” type=“xsd:decimal”/><xsd:attribute name=“bookid” type=“xsd:ID”/><xsd:attribute name=“refs” type=“xsd:IDREFS”/>

<book bookid=“b1” price=“10.50”/><title>War & Peace</title><author>Tolstoi</author><biblio refs=“b1 b2 b3”>

</book>


Existing data models• Graph and tree models used in research

• Document Object Model (DOM)> status: recommendation> programmatic interface for XML (with an object-oriented

flavor)

• XML Information Set (Infoset)> describes the information content exported by XML processors> can be generated after parsing or after validation

• XML languages’ Data models:> required for language semantics> XPath: recommendation has it’s own data model> XML Query Data model: working draft


• Graph based, unordered, edge-labeled (here OEM)

> But XML is ordered, tree based> Node-labeled seems more natural (e.g., like in DOM)

Semistructured model

&b0

&b1

&b2 &b3

“Tolstoi” 10.50

book

bookbook

references

biblio

biblio

authortitle price

author

authorauthor

titlepublisherauthor

authortitle

Bib

“War & Peace”

refs

refs

refs


Ordered model• Node-labeled, ordered trees, with references (YAT)

> But what about attributes (unordered!), namespaces, processing interactions, etc. ?

“War & Peace”

title

b0: bib

b1:

price

“Tolstoi"

author

10.50

book

biblio

b2:b3:

refs

&b1 &b2 &b3

title priceauthor

book

biblio

book

......

..........................................


XML Infoset• Specifies a description of information in a well-

formed XML document

• Abstract way to think about XML data

• Other processors (e.g. XML Schema) can contribute informationHere is an example in a made-up syntax:

b1 = Element [ local name = “book”;children =[ Element [ local name = “title” ... ];

Element [ local name = “author”... ]; ... ]attributes = [ Attribute [ local name = “price”;

children = [ Character [ code = ‘1’ ];

Character = [ code = ‘0’ ];

Character = [ code = ‘.’ ];

Character = [ code = ‘5’];

Character = [ code = ‘0’ ] ];

attribute type = “xsd:decimal” ] ... ] ]


XML Query Data Model• A node-labeled, tree model with references

> Very close to XPath data model

• Generated after validation> provides also pointers to schema information

• Uses a functional notation> no explicit data structure

• Defines a mapping from post-schema validated Infoset to XML Query Data Model> preserves original infoset (e.g., characters)


XML Query Data Model• Nodes

Node = DocNode | ElemNode | AttrNode | ValueNode

| NSNode | PINode | CommentNode | InfoItemNode

• XML Schema primitive types string, boolean, ID, IDREF, decimal, QName, ...

• Collectionssequence bag union[T] {T} T1 | T2

Referencesref(T)


Constructors & accessors• Attribute Constructor

attrNode : (QNameValue, ValueNode) -> AttrNodeValueNode = StringValue | DecimalValue | ...qnameValue : (uriReference | null, string)-> QNameValue

• Attribute Accessorsname : AttrNode -> QNameValuevalue : AttrNode -> ValueNodetype : AttrNode -> ElemNode

• Example:<book price=“10.50”/>

A1 = attrNode(qnameValue(null, “price”),decimalValue(10.50))

name(A1) = qnameValue(null, “price”)value(A1) = decimalValue(10.50)


XML Data Model: Conclusion• Research focuses on simple formal models

• Many standards related to the need for a data model

• XML Query Data Model reconciles both worlds> Complete with respect to XML> Simple design with a clear connection to a formal model:

ordered trees, node-labeled, with references> Clear relationship two other W3C standards:

mapping to XML Infoset based on XPath + typed values and unordered collections

> Less clear relationship with DOM


Part IIIData Definition Language


Why a DDL for XML ?For old & well-know (but good!) reasons• As an ontology & modeling tool:

> to describe the structure of information: entities, relationships...

> to share common descriptions between actors/applications> to guide query formulation and application development

• For error detection & safety: > to verify that documents comply to what the application

expects> to make sure that the application accesses valid data> to enforce safe operations (e.g., don’t do float arithmetic on

trees!)> to check that compositions of operations make sense

• For performances:> to design storage (saving space, improving clustering, etc.)> to process queries (algebraic laws, rewriting path expressions,

etc.)


But XML deals with new needs• XML data created from legacy repositories

> Need to capture schemas from heterogeneous sources– Relational schemas: Simple but with integrity constraints– Object-oriented schemas: Typed references, Inheritance...– Document grammars: Regular expressions, mixed text and

structure

• XML used on the Web, for data exchange > Need to remain flexible– Web sources: From strict schemas to well-formed

documents (smooooothly........)– Many applications use the same information:

We should be able to type the same document in multiple ways


Existing schema languages

• DTDs (W3C recommendation as part of XML 1.0)> powerful for documents: regular expressions, mixes of text and

structure> limited for other applications: cannot capture relational or object

schemas

• XML Schema (Candidate recommendation)> Many new features: data types, forms of subtyping, etc.> More powerful but quite complex

• Schemas for unordered semistructured models: > Data guides, Graph schemas, using Datalog > Used for optimization, schema inference from data

• Schemas for ordered trees models> Regular tree grammars, YAT, lotos, XDuce, Relax, TRex etc.> Used for optimization, type checking and inference from queries


DDL Roadmap3.1. Describing atomic values

> integer, string, float, date, images, etc

3.2. Describing structures> elements: tag-coupled approach vs. tag-decoupled

approach> attributes

3.3. More semantics> identity, references, relationships intra or inter

documents> isa: notion of inheritance...

3.4. Simplifying schema reuse> import/export abilities> refinement of existing descriptions


Values in XML: easy ?• DTD says it’s easy:

Recipe: #PCDATA = string CDATA = other strings, ...I.e.: Everything is a string

Unfortunately: Strings are not a panacea...

• Database research says it’s easy:Recipe: Take a data model with atomic types

Each value is in a different type...I.e.: Don’t deal with syntax but data model

Unfortunately: XML = file = syntax


Values in XML: many issues...

• Addressing numerous needs:> float, string, int, date, URI, telephone number, gif, applet, etc.

• Living with XML 1.0 syntax> The same lexical representation can correspond to several values

> The same value can have several lexical representations

> binary formats (images, etc.) must be serialized in a portable way

• Compatible with other standards

• Compatible with internationalization> World Wide Web!

<book><title>Haystacks at Chailly </title><author>Monet</author> <date>1865</date><price>1865</price></book>

<book><ref>Monet1865</ref><in_stock>true</in_stock></book><book><ref>Monet1865</ref><in_stock>1</in_stock></book>


XML Schema Part 2: Datatypes

• Defines 14 built-in types (basic types)> general purpose types> types for compatibility with DTDs

• Relies on other existing standards whenever possible> IEEE 754-1985 for floats> UCS [ISO 10646] & Unicode for internationalization> ISO 8601 for dates

• Gives the ability to define new types (derived types)

• Single lexical representation for many values ?> document is interpreted with respect to a given schema> if no schema, the value is given the type string


Datatypes: base types• Base types cover essential needs

> “classic” values: string, boolean, float, double, decimal> temporal values: timeDuration, recurringDuration> binary values: binary> Web-related types: uriReference, QName> DTD types: ID, IDREF, ENTITY, NOTATION

• One value for several syntaxes> Each base type has a set of values (value space)> Values may have several lexical representations (lexical

space)> Equality and order are defined in terms of the value space


Base types: examplesDatatype Examples Notes

string Victor Hugo

boolean true, f alse, 1,0

fl oat 12, 12.00, 1.2E-2, I NF mx2 e where m < 2 24 -149 <= e <= 104

double 12, 12.00, 1.2E-2, I NF mx2 e where m < 2 53 -1075 <= e <= 970

decimal 0, -0, 1.23, 123.4 Arbitrary precision

timeDuration P29Y2MT1H30M1.3S 29 years, 2 months, 3 days, 1 hour, 30 minutes, 1.3 seconds

recurringDuration --08-29T19:05:00 August 29th at 7.05pm every year

uriRef erence http:/ / www.w3.org/


Datatypes: facets• Each base type has facets (read: properties)

• Some facets are fundamentals> equality, order> bounded, cardinality, numeric

• Some facets are constraining> length, minLength, maxLength: for string, binary or lists> maxInclusive, maxExclusive, minInclusive, minExclusive> precision, scale: for decimal numbers> encoding: hex or base64 for binary> enumeration, pattern> duration, period


Datatypes: derived types• One can derive types by restriction of

facets

• One can derive types by list

• XML Schema offers predefined derived types> integer, nonpositiveInteger, int, date, year, century,

timeInstant, language, etc.

> IDREFS, NMTOKENS, etc.

<simpleType name=’integer' base=’xsd:decimal'> <scale value='0'/></simpleType>

<simpleType name=’int' base=’xsd:integer'> <maxInclusive value=’2147483647'/> <mininclusive value=‘-2147483648’/></simpleType>

<simpleType name=’IDREFS' base=’xsd:IDREF’ derivedBy=‘xsd:list’/>


Now you can practice...> Using a range facet

> Using an enumeration facet

> Using a pattern facet

> Using a list type

> etc.

<simpleType name=’auctionprice' base=’xsd:decimal'> <minInclusive value='10'/></simpleType>

<simpleType name=’booktype' base=’xsd:string'> <xsd:enumeration value=”Book"/> <xsd:enumeration value=”Collection"/>...

<xsd:simpleType name=”isbn" base=‘xsd:string’> <xsd:pattern value=”ISBN \d{10}"/></xsd:simpleType>

<xsd:simpleType name=”auctions" base="xsd:auctionprice” derivedBy=“xsd:list”/>


Describing Values: Conclusion• Not addressed in research

• XML Schema Part2: Datatypes does a good job> Quite complete> Deals with complex requirements

(e.g.,internationalization)

• Defines values but not operations!> Needed by XPath, XQuery…


Describing XML structures• element names

> with the names themselves: book, title, etc.> possibly with wildcards: ~ = any tag, !a = not a,

etc.

• element children> using regular expressions

• element attributes> unordered attribute-value pairs

• Main question: types vs. element names> does the element name determines the type ?> tag-coupled types vs. tag-decoupled types


Coupled types• Approach taken by DTDs

> two elements with same name have always same type

> children = regular expression over elements

• Properties> easy to parse: => no depth look-ahead> no closure under union, no local names allowed> cannot express relational, object-oriented schemas

<!ELEMENT book (title, author+, price, publisher, section, conclusion?)><!ELEMENT title (#PCDATA)>....<!ELEMENT author (name,affiliation)<!ELEMENT name (first, last)><!ELEMENT first (#PCDATA)>....<!ELEMENT publisher (name, address)>...


Decoupled types• Approach taken by YAT, XDuce, lotos, etc.

> types are decoupled from element names> children are defined by regular expressions over types

> different types can have the same tag

• Properties> equivalent to regular tree grammars> closure under intersection, complement, union...> more precise type for documents and queries> harder to parse (might require look-ahead and

backtracking)

type Book = book [ Title, Author+, Price, Publisher, Section, Conclusion? ]type Title = title [ String ]type Author = author [ Name, Affiliation ]type Name = name [ first [ String ], last [ String ] ] ...

type Publisher = publisher [ PName, Address ]type PName = name [ String ]


Decoupled types cont’d• They are simple to define

> basic entities: datatypes, tags, type names> one construct : typesschema ::= type type_name = type .........type ::= String | Boolean | ... (* datatypes *) | type_name (* type name *) | tag [ type ] (* element *) | ~ [ type ] (* element with wild

card *) | type, type (* sequence *) | type | type (* union *) | type* (* kleene star *)


Decoupled types cont’d• They can easily describe mixed content

• They can easily describe all well-formed documents

• They support a notion of subtyping via inclusion

> all documents of type Body2 are also of type Body and UrTree

• But they can be ambiguous

> deciding between Body and Body2 can be expensive

type Section = section [ title [ String ], Body ]type Body = content [ (b [ Body ] | footnote [ String ] | Section | String)* ]

type UrScalar = (String | Boolean | Float | Double ...)type UrTree = UrScalar | ~[ UrTree* ]

type Body2 = content [ String, (b [ String ] | footnote [ String ] | String)*, Section* ]Body2 <: Body <: UrTree

type Section2 = section [ title [ String ], Body2*,Body* ]


Decoupled types & full XML• How do you describe attributes ?

> but attributes are unordered, without duplicates> they do not interact with the children of the element> they cannot contain complex values

• How do you describe references ?> Like in object schemas [Cluet et al 1998]:

> but it’s even harder to parse because of cycles [Beeri, Milo 1999]

• How do you deal with XML specifics ?> entities, process instructions, name spaces, serialization,

etc.

type Book = book [ @isbn [ String ], Title, Author+, Price, Publisher, Section, Conclusion? ]

type Author = author [ name [ first [ String ], type Book = book [ title [ String ], last [ String ] ] ]

&Author+,&Publisher ] type Publisher = publisher [ name [ String ] ]


What about XML Schema ?• Tries to get the expressive power of decoupled types

+ the ease of parsing of coupled types

• Advanced features: “subtyping”, constraints...

• Deals with all the specifics of XML

• XML Schema Syntax is in XMLResults in a pretty complex specification

<xsd:element name=”book”> <xsd:complexType> <xsd:element name=”title" type="xsd:string"/>

<xsd:element name=”author” maxOccurs=“unbounded”> <xsd:complexType><element name=“first” type=“xsd:string”/> <element name=“last” type=“xsd:string”/> </xsd:complexType></xsd:element> ……… </xsd:complexType></xsd:element>


Element & attribute declarations• Element decl. ~ associate element names to

types> have a name and their content is described by a type

• Attribute decl. ~ associate element names to types> have a name and contain an atomic value> can be required or optional> can only appear inside elements (through complex types)

<xsd:element name=”title" type="xsd:string"/> title [ String ]

<xsd: element name = “affiliation” type=“publisher”/> affiliation [ Publisher ]

<xs:attribute name=”price”/> @price [ String ]?

<xs:attribute name=”auctionhistory” type="auctions”@auctionhistory [ Auctions] use="required"/> type Auctions = Decimal*


Model groups• Defines content models (i.e., type for the children of an

element)~ equivalent to regular expressions over elements<xsd:sequence> title[Title],price[Price]

<xsd:element name=”title" type=”Title"/> <xsd:element name=”price" type=”Price"/></xsd:sequence>

<xsd:choice> ( publisher[Publisher] <xsd:element name=”publisher” type=“Publisher”/> | editor[Author]) <xsd:element name=”editor” type=“Author”/></xsd:choice>

<xsd:sequence minOccurs=“0” book[ Book ]* maxOccurs=“unbounded”>

<xsd:element name = “book” type=“Book”></xsd:sequence>

<xsd:all> (title[Title],price[Price]) <xsd:element name=”title" type=”Title"/> | (price[Price],title[Title]) <xsd:element name=”price" type=”Price"/></xsd:all>


Complex type definitions> they contain a content model and attribute declarations

> they can be empty

> they can be recursive> then can be mixed (I.e., strings + sub elements)

<xsd:complexType name=“Book”> type Book = @isbn [String], <sequence> title [String] <xsd:element name=”title" type="xsd:string"/> author[ Name ]+ <xsd:element name=”author” maxOccurs=“unbounded”

type=“AuthorName”/> </sequence> <xsd:attribute name = “isbn” type=“xsd:string/></xsd:complexType>

</xsd:complexType name=“RefBib” content=“empty”> type RefBib = @refto [ &UrTree ] <xsd:attribute name = “refto” type=“xsd:IDREF/></xsd:complexType>

</xsd:complexType name=“Body” content=“mixed”> type Body = (b[Body]|String)* <xsd:element name = “b” type=“Body” minOccurs=“0”

maxOccurs=“unbounded”/></xsd:complexType>


Some feature interactions• Local element restrictions

> local elements with same name can have different types

> but they must have the same type among siblings

• To be simple or not to be simple...

> requires a complexType defined by extension over decimals

<xsd:element name=”author”> <xsd:complexType> type Author = author [ name[ AuthorName ] ]<xsd:element name=”name” type=“AuthorName”/>

</xsd:complexType></xsd:element><xsd:element name=”publisher"/><xsd:complexType> type Publisher = publisher [ name [ String ]

]<xsd:element name=”name" type="xsd:string"/>...

</xsd:complexType></xsd:element>

<internationalPrice currency='EU'>423.46</internationalPrice>

<xsd:complexType name=“Names”> type Names = name [ AuthorName ],

<xsd:element name=”name” type=“AuthorName”/> name [ String ]? <xsd:element name=“name” type = “xsd:string” minOccurs = “0”/><xsd:complexType>


Describing Structures:Conclusion• Research : formal models with good properties

• XML Schema Part1: Structures is complex> Deals with XML syntactic aspects> Focuses on validation> Many features with complex interactions

• Need for some middle ground> We need to reason about schemas (e.g., for typing)> XML Schema: Formalism has just been released


Integrity constraints• Come from relational

> practical view-point: key & foreign key constraints

> theoretical view-point: functional & inclusion dependencies> studied in depth in the literature

• Many useful applications of ICs> used to preserve information when mapping ER model to

relational> used for safety and verification (e.g., controlling updates)> used for optimization (e.g., dropping useless joins)

• reasoning about ICs is hard:> implication of functional + inclusion dependencies is

undecidable> etc.

Book ( isbn, title, price, publisher ) isbn is a key for the relation BookAuthor (authorid, first, last, affiliation) authorid and first,last are both keys for the relation AuthorWrote (isbn,authorid) isbn and authorid are foreign keys to Book and Author


ID/IDREF mechanism in DTDs• Very simple ICs to model identity and references

• ID attributes must have distinct values> they identify elements uniquely in a document> but they are not exactly like keys: publisher’s stickers and

book’s isbns must be different

• IDREF attributes must have values from ID attributes> they can capture references to other elements> but: they allow refs to point to publishers!

<!ELEMENT book (title, author+, price, publisher, section, bibliography?)><!ATTLIST book isbn ID #required><!ELEMENT title (#PCDATA)><!ELEMENT publisher (name, address)><!ATTLIST publisher sticker ID #required><!ELEMENT bibliography EMPTY><!ATTLIST bibliography refs IDREFS #implied>


Adding constraints to DTDs• We can replace IDs by real keys:

• We can replace IDREFs by real foreign keys

> Reasoning about simple IC’s for XML is possible [FanSimeon 2000]

> Reasoning about IC’s with DTDs is very hard [FanLibkin 2001]

book.isbn -> book isbn is a key for the relation bookpublisher.sticker -> publisher sticker is a key for the relation publisher

author.authorid -> author authorid is a key for the relation authorwrote.isbn, wrote.authorid -> wrote isbn and authorid are a key for the relation wrote

biblio.refs <= book.isbn refs is a multi-valued foreign key from biblio to book

wrote.isbn <= book.isbn isbn is a foreign key from wrote to bookwrote.authorid <= author.authorid authorid is foreign key from wrote to author


Constraints in XML Schema• XML Schema can define powerful constraints

> Using XPath expressions

• One can define keys:

> the selector gives the collection on which the constraint applies

• One can define foreign keys:

• Many open issues> is XPath too powerful for reasoning (predicates, function calls ?) > which notion of equality is used ?> interaction between ICs and structural constraints ?

<key name=”Isbn"> <selector>books/book</selector> <field>@isbn</field> </key>

<key name=”Publisher"> <selector>books/book/publisher</selector> <field>@sticker</field> </key>

<keyref refer=”Isbn"> <selector>books/book/biblio</selector><field>@refs</field> </keyref>


Unified Constraint Model• Based on XML Query Algebra type system• Key/Foreign Key domains are defined by Types• Very simple path expression for key components

> Powerful: relational keys/fkeys, object references, ID/IDREFs> Close to relational approach> Simple enough to reason about satisfiability

[Fan Kuper Simeon 2001]

type Book = book [ title [ String ], Author*, publisher [ Publisher ] … ]

type Author = author [ name [ String ], wrote [ String* ] ]

key book = Book [| ./title/data() |]

fkey authorbooks = Authors [| ./wrote/data() |] references book


Reusing schemas• Many benefits

> sharing existing definitions> faster development

• Traditional techniques for schema reuse:> some notion of import and the ability to resolve name conflicts

> inheritance, based on subtyping

• We need means to access schemas over the Web

class Author inherit Person class Publisher inherit Company tuple(affiliation : Publisher ) tuple(address:string) tuple(first:string,last:string,affiliation:Publisher) tuple(name:string, address: string)<: tuple(first:string,last:string) <: tuple(name:string)

Import Person, Company from StdClass

class Person class Company tuple(name : tuple( first : string, tuple(name: string) last : string ))


Reusing XML Schemas• Means to import types from other schemas

> access and import though URIs> name conflict resolution based on namespaces

• Mechanisms for limited “inheritance” or subtyping> notions of extension and restriction> abstract types and “equivalence classes”

<schema xmlns="http://www.w3.org/1999/XMLSchema”

xmlns:html="http://www.w3.org/1999/xhtml" targetNamespace="uri:mybiblio”

xmlns:my="uri:mybiblio">


Extension• Extension allows to add new fields in a complex type

• Now you can use both types> but you might need to mark the data with xsi:type attributes

> you cannot export the document without its type anymore...

<complexType name=”ContactAuthor" base=” Author" derivedBy="extension">

<element name=”telephone" type=”xsd:string"/> </complexType>

<author xsi:type=“Author”><name> <first>Serge</first><last>Abiteboul</last></name>

<affiliation>INRIA</affiliation></author><author xsi:type=“ContactAuthor”>

<name><first>Jerome</first><last>Simeon</last></name><affiliation>Bell Laboratories</affiliation><telephone>+1 908 582 5473</telephone>

</author>


Restriction• Restricts the scope of a type definition

• 5x5 table across schema features to define restriction

• Spirit is to allow:> smaller datatypes> narrowed range for sequences t{n,m} < t{n’,m} iff n>n’

&& m<m’> reduced alternative t1 < (t1|t2)> propagation of restriction t1 < t1’ implies t1 < (t1’|t2)

<xsd:element name=”book2” base=“book” derivedBy=“restriction”> <xsd:complexType> <xsd:element name=”title" type="xsd:string"/>

<xsd:element name=”author” minOccurs=“2” maxOccurs=“10”>....... </xsd:complexType></xsd:element>


“Equivalence classes”• Allows to define elements that can be used in place

of other elements

> allow an element named contact to be used whenever an author element is expected

> the corresponding type can be a derived type

> of course, “equivalence classes” are not based on equivalence

<element name=“contact” type=“ContactAuthor” equivClass=’author' />

<author><name> <first>Serge</first><last>Abiteboul</last></name>

<affiliation>INRIA</affiliation></author><contact>

<name><first>Jerome</first><last>Simeon</last></name><affiliation>Bell Laboratories</affiliation><telephone>+1 908 582 5473</telephone>

</contact>


Some short-comings• Restriction is very syntactic

> the following two types are not restrictions of one another!

• Restriction and extension are not possible together:

<xsd:sequence> a[A],(b[B],c[C]) <xsd:element name=“a" type=”A"/> <xsd:sequence> <xsd:element name=“b" type=”B"/> <xsd:element name=”c" type=”C"/> </xsd:sequence></xsd:sequence>

<xsd:sequence> (a[A],b[B]),c[C] <xsd:sequence> <xsd:element name=“a" type=”A"/> <xsd:element name=”b" type=”B"/> </xsd:sequence><xsd:element name=“c" type=”C"/></xsd:sequence>

Person1 = person [ name [ UrTree ], age [ Integer ] ]

Person2 = person [ name [ String ], age [ Integer ],

address [ Address ] ]


Subtyping: Conclusion• Subtyping and inheritance in programming languages

• By name subtyping in XML Schema: relies on user declaration

• Structural subtyping in XDuce relies on set inclusion

• Subsumption for semistructured data [Buneman et al 1997] and for XML [Kuper Simeon 2001] proposes a trade-off between by name and structural subtyping

Still an open problem


XML DDL: Conclusion• Many research work with interesting and

complementary properties

• Complete but complex XML Schema specification...

• Yet no approach that reconciles all of the above

• And still some difficult problems to solve:> concrete integrity constraint language that is tractable> syntactic vs. semantics notion of subtyping ?> use of types for language typing> use of types for query processing> use of types for storage


Part IVXML Query Languages


Plan of the rest of the talk• Querying XML: problem definition

• Previous query languages for XML and graph-based data

• Xquery as a “standard” query language for XML– Syntax and semantics

– Functionalities and expressive power

– Open issues

• Other desirable features for Xquery

• Research problems related to XML data management

• Conclusion


In search of a query language...

• What do we call a query language?

The language used to describe, in a declarative fashion, the mapping

between an input instance of the data model to an output instance of the data

model.

What data model for XML ?


XML data models

• XML is just a syntax and did not have any standard data model for many years (still doesn’t !)

• Graphs data models have been used to model irregular data even before XML

• All query languages for graph-based data models are relevant to XML

• Xquery data model (www.w3c.org/TR/query-datamodel)– First formal and complete data model for XML– Used in the formal semantic specification of Xquery


XML example<book year=“1967” >

<title>The politics of experience</title><author>R.D. Laing</author><ref isbn=“1341-1444-555”/><section>





XML data model in a slide• An instance of the data model = a forest of nodes

• Eight type of nodes:

– Document, element, attribute, value, namespace, processing-instruction, comment, reference nodes

• Each type of node has accessors (e.g name(element)) and constructors (e.g. comment(“this is a comment”))

• Nodes have an optional (unique) parent

• Nodes have an identity that can be queried and preserved

• Support for ordered and unordered collections

• No support for nested collections

• Document order can be queried and preserved

• Data model instances are described and constraint by a type system


XML query language requirements (1)

1. Select portions of an XML document

2. Copy portions of a document while

preserving the hierarchy and the order of

the nodes

3. Combine (join) two documents

4. Construct new documents

5. Navigate irregular or unknown documents


XML query language requirements (2)

6. Formulate predicates on the tag names and

attribute names

7. Query and preserve the nodes global

topological order

8. Apply aggregation and sorting functions

9. Apply existential and universal quantifiers

10. Apply full-text predicates and text operations


Relevant query languages• Query languages for graph data

– e.g. GOOD, GraphLog, Clean

• Query languages/scripting languages for the WEB – e.g. WebSQL, WebOQL, WebL

• Query languages for semi-structured data– e.g. MSL, UnQL, StruQL, YATL

• Research query languages for XML– e.g. XML-QL, Lorel, XML-GL, Quilt, Xduce

• Industry query languages for XML– e.g. XQL, OQL extensions to query SGML documents

• Standard processing languages for XML (W3C standards)– e.g. XPath, XSLT

• Standard W3C XML Query Language: Xquery “XML Query Languages: Experiences and Exemplars”, M. Fernandez, J. Simeon, P.

Wadler“Comparative Analysis of Five XML Query Languages”, Angela Bonifati, Stefano Ceri


XQuery• Current working drafts inside the W3C

www.w3c.org/XML/Query

• Basis of the future “standard” XML query language

• Xquery will have a : (a) human readable (non-XML) syntax and (b) an XML syntax (ABQL)

• XML Algebra:– Formal data model, type system– Formal semantics for the query languageCaveat: many features and design decisions

are stable; some will change


Xquery as a functional language• Xquery :

– consumes an instance of the XML data model as input– produces an instance of the XML data model as output

• Xquery is a functional language (like OQL)• Xquery is a strongly typed language• A query is an expression• Static semantics:

– Given an expression computes the type of the result

• Dynamic semantics: – Given an expression and an environment, determines the

resulting value

• Environment binds functions and variables


Xquery expressions• Constants (all XML Schema atomic types)

– “string literal” , 1345.46E23, etc

• Variables– $x, $y

• XPath expressions (for navigation)– $x/girls, $y/* , $x/@name

• Expression OP Expression– 1 +3, true and false, $x/girls union $x/boys

• f(exp1,...exp2)– descendents($x)

• FLWR expressions (for iteration)• SORTBY expressions• Quantified expressions • Conditional expressions• XML node constructors (elements, attributes, etc)


Xquery functions and operators

• Arithmetic operators– +, -, *, div,

• Logical operators– And, Not, Or

• Collection oriented operators– Union, intersection, difference, empty(), distinct()

• XML specific functions– Document(), name(), value(), string(), etc

• Work in progress• Many semantic open issues: what is the semantics of a

+ operator when the input is not a value of a numerical type but a list of strings ? See type coercion problem later on.


Navigation using Xpath• General syntax:

expression ‘/’ step• Step:

axis ‘::’ nodeTest

• Axis control the direction– ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following,

following-sibling, namespace, parent, preceding, preceding-sibling, self

• Node test by– Name (e.g. employee, myNS:employee, *: employee, myNS:* , *:* )– Type (e.g. node(), comment(), text() )

• Examples of path expressions

document(“employees.xml”)/child::employee

$x/parent::*

$x/ancestor::*/descendent::comment()


Semantics of path expressions

• Semantics of path expressions in Xpath 1.0(1) Ordered forests of nodes as input, ordered forests of nodes as output (2) For each root node in the input forest, select the nodes in the same document that obey to the given axis; among those select and return the ones that satisfy the node test.(3) No duplicates are allowed in the output(4) Output nodes are ordered by the document order(5) Nodes preserve their identity

• No type error for $book/firstname

• A list of lists is automatically flattened


XML example<book year=“1967” >

<title>The politics of experience</title><author>R.D. Laing</author><ref isbn=“1341-1444-555”/><section>





Shortcuts in Xpath (1)• Axis is not mandatory

– By default it is child $x/child::person -> $x/person

• Short-hands for common axes– Descendents,

$x/descendant::comment() -> $x//comment() – Parent

$x/parent::* -> $x/.. – Attribute

$x/attribute::name -> $x/@name – Self

$x/self::* -> $x/.


Shortcuts in Xpath (2) • Implicit root node

$root/department -> /department $root -> /

where $root is implicitly bound to the current document node

• Implicit current node$self/title -> ./title $self/title -> title where $self is implicitly bound to the ‘current’ node

(eliminates the need for an explicit variable declaration in second-order operators like sortby and filter predicates )


Iteration • Syntax :

for variable in expression0 return expression1

• Example :» for $y in document(“books.xml”)/book return $y/authors» for $x in //text() return value($x)» for $z in ( for $y in //book return $y/authors ) return $z» for $z in //book return ( for $y in $z/authors ) return $y)

• Semantics :– bind the variable to each root node of the forest returned by

expression0; for each such binding evaluate expression1; concatenate the resulting forests.


Local variable declaration

• Syntax : let variable := expression1 return expression2

• Example :» let $y := document(“books.xml”)/book return count($y)» let $a :=f(2) return $a+$a

• Semantics :– Evaluate expression1 and add a binding of the variable with this

value to the current environment; evaluate expression2 in this environment; remove the local variable from the environment.

• Usage:– Avoid common sub-expressions repetition– Split large expressions into smaller, more manageable sub-

expressions.


Conditional expressions

• Syntax : if expression1 then expression2 else expression3

• Example :» if $book/year <1980 then “old book” else “new book”» if count($company//employee)>200 then BigCompanyTaxCalculation($company)

else SmallCompanyTaxCalculation($company)

• Semantics :– If expression1 evaluates to true then return the result of

the evaluation of expression2 else return the result of the evaluation of expression3.


FLWR expressions• Syntactic sugar that combines FOR, LET, IF• Syntax:

( ( for (for_variable_binding)+ ) | ( let (let_variable binding)+ ) | ( where expression ) )+ return expressionfor_variable_binding := variable IN expressionlet_variable_binding := variable := expression

• Example for $x in //employee, $y in //department let $z := $x/name where $x/@departament=$y/name return $z


FLWR example • FLWR expression:

for $x in //employee, $y in //department

let $z := $x/name where $x/@department=$y/name return $z

• Syntactic sugar for: for $x in //employee

return ( for $ y in /department return (let $z := $x/name return if ( $x/@department=$y/name ) then $z else [] /*empty list */ ) )


Filter predicates• Syntactic sugar that simplifies some FLWR

expressions

• Syntax: expression1 [ expression2 ]where expression 2 is allowed to use the $self implicit variable

(or the equivalent . )

• Semantics: – if expression2 is of type boolean, shorthand for

for $self in expression1where expression2return $self

– if expression2 is of type integer, return the Nth root element of the forest returned by expression1


Filter predicates (2)• Filtering by predicate :

» //employee [./name/firstname = “jerome”]» //book [price <25]» //book [count(author [@sex=“female”] )>0 ]

• Filtering by position :» /book[3] » /book[3]/author[1] » /book[3]/author[1 to 4]

• Same syntax, different semantics based on the type of the expression !


Quantifiers• Syntax:

some variable in expression1 satisfies expression2every variable in expression1 satisfies expression2

• Examples:»some $x in //book satisfies $x/price <200»//book[some $x in author satisfies $x/@sex=“female”]

» for $x in //department where every $y in $x/employee satisfies $y/salary >1000 return $x/manager/name


Sorting• Syntax:

expression0 SORTBY ‘(‘ expression1 [ ASCENDING | DESCENDING ] , ….,

expressionK [ ASCENDING | DESCENDING ] ‘)’

• Semantics:– Second order operator– Stable sort using the comparison function defined on the domains

1..K– The implicit self variable is allowed in expression1,…, expressionk

• Examples:» //employee sortby (./name/firstname)» //person sortby ( ./income descending, ./name ascending)» for $x in //departments where count($x/employee)>2000 return $x sortby (revenue)


Global (document) order queries

• Syntax: expression1 ( before | after ) expression2

• Semantics: – return all the roots of the first forest that are

located before (resp. after) at least one root node in the second forest according to the global topological order of the document

• Examples:– //incision before //anesthesia[1]– //paragraph after //section[name=“introduction”] before //paragraph[contains(“Xquery”)


Element constructors (1)• Normal XML elements:

<section title=“Introduction” > This is the introduction of the book entitled <title>Data on the Web</title> written by <author> Dan Suciu </author> <author>Peter Buneman</author> <author> Serge Abiteboul </author> . </section>

• XML elements with dynamically computed data <section title = $s/title > “This is the introduction of

the book entitled“, $s/ascendents::book/title , “ written by “, for $a in$s/ascendents::book/author return <author> concat($a/firstname, $a,lastname) </author> </section>


Element constructors (2) • Example: “For each book with an author, return the

book and its authors; for each book with an editor return the book’s title and the editor’s affiliation”.

<bibliography> for $x in //book return

if(empty($x/author)) then <book> $x/title, $x/editor/affiliation</book>

else <book>$x/title, $x/authors></book> </bibliography> Attention to the deep copy semantics !


Constructing other types of nodes

• Eight types of nodes:– Document, elements, attributes, references,

namespaces, comments, processing-instructions

• Elements are constructed using an XML notation

• All the others use specific functions– comment(“Please look at this issue!”)– makeAttribute(“age”, 25)


FILTER• Example: ”Retrieve the table of content of a

specific book”

filter(document(“input.xml”)//book[@ISBN=10],

//book | //section | //title | //section/title/text() )

• Copy from the input document only the book elements, the section elements, the section titles and their text content (but not their children)• For the copied nodes, preserve their relative order and their hierarchical structure.


FILTER example<?XML version=“1.0”?><bib>…………………………….

<book ISBN=“10” year=“1967” > <title>The politics of experience</title> <author><firstname>R.D.</firstname>

<lastname>Laing</lastname>

</author> <section>

<title>Persons and experience</title> The great and true Amphibian <section>

Exploitation must not .... </section>

</section> </book>………………………..</bib>

<?XML version=“1.0”?><book> <title>The politics of experience</title>

<section> <title>Persons and experience</title>

<section> ..................... <section> </section></book>


Dealing with node identity

• All nodes in the data model have node identity

• Node identity is preserved through queries:– All the constructs in Xquery preserve node identity

except

– The element constructor that makes copies of the input nodes and generates new nodes with new identity

• Two node can be compared using the identity equality operator (‘==‘)


XQueries• … we talked until now about expressions

• What is a query?

• An Xquery is defined as:– A list of context definitions– A list of function definitions– A main expression

• The result of the query is the result of the evaluation of the main expression

• Context definition:– Namespace definitions


Local function declarations• Syntax:

function functionName ‘(‘ Parameter list ‘)’ return dataType ‘ {‘ expression ‘}’

• Example:function total_cost($x myNS:component) return xsd:float{ if(simpleComponent($x)) then return $x/price/data() else return sum(for $y in $x/* return total_cost($y )) }

total_cost(/component[1])

• Functions can be recursive; no restrictions on the type of the recursion

• Functions obey to the “implicit mapping rule”


Static semantics for path expressions

”Retrieve the titles of all the books.”

• Input: type Bib = bib [ Book* ] type Book = book [ title [ String ], year [ Integer ] author

[ String ]* ] • Query: document(“bib0.xml”)/book/title

• Result: <title>Data on the Web</title> <title>Foundations of Databases</title> : title[String]*


Static semantics for the iteration

Example: ”Retrieve all the books written before 1967.”

• Query: for $v in document(“bib0.xml”)/book return if $v/title < 1967 then $v else []

• Result: <book>…..</book> <book>…..</book> : book[ title [ String ], year [ Integer], author [String]* ]






– Open issues



• Conclusion


Joins• Example: “For each book found at both amazon.com and

bn.com list the title of the book and the price from each vendor”.

<book-with-prices> for $a in document(“amaxon.xml”)/book, $b in document(“bn.xml”)/book where $b/isbn=$a/isbn return

<book> $a/title, <price-amazon>$a/price</price-amazon>, <price-bn>$b/price</price-bn> </book> </book-with prices>


Left-outer joins• Example: “For each book found at both amazon.com list

the title of the book and its price. If the book also appears in bn.com, list also the bn price”.

<book-with-prices> for $a in document(“amaxon.xml”)/book return

<book> $a/title,

<price-amazon>$a/price</price-amazon>, for $b in document(“bn.xml”)/book where $b/isbn=$a/isbn return <price-bn>$b/price</price-bn> </book> </book-with prices>


Full-outer joins• Example: “For each book found at either amazon.com or

bn.com list its price(s).”

let $allISBNs:=distinct(document(“amazon.xml”)/book/isbn union document(“bn.xml”)/book/isbn )return <book-with-prices> for $isbn in $allISBNs return

<book> ( for $a in document(“amazon.xml”)/book where $a/isbn=$isbn return <price-amazon>$b/price</price-amazon> ),

( for $b in document(“bn.xml”)/book where $b/isbn=$isbn return <price-bn>$b/price</price-bn> ) </book> </book-with prices>


Group-by and Having• Example: “For each author with more then

10 books list the name of the author and the list of the first 10 books that he/she wrote”.

for $a in distinct(//author)let $books := for $b in //book[author=$a]where count($books)>10return <result> $a/name, $books[1 to 10] </result>


Views and parameterized views

• Support for views is a must

• Views are supported via functions

• Non-parameterized views are functions with no arguments; parameterized views are functions with at least one argument

• Xquery can support recursive views (unrestricted form of recursion)

• Termination is ensured by the programmer


Open issues• Three value logic :

– XML Schema supports elements with nil content– Xquery has to deal with the absence of information

• Extensibility :– Some functions will be written in other programming languages

then Xquery– How are those functions declared and invoked in Xquery?

• Exceptions and exception handling mechanisms :– What is the semantics of a query in case of exceptions?– What is the semantics of Booleans operators in case of

exceptions?– How should we raise and catch exceptions?

• Type coercion rules :– XML has no mandatory Schema; does this imply that data should

be converted on the fly to the types expected by the operators ?– E.g. lists to singletons, strings to float, float to string


Implicit type casting in Xpath 1.0• Data model has 4 types:

– untyped set, string, integer, Boolean

• The evaluation uses implicit type casting rules:

/person [ child/age = 19] implicit existential quantifier

/person [ child/age + 1 = 20] the age of the first child equal 19

/book[@year] implicit existential quantifier

/book[@year+1-1] two type conversions: string->int, int->Boolean

will return a book written in 1999 if it happens that this is

the 1999th book in the document

/book[title=“”] empty set to string conversion

returns also the books without a <title> subelement






– Open issues



• Conclusion


XML patterns and pattern matching

• UnQl, XML-QL, YATL• Example:

– ”Retrieve the titles of the books written by Laing before 1967”

WHERE <bib> <book year= $y ISBN= $isbn>

<title> $t </title> <author> <lastname>Laing</lastname> </author> </book>

</bib> in “bib.xml”, $y<1967

CONSTRUCT <resultBook ISBN= $isbn > <resultTitle> $t </resultTitle> </resultBook>

•No distinction between For and Where•Pattern matching semantics


Skolem functions• UnQl, XML-QL, Lorel• Example:

– ”Retrieve the titles of the all the books, grouped by year of publication”

WHERE <bib> <book year= $y>

<title> $t </title> </book>

</bib> in “bib.xml

CONSTRUCT <groupPerYear id= F($y) > <resultTitle> $t </resultTitle> </groupPerYear>


Vertical regular expression• UnQl, XML-QL, Lorel, YATL• Example:

– ”Retrieve the titles of all the sections or chapters”

WHERE <bib> <book>

< (section | chapter) * > <title> $t </title>

</> </book>

</bib> in “bib.xml” CONSTRUCT <resultTitle> $t </resultTitle>


Horizontal regular expressions

• YATL

• A Tree Pattern = type expression without union, and with annotated variables ($v)

• Example: ”Retrieve the first author after the book title”

• Process DTDs like: <!ELEMENT bib (title, author+)*>

• Example: “Create a bibliography for each author”

book[ title [ String ] book($b) [ title [ $t ], author[String]+, +author [ $a ]+, UrTree* ] _ ]

MAKE $aMATCH book WITH book [ _ , title , _, author[$a] , *author, _ ]

MAKE *($a) bib [ author [ $a ], *title [ $t ] ]MATCH bib WITH bib[*(title [ $t ], +author [ $a ] )]


XML-related research problems(1)

• Update languages for XML

• XML views of object-relational databases

• Storing XML data in object-relational DBMSs– new challenges for the traditional DBMSs and for SQL

• Alternative storage methods for XML data

• Indexing XML

• Query processing algorithms for XQuery

• Efficient (streamed) processing of XML transformations

• Mixing structured search with full-text search

• Distributed execution of XML queries

• XML benchmarks


XML-related research problems(2)• XML-based information mediation

• XML data cleaning

• XML data compression

• XML-based information brokering

• XML-based workflow systems

• XML scripting languages

and many more...


Conclusion• XML is the lingua franca of the Web • XML is the next big challenge for the database community• Large quantities of a new type of data

– textual, irregular, self-organizing, distributed, replicated, etc.

• Many orders of magnitude larger:– the volume of XML data– the number of XML data repositories

• We have now good quality standards: – XML data model, XML schemas, XML query and transformation

languages

• Very clear need from the industry• Extraordinary opportunity for database research !


XSLT(1)• Paper:

– “XSL Transformations (XSLT)”, W3C recommendation

• XML to XML rule based transformation language

• An XSLT program is an XML document itself

The divided self

publisher

R.D. Laing

author

book


bookbook

......

..........................................

title

bib

Pantheon Books

The divided self

publisher

R.D. Laing

author

book


bookbook

......

..........................................

title

bib

Pantheon Books

The divided self

publisher

R.D. Laing

author

book


bookbook

......

..........................................

title

bib

Pantheon Books

DOM

XML

HTML

data

transformation

result


XSLT(2)

• An XSLT program is a valid XML document containing:– elements in the <xsl:> namespace (i.e. the XSLT statements)

– elements in other namespaces(i.e the user-defined data)

• The result of the evaluation of an XSLT program on an input XML document := the XSLT document where each <xsl:> element has been replaced with the result of its “evaluation”

• Uses Xpath as a sublanguage

• Used mostly as a stylesheet language


XSLT programs

• An XSLT program – is an element of type <xsl:stylesheet>

1. XSL elements describing rewriting rules– <xsl:template>

2. XSL elements describing rule execution control – <xsl:apply-templates>– <xsl:call-template>

3. XSL elements describing instructions– <xsl:element>, <xsl:attribute>, <xsl:for-each>,

<xsl:if>, <xsl:copy>, <xsl:copy-of>, <xsl:sort>, <xsl:value-of>, etc


XSLT processing model• Process an XML document (procedure PD):

1. Apply the procedure PL (bellow) to a list with a single node: the root of the document

• Process a list L of nodes (procedure PL):1. Process each node N (procedure P bellow) in the list (with current

node=N and current list=L)

2. Return the concatenation (in the right order) of the partial results

PL([x1, x2…, xn]) = [ P(x1), P(x2), …, P(xn)]

• Process a node N (procedure P):1. Find all applicable templates to the node N

2. Find the “best” template among them

3. Instantiate the content of the template

4. Return this result


<xsl:template>• Basic XSLT concept: describes a rewriting rule

• It has:– attributes to describe the acceptable input – content to describe the output

• Attributes:– match: Xpath expression describing the elements to which this

template applies– name: the name of the template rule– priority: guides the choice of the best template to apply

• The content is a legal XML fragment with:– Elements from the xsl namespace – Other elements (user data)


<xsl:template> example <xsl:template name=“myTemplate” match=“book[title]” >

<resultBook> <xsl:attribute name=resultYear>

<xsl:value-of select=“./@year”/> </xsl:attribute>

The title of this book is <resultTitle>

<xsl:value-of select=“./title”/> </resultTitle>

and it was.... </resultBook><xsl:template>


Instantiating an <xsl:template>

• ... on a node N:» returns the content of the template where the <xsl:> elements

from the content of the template have been replaced with the result of their “evaluation” ( with the current node=N )

» Two types of <xsl:> elements in the content:

1. Instruction elements » <xsl:copy>, <xsl:copy-of>, <xsl:value-of>, <xsl:for-each>» return a certain list of nodes according to their particular semantics

2. Rule control elements » <xsl:apply-templates>, <xsl:call-templates>» recursive calls to the rule engine (see below)

• Maps an XML node into a list of XML nodes









Example of instantiation<book ISBN=“10” year=“1967” >

<title>The politics of experience</title> <author>R.D.Laing</author> <section> The great and tr

<title>Persons and experience</title>

<section> Exploitation must not been….

</section> </section> </book>

<resultBook resultYear=1967> The title of this book is <resultTitle>

The politics of experience </resultTitle> and it was ….</resultBook>

Input XML

Output XML


Recursive <xsl:template><xsl:template name=“myTemplate” match=“book[title]”

> <resultBook>

<xsl:attribute name=resultYear><xsl:value-of select=“./@year”/>

</xsl:attribute> <resultTitle>


<xsl:apply-template select= “./section” /> </resultBook><xsl:template>

Invokes the procedure PL with current list= “./section”.


Recursive calls• <xsl:apply-templates>

– invokes recursively the procedure PL – the argument is a new list of nodes

» explicitly specified in the select attribute» by default is the list of children of the current node

<xsl:apply-template select=“ ./section ”/>

• <xsl:call-template>– triggers the instantiation of a specific template identified by

name – does not change the context node and the context list

<xsl:call-template name=“myTemplate” />


XSLT execution control <xsl:stylesheet>------------------------------------------------------------------ <xsl:template name=“myTemplate”>

<xsl:apply-template select=“./ascendent::book”/> <xsl:template>------------------------------------------------------------------ <xsl:template match=“section”>

This is a section of the book <xsl:call-template name=“myTemplate”/> and its name is <xsl:value-of select=“./title”> . </xsl:template>------------------------------------------------------------------ <xsl:template match=“book”>

<xsl:value-of select=“./title”> </xsl:template>----------------------------------------------------------------- <xsl:template match=“/”>

<xsl:apply-template select=“//section[title]”> </xsl:template>------------------------------------------------------------------</xsl:stylesheet>


Built-in templates------------------------------------------------------------------

<xsl:template match=“*|/”> apply recursively on the children <xsl:apply-templates select=“./node()” /> if element</xsl:template>

------------------------------------------------------------------

<xsl:template match=“@*|text()”><xsl:value-of select=“.”/> print the content

</xsl:template> if text node or attribute

-----------------------------------------------------------------

<xsl:template match=“processing-instruction()|comment()”/> ignore (do nothing) if processing instruction or comment


TOC of a certain book

<xsl:template match=“/”> <xsl:apply-template select=“//book[@ISBN=10]”>

</xsl:template>----------------------------------------------------------------------------------

<xsl:template match=“book”><xsl:apply-template select=“./section”>

</xsl:template>--------------------------------------------------<xsl:template match=“section”>

Section <xsl:value-of select=“title”> <xsl:apply-templates select=“./section”>

</xsl:template>

-----------------------------------------------------------------


XSLT

• Like Xquery, it describes general XML to XML transformations

• Built-in processing model

• Full recursion

• Possibile to write non-terminating programs even on trees

• XSLT vs. Xquery – same expressive power– differences: programming style, XML vs. Non-XML syntax

• Could be considered as a query language

• Is it “declarative” ?


Part IVData Manipulation

Language


Query languages for XML• problem definition

• overview of different approaches

• overview of representative research languages – query languages for semistructured data

– research and industry query languages for XML

• status of the XML Query Working Group– XML Query Algebra (working draft)

– XQuery: a query language for XML (working draft)


In search of a query language...• What do we call a query language?

The language used to describe, in a declarative fashion, the mapping

between an input instance of the data model to an output instance of the data

model.


XML vs. graph-based models• XML document content could be modeled as a graph

– components (elements, attributes) in a hierarchical structure

• ...but XML is more complicated than that– several distinct types of nodes

» text, elements, attributes, comments, processing instructions, etc.

– some parts are ordered (e.g. children of an element) and some other parts not ordered (e.g. attributes)

– in the absence of a DTD or schema, the document is a tree; otherwise it could be a graph

• We will not consider only XML query languages, but also query languages for graph-based data


Some relevant query languages• Query languages for graph data

e.g. GOOD, GraphLog, Clean

• Query languages for the WEB e.g. WebSQL, WebOQL

• Query languages for semi-structured datae.g. MSL, UnQL, StruQL

• Research query languages for XMLe.g. XML-QL, Lorel, YATL, XML-GL, Quilt, XDuce

• Industry query languages for XMLe.g. XQL, OQL extensions to query SGML documents

• Standard processing languages for XML (W3C standards)e.g. XPath, XSLT

“XML Query Languages: Experiences and Exemplars”M. Fernandez, J. Simeon, P. Wadler

“Comparative Analysis of Five XML Query Languages”Angela Bonifati, Stefano Ceri


XML languages: the big picture

SPJ +RegExpr +grouping.

Expressive power

Data model

Simple graphs

Idealized XML data model

Real XML

Navigation & selection

OQL+RegExpr

XML-QL (2) Lorel (3)

UnQL (1)

XSLT (7)

XQuery (6)

XPath(5)

SPJ+RegExp

OQL+conditional +full recursion

YATL (4)


DDL Roadmap3.1. XPath

> Building block for several other languages

3.2. XQuery and the XML Query Algebra> Both working drafts> Design based on requirements and use cases

3.3. Other languages and features> XML-QL, Lorel, YATL, XDuce, etc.> Focusing on specific features

3.4. XSLT> Already a W3C recommendation> Already widely used


XPath: Overview• Syntax for XML document navigation and

node selection

• Papers:– “XML Path Language (XPath)”, W3C

recommendation

• Building block for other W3C activities:– XSL Transformations (XSLT) – XML Link (XLink)– XML Pointer (XPointer)– XML Query (XQuery)


XPath Expressions• A query is an expression (Location Path)

– describes a single navigation path in an XML document

• A query simply selects a list of nodes from the input document

• A Location Path consists of:– a context node– a series of Location Steps separated by /

• A verbose Location Step consists of:– an axis, a node test, a list of predicates

document(“bib.xml”) / child::book [./attribute::ISBN=10] / descendant::section / [position()=1]


XPath• Location step:

– an axis, a node test, a list of predicates

• 13 Axes:– ancestor, ancestor-or-self, attribute, child, descendent,

descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

• Node Test: – name test (e.g. section, *, myNs:myTag) – type test (e.g. text(), comment(), node() )

document(“bib.xml”) / child::bib/ child::* [./attribute::ISBN=10] /

descendant::section [position()=1] / child::comment()


XPath abbreviated syntax book CN/child::book book/@ISBN CN/child::book/attribute::ISBN

section[1] CN/child::section[position()=1]. CN.. CN/parent::*../text() CN/parent::*/child::text()//section ROOT/descendent-or-self::section/section ROOT/child::section// ROOT/descendent-or-self::*//section[last()]

ROOT/descendent-or-self::section[position()=last()]

//section [5] [title=“introduction”]//section [title=“introduction”] [5]


Semantic aspects of XPath• Data model has 4 types:

– untyped set, string, integer, Boolean

• The evaluation uses implicit type casting rules:/person [ child/age = 19] implicit existential quantifier/person [ child/age + 1 = 20] the age of the first child equal 19/book[@year] implicit existential quantifier/book[@year+1-1] two type conversions: string->int, int->Boolean will return a book written in 1999 if it happens that this is the 1999th book in the document/book[title=“”] empty set to string conversion returns also the books without a <title> sub-elementpreceding::foo[1] and (preceding::foo)[1] are not the same


XML Query Working Group

• XML Query Requirements (WD)– What should be achieved with the language

• XML Query Use Cases (WD)– Many examples of queries for a lot of applications

• XML Query Algebra (WD)– Formal basis for the language(s)

• XQuery : a traditional syntax (WD)

• An XML syntax (Not here yet)


XML Query Requirements

• Declarative

• Expressive (joins, manipulation of documents, etc)– Supporting both database applications and

documents applications

• Formally specified– Precise semantics

• Two syntaxes: ‘user-readable’ and XML

• Should allow updates in the future


XML Query Use Cases• Illustrate the Query language with examples

– Access to relational databases– Access to documents– Full-text queries– Recursive queries– queries that use references– metadata queryingEtc.

• Decide what XQuery should and should not do– Make 80/20 cut

• ‘Benchmark’ for the language design– Important queries should be easy to write


XML Query Algebra• Based on XML Query Data Model

• ‘Minimal’ set of operations

• Static semantics (type checking)– Can infer the type of your query

• Dynamic semantics (result of the query)

• Expressive enough to support Xquery– Iteration (and join)– Navigation– Functions with full recursion

• Contains a tutorial on types and expressions


Static semantics for path expressions

”Retrieve the titles of all the books.”

• Input: type Bib = bib [ Book* ] type Book = book [ title [ String ], year [ Integer ] author

[ String ]* ] • Query: document(“bib0.xml”)/book/title

• Result: <title>Data on the Web</title> <title>Foundations of Databases</title> : title[String]*


Static semantics for the iteration

Example: ”Retrieve all the books written before 1967.”

• Query: for $v in document(“bib0.xml”)/book return if $v/title < 1967 then $v else []

• Result: <book>…..</book> <book>…..</book> : book[ title [ String ], year [ Integer], author [String]* ]


XQuery

• First Working Draft in February

• Coming from work on Quilt– Already a number of test implementations

• Supports XML Query use cases

• Draft of semantics on top XML Query Algebra

• Test parsers are available


XQuery• Data model:

– the XML Query working group data model

• Language description:– borrows features from OQL, XML-QL, Lorel, XQL, ML. – as ML, OQL, Lorel: it is a functional language– includes a subset of XPath as a sub-language– as ML, it uses IF-THEN-ELSE and LET constructs– as YATL, it uses local function definitions– as XQL, it uses BEFORE and AFTER operators (global

topological order of the XML document)– new FILTER operator to do projection while

preserving the hierarchy and the order


XQuery• A query:= a list of local function definitions + the

main expression to evaluate

• An XQuery expression:– constant (all XML Schema atomic types)– variable– f(exp1,...exp2)

» +, -, and, or, union, intersection, etc– LET var := expr1 in expr2– XPath expression (for navigation)– FLWR expression– SORT expr1 by expr2– XML node constructors (elements, attributes, etc)


XPath in XQuery

• Query1: ”Retrieve the titles of all the books written before 1967.”

document(“bib.xml”)//book[@year<1967]/title

• An XPath expression is an XQuery expression• Returns the selected forest of the input

document • XPath queries can be used as building blocks

for more complex expressions


FLWR expressions• Query1: ”Retrieve the titles of the books written

by Laing before 1967, together with their reviews.”

FOR $b in document(“bib.xml”)//book[@year<1967],

$r in document(“reviews.xml”)//review

WHERE $b/authors/lastname=“Laing” and $b/@ISBN=$r/@ISBN

RETURN

<resultBook ISBN=$b/@ISBN>

<title> $b/title/text() </title>,

$r

</resultBook>FLWR expression


Local variables• Query1: ”Retrieve the titles of the books written by

Laing before 1967 together with their reviews.”

FOR $b in document(“input.xml”)//book[@year<1967]

LET $R := document(“input.xml”)//review[@isbn=$b/@isbn]

WHERE $b/authors/lastname=“Laing”

RETURN

<resultBook ISBN=$b/@ISBN>

<resultTitle> $t </resultTitle>

<bookReviews> $R </bookReviews>

</resultBook>


Global order operators• Query4: “Retrieve the titles of the first 4

sections (and of their subsections) of a specific book.”

LET $b := /bib/book[@ISBN=10] IN

$b//section/title BEFORE $b/section[5]

the list of all the titles of the

sections of the book $bthe fifth section of the book $b

the book with ISBN = 10

the list of all the titles that appear before the fifth section (in the global topological order of the document)


FILTER• Query1: ”Retrieve the table of content of a

specific book”

document(“input.xml”)//book[@ISBN=10]

FILTER //book | //section | //title | //section/title/text()

• Erase all the nodes from the input document except the book element, the section elements, the section titles and their text content• For the remaining nodes, preserve their relative order and their hierarchical structure.


FILTER example<?XML version=“1.0”?><bib>…………………………….

<book ISBN=“10” year=“1967” > <title>The politics of experience</title> <author><firstname>R.D.</firstname>

<lastname>Laing</lastname>

</author> <section>

<title>Persons and experience</title> The great and true Amphibian <section>

Exploitation must not .... </section>

</section> </book>………………………..</bib>

<?XML version=“1.0”?><book> <title>The politics of experience</title>

<section> <title>Persons and experience</title>

<section> ..................... <section> </section></book>


XQuery: conclusion

• XQuery design goals:– learn from previous experience– keep it simple– make sure it is useful– make sure it is semantically clean :)

• Still many issues:– Which additional feature to add (full regular

expressions, text operators, etc)– Relationship with XPath – Relationship with XML Query Algebra– Relationship with XML Schema


UnQL(1)• Authors:

– P.Buneman, D. Suciu, M. Fernandez

• Papers:– “UnQL: A Query Language and Algebra for

Semistructured Data Based on Structural Recursion”, P. Buneman, M. Fernandez and D.Suciu, VLDB Journal 9(1), 2000.

– More information at: http://www.research.att.com/~suciu/unql-

home.html


UnQL(2)

• Initial data model:– trees with labeled edges and labeled leaves

• A query = a function– takes a tree as input and returns a tree as output

• Language description:– based on structural recursion

“The form of the program follows the form of the data.”


UnQL tree data model• 4 constructs to build a tree

(1) the empty set is a tree (with no nodes and no edges)(2) if V is a value then {V} is a tree (leaf node)(3) if T is a tree and L is a label then {L:T} is a tree (edge

construction)(4) if T1 and T2 are trees then T1 U T2 is a tree (union)

publisher

..........................................

{book : {title: ”The divided self”} {author: ”R.D.Laing”} {publisher: ”Pantheon Books”}}

The divided self

publisher

R.D. Laing

author

book

titleauthor

bookbook

......title

bib

Panthoen Books


UnQL query language• A query = a function • A function = an ordered set of rules• A rule:

– left-hand side: » a pattern : when the rule has to be applied

– right-hand side» an expression that describes how to create the resulting tree

• 4 types of patternsF({“a”}) = {“A”} F({“b”: T}) = {“B”: F(T)}

• Syntactic restrictions of the expression in the right-hand side in order to guarantee nice behavior


UnQL in action (3)• Query1: ”Retrieve the titles of all the books”.

F({L:T})= if L=“title” then {“result”:T} else F(T) specific rules -------------------- F( T1 U T2) = F (T1) U F(T2) fixed in the F({})={} language

The divided self

publisher

R.D. Laing

author

book

titleauthor

bookbook

......title

bib

Panthoen Books


UnQL in action (4)• Query2: ”Copy the document while translating

the edge labels into French and omitting the sections and their descendents.”

F( T1 U T2) = F (T1) U F(T2) F({})={} -------------------- F({“book”:T})={“livre”:F(T)} F({“title”:T})={“titre”: F(T)}

F({“year”:T})={“annee”: F(T)} F({ L : T}={} F({V})=V

T

book

F(T)

livreF


Alternative SELECT-WHERE syntax

• Query2: ”Copy the books written before 1967 while translating the edge labels into French and omitting the sections and their descendents.”

SELECT {livre : {titre: T} {annee: Y}

} /* output tree pattern */

WHERE {bib {book :

{title: T} {year: Y}

}} in db, /* input tree pattern */ Y <1967

• Can be translated into the previous formalism

T Y

- - - -


Vertical regular expressions • Introduced by POQL (INRIA)

• Query4: ”Retrieve the books that have a section or a chapter entitled “Persons and experience”

SELECT {title: T}WHERE

{bib: {book: {title: T} { (section| chapter )*.title : “Persons and

experience” }}

} in db Any regular expression can be expressed using structural recursion


Cyclic data in UnQL

The divided self

publisher

R.D. Laing

author

Western studies

book


bookbook

......

..........................................

titlecitation

citation

• Normal evaluation would create infinite loops

• Two (equivalent) solutions:– memoization (do not visit the same node twice)– bulk semantics (apply the function on each edge in parallel

and group the resulting graph at the end)

F({“title”:T})={“result”:T} F({L:T})=F(T)


UnQL: final conclusion• Structural recursion as a programming style• Defined on trees but also on cyclic data • Well defined semantics• Well studied properties

– expressive power (FO+TC)– computable in PTIME– compositional q1 o q2 =q3– allows for traditional optimization– structural recursion guarantees termination even for cyclic

data

• Very interesting study but not usable as such for XML. XML is not a simple graph.


XML-QL(1)• Authors:

– A. Deutch, M. Fernandez, D.Florescu, A.Levy, D. Suciu

• Papers:– “XML-QL: a Query Language for XML”, A. Deutsch,

M.Fernandez, D. Florescu,A. Levy, D. Suciu, Proc. Int. Conf. of WWW, 1999.

• Implementation:– available at http://www.research.att.com/~mff/xmlql/doc– home-grown main memory XML data repository– query optimizer and execution engine


XML-QL(2)• Data model:

– node and edge labeled graph (elements & attributes)– a (totally) ordered or a (totally) unordered graph

• Language description:– WHERE clause to bind variables and to test predicates– CONSTRUCT clause to create new XML structures

• Features:– as UnQL: XML patterns for both the WHERE clause and the

CONSTRUCT clause– as UnQL: regular expressions for navigation– in addition: joins on multiple input sources– in addition: Skolem functions to create nested structures


XML patterns• Query1: ”Retrieve the titles of the books written

by Laing before 1967”

WHERE <bib> <book year= $y ISBN= $isbn>

<title> $t </title> <author> <lastname>Laing</lastname> </author> </book>

</bib> in “bib.xml”, $y<1967

CONSTRUCT <resultBook ISBN= $isbn > <resultTitle> $t </resultTitle> </resultBook>

$y $isbn $t

- - -

- - -


Joins in XML-QL• Query2: ”Retrieve all the rewiews about books written

by Laing”. WHERE

<bib><book ISBN = $i> <author>

<lastName>Laing</lastName></author>

</book></bib> in “bib.xml”, <reviews>

<review ISBN = $i> </review> ELEMENT_AS $e </reviews> in “reviews.xml”

CONSTRUCT$e


Outer-joins in XML-QL• Using nested queries • Query3: ”Retrieve the titles of the books written by Laing before

1967, together with their reviews (if any).”

WHERE <bib><book year=$y ISBN= $i > <title>$t</title> <authors><lastname>Laing</lastname></author> </book></bib> in “bib.xml”, $y<1967 CONSTRUCT <resultBook ISBN=$i> <title> $t</title>,

( WHERE <reviews> <review ISBN = $i> </review> ELEMENT_AS $r </reviews> in “reviews.xml”

CONSTRUCT $r)

</resultBook>

Outer-join semantics.


Meta-data queries• Query4: “Which kind of elements can be found in

the content of the element corresponding to the book with isbn=10 ?”

WHERE

<bib>

<book ISBN=“10”> <$tagName> </> </book>

</bib> in “bib.xml”, CONSTRUCT

<result>$tagName <result>


Fusion using Skolem functions• Fusion introduced by MSL (TSIMMIS)• Query5: ”Retrieve the titles of the all the books, grouped

first by year and then by publisher”. WHERE

<bib><book year=$y><title> $t </title><publisher>$p/publisher>

</book><bib> CONSTRUCT

<bookPerYear id=F1($y) > <bookPerYear&Publisher id=F2($y,$p) >

<bookTitle> $t </bookTitle> </bookPerYear&Publisher >

</bookPerYear>

Automatic fusion of all the bookPerYear elements with the same id attribute

$y $p $t


Skolem functions issues• Query5: ”Retrieve the titles of the books published by

“Pantheon Books”, grouped by year and by publisher”. WHERE



<bookPerYear id=F1($y) > <bookPerYear&Publisher id=F2($p) >


</bookPerYear>

Creates graphs with cycles and sharing.Several possible XML serializations.


Skolem functions issues• Query5: ”Retrieve the titles of the books published by

“Pantheon Books”, grouped by year and by publisher”. WHERE



<bookPerYear id=F1($y) > <newElement> We have an order problem </newElement> <bookPerYear&Publisher id=F2($y, $p) >


</bookPerYear>

Creates graphs with cycles and sharing.Several possible XML serializations.


XML-QL: final conclusion• Advantages:

– XML templates look very familiar– can express selection, projection, join, grouping – can construct deeply nested XML elements

• Limitations:– problems with the semantics of Skolem functions:

» order» nested Skolem functions

– preserving structure and hierarchy is difficult– no disjunction, aggregates, quantifiers, etc.– data model ignores some important XML details


Lorel• Authors:

– S. Abiteboul, D. Quass, J.McHugh, J. Widom, J. Wiener

• Paper:– “The Lorel Query Language for Semistructured Data”, S.

Abiteboul, D. Quass, J.McHugh, J. Widom, J. Wiener, Journal of Digital Libraries, 1(1), 1997

– Semistructured data (OEM), reconverted to XML

• Lorel is an extension of OQL for OEM:– functional language– applies type coercion (relaxes the strong typing constraint of

OQL) – performs path navigation with full regular expressions– adds an XML element creation operator– adds Skolem functions for grouping


OQL-like queries for XML• Query1: ”Retrieve the books written by Laing

before 1967.”

SELECT xml(result: $b )

FROM $b in bib.book

WHERE $b.author.lastname?=“Laing” and $b.@year<1967

•UnQL & XML-QL vs. Lorel: •No more patterns and pattern matching but path expressions.

•Different syntax. Equivalent expressive power.


Type coercion• Query1: ”Retrieve the books written by Laing

before 1967.”


FROM $b in bib.book

WHERE $b.author.lastname=“Laing” and $b.@year<1967


FROM $b in bib.book

WHERE

exists $l in $b.author.lastname?: $l =“Laing” and

real($b.@year) < real(1967)


Type coercion in Lorel• Basic comparison operators for atomic types

– conversion to the most general type (real)

• Coercion for equality– “set=value” => existential quantifier– “set=atomic object” => existential quantifier– “set, value=complex object” => false – complex object equality defined recursivelyprice=“12.5” verifies price<13 but no price<“013”

• Traditional operators loose their convenient properties (transitivity, distributivity, etc)

• Problem for query processing !


Lorel: final conclusion• Extends OQL in the following way:

– relaxes the strong typing constraint (type coercion)– adds regular path expressions for the navigation– adds Skolem functions

• Advantages:– builds on a powerful and well defined language

(OQL)– type coercion deals with irregular data

• Limitations:– type coercion is not always good– data model ignores some important XML details


YATL• Authors: Jerome Simeon, Sophie Cluet

• Papers: “Your Mediators Need Data Conversion!” Sigmod’1998

“The New YATL: Design and Specifications”, INRIA 1999

• Initial goal: data conversion and integration

• Data model: ordered trees, references, node-labeled

• Language description:– like OQL & Lorel: functional language

– like others: database iterator (make...match...where)

– like others: Skolem functions to manipulate references

– pattern matching with horizontal regular expressions

– local functions with full recursive functions for conversions

• Implementation: v1 INRIA in 1998 & v2 Bell Labs in 2000


YATL• Papers: “Your Mediators Need Data Conversion!” Sigmod’1998,

“The New YATL: Design and Specifications”, INRIA 1999

• Initial goal: data conversion and integration

• Data model: ordered trees, references, node-labeled

• Language description:– like OQL & Lorel: functional language

– like others: database iterator (make...match...where)

– like others: Skolem functions to manipulate references

– pattern matching with horizontal regular expressions

– full recursive functions and case expression for conversions

• Implementation: v1 INRIA in 1998 & v2 Bell Labs in 2000


Tree patterns in YATL• Query1: ”Retrieve the titles of the books

published in 1967 by ‘ Pantheon Books ’.

MAKE result [ $t ]

MATCH « bib.xml » WITH book[ @year[$y],

title[$t],

publisher[$p] ]

WHERE $p = “Pantheon Books” and $y=1967

Different semantics for matching: •no additional children allowed in a book •the cardinality of each @year, title and publisher has to be respected •the order of @year, title and publisher has to be respected


Tree patterns in YATL• Query1: ”Retrieve the titles of the books

published in 1967 by ‘ Pantheon Books ’.

MAKE result [ $t ]

MATCH input.xml WITH book[ _, @year[$y] _

title[$t], _,

publisher[$p], _ ]

WHERE $p = “Pantheon Books” and $y=1967

Different semantics for the patterns: •DO allow additional children in a book •the cardinality of each @year, title and publisher has to be respected •the order of @year, title and publisher has to be respected


Horizontal regular expressions• A Tree Pattern = type expression without union, and

with annotated variables ($v)

• Query: ”Retrieve the first author after the book title ”.

• Process DTDs like: <!ELEMENT bib’ (title, author+)*>

Ex: “Create a bibliography for each author”

book[ title [ String ] book($b) [ title [ $t ], author[String]+, +author [ $a ]+, UrTree* ] _ ]

MAKE $aMATCH book WITH book [ _ , title , _, author[$a] , *author, _ ]

MAKE *($a) bib [ author [ $a ], *title [ $t ] ]MATCH bib’ WITH bib[*(title [ $t ], +author [ $a ] )]


Recursive functions• Query1: ”Retrieve the table of content of a

book.”

• Problem: how to enforce termination ?!

define function toc($b) = case $b of | title[$t] -> title[$t] | section [*$child] -> section[ *toc($child) ] | _ [ *$child ] -> [ *toc($child) ];

toc(bib/book);


YATL: final conclusion• YATL design goals :

– Orthogonal constructs + functional glue– Regular expressions = XML types

= YATL primitive operation– Recursion and case statement: very expressive

to support queries, conversion and integration– Efficient on the classical database queries

• Open issues :– no termination!– optimization of recursion and case ?


XSLT(1)• Paper:

– “XSL Transformations (XSLT)”, W3C recommendation

• XML to XML rule based transformation language

• An XSLT program is an XML document itself

The divided self

publisher

R.D. Laing

author

book


bookbook

......

..........................................

title

bib

Pantheon Books

The divided self

publisher

R.D. Laing

author

book


bookbook

......

..........................................

title

bib

Pantheon Books

The divided self

publisher

R.D. Laing

author

book


bookbook

......

..........................................

title

bib

Pantheon Books

DOM

XML

HTML

data

transformation

result


XSLT(2)

• An XSLT program is a valid XML document containing:– elements in the <xsl:> namespace (i.e. the XSLT statements)

– elements in other namespaces(i.e the user-defined data)

• The result of the evaluation of an XSLT program on an input XML document := the XSLT document where each <xsl:> element has been replaced with the result of its “evaluation”

• Uses Xpath as a sublanguage

• Used mostly as a stylesheet language


XSLT programs

• An XSLT program – is an element of type <xsl:stylesheet>

1. XSL elements describing rewriting rules– <xsl:template>

2. XSL elements describing rule execution control – <xsl:apply-templates>– <xsl:call-template>

3. XSL elements describing instructions– <xsl:element>, <xsl:attribute>, <xsl:for-each>,

<xsl:if>, <xsl:copy>, <xsl:copy-of>, <xsl:sort>, <xsl:value-of>, etc


XSLT processing model• Process an XML document (procedure PD):

1. Apply the procedure PL (bellow) to a list with a single node: the root of the document

• Process a list L of nodes (procedure PL):1. Process each node N (procedure P bellow) in the list (with current

node=N and current list=L)

2. Return the concatenation (in the right order) of the partial results

PL([x1, x2…, xn]) = [ P(x1), P(x2), …, P(xn)]

• Process a node N (procedure P):1. Find all applicable templates to the node N

2. Find the “best” template among them

3. Instantiate the content of the template

4. Return this result


<xsl:template>• Basic XSLT concept: describes a rewriting rule

• It has:– attributes to describe the acceptable input – content to describe the output

• Attributes:– match: Xpath expression describing the elements to which this

template applies– name: the name of the template rule– priority: guides the choice of the best template to apply

• The content is a legal XML fragment with:– Elements from the xsl namespace – Other elements (user data)









Instantiating an <xsl:template>

• ... on a node N:» returns the content of the template where the <xsl:> elements

from the content of the template have been replaced with the result of their “evaluation” ( with the current node=N )

» Two types of <xsl:> elements in the content:

1. Instruction elements » <xsl:copy>, <xsl:copy-of>, <xsl:value-of>, <xsl:for-each>» return a certain list of nodes according to their particular semantics

2. Rule control elements » <xsl:apply-templates>, <xsl:call-templates>» recursive calls to the rule engine (see below)

• Maps an XML node into a list of XML nodes









Example of instantiation<book ISBN=“10” year=“1967” >

<title>The politics of experience</title> <author>R.D.Laing</author> <section> The great and tr

<title>Persons and experience</title>

<section> Exploitation must not been….

</section> </section> </book>

<resultBook resultYear=1967> The title of this book is <resultTitle>

The politics of experience </resultTitle> and it was ….</resultBook>

Input XML

Output XML


Recursive <xsl:template><xsl:template name=“myTemplate” match=“book[title]”

> <resultBook>

<xsl:attribute name=resultYear><xsl:value-of select=“./@year”/>

</xsl:attribute> <resultTitle>


<xsl:apply-template select= “./section” /> </resultBook><xsl:template>

Invokes the procedure PL with current list= “./section”.


Recursive calls• <xsl:apply-templates>

– invokes recursively the procedure PL – the argument is a new list of nodes

» explicitly specified in the select attribute» by default is the list of children of the current node

<xsl:apply-template select=“ ./section ”/>

• <xsl:call-template>– triggers the instantiation of a specific template identified by

name – does not change the context node and the context list

<xsl:call-template name=“myTemplate” />


XSLT execution control <xsl:stylesheet>------------------------------------------------------------------ <xsl:template name=“myTemplate”>

<xsl:apply-template select=“./ascendent::book”/> <xsl:template>------------------------------------------------------------------ <xsl:template match=“section”>

This is a section of the book <xsl:call-template name=“myTemplate”/> and its name is <xsl:value-of select=“./title”> . </xsl:template>------------------------------------------------------------------ <xsl:template match=“book”>

<xsl:value-of select=“./title”> </xsl:template>----------------------------------------------------------------- <xsl:template match=“/”>

<xsl:apply-template select=“//section[title]”> </xsl:template>------------------------------------------------------------------</xsl:stylesheet>


Built-in templates------------------------------------------------------------------

<xsl:template match=“*|/”> apply recursively on the children <xsl:apply-templates select=“./node()” /> if element</xsl:template>

------------------------------------------------------------------

<xsl:template match=“@*|text()”><xsl:value-of select=“.”/> print the content

</xsl:template> if text node or attribute

-----------------------------------------------------------------

<xsl:template match=“processing-instruction()|comment()”/> ignore (do nothing) if processing instruction or comment


TOC of a certain book

<xsl:template match=“/”> <xsl:apply-template select=“//book[@ISBN=10]”>

</xsl:template>----------------------------------------------------------------------------------

<xsl:template match=“book”><xsl:apply-template select=“./section”>

</xsl:template>--------------------------------------------------<xsl:template match=“section”>

Section <xsl:value-of select=“title”> <xsl:apply-templates select=“./section”>

</xsl:template>

-----------------------------------------------------------------


XSLT: final conclusion

• Describes general XML to XML transformations

• Built-in processing model

• Full recursion (not only structural recursion like UnQL!)

• Possibile to write non-terminating programs even on trees

• XSLT vs. Quilt – equivalent expressive power– differences: programming style, XML vs. Non-XML syntax

• Could be considered as a query language

• Is it “declarative” ? Should it be a QL candidate?


XML-related research problems(1)• Update languages for XML

• XML views of object-relational databases

• Storing XML data in object-relational DBMSs– new challenges for the traditional DBMSs

• Alternative storage methods for XML data

• Indexing XML

• Query processing algorithms for XML data

• Mixing structured search with full-text search

• XML benchmarks


XML-related research problems(2)• Distributed execution of XML queries

• XML-based information mediation

• XML data cleaning

• XML data compression

• Efficient (streamed) processing of XML transformations

• XML-based information brokering

• XML-based workflow systems

and many more...


Conclusion• XML is the lingua franca of the Web • XML is the next big challenge for the database community• Large quantities of a new type of data

– textual, irregular, self-organizing, distributed, replicated, etc.

• Many orders of magnitude larger:– the volume of XML data– the number of XML data repositories

• The need for such a technology is here• The solutions are not here !• Myriad of standards and products issued from industry

What is the role of the research?


Typeswitch• Goal:

– control the evaluation using the type of a certain expression

• Syntax:typeswitch expression0 ‘ [ ‘ as variable ‘ ] ’

case type1 return expression1 ……….. case typeK return expressionK else return expressionk+1

• Semantics: – compute the dynamic type of the expression0 – if the dynamic type of expression0 and the typeK have a non-

empty intersection, the entire expression evaluates to the result of the expressionK.

– if no case clause satisfies this requirement, return the result of the expressionk+1.


Typeswitch (2)• Example:

for $x in /department[name=“operations”]/personnel/*

return typeswitch $x

case manager return $x/salary+ 1000

case regular_employee return $x/salary

else error

Documents

XML Data: From Research to Standards