Relational to XML Transformations

2005 rel-xml-i 1

Relational to XML Transformations

Background & Issues Preliminaries Execution strategiesThe SilkRoute System

2005 rel-xml-i 2

Background & Issues

• XML – a de-facto standard for data exchange (B2B)

• Business data is will be stored in relational db’s :

reliability, optimized query processing, established applications

Need efficient generation of XML data from relational db’s

Subject was investigated in several projects (early 2000’s) :

• Xperanto (IBM Almaden)

• SilkRoute (AT&T)

• PRATA (Bell Labs/Lucent)

2005 rel-xml-i 3

Issues :

Relations are flat, unordered, tuples are pure data

XML is nested, tagged, and ordered

• Need a language/interface to specify needed data and its form

• Which parts of the work should be performed

Inside/outside the relational engine

Relational engines are very good at

• Optimizing SQL queries (efficient execution)

• Sorting

But do not deal with tagging, do not generate XML

2005 rel-xml-i 4

Xperanto :

The IBM team had access to DB2 internals

• Extended SQL by a few primitives to generate XML, Implemented these in the DB2 relational engine

• Analyzed space of execution strategies, using simulations, concluded that 2 strategies, both doing almost all work as one (extended) SQL query is best

But:

We do not have access to db internals

The interesting part is their analysis/simulations

2005 rel-xml-i 5

SilkRoute :

• Relational db is presented as an XML view (standard transformation)

• Desired XML specified in XML query language– Initial version: home-brewed XML query language

– Last version: QXuery

Main idea:

• Query composition with the query that define the standard view allow to generate the data by SQL queries + tagging

• Found that one big SQL query is not always best

2005 rel-xml-i 6

PRATA :

• Use DTD’s as description of desired XML, & a generalization of attribute grammars with query actions to specify the needed data

• Can handle recursion in DTD’s (former approaches cannot)

• Can still optimize to use a small number of SQL queries for data generation

2005 rel-xml-i 7

Comment :

We are now back in the GAV approach

• View: standard XML view of relational

• Desired XML: query on this view

• Main idea: query composition

Complications:

• XML data is tagged, relational is not

• Nested data

• Different nesting in view and query target Need to change structure

May need fusion

2005 rel-xml-i 8

Execution strategies

The issue:

Data is stored in relational tables, can be retrieved with one/few/many SQL queries

• Which approach is more efficient?

• How can the approach be implemented, assuming the transformation is in some XML-ish l query language

2005 rel-xml-i 9

Xperanto execution strategies :

Base example :

2005 rel-xml-i 10

The source relational schema:

2005 rel-xml-i 11

Space of evaluation strategies :

• Early/late tagging

• Early/late structuring (to form the nested XML structure)

• All work inside the engine, or (at least part) outside the engine

Some combinations are meaningless, e.g.

early tagging/late structuring

2005 rel-xml-i 12

1st strategy : early tagging, early structuring, outside the engine

An application issues a sequence of SQL queries, matching the structure of the result e.g:

For each customer do

1. retrieve root – customer info – cust. name & id retrieved, tagged & output

2. retrieve, tag, & output customer account info

3. retrieve, tag, & output customer purchase orders

4. for each PO, retrieve, tag, output items, then payment info

Early structuring : queries follow structure of generated doc

Early tagging : each element is tagged when retrieved

Outside the engine : obvious

2005 rel-xml-i 13

Shortcomings :

• Many small granularity queries – several queries per “object” – serious performance problems

• Performs a nested loop join – a fixed join order and join strategy – the relational engine might explore others

2005 rel-xml-i 14

2nd strategy: Early structuring, tagging inside the machine :

For this, augment the db engine with

• New data type : xml

• Constructors for the kinds of elements in the document, e.g.

2005 rel-xml-i 15

Now, can express query as:

XMLAGG aggregates several XML fragments into one

For example, two accounts for the customer

2005 rel-xml-i 16

The XML fragments have variable size represented as

Character large objects (CLOBs)

Problems:

• CLOBS are stored separately of their tuples, hence may need separate fetches

• Each XML constructor copies the inpuyt CLOBs to form its output CLOB – a lot of copying

Advantage :

One large query, rather than many small ones

Still nested loop join, but possibly engine can select another strategy (?)

2005 rel-xml-i 17

3rd strategy: Late structuring & tagging :

If both structuring and tagging are done late (possibly outside engine), we can separate process into two

• Content creation – retrieve the data from the db, inside the engine

• Structuring and tagging – the 1st possibly inside, the 2nd outside

2005 rel-xml-i 18

Contents creation – outer join approach :

Select cust.*, acct.*, porder.*, pay.*, item.*

From Customer cust

left join Account acct on cust.id = acc.custId

left join PurchaseOrder porder on cust.id = porder.custId

left join Item item on porder.id = item.poId

left join payment pay on porder.id = pay,poId

Left join: a customer should occur in result even if no account, or has an account but no purchase orders, etc.

Result for a customer w/o some fields is padded with nulls

Join is performed for each path in tree (root to leaf) in some order

Disadvantages : ??

2005 rel-xml-i 19

Contents creation – (unsorted) path outer union approach :Select cust.*, acct.*, type =1

From Customer cust left join Account acct on cust.id = acc.custId

Outer union

Select cust.*, porder.*, item.*, type = 2

From Customer cust left join PurchaseOrder porder on cust.id = porder.poId

left join Item item on porder.id = item.poId

Outer union

Select cust.*, porder.*, pay.*, type = 3

From Customer cust left join PurchaseOrder porder on cust.id = porder.poId

left join payment pay on porder.id = pay,poId

Outer union: pads with nulls, like the left/right join, but does not duplicate data (as much)

Each sub-query is a join for one leaf-to-root path

Note: a sub-query repeated twice

2005 rel-xml-i 20

Contents creation – (unsorted) node outer union approach :

The previous strategy still ahs some redundnacy:

A parents info is replicated with the descendents

Can avoid by using id’s of parents in the descendents

2005 rel-xml-i 21

Structuring & tagging (for unsorted outer union) :

Use a hash table, with hash key type and ancestor id’s for an element to in the XML tree

hash-based tagger

For each tuple in relational result, that defines a node in tree, find out (hash) if parent is present:

Yes – just add this new node (tag)

No – add nodes for all the missing ancestors along the path to root (hashing repeatedly for shorter paths)

Main disadvantage :

For large outputs, main memory shortage

Also, does not necessarily satisfy required order – may need sorting

2005 rel-xml-i 22

Last strategy : early structuring & late tagging :

The idea: order the relational contents in the same order it needs to appear in the (flattened) XML file

Then, tagging (& nesting) can be performed in constant space

• All info about a node X appears before/with the information of its children

• The info of X & its descendents appears together (no mixing with descendents of other nodes)

• The children are ordered as required by the XML def

2005 rel-xml-i 23

Contents creation – sorted path/node outer union approach :

Same as the outer union approach, but add a final sort step

Relational engines are sorting experts, including external sorting

• Sort on id fields, with id of higher nodes preceding those of lower nodes

In example: CustId, AcctId, PoId, ItemId, PayementId

• Nulls should be accounted for in sorting, and null values should precede non-nulls

2005 rel-xml-i 24

Performance comparisons (fig. 13):

• Outer join performs badly – too much data redundancy (not shown)

• One large query is better than many small ones (stored proc.)

• Inside the engine outperforms out the engine (for similar strategies)

outside the engine needs to copy data and bind to external variables --- binding out time is a significant component for all approaches

Binding out

In black

Documents

Relational to XML Transformations