XML vs. Relational

Mapping between XML and relational –By Guy Karlebach

Based on XQuery from the Experts, Katz et al. Chapter 6

For the foreseeable future, most business data will continue to be stored in relational database system .

To exploit the benefits of XML a need for mapping arises.

XML vs. Relational

Relational data is flat and untagged, whereas XML data contains tags and is

hierarchical.

The order of rows in a relational table

is unimportant, in contrast to XML documents.

XML vs. Relational

Tags: For example, separating an article to subtopics .

<introduction title=“…”< , <methods<, <results … <

In a relational database each of these tags may be a different table or column .

XML vs. RelationalHierarchy: For example, describing the chain of command in the army

<General<…

< Colonel<…

< Lieutenant<…

< Sergeant <…

< Private<

In a relational database this hierarchy may be represented by using foreign keys, though this may be confusing – for example if an XML parent points

only to one child and several parents may point to the same row child in the DB .

A possible model for the mapping application

End application XML Processor Relational DB

XML Acts

XML Result

DB Result

DB Dialect

Types of End Applications which are XML – oriented and

interact with a relational DBRead-Only applications:

1 .Complete document emission –

For example, an application that visualizes the data of XML files that are stored in the database ,

which needs XML tags that instruct it through the visualization process (BSML documents has a

visualization section.(

The emitted data can be further processed, e.g. using XSLT.


interact with a relational DB2.Document fragment emission -

Doing this by further processing emitted documents is not efficient .

Therefore we need a way to ‘push the selectioncriterion into the database .’

The database already has an innate selectioncapability .

An example for this is an application that visualizes parts ofXML documents which are stored in a relational DB.


interact with a relational DB

3.Document XML Querying & transformation –

Query for XML documents/fragments

Add markup

Change the relational output.


interact with a relational DBExample: XML for Tables.

<EmployeesTable{<for $i in xp:table(“EXAMPLE_DB",“EMPLOYEE")/EMPLOYEE/row,

$p in xp:table(“EXAMPLE_DB", EMPLOYER")/EMPLOYER/row[DEPART_ID = $I/DEPART_ID]return

<row{<< employeeId<

${ i/ID}/< employeeId<

< managerId<${ p/ID}

/< managerId</<}row<

/<}EmployeesTable<


interact with a relational DBConsider the following XQuery:

for $x in doc (“purchases.xml”) / PurchaseOrder / originator

where $x / contactAddress / zip-four = “98102-1234”

return $x / contactName

Translation to SQL:

SELECT contactName

FROM purchaseorder , originator

WHERE purchaseorder.id = originator.poid AND originator.zip4 = “98102 ”


interact with a relational DBThe translation procedure can be described in the following manner:

1.Start with the SQL query that would be used to emit the entire PurchaseOrder element list.

) SELECT PurchaseOrder.originator, PurchaseOrder.id(… ,

2.Translate the XQuery selection conditions into equivalent SQL selection conditions.

3.Prune the SQL query so that it will reference only the tables and columns necessary either to evaluate the selection conditions or to return required

results.


interact with a relational DB

Writing applications:

4.Update of XML data In the DB –

Change of data in the DB according to an XML locating information

) such as XPath expressions.(

A more complex version if this type of application would make

changes in the XML schema of the XML data which is stored in the

relational DB.

Sources of the application’s input data

There are two possibilities:

1 .The original format of the data is relational. In this case the application needs a ‘virtual’ XML view of an originally relational data.

2.The original format of the data is XML. This is possible in two different scenarios:

a. The input data is currently XML – e.g. shredding.

b. The input data used to be XML but was mapped into a DB,

and the application has access to the relational form.

In the following discussion we will refer to the way (XML or relational)in which the data is stored.

Relational view of data which is stored as XML – concepts

There are 3 different ways to view XML as relational data:

1 .LOB –

A field in a DB table contains XML data .

In this case the mapping back to XML is trivial)Easy implementation of a complete

document emission application.(

No need to know the XML schema of the datawhich is received for storage .


2 .Composed –

The XML data is separated among different table fields and possibly different tables .

The process in which XML documents are entered into a relational DB in a composed form is called shredding.

When XML data is stored in a relational DB in a composed form a table field can ‘belong’ to more than one XML document.

E.g. an author that created many documents.


Knowing the XML schema/DTD helps composed viewing of input XML data because it allows to define mapping rules between the XML form and the relational form .

We will later discuss automatic creation of such mappings .

The process in which composed relational data is transformed to create XML documents or elements is called composition.

Complete document emission is not trivial, but on the other hand the DB mechanisms such as querying can be used on the data.

Composed and LOB representations can be combined.


3 .XML native data type –

A possible functionality of the DB itself. Fields in the

DB can be defined as XML, and the DB may allow

special operations on this data type .

Example: (Native XML data type in green)

SELECT…

FROM …

WHERE TABLE.FIELD1=“Year/Region=‘middle east”’

XML view of data which is stored as relational – concepts

The translation from relational to XML is divided into two types:

1 .Implicit translation –

The user of the translation mechanism has no control

over the results of the translation .

For example :

A constant XML structure that mirrors the DB tables .

< TableEmployee<<Column1 …<


2 .Explicit translation – The user of the translation mechanism can affect its results .

This effect may include:

a. Choosing tables and columns that will participate in the

translation .

b. Mapping between rows/columns in the DB and XML elements or

attributes .

c. Naming XML elements in the translation result.

d. Adding elements or attributes with constant values, or wrapper

elements. E.g., adding <time< to <hours< …<minutes< …<seconds<

e. Defining the result of the translation as a function of the DB content.

e.g. creating a <date< child of <meeting< only if the date of the

meeting has not occurred yet .


Does explicit translation of relational data to XML allow data base update through XML interface ?

Querying through XML interface is understandably possible,but when it comes to update giving the user the power of explicit translation may causedifficulties.

For example: If the user can define in the result a text node which is the concatenation of two database table fields, how will this one new field be used to update the original

database fields upon its change ?

Of course, rules can be defined, but the task gets more complicated and less safe.

We will review more aspects of explicit translation in the next two slides.

XML view of data which is stored as relational – Join Queries

Joining relational DB tables that are viewed as XML captures the structures of the join query. For example:

]A JOIN B JOIN C [ U [ A JOIN D ]

<A<

content of A…

< B< content of B …

< C< content of C… </C <

/< B<

< D< content of D …</D<

/<A<

XML view of data which is stored as relational – Group Queries

Grouping queries allow the user to produce more compact XML data. The elements by which the grouping is made become common XML elements in their group, as is demonstrated below:

<ElementForGroupA<< GroupingColumnOfGroupA<

1996 < GroupingColumnOfGroupA</

< GroupingColumnOfGroupA< books

/< GroupingColumnOfGroupA< …

< RowInGroupA<< column<

middle east /< column <

< column< 20,000,000

< column<…

Composition Techniques

We will discuss three general types of composition techniques:

1.Default mappings

2.Extended SQL

3.Annotated schemas

1.Default Mappings

This is a mirroring of the database tables’ structure as XML. This is an implicit translation method. Example:

<TableOrder<

< ColumnDate<

..

/< ColumnDate<

…


Advantage of default mappings: Uncomplicated update commands.

Disadvantage: Many applications require a specific XML format. This demands a XML

format other then the one which represents a table, and sometimes also data selection.

Further transforming ability can be added, but that dispels the update advantage.

2.Extended SQL

The idea is to add XQuery capabilities to SQL. An example for this implementation type is the SQL/XML language.

SELECT e.id, XMLELEMENT)

NAME “Emp,”

XMLATTRIBUTES( e.name AS “name”),

e.hire, e.dept) AS “result”

FROM EMPLOYEE e


XMLELEMENT is a DB’s native XML data type, and it stores the XML result of the query in a field called “result” in the relational table which is the result of the query.

Following is an example that creates a more complex hierarchy and also prevents redundancy by a grouping command:

SELECT XMLELEMENT)

NAME “Department,”

XMLELEMENT)

NAME “departmentName,”

e.dept

(

XMLELEMENT)

NAME “families,”

XMLAGG)

XMLELEMENT)

NAME “family” , e.lastname

(

(

(

( AS “families_in_departments”

FROM EMPLOYEE e

GROUP BY e.dept


Result:

<department<

< name< physics </name<

< families <

< family<

Smith

/< family<

< family<

Thompson

/< family<

/< families<

/<department<

<department<

…

Composition Techniques3.Annotated schemas

This composition technique is based on providing an XML document or a DTD or XML schema which contains instructions for the composition process.

Example: Annotated XSD Schema Language

The default of this language is XML schema without any annotations, which means that each complex element matches a table in the DB and each simple element in it matches a field in that table .

Annotations which are added to the schema override the default composition instructions, and also define the composition in the case of complex elements within complex elements that appear in

the schema .


In the following slide:Blue shows the relational tables from which the XML elements draw valuesGreen shows which elements should appear ‘together’ in the resultLight Blue matches between the attribute and the column for its content in the relational ‘Orders’ table. If it is lacking in a simple element/attribute definition the mappings looks for a relational table field that matches the attribute/simple element name.

Composition TechniquesAnnotated schema example:

<xs:element name=“Cusomer” type=“CustomerType” sql:relation=“Customers”<

< xs:complexType name=“CustomerType <”

< xs:attribute name=“OrderID” type=“xs:integer” sql:field=“ID” </

) … more fields as simple elements or attributes … (

< xs:sequence<

< xs:element name=“Order” maxOccurs=“unbounded” sql:relation=“Orders”<

) … fields as simple elements or attributes… (

< xs:annotation<

< xs:appinfo<

<sql:relationship parent=“Customers” parent-key=“CustomerID”

child=“Orders” child-key=“CustomerID</”

/<xs:appinfo<

/< xs:annotation<

/< xs:element<

/< xs:sequence<

/< xs:complexType <

/<xs:element<

Composition TechniquesRemarks:

1.parent-key and child-key can contain a list of values in the case where more than one column creates the relational table’s key.

2.The mapping marked in green allows a transitive mapping, e.g. coupling of a Customer element with an element from a relational table which ‘Orders’ points to but

‘ Customers’ doesn’t.3.The advantage of this mapping is that in addition to composition it also defines XML to

relational update or shredding . The SQL/XML XML output, on the other hand , is not restricted to a certain

schema and therefore the opposite mapping is not obvious.4.There is a limited way to apply selection to value-retrieval :

<xs:element name=“BillTo” type=“xs:string” sql:relation=“Addresses” sql:field=“StreeAddress” sql:limit-field=“AddressType” sql:limit-value=“billing”</

The AddressType column’s value must be ‘billing’ in order to use the StreetAddress column .


There used to be a tendency to use languages which their sole purpose was accepting input in one form (relational or XML) and transforming it into the other.

This approach is no longer common because

it requires learning a new language that has no other use ,

and it did not add expression ability beyond the other mechanisms abilities.

Composition – Automatic generation of XML schema from relational data

Usually useful for data which was originally relational (the case of an uncareful shredding also exists).

Motivation:

1 .Our originally relational data enjoys the benefits of XML

2 .Error prevention

3 .Reduces workload

4 .Provides a methodology different DB systems can conform

Based on “Schema Conversion Methods between XML and Relational Models, Dongwon Lee et al.


Own House #

Name Id

#1 1 Bob 10

#2 1 Bob 20

#3 2 Bob 10

#4 3 Bob 10

#5 4 Pete 10

#6 4 Pete 20

#7 5 Pete 20

nest

Own House #

Name

Id

#1 {1,2,3}Bob 10

#2 1 Bob 20

#3 4 Pete 10

#4 {4,5 }Pete 20


Method 1:Try nesting according to each column, and continue nesting in the tables which are the nesting opertaions’ results. Choose the longest of the nesting series.

If only one nesting operation succeeded create only one XML element from the relational table)each field will be either an attribute or a child element .(

Otherwise,create an XML element for the relational table, and:For each column by which nesting was performed:Create a child element of the table element.

If that column was nullable in the relational DB schema it will create an XML elementof multiplicity.*

Otherwise (not nullable) it will appear as an XML element of multiplicity.+

For each column by which the nesting was not performed:

All the columns which we didn’t nest by will appear as child elements of multiplicity 1 or 0 ,using the same considerations ( nullable or not).

Note that this method does not provide a mechanism for linking between the tables’ elements.


Method 2:Based only on the relational schema and not the content of the relational tables.

a. Create a directed graph where each element is a vertex and there is a directed edge AB if B has a foreign key (or a referencing field) that points to A.

b. Define basic vertexes as vertexes that satisfy one of the following: 1 .There are no edges that point to them

2 .All the edges that point to them have matching edges that point to the pointing edge’s source.

c. From each basic vertex perform BFS on the graph. Every discovery of a new vertex creates a child element of the basic node.

The multiplicity of the child element is set to ? If the edge represents a key element ) that is, a foreign key in the child that point to the parent and also acts as the child’s

only key field ) and otherwise it is set to .* d. If the BFS reaches a vertex a second time instead of creating a child element for it

create an attribute IDREF if the edge represents a key and IDREFS otherwise.) We use IDREF/S in order avoid data redundancy.(

Shredding – Planning of the relational DB schema

Creating a target relational schema is the first step in any shredding procedure. Some considerations:

1.Adding shredding information beyond the content of the shredded document: Examples of Information that is not contained in shredded XML documents:

table keys, or foreign keys.

Information about composition can also be added to the shredded document before shredding:

For example adding document-unique IDs + a document ID to elements and then assigning these IDs to a special field in the DB. This way we produce a field

which is a unique key among all the documents which are shredded in a relatively simple procedure (the ID attribute type won’t work here, for

instance .(

If we want to support update of the DB through incoming XML documents (for example a doctor who updates a clinical document which was previously shredded), we have

to keep composition enabling information .


2 .Spreading/Uniting XML elements in relational tables:

Creating a single table for several XML elements from the same XML schema/DTD is called Inlining.

In other cases it might make sense to create separate tables for certain elements, for example when a child element appears with cardinality 0.∞..

Creating a single table for several XML elements from different XML schemas/DTDs which resemble one another is called consolidation .

The benefits of consolidation are:a. Saving space b. Simplifies the DB schema c. Allows integration of data sources for research (e.g. BSML and HapMap) .


3 .Supporting the flexibility of XML documents:The structure that an XML schema/DTD defines is more flexible than that of a relational database.Some examples :

a. The maximal length of strings may not be limited. This requires estimation of the data size and using datatypes such as CLOB when these are needed. This may cause space waste.

b. XML elements which are defined in the XML schema/DTD may not appear in the XML document or may appear a changing number of times .

c. XML schema may use xs:any which allows the occurrence of arbitrary elements (for example the Inclusion of any well-formed XML code).

d. The case of an XML input that does not have a schema or that has a changing schema is still more complicated. Possible solutions:

Learning . Creating document-specific schemas and crossing them.

Making a wide use of LOB representation . Using an edge table .


Edge table:

A single relational table describes the whole XML document:

parentId The id of the row of the XML parent element of the element which was mapped to this row

childId The id of this row (which contains a single element/attribute)

Name The name of the element which is contained in this row in the XML document

Value The non-element content of the element which is stored in this row

.

.

.


4 .Prevention of data redundancy :

Shredding that involves large number of XML documents can result in data

redundancy.

Suppose that we have a repository of XML documents that describe the people who

live in a certain very long street.

Each document contains some data about one neighbor: his house number in the

street and the house numbers of all the people who live in a region of 50 houses

around his house.

When all the data will enter the database keeping only the house number of the one

neighbor that lives north to that person is a worthwhile consideration (a solution that

keeps all the house numbers requires a connecting person__neighbor table .(

Shredding – Automatic generation of relational DB schema

Motivation:1 .Error prevention

2 .Reduces workload3 .Provides a methodology different DB systems can conform

Of course, not always an automatic algorithm captures the semantics of the XML model.We will see an example of creating a relational DB schema from a DTD

Definition: Top nodes nodes in the XML tree that satisfy one of the following conditions:

1 .Root2 .A child of cardinality * or+

3 .A node that may be its own descendant (recursion)4 .A node that can appear between two nodes that appear recursively one inside the

other

Based on “Schema Conversion Methods between XML and Relational Models, Dongwon Lee et al.

Shredding – Automatic generation of relational DB schema

The algorithm: For each top node:

1 .create a table 2 .perform inlining of all the element and attribute nodes that can be reached from that

node, contain PCDATA , and they are not top nodes.

Explanation : The root table is the base table .

Children of unbounded cardinality should get their own table because the number of fields to contain them all cannot be predetermined.

The same goes for elements which can appear recursively and the elements described as type 4 in the previous slide.

The rest of the nodes can be ‘safely’ inlined.

A shredding tool that uses XPATH

XML Data Mediator is a tool for bi-directional data conversion between XML and structured data formats such as relational.

It is an explicit mapping tool.

It replaces the programming effort by the simpler effort of writing a script that describes the relationships between the XML constructs and the corresponding RDBMS constructs .

XML Data Mediator can be used as a stand-alone utility, or it can be integrated as a library in other applications .


?<xml version=“1.0<?”<xrt:xrt xmlns::xrt=http://www.xrt.org/xrt xmlns:xsl=http://www.w3.org/1999/XSL/Transform<

< xrt:xml2rdbms<< xrt:locator xrt:name=“EX2-store</”

< xrt:transaction xrt:tid=“level1<” --!< First transaction <--

< xrt:sqlinsert xrt:sid=“upsert-YEARS_HIGHER_THAN_PRECEDING” xrt:tablename=“SEMINAR. YEARS_HIGHER_THAN_PRECEDING”

xrt:template-name=“table_years_higher_than_preceding<”< xrt:key xrt:name=“YEAR_NAME” xrt:type=“integer</”

< xrt:column xrt:name=“REGION” xrt:type=“string</”< xrt:column xrt:name=“PARENT_ID” xrt:type=“string</”

< xrt:column xrt:name=“NODE_ID” xrt:type=“string</”/< xrt:sqlinsert<

/< xrt:transaction</< xrt:xml2rdbms<

< xrt:xml2xml<< xsl:stylesheet version=“1.0” xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:xrt=http://www.xrt.org/xrt<

< xsl:import href=“someDirectory/myFile.xsl </”< xsl:template name=“mainTemplate<”

< xsl:call-template name=“table_years_higher_than_preceding</”/< xsl:template<

/< xsl:stylesheet<< xrt:xml2xml<

http://www.xrt.org/xrt

http://www.w3.org/1999/XSL/Transform





myFile.xsl:

?<xml version=“1.0<?”<xsl:stylesheet version=“1.0” xmlns:xsl=http://www.w3.org/1999/XSL/Transform

xmlns:xrt=http://www.xrt.org/xrt <

<xsl:template name=“table_years_higher_than_preceding<”< xsl:for-each select=

“ salesdata/year[preceding-

sibling::year[position()=1]/region[name='west']/sales[@unit='millions']< region[name='west']/sales[@unit='millions']< “ [

< xsl:element name=“xrt:record<”< xsl:element name=“xrt:column<”

< xsl:attribute name=“xrt:name”<YEAR_NAME</xsl:attribute<< xsl:attribute name=“xrt:value”<<xsl:value-of select=“theyear”/<</xsl:attribute<

/< xst:element<< xsl:element name=“xrt:column<”

< xsl:atttribute name=“xrt:name”<REGION</xsl:attribute<< xsl:attribute name=“xrt:value”<<xsl:value-of select=“region”/<</xsl:attribute<

/< xsl:element<< xsl:element name=“xrt:column<”

< xsl:atttribute name=“xrt:name”<PARENT_ID</xsl:attribute<< xsl:attribute name=“xrt:value”<<xsl:value-of select=“generate-id(..)”/<</xsl:attribute<

/< xsl:element<<xsl:element name=“xrt:column<”

< xsl:atttribute name=“xrt:name”<NODE_ID</xsl:attribute<< xsl:attribute name=“xrt:value”<<xsl:value-of select=“region”/<generate-id(.)</xsl:attribute<

/< xsl:element<

/< xsl:element</< xsl:for-each</<xsl:template<

/<xsl:stylesheet<




Performance Analysis - compositionWe will separate the work into two conceptual stages:

1 .Performing an SQL query that will “prepare” the required data2‘ .Editing’ the query results by adding tags and hierarchy.

A major bottleneck in composition performance is creating XML hierarchy through JOIN queries. We will present 3 techniques to deal with this problem:

Simple Join:The most straightforward approach.

Perform one JOIN between all the relational tables that we need in the output document (for instance, using parent key id’s).

Edit the relational result into XML hierarchy.

The relational result of the query should be arranged so that child elements will appearafter their parents .

Passing over the results will indicate a new level in hierarchy wherever there’s a key column change .

Performance Analysis - composition

Experiment Patient Parent id Weight

Weight Change

Bob Group1 75

Weight Change

Bob Group1 76

Simple join requires a key column for each non-leaf element.Here’s an example where it doesn’t happen:

)The solution here would be assigning a different key column to each row of the EXPERIMENT table(


Another requirement of the simple join approach is that the hierarchy in the target XML document be linear.

To understand this, suppose that every XML element in the target document had two child elements. In this case the preliminary DB query will consist of 3*Non-leaf Elements JOIN operations.

For example, suppose we want to produce the following markup from the relational tables A,B1,B2,C1,C2,D1,D1:

<A<

< B1<

< C1</

< C2</

/< B1<

< B2<

< D1</

< D2</

/< B2<

/<A<

The preliminary query will be :

A JOIN [(B1 JOIN C1) JOIN (B1 JOIN C2)] JOIN A JOIN [(B2 JOIN D1) JOIN (B2 JOIN D2)]

because every element in the result table must be preceded by its parent element’s id .


Dependent Join:In this approach we define first an SQL query for the table that contains the root element of the XML document.

Using the resulting table we create a query for each child element, and so on recursively.

)remember that we know in advance which tables we should join but we don’t know in advance that each join will result in a new child element. When there are no more child elements in a certain path

we have to ‘climb up’ back .(

At each recursive step we add the current elements to the XML document .


For example:<A<

< B1<< C1</< C2</

/< B1<< B2<< D1</< D2</

/< B2</<A<

Step1: (A JOIN B1) JOIN (A JOIN B2) create markupStep2: (B1 JOIN C1) JOIN (B1 JOIN C2) create markup..Experimentally, the dependent join approach shows poor performance in large mappings.


Sorted outer join (SOU):This is an improvement on the simple join approach which removes the limitation on

hierarchy .

The idea is to deal in one query with several linear hierarchies which will result in a non-linear hierarchy .

First we compose many ‘legal’ simple join queries. We achieve this by giving each one such tables that its result will contain the keys of only one path from the root to one

of the leafs .

Next we perform a query which is the UNION of all these query results .

The result is a table in which each row contains a linear path from the root to a leaf.

Processing this table creates the XML document. We still require that each row will have a unique key field.


For example:<A<

< B1<< C1</< C2</

/< B1<< B2<< D1</< D2</

/< B2</<A<

The query:Step1: (A JOIN B1 JOIN C1) result1 Step2: (A JOIN B1 JOIN C2) result2 Step3: (A JOIN B2 JOIN D1) result3

Step4: (A JOIN B2 JOIN D2) result4Step5: result1 U result2 U result3 U result4

)It is also important to notice that in retrieving the a path like A-B-C, we will preserve in the result the keys of all three tables but only the value columns of C.(

Documents

XML vs. Relational