Translating XSLT programs to efficient SQL queries

Translating XSLT Programs to Efficient SQL Queries

Group 5 Guided By:

Keyur Patel (1100101106013) Dr Kalpdrum Passi

Shubham Shah (1100101106037)

Manav Sharma ( 110010107056 )

Ruturaj Raval (110090107036)1

Introduction to the reason, why XSLT originated

• It is originally designed to serve as a stylesheet, to map XML to HTML

• Nowadays, increasingly used in querying and transforming XML data

2

What is the need of translation (XSLT to SQL)

• Most of the XML data used in enterprises originates from relational databases.

• This will not change in near future

• Reason

Transactional guarantee offered by relational database

3

XSLT to SQL

• XML languages(like XML-QL XQuery and not XSLT) were been translated from a long time

• Translation from XSLT to SQL is more complex, since it is not query language

• Reason

Gap between followings

• XSLT - functional and recursive paradigm

• SQL - declarative paradigm

It is possible to construct such XSLT programs that have no SQL equivalent

4

Solution (Other than translation)

• XSLT interpreter can construct XML elements on demand, by issuing SQL query per XML element

• One can formulate SQL query to retrieve an XML element with a given ID

• Implementation of this algorithm will end up analyzing the entire XML documents most of the time

• It slows down interpretation due to multiple queries, issued for single XSLT program

5

XML view Beers/Drinkers/Bars

6

Solution

• Solution given further generates one query for single XSLT program than pushing entire computation inside relational engine

• Example<xsl:template match="*">

<xsl:apply-template/>

</xsl:template>

<xsl:template match=“drinker[name==’Smith’]">

<xsl:value-of select=“age/text()"/>

</xsl:template>

Generated SQL query

SELECT drinker.ageFROM drinkerWHERE drinker.name = "Smith"

7

XPath

• XPath is a component of XSLT, and the translation to SQL must handle it.

• For example the XPath /doc/drinkers/name/text() returns all drinkers.

• The equivalent SQL is

SELECT drinkers.name

FROM drinkers

8

Variables and parameters in XSLT

• In XSLT one can bind some value to a parameter in one part of

the tree, then use it in some other part

• In SQL this becomes a join operation, correlating two tables

9

Other XSLT Constructs

• The translation also supports if-[else], for-each, and case constructs.

• The for-each construct is equivalent to iteration using separate template rules.

• The case construct is equivalent to multiple if statements.

10

Challenges

• To map from a functional programming style to a declarative style.

• To cope with general recursion, both at the XPath level and in XSLTtemplates.

• Parameters add another source of complexities, and they typicallyneed to be converted into joins between values from different parts ofthe XML tree.

11

Challenges

• We have to address the quality of the generated SQL queries.

• SQL queries tend to be redundant and have unnecessary joins, typically self-joins .

• An optimizer for eliminating redundant joins is difficult to implement since the general problem, called query minimization, is NP-complete

• Commercial databases systems do not do query minimization because it is expensive and users do not write SQL queries that require minimization.

• In this case automatically generated SQL queries however, it is all too easy to overshoot, and create too many joins.

12

13

ARCHITECTURE

• An XML view is defined over the relational database using a View Tree .

• The XML view typically consists of the entire database, but can also be a subset to export a subset view of the relational database. It can also include redundant information.

• Once the View Tree has been defined, the system accepts XSLT programs over the virtual XML view, and translates them to SQL in several steps.

14

15

• First the parser translates the XSLT program into an Intermediate representation (IR). The IR is a DAG (Directed Acyclic Graph) of templates with a unique root template.

• Each leaf node contributes to the program’s result, and each path from the root to a leaf node corresponds to a SQL query: the final SQL query is a union of all such queries.

• Each such path is translated first into a Query Tree (QTree) by the QTree generator.

• A QTree represents multiple, possible overlapping navigations through the XML document, together with selection, join, and aggregate conditions at various nodes.

16

• The SQL generator plus optimizer takes a QTree as input, and generates an equivalent SQL query using the XML schema, RDB schema, and View Tree.

• The querier has an easy task; it takes the generated SQL query and gets the resulting tuples from the RDB.

• The result tuples are passed onto the tagger, which produces the output for the user in a format dictated by the original query.

17

18

TRANSLATION

-First the XSLT program is parsed into an internal representation (IR) that reflects the semantics of the program in a functional style.

-The QTree, which is an abstract representation of the paths traversed by the program on the View Tree. A QTree represents a single such path traversal, and is a useful intermediate representation for purposes of translating XML tree traversals into SQL.

-Then describe the representation of the XML view over relational data (the View Tree), and

-Finally show how we combine information from the QTree and View Tree to generate an equivalent SQL Query.

Example: which retrieves the number of beers every drinker likes in common with Brian.

XML view Beers/Drinkers/Bars

20

Parser

-The output of the parser is an Intermediate Representation (IR) of the XSLT query. Besides the strictly syntactic parsing, this module also performs a sequence of transformations to generate the IR.

-First it converts the XSLT program into a functional representation, in which each template mode is expressed as a function.

-We add extra functions to represent the built-in XSLT template rules ,then we “match” the resulting program against the XML Schema. During the match all wildcards (*) are instantiated and only valid navigations are retained.

-The end result for our running example is the IR . In this case the result is a single call graph.

The various stages leading to IR generation for the query

QTree

-The QTree is a simulation from the XML schema, and succinctly describes the computation being done by the query.

-The QTree abstraction captures the three components of an XML query: (a)the path taken by the query in the XML document, (b)the conditions placed on the nodes or data values along the path (c)the parameters passed between function calls.

-Corresponding to the three elements of the XML query above, a QTree has the following components.1. Tree2. Condition set3. Mapping for parameters

The logic encapsulated by the XSLT program is as follows:

1. start at the “root” node.

2. traverse down to a “drinkers” named Brian; “./beers/beername” is passed as a parameter at this point.

3. starting from the “root”, traverse down to “drinkers” again.

4. traverse one level down to “name”, and perform an aggregation on the node-set “beers[...]/”

The View Tree

-The View Tree defines a mapping from the XML schema to the relational tables.

-The View Tree defines a SQL query for each node in the XML schema.

1. drinkers(name,age,astrosign) = Drinkers(name,age)Astrosign(name,astrosign)

2. Beers(beerName,price,drinkerName)=Drinkers(drinkerName),Beers(beerName,price)Likes(drinkerName,beerName)

-Tree for the beers/drinkers/bars schema. The rule heads (e.g., Drinkers) denote the table name, and the arguments denote the column name. Same argument in two tables represents a join on that value

.

SQL Generation

-This section explains how we generate SQL from a QTree using the View Tree.

-A QTree represents a traversal of nodes in the original query and constraints placed by query on these nodes. The idea is to generate the SQL query clauses corresponding to those traversals and constraints. This is a three step process.First, nodes of the QTree are bound to instances of relational tables.

-Second, the appropriate WHERE constraints are generated using the binding in the first step. Intuitively, the first step generates the FROM part of the SQL query and join constraints due to tree traversal. The second step generates all explicitly specified constraints.

-Finally, the bindings for the return nodes are used to generate the SELECT part.

Eliminating Join Conditions on Intersecting Paths

-For any two paths in the QTree , nodes that lie on both paths must have the same value.

-One simple approach would be two bind the two paths independently and then for each common node add equality conditions to represent the fact that values from both paths are the same.

OPTIMIZATIONS

• Capable to generate inefficient queries

• Joins and nested queries

• Unnesting sub-queries, eliminates joins

• Most of the optimizers are general purpose SQL query rewritings

• Three major reasons• Optimizations are specific to kind of SQL queries• Optimizer did not preserve any semantic in general• Optimizer did not perform any commercial database, indeed as experienced

with the popular one• Semantics are preserved for XSLT to SQL translations

30

Nested queries

• Optimization is applied to the predicate of the form a and b, where b is subquery node-set

• Applied to expression when conjunction exist

• By default SQL query will generate b from algorithm

• Query can be unnested on base of node set b

• Three possibilities

31

• b is a singleton set• Simplest case• Unnesting of query will not affect multiplicity• To determine the node set to be a singleton• Atrosign in Xpath expression is a node-set• Node can have multiple values relative to its parent (by specifying a

‘*’). If no node in the QTree for the node-set has a ‘*’ in the XML schema, then it must be a singleton set

• Unnesting subquery representing singleton setQuery: Find name of drinkers which are 'Leo‘

XPath: /drinkers['Leo' == astrosign]Unoptimized: SELECT drinkers.name,

FROM drinkersWHERE 'Leo' in (SELECT astrosign.signFROM astrosignWHERE astrosign.drinker = drinkers.name)

Optimized: SELECT drinkers.name,FROM drinkers, astrosignWHERE 'Leo' = astrosign.sign ANDastrosign.drinker = drinkers.name

32

• b has no duplicates• Unnesting subquery having no duplicates

Query: The subquery

Unoptimized: SELECT likes.beer

FROM likes

WHERE likes.drinker = drinkers.name

likes.beer IN (SELECT likes.beer

FROM likes

WHERE likes.drinker= drinkers2.name)

Optimized: SELECT likes.beer

FROM likes, likes2

WHERE likes.drinker = drinkers.name AND

likes.beer = likes2.beer AND

likes.drinker = drinkers2.name

33

• No duplicates to subquery then, ‘true’ will be generated for all values in set b

• Unnesting of query without changing multiplicity

• To determine no duplicate exist, follow test is performed

• If the QTree for the nodeset has no node with a ‘*’ except at the leaf node, then it is a distinct set

• The intuition is that the siblings nodes (with the same parent) in the document are unique

• So if there is a ‘*’ at an edge other than the leaf edge, uniqueness of the leaves returned by the query is not guaranteed

• b can have duplicates• When a node-set can have duplicates for example, //beers,

unnesting the query might change the semantics

• Multiplicity of the resultant query will change if the condition a in b evaluates to true more than once

• Do not unnest such a query

34

Unnesting Aggregate Operators UsingGROUP-BY

• Optimization unnest subquery that uses aggregation by using GROUP-BY at the outer level

• Optimization is for form op b

• B is node set and op is aggregation operators like sum, min, max, count, avg

• Nested query is evaluated once for every iteration

• Same semantics on unnesting of query and GROUP-BY on all iterations

• Adding of same key in form of clause of outer query

• Aggregate condition is moved to HAVING clause

• SQL must contain all query of GROUP-BY fields

• All fields in SELECT field are added to GROUP-BY clause

35

• If outer query existing with GROUP-BY then above optimization can NOT be applied

• Qtree is used once

• Apply optimization at very first time as we can do first time

• Illustrating GROUP-BY Optimization• Query: Q1

Q2 (used below) is the optimized version Unoptimized SQL for Q1:

SELECT drinkers.name, count(Q2)FROM drinkers, drinkers2WHERE drinkers2.name = 'Brian'

Optimized SQL for Q1:SELECT drinkers.name, count(likes2.drinker)FROM drinkers, drinkers2, likes, likes2WHERE drinkers2.name = 'Brian'likes.drinker = drinkers.name ANDlikes.beer = likes2.beer ANDlikes.drinker = drinkers2.nameGROUP BY drinkers.name, drinkers2.name

36

QTree Reductions

• In optimization Qtree is transformed itself

• Long paths with unreferenced intermediate nodes are shortened

• QTree Reduction for query //beers/name

37

RELATED WORK

• SilkRoute is an XML publishing system that defines an XML view over a relational database, then accepts XML-QL queries over the view and translates them into SQL. The XML view is defined by a View Tree, an abstraction that we borrowed for our translation. Both XML-QL and SQL are declarative languages, which makes the translation somewhat simpler than for XSLT. A translation from XQuery to SQL is described in and uses a different approach based on an intermediate representation of SQL

• A generic technique for processing structurally recursive queries in bulk mode is described in. Instead of using a generic technique, we leveraged the information present in the XML schema

• This elimination is related to the query pruning described in SQL query block unnesting for an intermediate language has been discussed in the context of the Starburst system

38

CONCLUSIONS

• We have described an algorithm that translates XSLT into SQL

• By necessity our system only applies to a fragment of XSLT for which the translation is possible. The full language can express programs which have no SQL equivalent: in such cases the program needs to be split into smaller pieces that can be translated into SQL

• Our translation is based on a representation of the XSLT program as a query tree, which encodes all possible navigations of the program through the XML tree. We described a number of optimization techniques that greatly improve the quality of the generated SQL queries. We also validated our system experimentally on the TPC-H benchmark

39

40

Education

Translating XSLT programs to efficient SQL queries