XML in Relational DBcstyng/webdb.07/lectures/lesson5.pdfXML Data Storage - Relational DB 5...

Preview:

Citation preview

XML Data Storage - Relational DB 1

XML in Relational DB

XML Data Storage - Relational DB 2

A Sample File<SigmodRecord>

<issue><volume>11</volume><number>1</number><articles>

<article><title>Annotated Bibliography on Data Design.</title><initPage>45</initPage><endPage>77</endPage><authors>

<author position="00">Anthony I. Wasserman</author><author position="01">Karen Botnich</author>

</authors></article>

</articles></issue></SigmodRecord>

XML Data Storage - Relational DB 3

Corresponding Document TreeRoot

1

2

7 8 9

42

208207 209 210

10481047

5232 5237Anthony I

Wasserman

11 1

Annotated Bibliographyon Data Design. 45 77

KarenBotnich

SigmodRecord

issue

number articlesvolume

article

endPage authorsinitPagetitle

author author

@position@position

00 01

element node

text node

attribute node

XML Data Storage - Relational DB 4

Two Mapping Approaches

• Structure-mapping approach– static DTD– well-defined semantic constraints

• Model-mapping approach– dynamic DTD– ambiguous semantic constraints

XML Data Storage - Relational DB 5

Structure-mapping approach

• Design database schema based on DTD

• Different schemas for different DTDs

• Slides partially from a VLDB 2002 Tutorial

Relational DB

DTD

XML Doc

XMLQuery

XML

XML Data Storage - Relational DB 6

A Sample DTD<!ELEMENT book (booktitle, author)<!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)>

<!ATTLIST author id ID #REQUIRED>

<!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor?)><!ELEMENT title (#PCDATA)><!ELEMENT contactauthor EMPTY>

<!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)><!ELEMENT editor (monograph*)>

<!ATTLIST editor name CDATA #REQUIRED>

XML Data Storage - Relational DB 7

A DTD Graphbook

booktitle

author

monograph

title

contactauthor

authorID

editor

*

nameaddress

?

firstname lastname

?

authorid

article

*

name

12 elements

XML Data Storage - Relational DB 8

Basic Inlining• An element graph is created for each element in the DTD

graph• Relations created for:

– Root of an element graph– Set value children (under a *)– Nodes with a back pointer (recursion)

• All other descendants inlined• Every element is made into as comprehensive a relation as

possible• Set-valued attributes are handled with separate relations

and foreign keys

XML Data Storage - Relational DB 9

Basic Inlining• An element graph is constructed by

– Do a DFS traversal of DTD graph, starting at element node for which we are constructing relations

– Each node is marked as “visited” the first time it is reached and is unmarked once all its children have been traversed

– If an unmarked node in DTD graph is reach during DFS, a new node bearing the same name is created in the element graph

– A regular edge is created from the most recently created node in the element graph with the same names as the DFS parent of the current DTD node to newly created node

– If an attempt is made to traverse an already marked DTD, then a backpointer edge is added from the most recently created node in the element graph to the most recently created node in the element graph of the same name as the marked DTD node

editor

name

XML Data Storage - Relational DB 10

Basic Inlining

• A relation is created for root of element of graph

• All element’s descendents are inlined into that relation except– Children below a “*” node

are made into separate relations – this corresponds to creating a new relation for a set-valued child

– Each node having a backpointer edge pointing to it is made into a separate relation

XML Data Storage - Relational DB 11

Basic: Relational Schemabook (bookID: integer, book.booktitle : string, book.author.name.firstname: string, book.author.name.lastname: string,

book.author.address: string, author.authorid: string)booktitle (booktitleID: integer, booktitle: string)article (articleID: integer, article.contactauthor.authorid: string, article.title: string)article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,

article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)contactauthor (contactauthorID: integer, contactauthor.authorid: string)title (titleID: integer, title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.title: string,

monograph.editor.name: string, monograph.author.name.firstname: string, monograph.author.name.lastname: string, monograph.author.address: string, monograph.author.authorid: string)

editor (editorID: integer, editor.parentID: integer, editor.name: string)editor.monograph (editor.monographID: integer, editor.monograph.parentID: integer, editor.monograph.title: string,

editor.monograph.author.name.firstname: string, editor.monograph.author.name.lastname: string, editor.monograph.author.address: string, editor.monograph.author.authorid: string)

author (authorID: integer, author.name.firstname: string, author.name.lastname: string, author.address: string,aauthor.authorid: string)

name (nameID: integer, name.firstname: string, name.lastname: string)firstname (firstnameID: integer, firstname: string)lastname (lastnameID: integer, lastname: string)address (addressID: integer, address: string)

XML Data Storage - Relational DB 12

Fragmentation Problem<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)>

<!ELEMENT firstname (#PCDATA)>

<!ELEMENT lastname (#PCDATA)>

<!ELEMENT address ANY>

author (authorID: integer, id: string)

name (nameID: integer, authorID: integer)

firstname (firstnameID: integer, nameID: integer, value: string)

lastname (lastnameID: integer, nameID: integer, value: string)

address (addressID: integer, authorID: integer, value: string)

Each element is a table

XML Data Storage - Relational DB 13

Problems in Basic In-lining• Creating a large amount of relations

– Separate relational schema for each element as root (minor)

– Unrolling recursive strongly connected components (major)

• Consuming great storage space when where are many null able(optional) attributes

• Too many joins– E.g. author becomes scattered (i.e., it appears

once for each possible instantiation from a root). Query: list all the books authored by Louis

XML Data Storage - Relational DB 14

Complexity of DTDs• DTD element specifications can be of

arbitrary complexity• <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!• Fortunately, can simplify DTD for

translation purposes:– Key observations: not necessary to regenerate

DTD from relational schema– XML query languages do not query over

complex structures of elements

XML Data Storage - Relational DB 15

Simplification of DTDs

(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?

e1** e1*e1*? e1*e1?* e1*e1?? e1?

..., a*, ..., a*, ... a*, ...

..., a*, ..., a?, ... a*, ...

..., a?, ..., a*, ... a*, ...

..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …

Flattening Transformations Simplification Transformations

Grouping Transformations

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>

<!ELEMENT a (b*, c?, e*, f*)>

XML Data Storage - Relational DB 16

Shared Inlining• Idea

– Inline as many sub-elements as possible– However, to avoid scattering, do not inline an

element if it is shared– I.e. do not inline only if it is a shared, recursive or set

sub-element.

• Approach– Necessary and Sufficient Condition for shared/

recursive element: e.g. authorIn-degree >= 2 in DTD graph

XML Data Storage - Relational DB 17

Shared Inlining• Technique:

– Node with in-degree 1 in the DTD graph: inline

• Special cases: * and recursion– Node with in-degree 0: create a separate

relation, because it cannot be inlined– Node with in-degree greater than 1: create

a separate relation, because it is shared

XML Data Storage - Relational DB 18

Issues with Sharing Elements

• Parent of elements not fixed at schema level– Isroot: boolean variable

• Need to store type and ids of parents (or if there are no parents)– parentCODE field (type of parent)– parentID field (id of parent)– Not foreign key relationship

XML Data Storage - Relational DB 19

Shared: Relational Schemabook (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)article (articleID: integer, article.contactauthor.isroot: boolean,

article.contactauthor.authorid: string)monograph (monographID: integer, monograph.parentID: integer,

monograph.parentCODE: integer, monograph.editor.isroot: boolean,monograph.editor.name: string)

title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer,

author.name.isroot: boolean, author.name.firstname.isroot: :boolean,author.name.firstname: string, author.name.lastname.isroot: boolean,author.name.lastname: string, author.address.isroot: boolean,author.address: string, author.authorid: string)

XML Data Storage - Relational DB 20

Shared Inlining

• Advantages– Reduces number of joins for queries like “get

the first and last names of an author”– Efficient for queries such as “list all authors

with name Jack”• Problems

– Sharing whenever possible implies extra joins for path expressions. E.g. “Article with a given title name”

XML Data Storage - Relational DB 21

Converting XML Queries to SQL

• Converting queries with simple path expressions– Relation(s) corresponding

to start of root path expression(s) are identified and added to from clause of SQL query

– If necessary, path expressions are translated to joins among relations

XML Data Storage - Relational DB 22

Converting Relational Results to XML

• Simple Structuring– Constructing such

results from a RDBMS is natural and efficient

– Requires attaching appropriate tags for each tuple

XML Data Storage - Relational DB 23

Model-mapping approach

• Fixed database schema for different XML documents without the support of DTD– Edge (Florescu and Kossmann, 1999)– XRel (YoshiKawa, M. and Amagasa, T., 2001)– XParent (H. Jiang et al., 2002)– iNode (Lau & Ng, 2002)

• Support well-formed but non-DTD XML documents

XML Data Storage - Relational DB 24

Edge

• One-table schema– Edge(Source, Ordinal, Target, Label, Flag, Value)

• Each row corresponds to one edge in data graph

• Require table joins for edge connections• Cannot determine the parent-child

relationship directly

XML Data Storage - Relational DB 25

The Document TreeRoot

1

2

7 8 9

42

208207 209 210

10481047

5232 5237Anthony I

Wasserman

11 1

Annotated Bibliographyon Data Design. 45 77

KarenBotnich

SigmodRecord

issue

number articlesvolume

article

endPage authorsinitPagetitle

author author

@position@position

00 01

element node

text node

attribute node

XML Data Storage - Relational DB 26

EdgeSrc Ord Tgt Label Flag Value0 1 1 SigmodRecord ref1 1 2 issue ref2 1 3 volume val 112 2 4 number val 12 3 5 articles ref5 1 6 article ref6 1 7 title val Annotated ….6 2 8 initPage val 456 3 9 endPage val 776 4 10 authors ref10 1 11 author val Anthony I …11 1 12 @position val 0010 2 13 author val Karen Botnich13 1 14 @position val 01

XML Data Storage - Relational DB 27

XParent

• XParent– uses LabelPath and DataPath tables to maintain

the structural information of the XML documents.

– The DataPath table is also used to keep the parent-child relationship instead of using region as in XRel.

XML Data Storage - Relational DB 28

XParent

• Four-table schema– LabelPath(PathID, Len, Path)– DataPath(Pid, Cid)– Element(PathID, Ordinal, Did)– Data(PathID, Did, Ordinal, Value)

• Determine the parent-child relationship by using equijoin

XML Data Storage - Relational DB 29

The Document TreeRoot

1

2

7 8 9

42

208207 209 210

10481047

5232 5237Anthony I

Wasserman

11 1

Annotated Bibliographyon Data Design. 45 77

KarenBotnich

SigmodRecord

issue

number articlesvolume

article

endPage authorsinitPagetitle

author author

@position@position

00 01

element node

text node

attribute node

XML Data Storage - Relational DB 30

XParentPID PathLen PathExp1 1 ./SigmodRecord2 2 ./SigmodRecord./issue3 3 ./SigmodRecord./issue./volume4 3 ./SigmodRecord./issue./number5 3 ./SigmodRecord./issue./articles6 4 ./SigmodRecord./issue./articles./article7 5 ./SigmodRecord./issue./articles./article./title8 5 ./SigmodRecord./issue./articles./article./initPage9 5 ./SigmodRecord./issue./articles./article./endPage10 5 ./SigmodRecord./issue./articles./article./authors11 6 ./SigmodRecord./issue./articles./article./authors./author12 7 ./SigmodRecord./issue./articles./article./authors./author./@position PID CID

1 22 32 42 55 65 156 76 86 96 1010 1110 1210 1310 14

LabelPath Table

DataPath Table

XML Data Storage - Relational DB 31

XParentPathID Ordinal Did1 1 12 1 23 1 34 1 45 1 56 1 67 1 78 1 89 1 910 1 1011 1 1112 1 1211 2 1312 1 14 PathID Did Ordinal Value

3 3 1 114 4 1 17 7 1 Annotated Bibliography on Data Design.8 8 1 459 9 1 7712 12 1 0011 11 1 Anthony I. Wasserman12 14 1 0111 13 2 Karen Botnich

Element Table

Data Table

XML Data Storage - Relational DB 32

INode

• Model-mapping approach• Node-oriented approach• Based on a node numbering scheme• Using the virtual node concept to calculate

the id for each node (Lee et al., 1996)

XML Data Storage - Relational DB 33

Key Features

1. Does not need to concatenate the edges to form a simple path for query processing

2. Reduces the database size by using a table to store all simple path expressions

3. The parent-child relationship is embedded in the NodeID

4. Retrieve the parent-child relationship by calculation instead of using equijoins orθ–joins

XML Data Storage - Relational DB 34

INode Numbering Scheme (1/4)

• Given a node n, denote the path of n asa1, a2, a3, …, akwhere k is the depth of node n

and 1 ≤ ai ≤ nc

1

32

1198 10 12 13

4

765

1,1 1,2 1,3

1,1,1

1( )⎥⎦⎥

⎢⎣⎢ +

−= 12)(

kiiParent

real node virtual node

XML Data Storage - Relational DB 35

INode Numbering Scheme (2/4)

• Based on the Parent(i) function, we have a path function as follow:

⎪⎪⎩

⎪⎪⎨

>++−

≤=

∑ ∑

=

=−

−−

=

2,1

2,)( 2

1

3

0

22

2

1

kannan

kapathf

i

k

iik

ic

kci

kc

ii

Where nc = maximum number of child nodes for a node

k = path length of the node

ai = value in the path at the position i

XML Data Storage - Relational DB 36

INode Numbering Scheme (3/4)

• With the path function, we embed the document-id in the node-id with the following function, where i is the number of decimal places for document-id

i

docidpathfnodeid10

)( +=

XML Data Storage - Relational DB 37

INode Numbering Scheme (4/4)

• Given the node-ids of i and j, where i is the ancestor of j, and the depth difference between two nodes as d, we can confirm the ancestor-descendent relationship with the following function

⎣ ⎦jjdc

d

i

icj

i nodeidnodeidn

nnodeidnodeid −+

⎥⎥⎥⎥

⎢⎢⎢⎢

+−−

=∑−

= 12

1

1

XML Data Storage - Relational DB 38

Table Mapping

• Three-table schema– Path(PathID, PathExp)– Element(NodeID, PathID, Ordinal, Value)– Attribute(NodeID, PathID, Value)

• Determine the parent-child relationship by calculation

XML Data Storage - Relational DB 39

Querying - EdgeQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author

select authors.valuefrom edge sigmodrecord, edge issue, edge articles, edge article,

edge authors, edge author, edge endpagewhere sigmodreacord.label = ‘SigmodRecord’

and issue.label = ‘issue’and articles.label = ‘articles’and article.label = ‘article’and authors.label = ‘authors’and author.label = ‘author’and endpage.label = ‘endPage’and sigmodrecord.source = ‘0’and sigmodrecord.target = issue.sourceand issue.target = articles.sourceand articles.target = article.sourceand article.target = authors.sourceand endpages.source = authors.sourceand author.source = authors.targetand endpage.value = ‘77’

XML Data Storage - Relational DB 40

Querying - XParentQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author

select d1.valuefrom labelpath lp1, labelpath lp2, datapath dp1, datapath dp2, data d1, data d2

labelpath lp3, datapath dp3where lp1.path = ‘./SigmodRecord./issue./articles./article./authors/author'

and lp2.path = ‘./SigmodRecord./issue./articles./article./endPage’lp3.path = ‘./SigmodRecord./issue./articles./article./authors'and d1.pathid = lp1.pathidand d2.pathid = lp2.pathidand d1.did = dp1.cidand d2.did = dp2.cidand dp3.pid = dp2.pidand dp1.pid = dp3.cidand d2.value = ‘77’

XML Data Storage - Relational DB 41

Query Processing - iNode

Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors

select e1.valuefrom path p1, path p2, element e1, element e2where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors'

and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage'and e1.pathid = p1.pathidand e2.pathid = p2.pathidand mod(e1.nodeid, 1) = mod(e2.nodeid, 1)and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1)and e2.value = '77'

XML Data Storage - Relational DB 42

Discussion• Edge

– easy to maintain because it uses a singe table schema– Since it only maintains edges individually, it needs a

large number of equijoins to check the edge-connections.

• XRel– uses a table to store all simple path expressions. This

improves the query performance and performs regular path expressions easily.

– it can identify the containment relationship by using θ–joins. However, θ–join is more costly than an equijoin.

XML Data Storage - Relational DB 43

Discussion

• XParent– uses a table to maintain the parent-child relationship

instead of using the concept of region. Therefore, it can use equijoins to test the relationship and the performance can be increased.

– For some complex queries that require checking the ancestor-descendent relationship, XParent requires a large number of equijoins to check for the relationship.

– Although it can use another table, called Ancestor, to keep the ancestor-descendent relationship, it requires more storage space.

XML Data Storage - Relational DB 44

End of Lecture

Recommended