XML in Relational DBcstyng/webdb.07/lectures/lesson5.pdfXML Data Storage - Relational DB 5...

XML Data Storage - Relational DB 1

XML in Relational DB

A Sample File<SigmodRecord>

<article><title>Annotated Bibliography on Data Design.</title><initPage>45</initPage><endPage>77</endPage><authors>

<author position="00">Anthony I. Wasserman</author><author position="01">Karen Botnich</author>

</authors></article>

</articles></issue></SigmodRecord>

Corresponding Document TreeRoot

208207 209 210

10481047

5232 5237Anthony I

Wasserman

Annotated Bibliographyon Data Design. 45 77

KarenBotnich

SigmodRecord

number articlesvolume

article

endPage authorsinitPagetitle

author author

@position@position

element node

text node

attribute node

Two Mapping Approaches

• Structure-mapping approach– static DTD– well-defined semantic constraints

• Model-mapping approach– dynamic DTD– ambiguous semantic constraints

Structure-mapping approach

• Design database schema based on DTD

• Different schemas for different DTDs

• Slides partially from a VLDB 2002 Tutorial

Relational DB

XML Doc

XMLQuery

A Sample DTD<!ELEMENT book (booktitle, author)<!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)>

<!ATTLIST author id ID #REQUIRED>

<!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor?)><!ELEMENT title (#PCDATA)><!ELEMENT contactauthor EMPTY>

<!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)><!ELEMENT editor (monograph*)>

<!ATTLIST editor name CDATA #REQUIRED>

A DTD Graphbook

booktitle

author

monograph

contactauthor

authorID

editor

nameaddress

firstname lastname

authorid

article

12 elements

Basic Inlining• An element graph is created for each element in the DTD

graph• Relations created for:

– Root of an element graph– Set value children (under a *)– Nodes with a back pointer (recursion)

• All other descendants inlined• Every element is made into as comprehensive a relation as

possible• Set-valued attributes are handled with separate relations

and foreign keys

Basic Inlining• An element graph is constructed by

– Do a DFS traversal of DTD graph, starting at element node for which we are constructing relations

– Each node is marked as “visited” the first time it is reached and is unmarked once all its children have been traversed

– If an unmarked node in DTD graph is reach during DFS, a new node bearing the same name is created in the element graph

– A regular edge is created from the most recently created node in the element graph with the same names as the DFS parent of the current DTD node to newly created node

– If an attempt is made to traverse an already marked DTD, then a backpointer edge is added from the most recently created node in the element graph to the most recently created node in the element graph of the same name as the marked DTD node

editor

Basic Inlining

• A relation is created for root of element of graph

• All element’s descendents are inlined into that relation except– Children below a “*” node

are made into separate relations – this corresponds to creating a new relation for a set-valued child

– Each node having a backpointer edge pointing to it is made into a separate relation

Basic: Relational Schemabook (bookID: integer, book.booktitle : string, book.author.name.firstname: string, book.author.name.lastname: string,

book.author.address: string, author.authorid: string)booktitle (booktitleID: integer, booktitle: string)article (articleID: integer, article.contactauthor.authorid: string, article.title: string)article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,

article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)contactauthor (contactauthorID: integer, contactauthor.authorid: string)title (titleID: integer, title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.title: string,

monograph.editor.name: string, monograph.author.name.firstname: string, monograph.author.name.lastname: string, monograph.author.address: string, monograph.author.authorid: string)

editor (editorID: integer, editor.parentID: integer, editor.name: string)editor.monograph (editor.monographID: integer, editor.monograph.parentID: integer, editor.monograph.title: string,

editor.monograph.author.name.firstname: string, editor.monograph.author.name.lastname: string, editor.monograph.author.address: string, editor.monograph.author.authorid: string)

author (authorID: integer, author.name.firstname: string, author.name.lastname: string, author.address: string,aauthor.authorid: string)

name (nameID: integer, name.firstname: string, name.lastname: string)firstname (firstnameID: integer, firstname: string)lastname (lastnameID: integer, lastname: string)address (addressID: integer, address: string)

Fragmentation Problem<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)>

<!ELEMENT firstname (#PCDATA)>

<!ELEMENT lastname (#PCDATA)>

<!ELEMENT address ANY>

author (authorID: integer, id: string)

name (nameID: integer, authorID: integer)

firstname (firstnameID: integer, nameID: integer, value: string)

lastname (lastnameID: integer, nameID: integer, value: string)

address (addressID: integer, authorID: integer, value: string)

Each element is a table

Problems in Basic In-lining• Creating a large amount of relations

– Separate relational schema for each element as root (minor)

– Unrolling recursive strongly connected components (major)

• Consuming great storage space when where are many null able(optional) attributes

• Too many joins– E.g. author becomes scattered (i.e., it appears

once for each possible instantiation from a root). Query: list all the books authored by Louis

Complexity of DTDs• DTD element specifications can be of

arbitrary complexity• <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!• Fortunately, can simplify DTD for

translation purposes:– Key observations: not necessary to regenerate

DTD from relational schema– XML query languages do not query over

complex structures of elements

Simplification of DTDs

(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?

e1** e1*e1*? e1*e1?* e1*e1?? e1?

..., a*, ..., a*, ... a*, ...

..., a*, ..., a?, ... a*, ...

..., a?, ..., a*, ... a*, ...

..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …

Flattening Transformations Simplification Transformations

Grouping Transformations

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>

<!ELEMENT a (b*, c?, e*, f*)>

Shared Inlining• Idea

– Inline as many sub-elements as possible– However, to avoid scattering, do not inline an

element if it is shared– I.e. do not inline only if it is a shared, recursive or set

sub-element.

• Approach– Necessary and Sufficient Condition for shared/

recursive element: e.g. authorIn-degree >= 2 in DTD graph

Shared Inlining• Technique:

– Node with in-degree 1 in the DTD graph: inline

• Special cases: * and recursion– Node with in-degree 0: create a separate

relation, because it cannot be inlined– Node with in-degree greater than 1: create

a separate relation, because it is shared

Issues with Sharing Elements

• Parent of elements not fixed at schema level– Isroot: boolean variable

• Need to store type and ids of parents (or if there are no parents)– parentCODE field (type of parent)– parentID field (id of parent)– Not foreign key relationship

Shared: Relational Schemabook (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)article (articleID: integer, article.contactauthor.isroot: boolean,

article.contactauthor.authorid: string)monograph (monographID: integer, monograph.parentID: integer,

monograph.parentCODE: integer, monograph.editor.isroot: boolean,monograph.editor.name: string)

title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer,

author.name.isroot: boolean, author.name.firstname.isroot: :boolean,author.name.firstname: string, author.name.lastname.isroot: boolean,author.name.lastname: string, author.address.isroot: boolean,author.address: string, author.authorid: string)

Shared Inlining

• Advantages– Reduces number of joins for queries like “get

the first and last names of an author”– Efficient for queries such as “list all authors

with name Jack”• Problems

– Sharing whenever possible implies extra joins for path expressions. E.g. “Article with a given title name”

Converting XML Queries to SQL

• Converting queries with simple path expressions– Relation(s) corresponding

to start of root path expression(s) are identified and added to from clause of SQL query

– If necessary, path expressions are translated to joins among relations

Converting Relational Results to XML

• Simple Structuring– Constructing such

results from a RDBMS is natural and efficient

– Requires attaching appropriate tags for each tuple

Model-mapping approach

• Fixed database schema for different XML documents without the support of DTD– Edge (Florescu and Kossmann, 1999)– XRel (YoshiKawa, M. and Amagasa, T., 2001)– XParent (H. Jiang et al., 2002)– iNode (Lau & Ng, 2002)

• Support well-formed but non-DTD XML documents

• One-table schema– Edge(Source, Ordinal, Target, Label, Flag, Value)

• Each row corresponds to one edge in data graph

• Require table joins for edge connections• Cannot determine the parent-child

relationship directly

The Document TreeRoot

208207 209 210

10481047

5232 5237Anthony I

Wasserman

KarenBotnich

SigmodRecord

article

author author

@position@position

element node

text node

attribute node

EdgeSrc Ord Tgt Label Flag Value0 1 1 SigmodRecord ref1 1 2 issue ref2 1 3 volume val 112 2 4 number val 12 3 5 articles ref5 1 6 article ref6 1 7 title val Annotated ….6 2 8 initPage val 456 3 9 endPage val 776 4 10 authors ref10 1 11 author val Anthony I …11 1 12 @position val 0010 2 13 author val Karen Botnich13 1 14 @position val 01

XParent

• XParent– uses LabelPath and DataPath tables to maintain

the structural information of the XML documents.

– The DataPath table is also used to keep the parent-child relationship instead of using region as in XRel.

XParent

• Four-table schema– LabelPath(PathID, Len, Path)– DataPath(Pid, Cid)– Element(PathID, Ordinal, Did)– Data(PathID, Did, Ordinal, Value)

• Determine the parent-child relationship by using equijoin

The Document TreeRoot

208207 209 210

10481047

5232 5237Anthony I

Wasserman

KarenBotnich

SigmodRecord

article

author author

@position@position

element node

text node

attribute node

XParentPID PathLen PathExp1 1 ./SigmodRecord2 2 ./SigmodRecord./issue3 3 ./SigmodRecord./issue./volume4 3 ./SigmodRecord./issue./number5 3 ./SigmodRecord./issue./articles6 4 ./SigmodRecord./issue./articles./article7 5 ./SigmodRecord./issue./articles./article./title8 5 ./SigmodRecord./issue./articles./article./initPage9 5 ./SigmodRecord./issue./articles./article./endPage10 5 ./SigmodRecord./issue./articles./article./authors11 6 ./SigmodRecord./issue./articles./article./authors./author12 7 ./SigmodRecord./issue./articles./article./authors./author./@position PID CID

1 22 32 42 55 65 156 76 86 96 1010 1110 1210 1310 14

LabelPath Table

DataPath Table

XParentPathID Ordinal Did1 1 12 1 23 1 34 1 45 1 56 1 67 1 78 1 89 1 910 1 1011 1 1112 1 1211 2 1312 1 14 PathID Did Ordinal Value

3 3 1 114 4 1 17 7 1 Annotated Bibliography on Data Design.8 8 1 459 9 1 7712 12 1 0011 11 1 Anthony I. Wasserman12 14 1 0111 13 2 Karen Botnich

Element Table

Data Table

• Model-mapping approach• Node-oriented approach• Based on a node numbering scheme• Using the virtual node concept to calculate

the id for each node (Lee et al., 1996)

Key Features

1. Does not need to concatenate the edges to form a simple path for query processing

2. Reduces the database size by using a table to store all simple path expressions

3. The parent-child relationship is embedded in the NodeID

4. Retrieve the parent-child relationship by calculation instead of using equijoins orθ–joins

INode Numbering Scheme (1/4)

• Given a node n, denote the path of n asa1, a2, a3, …, akwhere k is the depth of node n

and 1 ≤ ai ≤ nc

1198 10 12 13

1,1 1,2 1,3

1( )⎥⎦⎥

⎢⎣⎢ +

−= 12)(

kiiParent

real node virtual node

• Based on the Parent(i) function, we have a path function as follow:

⎪⎪⎩

⎪⎪⎨

>++−

∑ ∑

−−

2,)( 2

kannan

kapathf

Where nc = maximum number of child nodes for a node

k = path length of the node

ai = value in the path at the position i

• With the path function, we embed the document-id in the node-id with the following function, where i is the number of decimal places for document-id

docidpathfnodeid10

• Given the node-ids of i and j, where i is the ancestor of j, and the depth difference between two nodes as d, we can confirm the ancestor-descendent relationship with the following function

⎣ ⎦jjdc

i nodeidnodeidn

nnodeidnodeid −+

⎥⎥⎥⎥

⎢⎢⎢⎢

+−−

=∑−

Table Mapping

• Three-table schema– Path(PathID, PathExp)– Element(NodeID, PathID, Ordinal, Value)– Attribute(NodeID, PathID, Value)

• Determine the parent-child relationship by calculation

Querying - EdgeQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author

select authors.valuefrom edge sigmodrecord, edge issue, edge articles, edge article,

edge authors, edge author, edge endpagewhere sigmodreacord.label = ‘SigmodRecord’

and issue.label = ‘issue’and articles.label = ‘articles’and article.label = ‘article’and authors.label = ‘authors’and author.label = ‘author’and endpage.label = ‘endPage’and sigmodrecord.source = ‘0’and sigmodrecord.target = issue.sourceand issue.target = articles.sourceand articles.target = article.sourceand article.target = authors.sourceand endpages.source = authors.sourceand author.source = authors.targetand endpage.value = ‘77’

Querying - XParentQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author

select d1.valuefrom labelpath lp1, labelpath lp2, datapath dp1, datapath dp2, data d1, data d2

labelpath lp3, datapath dp3where lp1.path = ‘./SigmodRecord./issue./articles./article./authors/author'

and lp2.path = ‘./SigmodRecord./issue./articles./article./endPage’lp3.path = ‘./SigmodRecord./issue./articles./article./authors'and d1.pathid = lp1.pathidand d2.pathid = lp2.pathidand d1.did = dp1.cidand d2.did = dp2.cidand dp3.pid = dp2.pidand dp1.pid = dp3.cidand d2.value = ‘77’

Query Processing - iNode

Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors

select e1.valuefrom path p1, path p2, element e1, element e2where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors'

and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage'and e1.pathid = p1.pathidand e2.pathid = p2.pathidand mod(e1.nodeid, 1) = mod(e2.nodeid, 1)and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1)and e2.value = '77'

Discussion• Edge

– easy to maintain because it uses a singe table schema– Since it only maintains edges individually, it needs a

large number of equijoins to check the edge-connections.

• XRel– uses a table to store all simple path expressions. This

improves the query performance and performs regular path expressions easily.

– it can identify the containment relationship by using θ–joins. However, θ–join is more costly than an equijoin.

Discussion

• XParent– uses a table to maintain the parent-child relationship

instead of using the concept of region. Therefore, it can use equijoins to test the relationship and the performance can be increased.

– For some complex queries that require checking the ancestor-descendent relationship, XParent requires a large number of equijoins to check for the relationship.

– Although it can use another table, called Ancestor, to keep the ancestor-descendent relationship, it requires more storage space.

End of Lecture

XML in Relational DBcstyng/webdb.07/lectures/lesson5.pdfXML Data Storage - Relational DB 5...

Documents

RDBMS and SQL Relational Algebravvtesh.co.in/teaching/dbms2019/Lecture2-Relational...Tuple Relational Calculus Domain Relational Calculus Popular Language SQL Relational Algebra Relation

Oracle WebDB Creating and Managing Components - User Guidecourses.ischool.berkeley.edu/.../webdb-components-guide.pdf · 1999-11-23 · Oracle WebDB Creating and Managing Components

Chapter 3 relational model 3.1 An Informal Look at the Relational ModelAn Informal Look at the Relational Model 3.2 The Relational AlgebraThe Relational

Chapter 6: Formal Relational Query Languages. 6.2 Chapter 6: Formal Relational Query Languages Relational Algebra Tuple Relational Calculus Domain Relational

1 The Relational Data Model, Relational Constraints, and The Relational Algebra

Chapter 2: Relational Model - WordPress.com · Chapter 2: Relational Model Structure of Relational Databases Fundamental Relational-Algebra-Operations Additional Relational-Algebra-Operations

Universitas Syiah Kuala - informatika.unsyiah.ac.idinformatika.unsyiah.ac.id/irvanizam/teaching/webdb/report-webdb...Universitas Syiah Kuala : Program Studi Matematika Database Perpustakaan

1 Relational Query Languages Relational Algebra (procedural) Relational Calculus (non-procedural)

Relational Model & Relational Algebra

Relational Model and Relational Algebra

Relational Algebra Lecture 2. Relational Model Basic Notions Fundamental Relational Algebra Operations Additional Relational Algebra Operations Extended

Chapter 3: Relational Model - Bucknell Universityxmeng/Course/CS379/db-relational.pdf · Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple Relational

Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

3.1Database System Concepts Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple Relational Calculus Domain Relational

Relational Algebra and Relational Calculus

Relational Model. 2 Structure of Relational Databases Fundamental Relational-Algebra-Operations Additional Relational-Algebra-Operations Extended Relational-Algebra-Operations

Relational vs. Non-Relational

Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Relational Model dww-database system. Outline Relational Model Concepts Relational Model Constraints & Relational Database Schemas Update Operation &

Oracle Webdb