78
XML: Data Driving Business? Laks V.S.Lakshmanan, IIT Bombay and Concordia University

XML: Data Driving Business?

Embed Size (px)

DESCRIPTION

XML: Data Driving Business?. Laks V.S.Lakshmanan, IIT Bombay and Concordia University. XML : Data Model. What is an XML Document Linearization of a tree structure Every node of the tree can have several character strings associated - PowerPoint PPT Presentation

Citation preview

Page 1: XML: Data Driving Business?

XML: Data Driving Business?

Laks V.S.Lakshmanan,

IIT Bombay and Concordia University

Page 2: XML: Data Driving Business?

XML : Data Model

• What is an XML Document– Linearization of a tree structure– Every node of the tree can have several character

strings associated– Info content of the document is the tree structure

together with the character strings

Is XML just a syntax for data interchange and serialization?

Page 3: XML: Data Driving Business?

XML: Data Model

Types of nodes Element Eg. <p a1="A1" . . . an="An">c1 . . . cm</p>

Document Eg. <!DOCTYPE name [markedupdeclarations]>

Processing instruction Eg. <?xml version=“1.0”? >

Comment Eg. <!--This is a comment-->

Atomic data Eg. <Data>

Page 4: XML: Data Driving Business?

What is a DTD?

• Document Type Definition(DTD) serves as grammar

• A document type definition specifies:

– the elements that are permissible in a document of this type

– for each each element the possible attributes, their range of values and defaults

– for each element, the structure of its contents, including:

• which element can occur and in what order

• whether text characters can occur

Page 5: XML: Data Driving Business?

Example of a DTD

Eg:<!DOCTYPE> Bookslist[

<!ELEMENT Bookslist (book)*><!ELEMENT book

(title,author*,publisher)><!ELEMENT title (#PCDATA)><!ELEMENT author(#PCDATA)><!ELEMENT publisher(#PCDATA)>

]

Page 6: XML: Data Driving Business?

XML and DTD

• Well formed documents– Tags should be nested properly and attributes should be

unique.

• Valid documents– Well formed documents that confirm to a Document

Type Definition(DTD)

• DTDs are used– Constrain structure

– Declare entities

– Provide some default values for attributes

Page 7: XML: Data Driving Business?

DTD Limitations

• too much document oriented• too simple and too complicated at the same time• too limited to represent complex structures• IDREFs are not typed• No notion of inheritance/sub-typing• too many ways to represent the same thing• names are global, not locals

Page 8: XML: Data Driving Business?

DTD vs. Database Schema

• Order is of significance in DTD and not in DB• DTD does not provide for data types• DTD cannot specify keys

Page 9: XML: Data Driving Business?

XMLSchema

• Why XMLSchema – Based on XML syntax– Can be parsed and manipulated like any XML

document– Supports variety of data types– Allows extensions of vocabularies and inherit from

elements– Provides namespace integration – Provides logical grouping of attributes

Page 10: XML: Data Driving Business?

XMLSchema: An example

<datatype name="PriceType"> <basetype name="decimal"/> <minExclusive>0.00</minExclusive> <scale>2</scale></datatype><element name="price" type="PriceType"></element>

<element name='Person'> ... </element>

<element name='Employee'>

<refines name='Person'/> ...

</element>

Page 11: XML: Data Driving Business?

XMLSchema vs. DTDDTD XMLSchema

Syntax Specialized Same as XML

Compactness Compact Verbose

Data types Strings Variety of types

Data model Closed Open

Namespaceintegration

Primitive Full fledged

Attributegrouping

Not supported Supported

Page 12: XML: Data Driving Business?

XML Data

• Superset of XMLSchema

• Can express Database relationships too..

• Eg: <elementType id="booktable">

<element id="titleID" type="#title”/>

<element type="#author”/>

<element type="#pages”/>

<key id="bookkey"> <keyPart href="#titleID"/> </key> </elementType>

Page 13: XML: Data Driving Business?

Semistructured data

• Data that is neither raw nor very strictly typed like in databases

• Examples of semistructured data– Html file with one entry per restaurant that

provides info on prices, addresses, styles – BibTex files– Genome and scientific databases– Online documentation

Page 14: XML: Data Driving Business?

Semistructured data: Main aspects

• Structure– Irregular– Implicit– Partial

• Schema– Very large– Rapidly evolving– Distinction between data and schema is blurred

Page 15: XML: Data Driving Business?

Semistructured data:Data model

• Object Exchange Model(OEM)– Lightweight and flexible– Data representation

• As a graph with objects as vertices and labels on edges

• Each object has a unique object identifier

• Some objects are atomic, e.g., integer, real,…

• Complex objects have value as set of object references

Page 16: XML: Data Driving Business?

OEM: An example

Page 17: XML: Data Driving Business?

Semistructured data: Query Languages

• Lorel– Based on OQL– Eg.,

• Select author:X

from biblio.book.author X

• Computes the set of book authors

• Forms a new node and connects it with edges labelled author to nodes resulting from evaluation of the path expression

Page 18: XML: Data Driving Business?

Lorel: Salient features

• Coercion• force comparison operators to handle comparisons

between objects of different types like between string and integer

• Eg.Select row:X

from biblio.paper X

where X.year=1998

Comment:

==>Year could have been string or integer

Page 19: XML: Data Driving Business?

Lorel: Salient Features• Path expressions

• Data model allows arbitrary nesting

• Queries should hence be able to probe arbitrary depth

• Provided by path expressions

• Eg.

select title:t

from chapter(.section)* s, s.title t

where t like "*XML*"

Page 20: XML: Data Driving Business?

UnQL• Based on Edge labeled Graph Model• Coercion not supported

• More precise knowledge of data needed

• Pattern Usage– Eg.

Select title: X

where {biblio: {paper: {title: X, year:Y}}}

in db, Y>1998

Page 21: XML: Data Driving Business?

UnQL• Path variables

– Can use path too as data– Eg.

Select @P

from db1 @P.X

where matches(“.*(U|u)biquitin.*”,X)

==>To determine where string “ubiquitin” appears in db1

Page 22: XML: Data Driving Business?

Semistructured vs. XML• Both are schema-less, self-describing

• XML is ordered and semistructured data is not

• XML can mix text and elements:– XML has lots of other stuff: entities, processing

instructions, comments

Page 23: XML: Data Driving Business?

Requirements of an XML Query Language

• XML Output• Server-side processing• Query operations

– Selection, Extraction, Reduction, Restructuring, Combination

• No schema required• Exploit available schema• Preserve order and association• Programmatic Manipulation

Page 24: XML: Data Driving Business?

Requirements of an XML Query Language

• XML representation• Mutual embedding with XML• XLink and XPointer cognizant• Support for new data types• Suitable for metadata

Page 25: XML: Data Driving Business?

XML Query Languages• XQL

• XML-QL

• Quilt

Page 26: XML: Data Driving Business?

XQL• Simple expressions

•//product[@maker='BSA'] : All products with attribute maker ‘BSA’

• Filters•author/address[@type='email']: Address nodes with attribute type as email

• Subscripts•section[1,3 to 5]: Nodes with position 1,3,4,5

Page 27: XML: Data Driving Business?

XQL• Supports boolean and set operators

•q1 and q2

•q1 union q2

• Grouping•//invoice{q1} : Using invoice groups the results of q1

• Sequence •a before b

• Others : node(), text(), ...

Page 28: XML: Data Driving Business?

XQL: Limitations• Flattening

– As the results of patterns and filters are not modeled by an intermediate relation

• Restructuring– As flattening not permitted cannot restructure

• Tag variables– Not supported

• Sorting

Page 29: XML: Data Driving Business?

XML Query Languages• XQL

• XML-QL

• Quilt

Page 30: XML: Data Driving Business?

XML-QL• Simple examples

WHERE <book> <publisher>

<name>Addison-Wesley</name> </publisher>

<title> $t</title> <author> $a</author> </book> IN "www.a.b.c/bib.xml"CONSTRUCT

<result> <author>$a</author>

<title>$t</title> </result>

Page 31: XML: Data Driving Business?

XML-QL• Grouping

WHERE <book> $p </> IN "www.a.b.c/bib.xml", <title > $t </>, <publisher>

<name>Addison-Wesley</> </publisher> IN $p

CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN $p CONSTRUCT <author> $a</> </>

==> Groups by title.

Page 32: XML: Data Driving Business?

XML-QL• Tag variables

WHERE <$p> <title> $t </title> <year>1995 </> <$e> Smith </> </> IN "www.a.b.c/bib.xml", $e IN {author, editor}

CONSTRUCT <$p> <title> $t </title> <$e> Smith </> </>

==> List of books where Smith could be either author or editor

Page 33: XML: Data Driving Business?

XML-QL• Regular Path Expressions

WHERE <part*> <name>$r</> <brand>Ford</>

</> IN "www.a.b.c/bib.xml"CONSTRUCT <result>$r</>

==> Gets list of names of parts irrespective of the nesting of parts in the document.

Page 34: XML: Data Driving Business?

XML-QL• Skolem functions

WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> </> <title> $t </> </> IN "www.a.b.c/bib.xml",CONSTRUCT <person ID=PersonID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> </>

==> PersonID is a Skolem function

Generates new id for distinct value of ($fn,$ln) else appends to existing node.

Page 35: XML: Data Driving Business?

XML-QL• Allows integrating data from multiple

sources

• Can query order as well

• Provides for embedding query within data

• Allows function definitions

• Is relationally complete

Page 36: XML: Data Driving Business?

XML-QL• Is everything fine?

– Pattern specifications are too verbose– Result of the WHERE clause is a relation

composed of scalar values• So cannot preserve information about hierarchy and

sequence

• Can hence not handle hierarchy and sequence related queries

Page 37: XML: Data Driving Business?

XML Query Languages• XQL

• XML-QL

• Quilt

Page 38: XML: Data Driving Business?

Quilt• Combines strengths of XML-QL and XQL

• Derives ability to navigate and select nodes based on sequence from XQL

• Binding of variables done like in XML-QL

Page 39: XML: Data Driving Business?

Quilt• An example

FOR $b in //book

WHERE exists($b/title) AND

NOT exists($b/author)

RETURN $b/title

==> Lists those titles of those books which do not have author info

Page 40: XML: Data Driving Business?

Quilt XML Input

FOR/LET

Tuples of bound var. WHERE

Tuples selected

RETURN

XML Output

Flow of data in a quilt expression

Page 41: XML: Data Driving Business?

Quilt: Filtering Documents• Need to preserve the relationships among

selected elements

• Eg:C

CB

C

B

AA

A C B

B

B A

A

BA

filter = A|B

Page 42: XML: Data Driving Business?

Quilt• Can perform Sorting

• Aggregation provided

• Allows recursive functions

Page 43: XML: Data Driving Business?

Quilt: The real power of it• Sample document

<section>

<section.title>Procedure</section.title> The patient was taken to the operating room where she was placed in a supine position and <Anesthesia>induced under general anesthesia. </Anesthesia> <Prep> <action>Foley catheter was placed to decompress the bladder</action> and the abdomen was then prepped and draped in sterile fashion. </Prep> <Incision> A curvilinear incision was made <Geography>in the midline immediately infraumbilical</Geography> and the subcutaneous tissue was divided <Instrument>using electrocautery.</Instrument> </Incision> The fascia was identified and <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> <Incision> The fascia was divided using <Instrument>electrocautery</Instrument> and the peritoneum was entered. </Incision> <Observation>The small bowel was identified</Observation> and <action> the <Instrument>Hasson trocar</Instrument> </action>

:

</section>

Page 44: XML: Data Driving Business?

Quilt: The real power of it• In each section with title "Procedure", what Instruments were used in

the second Incision?FOR $s IN //section[section.title="Procedure"]

RETURN ($s//Incision)[2]/Instrument

• In each section with title "Procedure", what are the first two instruments to be used?

FOR $s IN //section[section.title="Procedure"]

RETURN ($s//Instrument)[1-2]

Page 45: XML: Data Driving Business?

Quilt: The real power of it• In the first procedure, what happened between the first incision and

the second incision?

FOR $proc IN //section[section.title="Procedure"][1],

$bet IN $proc//((* AFTER ($proc//incision)[1]) BEFORE ($proc//incision)[2]) RETURN $bet

Page 46: XML: Data Driving Business?

XML Storage• Text files

• Simple

• Would require special purpose query processor

• Relational databases• Ternary relations [Florescu et al]

• Inlining methods [Shanmugasamudram et al]

• STORED [Mary Fernandez]

Page 47: XML: Data Driving Business?

XML Storage• Object Oriented databases[Sophie Cluet et al]

• Native storage

Page 48: XML: Data Driving Business?

XML Storage• Using Ternary relations

• Edge labels are maintained in a table with the object ids that the edge connects

• Value of leaf nodes are stored using yet another table

Page 49: XML: Data Driving Business?

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

Store XML in Ternary Relation

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

Page 50: XML: Data Driving Business?

XML Storage• DTDs converted into DTD graph

• Inlining methods• Basic inlining

• Shared inlining

• Hybrid inlining

Page 51: XML: Data Driving Business?

Corresponding DTD graph

Page 52: XML: Data Driving Business?

Element graph for Editor Element

Page 53: XML: Data Driving Business?

XML Storage• Basic inlining

• For each node in the DTD graph a relation is created

• Creates a large no. of relations

Page 54: XML: Data Driving Business?

Relations created using Basic inlining

Page 55: XML: Data Driving Business?

XML Storage• Shared inlining

• Create relations for elements in-degree>1

• An element node is repr in exactly 1 rel

• For mutually recursive elements make one as a separate relation

Page 56: XML: Data Driving Business?

Relations created using shared inlining

Page 57: XML: Data Driving Business?

XML Storage• Hybrid inlining

• inlines elements with in-degree > 1 that are not recursive or reached through a “*” node

Page 58: XML: Data Driving Business?

Relations created using hybrid inlining

Page 59: XML: Data Driving Business?

XML Storage• STORED

• Uses a query language to specify mappings.

• Mappings are generated using mining algorithms

• Nonconforming data is stored in overflow graphs.

Page 60: XML: Data Driving Business?

XML Storage• STORED(contd.)

• Given a data instance D, a STORED query is generated automatically.

FROM Audit.taxpayer:$X{name:$N, phone:$P1,

optional{phone:$P2}}

STORE R1($X,$N,$P1,$P2)

• Given relational mappings, generate explicit overflow mappings so that the query is lossless.

Page 61: XML: Data Driving Business?

XML Storage• Object oriented method

• Using DTD a hierarchy of the elements is obtained

• Each element is now modeled as a class

• For handling “*” of DTD a list of objects is maintained

• To handle union types(Eg., phone|email) new class can be introduced

Page 62: XML: Data Driving Business?

XML Storage• eXcelon way

– eXcelon XML Data Engine is a high performance XML data management engine

– Based on ObjectStore DBMS

– When XML data gets parsed in eXcelon, it is represented in XMLStore as discrete XML elements.

– The hierarchical structure of XML is therefore preserved in its persistent representation

Page 63: XML: Data Driving Business?

XML AlgebraWhy yet another algebra?

– Structure of data• Deeply structured

• Exact structure not specific

– Recursion• Structurally recursive

Proposed Algebra: Too much stress on type conformance

Page 64: XML: Data Driving Business?

XML Algebra• Sample Data<bib>

<book>

<title>Data on the Web</title>

<year>1999</year>

<author>Abiteboul</author>

<author>Buneman</author>

</book>

<book>

<title> XML Query</title>

<year>2000</year>

<author>Mary</author>

</book>

</bib>

Page 65: XML: Data Driving Business?

XML Algebratype Bib = bib [ Book{0,*}]

type Book = book [

title [String ],

year [Integer],

author[ String]{1,*}

]

let bib0: Bib = bib [

book [

title [“Data on the Web”], year [1999],

author[“Abiteboul”], author[“Buneman”]

]

book[

title[“XML Query”],year[2000],

author[“Mary”]

]

]

Page 66: XML: Data Driving Business?

XML Algebra• Projection

Eg: project book( children (bib0) )– Allows a more convenient notation as well

(similar to Xpath notation)– Eg. bib0/book/author

==> author [“Abiteboul”]

author [“Buneman”]

author [“Mary”]

:author [ String ] {0,*}

Page 67: XML: Data Driving Business?

XML Algebra• Selection

Eg: for b bib0/book in

where value(b/year) <= 2000 then b

==> book [

title [ “Data on the web”],

year [“1999”],

author[“Abiteboul”],

author[“Buneman”]

]

: Book{0,*}

Page 68: XML: Data Driving Business?

XML Algebra• Join:type Reviews =

reviews [

book [

title [String],

review [ String]

]{0,*}

]

let review0: Reviews =

reviews[

book [ title[“XMLQuery”],

review[“A fine book”]

],

book [ title[“Data on Web”],

review[“This is great”]

]

]

Page 69: XML: Data Driving Business?

XML Algebra• Join

for b bib0/book infor r review0/book in

where value(b/title) = value(r/title) thenbook [ b/title, b/author, r/review]

==> book [title [“Data on the web”],

author[“Abiteboul”],author[“Buneman”]

review[“A fine book”]],

Page 70: XML: Data Driving Business?

XML Algebra• Join book[

title[“XML Query”],

author[“Mary”],

review[“This is great”]

]

: book[

title[String ],

author[String]{1,*},

review[String]

]{0,*}

Page 71: XML: Data Driving Business?

XML Algebra• Querying Order

– Index function pairs an integer index with each element in a forest

– Eg: index(book0/author)

==> pair[fst[1],snd[author[“Abiteboul”]]],

pair[fst[2],snd [author[“Buneman”]]],

pair[fst[3],snd [author[“Suciu”]]]

:pair[fst[Integer],snd[author[String]]]{1,*}

Page 72: XML: Data Driving Business?

XML Algebra• Aggregation

– Has five built-in aggregation

functions: avg,count, max, min and sum– Eg:

for b bib0/book in

where count(b/author) >= 2 then b/title

==> title[“Data on the web”]

: title{0,*}

Page 73: XML: Data Driving Business?

XML Algebra• Additional Features

– Structural Recursion • To define documents with recursive structure, recursive types

are used

– Sorting• sort(pairs)

– Grouping• Group(pairs)

Page 74: XML: Data Driving Business?

Kweelt• Is a framework to query XML Data

• An implementation of Quilt

• Architecture :

Page 75: XML: Data Driving Business?

XML Indexing1

2 3 4 5 6

7 8 9 10 11 12 13

t t t t t

a b a c a d a a b

Semistructured Data

Page 76: XML: Data Driving Business?

XML Indexing• Data guides(Used in Lore)

• Data guide is a concise and accurate summary of the data graph

1

2 3 4 5 6

7 8 10 12 13 7 13 9 11

t

ab c

d

Data Guide

Page 77: XML: Data Driving Business?

XML Indexing• T-Index

1

2 3 4 5 6

7 13 8 10 12 9 11

t

aa c db

T-Index

Page 78: XML: Data Driving Business?

Challenges

• Storage issues• Relational or native?

• Query optimization• Query plan?

• Other than queries…say triggers?

• Updates to data

• Mining of XML data