74
1 Keys for XML Peter Buneman, Susan Davidson, Wenfei Fan Peter Buneman, Susan Davidson, Wenfei Fan Carmem Hara , Wang-Chiew Tan Carmem Hara , Wang-Chiew Tan University of Pennsylvania Temple University Universidade Federal do Parana, Brazil Jonathan Mamou

1 Keys for XML Peter Buneman, Susan Davidson, Wenfei Fan Carmem Hara, Wang-Chiew Tan Carmem Hara, Wang-Chiew Tan University of Pennsylvania Temple University

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

1

Keys for XML

Peter Buneman, Susan Davidson, Wenfei FanPeter Buneman, Susan Davidson, Wenfei Fan

Carmem Hara , Wang-Chiew TanCarmem Hara , Wang-Chiew Tan

University of Pennsylvania

Temple University

Universidade Federal do Parana, Brazil

Jonathan Mamou

Keys for XML 2

Keys in DB design

Essential part of DB design Invariant connection between the tuple and the

real-world entity Important in update

– Guarantee that an update will affect precisely one tuple

Keys for XML 3

Keys in XML

XML documents are to do – at least - double duty as databases

Examination of existing DTDs reveals a number of cases in which some element or attribute is specified as a “unique identifier” in comments

Various key specifications in XML Standard, XML Data, XML Schema

Keys for XML 4

Components: XML vs. relational DB

<db><student> <name> Smith </name>

<course> Math </course> <grade> B </grade>

</student><student>

<name> Jones </name><course> Math </course><grade> A+ </grade>

</student><student>

<name> Smith </name><course> CS </course>

<grade> A- </grade>

</student>

</db>

Namecoursegrade

SmithMathB

JonesMathA+

SmithCSA-

Keys for XML 5

Components: XML vs. relational DB (cont’d)

DB If 2 tuples agree on their name and course attributes they agree everywhere

XML If 2 elements agree on

the name and course subelements then they are the same element

Node identification? Equality?

Keys for XML 6

Nodes - Value Equality name key for person nodes name may have a complex structure: first-

name, last-name

dept

...

db

company government university

employee employee employee

@id name@id @id

company

employee

employee

name

name

firstName lastName

“Bill” “Clinton”

“Bill Clinton”

Keys for XML 7

Hierarchical structure

Hierarchically structured databases, e.g. scientific data formats

Top-level key to identify components of a document

Secondary key to identify sub-components– Book/chapter/section– Bible/book/chapter/verse

Keys for XML 8

Absolute and relative keysIn an XML document, how to identify A book? a chapter? a section?

db

book bookbook book

title chapter

“XML”

chapter

section section

“1” “...”

“6”

number section

number text number

“10”

number

“1” number

“1”

section

number

“5”

title chapter

“SGML” number

“1”

chapter

number

“10”text

“…”

Keys for XML 9

XML standard - ID attribute

<!ATTLIST book titleID #required> <!ATTLIST chapter number ID #required> <!ATTLIST section number ID #required>

Internal “pointers” rather than keys Scoping: ID attribute unique within the entire document rather than among a

designated set of elements– can’t express relative keys, e.g., for chapters/sections.

Limit to using attributes rather than elements unary: at most one ‘key’ can be defined, in terms of a single attribute

value equality: on text (string) defined in a attribute type : keys must come with a DTD

Keys for XML 10

XML Data Introduces a notion of keys explicitly<elementType id="booktable">

<element id="titleID" type="#title"> <element type="#author"> <element type="#pages"><key id="bookkey"> <keyPart href="#titleID"/>

</key>

</elementType> BUT

– Can only be defined for element types rather than for certain collections of elements e.g. book, articles, …

Keys for XML

XPath Possible to specify interesting fragments of a

document Syntax similar to navigating directories in a

file system//arbitrary path. empty path/ document root - path concatenator* any single node name

Keys for XML 12

XPath example

Select BBB elements which have any attribute      <AAA>           <BBB id = "b1"/>           <BBB id = "b2"/>           <BBB name = "bbb"/>           <BBB/>      </AAA>

//BBB[@*]

Keys for XML 13

Xpath example (cont’d)

<AAA> <BBB></BBB> <XXX>

<DDD><FFF>

<GGG></GGG>

             </FFF>        </DDD>  </XXX>    <CCC>    </CCC> </AAA>

//GGG/ancestor*::

Keys for XML 14

XML-Schema<element name = “book”> <complexType>

<sequence> <element name=“title” type=“string”/> <element name=“chapters” max0occurs=“unbounded”>

<complexType> ... </complexType> </element>

</sequence> </complexType>

<key name=“k” ><selector xpath=“.”/>

<field xpath=“title”/></key>

</element>

Keys for XML 15

XML Schema (cont’d)

Allow to specify keys in term of XPath expressions BUT

– XPath is a relatively complex language (move down, sideways, upwards, predicates and functions can be embedded)

– Equivalence/containment of XPath expressions is unresolved No efficient way to tell whether two keys are equivalent.

– Value equality: restricted to text

– Relative key not addressed

– Structural requirement: key paths must exist and be unique.

Keys for XML 16

A new key constraint language for XML

Powerful enough to express absolute and relative keys

Simple enough to be reasoned about efficiently– Equivalence/containment– consistency (satisfiability)

– implication (keys derived from others)

Capturing the semistructured nature of XML data:– independent of any types/schema

– no structural requirements: tolerating missing/multiple key paths

Keys for XML 17

Outline

Node addresses – testing whether 2 nodes are the same node

Value equality – testing whether 2 nodes have the same value

Path expression language Absolute key Key Inference Relative key Strong key Some issues

Keys for XML 18

Tree representation

DOM (Document Object Model) Document is a hierarchical structure of nodes

– Element nodes– Attribute nodes– Text nodes

Keys for XML 19

Tree representation (cont’d)<db>

<composer><name> J.S. Bach </name> <born> 1685 &</born><work num="BWV82“>

<title> Ich habe genug </title></work><work num="BWV552“></work>

</composer><composer period="baroque“>

<name> G.F. Handel </name><work num="HWV19“>

<title> Art Thou Troubled? </title></work>

</composer></db<

Keys for XML 20

Tree representation (cont’d)

““Art Thou Troubled”Art Thou Troubled”

namename

““J.S. Bach”J.S. Bach”

1

bornborn

titletitle numnum

““BWV82”BWV82”

dbdb

composercomposer

21

1

workwork

““1685”1685”

““Iche abe genug”Iche abe genug”

numnum

““BWV552”BWV552”

workworknamename

periodeperiode

““Baroque”Baroque”

composercomposer

1

1

““G.F. Handel”G.F. Handel”

numnum

workwork

titletitle

““HWV19”HWV19”

11 11

2

2

34

1

@num @periode@num@num

Keys for XML 21

Tree representation (cont’d) Attribute node: name+text, terminal Text node: text, terminal Element node:

– name, may have children– Text and element children held in an array

• Index in the array determined by the order of the subelement in the document

– Attribute children held in a dictionary• Name of the attribute used as the index

Edge label uniquely identify children

Keys for XML 22

Node Address

A path of edge labels from the root uniquely identifies a node <l1#…#ln>– <1#2#1>, <1#3#@num>

An attribute node can only occur at the end of a node address

Order of attributes is unimportant Order of subelements specified by their indexes Address of a subnode relative to a node

– Any subnode of a node with address <a> will have a node address of the form <a#b> where <b> is the address of the subnode relative to <a>.

Keys for XML 23

Value Equality

Value of a node1.A set S of relative addresses of its subnodes

2.A partial function from S to names

3.A partial function from S to texts

2 nodes are value-equal if they agree on 1, 2, 3 Notation: a =v b

Keys for XML 24

Value Equality (example)S = {., <1>, <2>, <1,1>, <2,1>}

...

db

person personperson person

@pnone

“234-5678”@phone

“123-4567”

name

firstName lastName

“George” “Bush”

name

firstName lastName

“George” “Bush”

1

11

2 1

1 1

2

Keys for XML 25

Path expressions

How to identify nodes in a tree? Expression involving node names (tags +

attributes) that describes a set of paths in the document tree– XPath (XML-Schema)– Regular expressions (semistructured data)

Keys for XML 26

Regular Path Expressions

db

empsdepts

mgremp

“Mary” “John” “Bill”

name name

emp

name

In the normal syntax of regular expressions:

db.emps.emp

db.(depts.dept.mgr |emps.emp)

db._*.name

dept

Keys for XML 27

Language for path expression

2 necessary properties– Concatenation operation, not uniform presentation

in XPath• Concatenate a/b with /c/d : a/b//c/d

– A path should only move down the tree• Navigation axis in XPath

Keys for XML 28

Language for path expression Empty path “ε” (“.”) Node name (tag/attribute name) Wild card “_”, single node name (“*”) Arbitrary path “_*” (“//”) Concatenation of paths P, Q is P.Q (“/”) Notation

– n[P]: set of nodes (node addresses) reached by starting at node n and following a path that conforms to P

– [P] := root[P]

Keys for XML 29

Examples Simple path

– <2#2>[title] = {<2#2#1>}

– [composer.work] = {<1#3>, <1#4>, <2#2>}

Complex path– <2#2>[_*] = {<2#2>, <2#2#1>, <2#2#1#1>,

<2#2#@num>}– [composer._] = {<1#1>, <1#2>, <1#3>, <1#4>,

<2#1>, <2#2>}– [_*.num] = {<1#3#@num>, <1#4#@num>,

<2#2#@num>}

30

Absolute key

Keys for XML 31

Key specification

Necessary to specify– Set on which we are defining the key (relation)– “Attributes” (set of column names)

Pair (Q, {P1, …, Pn})

– Target path Q path expression: target set on which the key constraint is to hold

– Key path {P1, …, Pn} set of simple path expressions

Keys for XML 32

Key specification (cont’d)

– Target path Q – Key path {P1, …, Pn}

For any node n in [Q], there is a set of nodes n[Pi] found by following Pi from n (may be empty)

Examples1. (person.employees, {name.firstname, name.lastname})2. (composer, {name})3. (composer, {born})

Keys for XML 33

Formal DefinitionA node n satisfies a key specification (Q,{P1,... , Pk}) iff for any

n1, n2 in n[Q],

if for all i, 1<= i <= k , there exist z1 in n1[Pi] and z2 in n2[Pi] such that z1 =v z2

then n1 = n2. Value equalityValue equality z1 =v z2 Node equalityNode equality : 2 nodes are equal if they have the same node

address n1 = n2

The values associated with key paths uniquely identify a node in the target set

Not part of the schema, data

Keys for XML 34

Remarks For any n1, n2 in [Q], if Pi is missing at either n1 or n2

then n1[Pi] and n2[Pi] are by definition disjoint

Multiple nodes<db>

<A> <B> 1 </B> </A>

<A> <B> 1 </B> <B> 2 </B> </A>

</db>

Key (A, {B}) with respect to the root.

The document does not satisfy the key.

Keys for XML 35

Example of keys (_*.person, {id})

– 2 persons elements are disjoint on their id fields

(person, {ε})– Any 2 person nodes immediately under the root have different

values

(employee, {})– Empty key. There is at most one employee under the root

(_*, {id})– Any 2 nodes are disjoint on their id fields up to value-equality

– Semantics of ID attribute in the XML standard

Keys for XML 36

XML vs. relational

XML, paths that define keys – Need not exist (null-

valued keys)

– Do not have to be unique

– Key paths specify a set of addresses within a document

Relational DB– Key values cannot be

null, must exist

– Have to be unique

– 1NF requires each component of every tuple to be atomic value, not set

Keys for XML 37

Remarks Equivalence of 2 path expressions is decidable Given a definition of equality on tree, do we need to have

more than one key path in a key specification?– All key attributes must be represented as subnodes of some node

– Constrain this node to contain only those subnodes

– Too restrictive, unnecessary interference between key specifications and data models

Allow a (possible empty) set of nodes at the end of each key path– How to require each of the key paths to exist and to be unique?

Keys for XML 38

Remarks (cont’d)

Language of path expression – Need something more powerful to express Q

(person.(mother | father)*, {id})

A person element followed by zero or more father or mother elements

Provisional language of path expressions Does not change in the way of the theory

Keys for XML 39

Key inference In relational DB

– Infer some keys from the presence of others

If (Q, S) is a key and S S’, then so is (Q, S’)– Counterpart of relational inference rule

If (Q.Q’, {P}) is a key, then so is (Q, {Q’.P})– tree-like structure : if a node is identified in a tree then

its ancestor are also determined I.e. if a key path P uniquely identifies a node n in [Q.Q’] then Q’.P is a key path for the ancestor of n in [Q].

Keys for XML 40

Key Inference (cont’d)

If (Q,S) is a key and Q’ Q,then (Q’, S) is also a key– Any key of the set [Q] is also a key for any subset of [Q]

For any finite set Σ of keys, there exists an (finite) XML document satisfying Σ– Key paths may be missing, e.g. (_*,{id})

• If key path was required to exist at all nodes specified by the target path, the XML document would have to be infinite to satisfy the key

– Only holds in the absence of DTDs

Keys for XML 41

Key Inference

Key K = (X, {}) DTD D: <!ELEMENT foo (X, X)>

foo foo

No XML document that both conforms to D and satisfies K

DTDs interact with XML key constraint

X X X

42

Relative Key

Keys for XML 43

Relative key - Motivation Motivated by scientific data format, hierarchical structure,

large set of entries at the top-level Protein sequence database Swiss-prot

– Accession number (key) for each entry– Within each entry, sequence of citations each identified by a

number 1, 2, 3, … Linguistic database – recording of speech

– Data sets held in files– Metadata provided by directory structure– /timit/train/dr1/fcjjf0/sa1.wav– TIMIT corpus, training set, dialect region 1, female speaker,

speaker-ID "cjf0", sentence text "sa1", speech waveform file

Keys for XML 44

An absolute key for booksAn absolute key to identify a book: (book, {title} ) target path: book, starting from the root and identifying a

collection of books key path: title; its value uniquely identifies a bookabsolute: defined on the entire document

section

db

book bookbook book

title chapter

“XML”

chapter

section

“1” “...”

“6”

number section

number text number

“10”

number

“1” number

“1”

section

number

“5”

title chapter

“SGML” number

“1”

chapter

number

“10”text

“…”

Keys for XML 45

Relative key - definition

Like the key of a weak entity set in DBStudios(name, address)Crews(number)

A document satisfies a relative key specification (Q, (Q’,S)) iff for all nodes n in [Q], n satisfies the key (Q’,S).

Absolute keys are a special case of relative keys– (Q’,S) equivalent to (ε, (Q’,S))

Keys for XML 46

A relative key for chaptersA relative key: (book, (chapter, {number} ) )

A chapter number uniquely identifies a chapter within a book! Context path: book target path: chapter, starting at a book key path: numberrelative: defined on sub-documents, relative to the context

section

db

book bookbook book

title chapter

“XML”

chapter

section

“1” “...”

“6”

number section

number text number

“10”

number

“1” number

“1”

section

number

“5”

title chapter

“SGML” number

“1”

chapter

number

“10”text

“…”

Keys for XML 47

Absolute/Relative Key What is the difference between

– Absolute key (book.chapter, {number})

– Relative key (book, (chapter, {number} ) )

section

db

book bookbook book

title chapter

“XML”

chapter

section

“1” “...”

“6”

number section

number text number

“10”

number

“1” number

“1”

section

number

“5”

title chapter

“SGML” number

“1”

chapter

number

“10”text

“…”

Keys for XML 48

A relative key for sectionsKey: (book.chapter, (section, {number} ) )

A section number uniquely identifies a section within a particular chapter of a particular book!

relative to the chapter containing the section, and to the book containing the chapter

“XML”

“1” “...” “10”

db

book bookbook book

title chapter chapter

section section

“6”

number section

number text number

number

“1” number

“1”

section

number

“5”

title chapter

“SGML” number

“1”

chapter

number

“10”text

“…”

Keys for XML 49

Transitivity of relative keys

A relative key such as (bible.book.chapter,(verse, {number}))

does not uniquely identify a particular verse in the bible

Book name, chapter number, verse number verse

Keys for XML 50

“immediately precedes” relation

(Q1, (Q’1,S1)) immediately precedes (Q2, (Q’2,S2)) if Q2 = Q1.Q’1

– (bible, (book,{name})) immediately precedes

(bible.book, (chapter,{number})) – Any absolute key immediately precedes itself

Keys for XML 51

“precede” relation

Precede is the transitive closure of the immediately precedes relation– Qn = Q1.Q’1…Q’n-1

(bible, (book, {name})),

(bible.book,(chapter, {number})),

(bible.book.chapter,(verse, {number}))

Keys for XML 52

Transitivity of relative keys

A set Σ of relative keys is transitive if for any relative key K1 = (Q1,(Q’1,S1)) in Σ there is a key K2 = (ε,(Q’2,S2)) in Σ which precedes K1

Any transitive set of relative key must contain some absolute key

Keys for XML 53

Transitivity of relative keys - example

TRANSITIVE SET

(ε,(bible.book, {name}))

(bible.book,(chapter, {number}))

(bible.book.chapter,(verse, {number}))

Keys for XML 54

Insertion-friendly relative keys

Transitive key specification(ε, (university, {name}))

(university, (dept.employee, {emp-id}))

Identify an employee: university name + emp-id Add an employee: specify a dept for the employee No way to identify a dept

– Many ways to add an employee!!!

Keys for XML 55

Insertion-friendly relative keys (cont’d)

Insert an element in the “keyed” part of the document unambiguously by specifying where to insert the element using keys.

A set Σ of relative keys is insertion-friendly if it is transitive and whenever (Q1,(Q’1.n,S1)) Σ, there is a relative key (Q2,(Q’2,S2)) Σ where |Q’2| > 0 and Q1. Q’1 = Q2.Q’2.– n is a node name

Every element with a prefix along the path Q1.Q’1 can be identified through some keys

Keys for XML 56

Insertion-friendly relative keys (cont’d)

(ε, (university, {name}))

(university, (dept, {dept-name}))

(university, (dept.employee, {emp-id}))

n = employee

Keys for XML 57

Insertion-friendly relative keys (cont’d)

(ε, (university, {name}))(university, (dept, {dept-name}))(university, (dept.employee, {emp-id}))

Nothing about the dept is necessary to identify employees!!!

Anomaly that occurs in non-second NF of relational databases

Employees should not be children of department nodes, but only of university nodes

Linkage between employees and department should be expressed through a foreign key

Keys for XML 58

Notation for relative key

If system of relative keys is transitive, it forms a hierarchical structure create a compressed syntax for such systems

Basic syntactic form

Q1{P1 ,...,Pk1}.Q2{P1,...,Pk2}. ...Qn{P1 ,...,Pkn}

Keys for XML 59

Notation for relative key (cont’d)

bible{}.book{name}.chapter{number}.verse{number}

(ε, (bible, {}))(bible, (book, {name})(bible.book, (chapter,{number}))(bible.book.chapter, (verse,{number}))

company{name}[.employee{id}, .department{name}]

company{name}.employee{id}company{name}.department{name}

Keys for XML 60

Notation for relative key

Compact and understandable Ensure the internal consistency of the document To tell other how to cite a component of our

document Our document have a structured “core”

61

Strong keys

Keys for XML 62

Stronger definitions of keys

Requirements imposed by a key in relational DB:– Uniqueness of a key– Existence of key

Key paths exist and are unique (for 1 i n, n[Pi] contains exactly one node)– name is unique at <1>– work and num are not unique at this node

Keys for XML 63

Stronger definitions of keys (cont’d)

A node n satisfies a strong key specification (Q, {P1, …, Pk}) if– For all n’ in n[Q] and for all Pi, Pi exists and is

unique at n’.

– For any n1, n2 in n[Q], if for all I, n1[Pi] =v n2[Pi] then n1=n2

Keys for XML 64

Stronger definitions of keys (cont’d)

(_*.person, {id}) – Any 2 person elements, have unique id and differ on

those elements

(person, {ε})– Unchanged

(employees, {})– Unchanged

Keys for XML 65

Stronger definitions of keys (cont’d)

(_*, {k})– Every element has a key k, including element whose

name is k Finite satisfiability? Impose an infinite chain of k nodes

– No finite document satisfies it Because of the requirement of existence of key

paths– Structural constraint

Keys for XML 66

Relative Strong Key

A document satisfies a strong relative key specification (Q, (Q’,S)) iff for all nodes n in [Q], n satisfies the strong key (Q’,S)

67

“Unconstrained” XML : Node names as key values

Keys for XML 68

Node names as key values

Key specification must cover the practical cases without using definitions that are too complex to allow any kind of reasoning about keys

Issue in “unconstrained” XML: interchanging structure (the names) with data (their values)

Keys for XML 69

“unconstrained” XML<db>

<parts> <widget> <id> 123 </id>

<w> 1.5 </w> </widget>

<widget> <id> 234 </id>

<w> 2.5 </w> </widget>

<gadget> <id> 123 </id>

<w> 3.2 </w> </gadget> </parts>

</db>

<db> <parts>

<part> <type> widget </type> <id> 123 </id> <w> 1.5 </w>

</part> <part> <type> widget </type> <id> 234 </id> <w> 2.5 </w>

</part> <part> <type> gadget </type> <id> 123 </id> <w> 3.2 </w>

</part> </parts>

</db>

Keys for XML 70

Node names as key values (cont’d)

“Unconstrained” XML– Type of a part is expressed in the tag– Key constraint: parts{}[.widget{id},.gadget{id}]

Alternative XML representation – type expressed as an attribute or subelement of a

part element– Key constraint: parts{}[.part{type,id}]

Keys for XML 71

Introducing a new part type

Introduce a thingy “unconstrained”

– Change key specification– parts{}[.widget{id},.gadget{id},.thingy{id}]

Alternative– No change parts{}[.part{type,id}]

Ability to interchange structure and data is supposed to be one of the strong points of semistructured data and XML

Keys for XML 72

Solution

Adding a “virtual” subelement node-name to each named node, whose value consists of the node name

Key: parts{}._{node-name, id} Does not alter any of the properties

expected to hold for keys Account for any practical use of tag names

in keys

Keys for XML 73

Conclusion

A new key constraint language for XML:

– independent of any schema specifications for XML

– powerful enough to express absolute and relative keys

– simple enough to be reasoned about efficiently

In contrast to their relational counterparts:

– XML keys are more complex

– the analyses of XML keys are far more intricate

Keys for XML 74

References

Peter Buneman, Susan Davidson, Wenfei Fan, Carmem Hara, and Wang-Chiew Tan. Keys for XML. WWW10 (2001) http://db.cis.upenn.edu/DL/xmlkeys.ps

Peter Buneman, Susan Davidson, Wenfei Fan, Carmem Hara, and Wang-Chiew Tan. Reasoning about keys for XML. University of Pennsylvania. Technical Report MS-CIS-00-26, 2000 http://db.cis.upenn.edu/DL/absolute-full.ps