Upload
leslie-copeland
View
228
Download
3
Embed Size (px)
Citation preview
Master Informatique 10/9/2007 1
Typing semistructured data
Slides courtesy of Serge Abiteboul
Web Data Management
Typing semistructured data
Master Informatique 10/9/2007 2Web Data Management Typing semistructured data
Organization
• Motivations• Automata
– Automata on words– Ranked tree automata– Unranked tree automata– Automata and monadic second-order logic– Automata – to compute
• XML typing: DTD, XML schema• Graphs and bisimulation
Master Informatique 10/9/2007 4Web Data Management Typing semistructured data
XML typing
• Not compulsory• Simplify writing software for XML
– Improve interoperability between programs
• Improve storage and performance• Ease of querying: data guide • Simplify data protection
– Reject illegal update – like relational dependencies
Master Informatique 10/9/2007 5
Improve storage
Root
Company Employee
string
company
person
works-for
c.e.o.
address
name
managed-by
name
o i d n a m e a d d r e s s c . e . o .… … … …… … … …
Company
o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …
Employee
Store rest in overflow graph
Lower-bound schema
Typing semistructured data
Master Informatique 10/9/2007 6
Improve performance
Bib
paper book
yearjournal
title
int string string
addressauthor
title
zip city street
lastname
firstname
string string string string string
string
select X.titlefrom Bib._ Xwhere X.*.zip = “12345”
select X.titlefrom Bib._ Xwhere X.*.zip = “12345”
select X.titlefrom Bib.book Xwhere X.address.zip = “12345”
select X.titlefrom Bib.book Xwhere X.address.zip = “12345”
Typing semistructured data
Master Informatique 10/9/2007 7Web Data Management Typing semistructured data
Type checking
• Who checks– XML editor: check that the data conforms to its type– XML exchange, e.g., with Web service
• Server when delivering the data• Client/application: when receiving it
• Dynamic verification: after the data is produced• Static verification: verification of the program that
generates the data
Master Informatique 10/9/2007 8Web Data Management Typing semistructured data
Static verification
• Input: input type T and code of function f– f is Xquery, Xpath, XSLT, etc.
• Verification of T’– Is it true that d╞T, f(d)╞T’ ?
• Type inference– Find the smallest T’ such that d╞T, f(d)╞T’
• Rapidly undecidable because of “joins”
Master Informatique 10/9/2007 9Web Data Management Typing semistructured data
Examplefor $p in doc("parts.xml“)//part[color=“red"]return <part>
<name>$p/name/text()</name><desc>$p/desc/node()</desc>
</part>
Result type (part (name (string) desc (any) )*
If the type of parts.xml//part/desc is string(part (name (string) desc (string) )*
Master Informatique 10/9/2007 10Web Data Management Typing semistructured data
Difficultyfor $X in Input, $Y in Input do { print ( <b/> }
Input: <a/> <a/> Result: <b/> <b/> <b/> <b/> Problem: { bi i=n2 for n ≥ 0 } cannot be described in XML
schemaThere is no « best » result
– b*– + b2 b*
– + b2 + b4b*
– + b2 + b4 + b9b*
– …
Master Informatique 10/9/2007 11Web Data Management Typing semistructured data
Why tree automata?
• XML = unranked trees• No theory for XML • Rich theory for strings: Automata• Extend to
rich theory for ranked trees: Tree automata – Nice algorithms– Nice theorems– Can this carry to unranked trees and XML?
• Yes!
Master Informatique 10/9/2007 12Web Data Management Typing semistructured data
From strings to treesa
b
b
a
a
b
b a
b
b
a b
a
b
b
a
b
b
a b
a b
a b
Word Binary tree… Unranked tree automataFinite State Ranked tree automata no bound on number of childrenAutomata
a
b b b
Master Informatique 10/9/2007 13Web Data Management Typing semistructured data
Why not then useunranked tree automata?
• Missing practical gadgets• Complexity of verification
– Goal: typing at reasonable cost
Master Informatique 10/9/2007 15
Finite state automata on words
),,,,( 0 FqQ
Alphabet
State
Initial state Accepting states
Transitions
Qq 0 QF
)(: QPQ
Typing semistructured data
Master Informatique 10/9/2007 16Web Data Management Typing semistructured data
q0
Nondeterministic automaton: Example
33
32
21
01
100
,
,
,
,
,,
qqb
qqqa
2
3210 ,,,
,
qF
qqqqQ
ba
a b a a b - a b a -q0
q1
q0 q0
q1
q0
q1
q0 q0
q1
q0 q0 q2
q1
q0
KO OK
Master Informatique 10/9/2007 17Web Data Management Typing semistructured data
• Deterministic– No transition– No alternative transitions such as
• Determinization – It is possible to obtain an equivalent deterministic automaton– State of new automaton = set of states of the original one– Possible exponential blow-up
• Minimization• Limitations – cannot do
– Context-free languages• Essential tool – e.g., lexical analysis
Reminder
Ν, nba nn
100 ,, qqqa 0, qq
Master Informatique 10/9/2007 18Web Data Management Typing semistructured data
Reminder (2)• L(A) = set of words accepted by automata A• Regular languages• Can be described by regular expressions, e.g. a(b+c)*d• Closed under complement
• Closed under union, intersection
– Product automata with states (s,s’) where s is from A and s’ is from A’
)(* AL
)()(
)()(
BLAL
BLAL
Master Informatique 10/9/2007 19Web Data Management Typing semistructured data
Automata on words versus trees
a b b a
a
b
b a
b
b
a b
a
Left to right
Right to left
No difference
Bottom up
Top down
Differences
Master Informatique 10/9/2007 21
Binary tree automata
• Parallel evaluation
• For leaves:
• For other nodes:
),,,( FQ
)(: QP
)(: QPQQ
a
b
b a
b
a b
a
Bottom up
q q’
bq”
q1q”
q2
qqq’
Typing semistructured data
Master Informatique 10/9/2007 22Web Data Management Typing semistructured data
Bottom-up tree automata
• Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’
• Accepts is the root is in some state in F
• Not deterministic if alternatives or -transitions:
',',, rrqqa
}',{',, rrqqa ', rr
Master Informatique 10/9/2007 23Web Data Management Typing semistructured data
Example: deterministic bottom-up
1102012112
0002
0102012002
1112
,,,,,,,,
,,
,,,,,,,,
,,
qqqqqqq
qqq
qqqqqqq
qqq
1
10 ,
,,1,0
qF
qqQ
11
01
1
0
q
q
Master Informatique 10/9/2007 24Web Data Management Typing semistructured data
1102
1012
1112
0002
0102
0012
0002
1112
,,
,,
,,
,,
,,
,,
,,
,,
qqq
qqq
qqq
qqq
qqq
qqq
qqq
qqq
Boolean circuit evaluation
v
v
v
1v v1
10
v
0
11
11
01
1
0
q
q
0q 1q 0q
1q1q
1q1q1q
1q
1q
1q
1q
1q
OK
Master Informatique 10/9/2007 25
Regular tree language = set of trees accepted by a bottom-up tree automaton
Typing semistructured data
Master Informatique 10/9/2007 26Web Data Management Typing semistructured data
Regular tree languages
The following are equivalent– L is a regular tree language– L is accepted by a nondeterministic bottom-up
automaton– L is accepted by a deterministic bottom-up
automaton– L is accepted by a nondeterministic top-down
automaton
Deterministic top-down is weaker
Master Informatique 10/9/2007 27Web Data Management Typing semistructured data
Top-down tree automata
• Top-down: if a node labeled a is in state q”, then its left child moves to state q (right to q’)
• Accepts is all leaves are is in states in F• Not deterministic if
',", qqqa
',,',", rrqqqa
Master Informatique 10/9/2007 28Web Data Management Typing semistructured data
Why deterministic top-down is weaker?
• Consider the language– L = { f(a,b), f(b,a) }
• It can be accepted by a bottom-up TA– Exercise: write a BUTA A such that L = L(A)
• Suppose that B is a deterministic top-down TA with L = L(B)– Exercise: Show that B also accepts {f(a,a)} – A contradiction
Fact: No deterministic top-down tree automata accepts L
Master Informatique 10/9/2007 29Web Data Management Typing semistructured data
Ranked trees automata: Properties
• Like for words• Determinization • Minimization• Closed under
– Complement– Intersection– Union
Master Informatique 10/9/2007 30Web Data Management Typing semistructured data
But…
• XML documents are unranked:book (intro,section*,conclusion)
Master Informatique 10/9/2007 32Web Data Management Typing semistructured data
Unranked tree automata
...,,,,,,
...,,,,,
...,,,,,
...,,,,,,
222
222
222
222
fffffffff
ttftfttt
ftffftff
ttttttttt
Issue: represent an infinite set of transitionsSolution: a regular language
Master Informatique 10/9/2007 33Web Data Management Typing semistructured data
• Rule:• Meaning: if the states of the children of some
node labeled a form a word in L(Q), this node moves to some state in {r1,…,rm}
Unranked tree automata (2)
mrrQLa ,...,)(, 1
fOrwherefOr
fttftOrwheretOr
ftfftAndwherefAnd
tAndwheretAnd
00,
*)(*)(11,
*)(*)(00,
11,
2
2
2
2
Master Informatique 10/9/2007 34Web Data Management Typing semistructured data
Building on ranked trees
a
b
b
b
b
a b
a b
a
b
b
b
b
a b
a b
Ranked tree: FirstChild-NextSibling
F: encoding into a ranked tree• F is a bijection
F-1: decoding
Master Informatique 10/9/2007 35Web Data Management Typing semistructured data
Building on bottom-up ranked trees (2)
• For each Unranked TA A, there is a Ranked TA accepting F(L(A))
• For each Ranked TA A, there is an unranked TA accepting F-1(L(A))
• Both are easy to construct
Consequence: Unranked TA are closed under union, intersection, complement
Master Informatique 10/9/2007 36Web Data Management Typing semistructured data
• Determinization always possible for bottom-up
• Can we use the FirstChild-NextSibling encoding – No: it does not preserve determinism
Determinization
.such that
),( rule unique a exists there,, *
Lw
LQw
Master Informatique 10/9/2007 37Web Data Management Typing semistructured data
Top-down?
• This is more delicate• Transition (a,q)=A(a,q)
– The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node
– Accepts if all the leaves are in accepting state
Master Informatique 10/9/2007 38Web Data Management Typing semistructured data
1q
Boolean circuit evaluation
v
v
v1 v
11
0q
1q
010v
1
111
10
0v
v
v
1q
1q
1q
1q
1q 1q0q
0q
0q0q 1q
0q
0q 0q
1q 0q
0q
1q
A tree is accepted if, for some possible run, the states of all leaves are final
Master Informatique 10/9/2007 39
Automata
Automata and monadic second-order logic
Typing semistructured data
Master Informatique 10/9/2007 40Web Data Management Typing semistructured data
Monadic second-order logic
• Representation of a tree as a logical structure
E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)
a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)
a
b
b
b
b
a b
a b
1
6
3 42
7 8 9
5
Master Informatique 10/9/2007 41Web Data Management Typing semistructured data
XxXX
x
xayxSyxEyx
)(
...)(),(),(::
Monadic second-order logic
E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)
MSO syntax
Set variable
Quantification over a set variable
Master Informatique 10/9/2007 42Web Data Management Typing semistructured data
Example of MSO
• Each a node has a b-descendant• This corresponds to the formula
For each node x labeled a: each set X that ()contains x and that () is closed under descendant, X contains some y labeled b
))()((
))()(),((
)(
)(
ybyXy
zXyXzyEzy
xX
whereXxax
Master Informatique 10/9/2007 43Web Data Management Typing semistructured data
Bridge
Theorem: for a set L of trees, the following are equivalent
1.L = L(A) for some bottom-up tree automata Ai.e. L is definable with bottom-tree automata
2.L = {T | T satisfies } for some MSO formula i.e. L is definable in MSO
Master Informatique 10/9/2007 45Web Data Management Typing semistructured data
DTD
• Describe the children of a node of a label a by a regular expression
• Syntax:<!ELEMENT populationdata (continent*) >
<!ELEMENT continent (name, country*) >
<!ELEMENT country (name, province*)>
<!ELEMENT province (name, city*) >
<!ELEMENT city (name, pop) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT pop (#PCDATA) >
Master Informatique 10/9/2007 46Web Data Management Typing semistructured data
DTD and deterministism
• Regular expressions in DTD should be deterministic– Complicated definition
• Intuition: the corresponding automata should be deterministic– (a+b)*a is not– When reading <a>, one cannot tell whether it is an a
from (a+b) or if it is the a of the end– (b*a)(b*a)* is an equivalent expression that is
deterministic
Master Informatique 10/9/2007 47Web Data Management Typing semistructured data
Very efficient validation
• It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata Aa
• Possible to type check the document while scanning it, e.g. with SAX parser
Master Informatique 10/9/2007 48Web Data Management Typing semistructured data
Very efficient validation (2)<!ELEMENT a ( b c ) >
<!ELEMENT b ( d+ ) >
a
b c
d d
s t ub c
Aa
s’ t’d
dAb
<a><b><d/><d/></b><c/></a>
s’
st
t’
Acceptu
Master Informatique 10/9/2007 49Web Data Management Typing semistructured data
Warning
• The previous example can be checked with a simple automata on words
• But not the following one <!ELEMENT part ( part* ) >
• The stack is needed for accepting<a>…<a></a>…</a>
n <a> n </a>
Master Informatique 10/9/2007 50Web Data Management Typing semistructured data
Some bad news for DTD
• Not closed under union DTD1 …
<!ELEMENT used( ad*) >
<!ELEMENT ad ( year, brand )>
DTD2 …
<!ELEMENT new( ad*) >
<!ELEMENT ad ( brand )>
• L(DTD1) L(DTD2) cannot be described by a DTD but can be described easily by a tree automata– Problem with the type of ad that depends of its parent
• Also not closed under complement• Limited expressive power
Master Informatique 10/9/2007 51Web Data Management Typing semistructured data
Car example continued
• The best DTD we can choose does not distinguish between ads for used and new cars– <!ELEMENT ad (year?, brand) >
Car
Used New
Brand Year Brand
“Renault” “2008” “BMW”
Master Informatique 10/9/2007 52Web Data Management Typing semistructured data
Decoupled types in XML schema
• Each type corresponds to a label, not converselycar: [car] ( used + new )*
used:[used] (ad1*)
new: [new] (ad2*)
ad1: [ad] (year, brand)
ad2: [ad] (brand)
• The tags are in green; type names in blue• Nice closure properties• Many other « gadgets » in XML schemas
Master Informatique 10/9/2007 54Web Data Management Typing semistructured data
XML Schema
• Often criticized & unnecessarily complicated• Boosted by Web services• Richer than DTD – decoupled types• Deterministic top-down tree automata (close to)• XML schemas are extensible• Many other useful functionalities
– Namespaces – Atomic types– Integrity constraints, etc.
Master Informatique 10/9/2007 55Web Data Management Typing semistructured data
An XML schema is an XML document
• Since it is an XML syntax, it can use XML tools– Editor– Type checker– Etc.
• The type of all XML schemas can be described with an XML schema
Master Informatique 10/9/2007 56
<?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetnamespace="http://www.net-language.com"> <xs:element name="book"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="author" type="xs:string"/> <xs:element name="character" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="friend-of" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="since" type="xs:date"/> <xs:element name="qualification" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="isbn" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema>
Typing semistructured data
Master Informatique 10/9/2007 57Web Data Management Typing semistructured data
Simple elements and atomic types
Definition: <xs:element name="xxx" type="yyy"/>with common types: xs:string; xs:decimal; xs:integer; xs:boolean; xs:date; xs:time
Examples<xs:element name="lastname" type="xs:string"/> <xs:element name="age" type="xs:integer"/> <xs:element name="dateborn" type="xs:date"/>
Instances of such elements<lastname>Refsnes</lastname> <age>34</age> <dateborn>1968-03-27</dateborn>
Master Informatique 10/9/2007 58Web Data Management Typing semistructured data
Attributs
Definition: <xs:attribute name="xxx" type="yyy"/>
Example<xs:attribute name="lang" type="xs:string"/>
Instance of such attribute<lastname lang="EN">Smith</lastname>
Master Informatique 10/9/2007 59Web Data Management Typing semistructured data
Complex elements
• Empty element<product pid="1345"/>
• Contains only other elements<employee> <firstname>John</firstname> <lastname>Smith</lastname> </employee>
• Contains only text<food type="dessert">Ice cream</food>
• Contains both elements and text<description> It happened on <date lang="norwegian"> 03.03.99</date> .... </description>
Master Informatique 10/9/2007 60Web Data Management Typing semistructured data
Restriction of simple elements<xs:element name="age">
<xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/>
<xs:maxInclusive value="100"/> </xs:restriction>
</xs:simpleType></xs:element>
Other restrictions: enumerated types, patterns, etc.
Master Informatique 10/9/2007 61Web Data Management Typing semistructured data
Restriction on complex elements
<xs:element name="person"> <xs:complexType>
<xs:sequence> <xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType>
</xs:element>
Master Informatique 10/9/2007 62
Possible to name a type<xs:element name="employee">
<xs:complexType> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType>
</xs:element>
Only the "employee" element can use the specified complex type(<sequence> indicates an order on child elements)
Alternative
<xs:element name="employee" type="personinfo" />
<xs:complexType name="personinfo"> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence>
</xs:complexType>
Typing semistructured data
Master Informatique 10/9/2007 63Web Data Management Typing semistructured data
Other gadgets
• Import of types associated to a namespace– <import nameSpace = "http:// ..."
schemaLocation = "http:// ..." />
• Possible to include an existing schema– <include schemaLocation="http:// ..."/>
• Possible to extend/redefine an existing schema– <redefine schemaLocation="http:// ..."/>
.... Extensions ...
</redefine>
Master Informatique 10/9/2007 64Web Data Management Typing semistructured data
Example: a DTD<!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
<!ATTLIST EMAIL
LANGUAGE (Western|Greek|Latin|Universal) "Western"
ENCRYPTED CDATA #IMPLIED
PRIORITY (NORMAL|LOW|HIGH) "NORMAL">
<!ELEMENT TO (#PCDATA)>
<!ELEMENT FROM (#PCDATA)>
<!ELEMENT CC (#PCDATA)>
<!ELEMENT BCC (#PCDATA)>
<!ATTLIST BCC
HIDDEN CDATA #FIXED "TRUE">
<!ELEMENT SUBJECT (#PCDATA)>
<!ELEMENT BODY (#PCDATA)>
<!ENTITY SIGNATURE "Bill">
Master Informatique 10/9/2007 65Web Data Management Typing semistructured data
The same in a variant of XML schema(more verbose)
<?xml version="1.0" ?>
<Schema name="email" xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<AttributeType name="language"
dt:type="enumeration" dt:values="Western Greek Latin Universal" />
<AttributeType name="encrypted" />
<AttributeType name="priority" dt:type="enumeration" dt:values="NORMAL LOW HIGH" />
<AttributeType name="hidden" default="true" />
<ElementType name="to" content="textOnly" />
<ElementType name="from" content="textOnly" />
<ElementType name="cc" content="textOnly" />
<ElementType name="bcc" content="mixed">
<attribute type="hidden" required="yes" />
</ElementType>
<ElementType name="subject" content="textOnly" />
<ElementType name="body" content="textOnly" />
<ElementType name="email" content="eltOnly">
<attribute type="language" default="Western" />
<attribute type="encrypted" />
<attribute type="priority" default="NORMAL" />
<element type="to" minOccurs="1" maxOccurs="*" />
<element type="from" minOccurs="1" maxOccurs="1" />
<element type="cc" minOccurs="0" maxOccurs="*" />
<element type="bcc" minOccurs="0" maxOccurs="*" />
<element type="subject" minOccurs="0" maxOccurs="1" />
<element type="body" minOccurs="0" maxOccurs="1" />
</ElementType>
</Schema>
Master Informatique 10/9/2007 66Web Data Management Typing semistructured data
Where to place XML schemas
• Some bizarre restriction– Inside an element, no two types with the same tag
• Closer to DTDs than to tree automata• Efficient type validation
Tree automata
Deterministic .top-down tree automata
DTDXML schema
Master Informatique 10/9/2007 67Web Data Management Typing semistructured data
Exercise: coupled vs decoupled
• Write a realistic DTD1 for new cars– With make, model, engine…
• Write a realistic DTD2 for used cars– Also year, miles, zipcode
• Write an XML schema for L(DTD1) L(DTD2) – Using decoupled schema
Master Informatique 10/9/2007 69Web Data Management Typing semistructured data
Another use of automata: XPATH $x in //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)
Master Informatique 10/9/2007 70Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
Master Informatique 10/9/2007 71Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)(01)
Master Informatique 10/9/2007 72Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)(01)(02)
$x
Master Informatique 10/9/2007 73Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)(01)
$x
Master Informatique 10/9/2007 74Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
Master Informatique 10/9/2007 75Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
(01)
Master Informatique 10/9/2007 76Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
Master Informatique 10/9/2007 77Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
(02)$x
Master Informatique 10/9/2007 78Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
(02)$x
(01)
Master Informatique 10/9/2007 79Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
(02)
$x
(01)(02)
$x
Master Informatique 10/9/2007 80Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
(02)
$x
(01)$x
Master Informatique 10/9/2007 81Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
(02)
$x
$x
Master Informatique 10/9/2007 82Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)(01)
$x
$x
$x
Master Informatique 10/9/2007 83Web Data Management Typing semistructured data
Example: //a/b
a
b
a a b
ab
b$x $x
NFA DFA
(0)
$x
$x
$x
Master Informatique 10/9/2007 84Web Data Management Typing semistructured data
Determinization: exponential blow up
//a/*/*/b
Typing semistructured data
Master Informatique 10/9/2007 85Web Data Management Typing semistructured data
Proposal : k-pebble transducers
stack
[milo,suciu,vianu]
Master Informatique 10/9/2007 86Web Data Management Typing semistructured data
k-pebble transducers: result
root
a c
b a a b
a b
Capture a core aspect of Xquery but not the data management part
Master Informatique 10/9/2007 88Web Data Management Typing semistructured data
Graph
• Graph semistructured data• Graph simulation • Graph bisimulation• Data guides
Master Informatique 10/9/2007 89Web Data Management Typing semistructured data
Semistructured data = Labeled graph
• Possibly a root – in red
&r
&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7
&c
company
employeeemployee
employeeemployee employee employee
employeeemployee
worksfor
worksfor
worksforworksforworksfor
worksforworksfor
worksfor
manages
manages
manages
manages
managedby
managedbymanagedby
manages
managedby
managedby
Master Informatique 10/9/2007 90Web Data Management Typing semistructured data
Rooted graph
• OEM = Object Exchange Model• With ID-IDREF, XML is a graph model as well
• Labeled (rooted) graph (E,r)– Set N of edges– A finite ternary relation E NNLabel– E(s,t,l) = there is an edge from s to t labeled l– r is a node in the graph
Master Informatique 10/9/2007 91Web Data Management Typing semistructured data
Equality revisited
• {1,2,2,1,5} = {1,2,5}– Ignores the order
• For trees, if we ignore the order of siblings and use a “set” semantics
=a
b c
d d
b
d d
a
b c
d
Master Informatique 10/9/2007 92Web Data Management Typing semistructured data
Simulation
A simulation of (E,r) with (E’,r’) is a relation between the nodes of E and E’ such that
1.(r,r’)2. if (s,s’) and E(s,t,l) for some l then there
exists t’ with (t,t’) and E’(s’,t’,l’)
(we simulate a move in E by a move in E’)
Master Informatique 10/9/2007 93Web Data Management Typing semistructured data
Bisimulation
Given , E, E’, is a bisimulation if is a simulation of E with E’ and
-1 is a simulation of E’ with E
Master Informatique 10/9/2007 94Web Data Management Typing semistructured data
Examples
a a
a d
a a
a d
a
a d
G G’ G”
They all have the same paths from the root
bisimulation Not bisimulation
Master Informatique 10/9/2007 95Web Data Management Typing semistructured data
A more complex example of graph bisimulation
root
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
Master Informatique 10/9/2007 96Web Data Management Typing semistructured data
t1
Graph bisimulationroot
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
Master Informatique 10/9/2007 97Web Data Management Typing semistructured data
t1
Graph bisimulationroot
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
Master Informatique 10/9/2007 98Web Data Management Typing semistructured data
Graph bisimulationroot
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
Master Informatique 10/9/2007 99Web Data Management Typing semistructured data
Graph bisimulationroot
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
R
Master Informatique 10/9/2007 100Web Data Management Typing semistructured data
Graph bisimulationroot
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
Master Informatique 10/9/2007 101Web Data Management Typing semistructured data
Graph bisimulationroot
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
t1 t2
programmer | statistician
STRING_
employee
projects
R
R
Master Informatique 10/9/2007 102Web Data Management Typing semistructured data
Computing bisimulation in ptime
• Start with = N N’ (for N, N’ the set of nodes)
• While there exists (x,x’) in that violate the definition of simulation, remove (x,x’) from
• This computes the maximal bisimulation in ptime(Note: this maximal bisimulation exists because is
a bisimulation, and if 1, 2 are bisimulation, 1 2 is also one)
Master Informatique 10/9/2007 103Web Data Management Typing semistructured data
What does this have to do with typing?
• Take a very complex graph E• How do you describe it?• By a “smaller” graph T that is a bisimulation of
E• There may be several bisimulation with more
and more details
Master Informatique 10/9/2007 104Web Data Management Typing semistructured data
Rough bisimulation
Root&r
Bosses&p1,&p4,&p6
Regulars&p2,&p3,&p5,&p7,&p8
Company&c
company employee
manages
managedby
worksfor
worksfor
employee
Master Informatique 10/9/2007 105Web Data Management Typing semistructured data
More precise one
Root&r
Employees&p1,&p1,&p3,P4
&p5,&p6,&p7,&p8
Bosses&p1,&p4,&p6
Regulars&p2,&p3,&p5,&p7,&p8
Company&c
company
employee
managesmanagedby
manages
managedby
worksfor
worksfor
worksfor
Master Informatique 10/9/2007 106Web Data Management Typing semistructured data
Other “typing”: data guide
• See the graph as an automata with root as the start symbol and only accepting states
• This graph accepts all the paths from the root• Obtain an equivalent, minimal, deterministic
automata – This is the data guide for the graph– It can be used for describing the data– It can be used to support Graphical Query
Interfaces
Master Informatique 10/9/2007 107Web Data Management Typing semistructured data
Data guide {root}
{c1}
programmer
{c2}
statistician
{p1,p2,p3,p4,p5,p6,p7,p8,p9}
project
{e1,e2,e3,e4}
employee
{p1,p3} {p2,p4} {p1,p3,p5,p7} {p4,p6} {p4}
workson leads workson leadsconsults
{e1,e2} {e2,e3}{p1,p3,p5,p7,p9}
{p2,p4,p6,p8}
workson
{p4,p9}
leadsconsultsemployee employee
root
e2 e3 e4e1
p1 p2 p3 p4 p5 p6 p7 p8 p9
"exercise" "lecture""finance""adminstr.""PR""undergrad""grad""postgrad""web"
leads
workson leadsworkson
leadsworkson leads
workson consults
employee
consultsworkson
workson
c1 c2programmer statistician
project
workson
employee employee
• Gives all the paths from the root• Automata minimization
Master Informatique 10/9/2007 108Web Data Management Typing semistructured data
What you should remember
• Tree automata = theoretical foundation for XML• Bottom-up tree automata are nice• Top-down and determinism together limitations • XML documents do not have to be typed• Typing may be very useful for XML
– In particular for software managing XML data• DTD: simple but limited• XML Schema: more expressive but still limited• Graph data: bisimulation is the answer