Upload
jalila
View
29
Download
0
Embed Size (px)
DESCRIPTION
A Type System for a Semistructured and XML Data Base Management System. Ph. D. Thesis Proposal Dario Colazzo. Thesis Goals. Formal developement and study of a type system for XML querying Implementation of a concrete type system for an XML data base management system: the Xtasy system. - PowerPoint PPT Presentation
Citation preview
A Type System for a Semistructured and XML Data Base Management System
Ph. D. Thesis Proposal
Dario Colazzo
Thesis Goals Formal developement and study of
a type system for XML querying Implementation of a concrete type
system for an XML data base management system: the Xtasy system
Presentation outline Semistructured data and XML Data models Type languages: DTD, XML
Schema Querying XML data: Tequyla Processing XML data: XDuce Thesis goals
Semistructured data Irregular and instable structure Self-describing representation No separate schema information:
few guarantees of reliability and efficiency of applications
OEM graph
person
addr
person
age
first“Dario Colazzo”
second
name
30 “Pisa”
age
“Carlo”
30
“Sartiani”
name email
addrbook
XML syntax<addrbook>
<person><name>Dario Colazzo</name><addr>Pisa</addr>
</person><person>
<name><first> Carlo </first>
<second> Sartiani</second></name>
<addr>Pisa</addr> <email>[email protected]</email>
</person></addrbook>
Attributes and element reference<db>
<state id="01"> <name>Italy</name> <code>IT</code>
</state>.......<city region=“Toscana” state-of="01">
<name>Italy</name> <code>PI</code>
</city></db>
XML Query Data Model Based on node labeled forest trees
(set of documents) Several kind of nodes:
element node attribute node value node
Identifier and reference attributes modeled as general attribute
XML Tree
person
addr
person
age
first
“Dario Colazzo”
second
name
30 “Pisa”
age
“Carlo” 30“Sartiani”
nameemail
addrbook element node
attribute node
value node
addr
“Pisa”
XML schema languages Document Type Declarations:
schemas as grammars for documents. Regular type expressions
XML Schemas: closer to traditional type languages
DTD Regular type expressions:
T | U union T,U sequence T* zero or more T? zero or one X=T[X] recursive definitions
coupled-tag element declarations global definitions only one base type: string (PCDATA) no type reusing
DTD, example
<!DOCTYPE addrbook[<!ELEMENT addrbook (person*)<!ELEMENT person (name, addr,
tel?)><!ELEMENT name #PCDATA><!ELEMENT addr #PCDATA><!ELEMENT tel #PCDATA>
zero or more
zero or one
XML Schema decoupled-tag: elements and types
may be defined separately local definitions base types: intgers, string,
decimal,... type reusing:
type refining type extension with subtyping
XML Schema, example
<xsd:complexType name="person"><xsd:sequence><xsd:element name="name" type="xsd:string" /><xsd:element name="age" type="xsd:ageType"/><\xsd:sequence>
<\xsd:complexType>
<xsd:complexType name="newPerson" base="typeOfPerson" derivedBy="extension">
<xsd:element name="car" type="xsd:string" /><\xsd:complexType>
Querying XML data XML querying is based on the use of patterns to
select portions of document Untyped query languages:
XQL XML-QL Quilt
Typed: Tequyla XDuce (functional language)
Forthcoming W3C query language...?.. probably Quilt
Tequyla SQL-like query language query free-nesting typed:
query correctness query typing
Currently: only non algorithmical definitions, and weak subtyping
Tequyla queries The body of a Tequila query is a from
clause composed by XPath patterns x=addressbook.xml;
bind to x the root element of addressbook.xml
y in x//person/addr starting from the root (x) search for a
person element at an arbitrary depth (//), then for an addr sub element (/), finally bind the node found to y
A Tequyla query
Q = from x=addressbook.xml;
y in x//person/addr; z in x//person/name; where y="Pisa" select nome[z]
XPath
XDuce Typed functional language Regular expressions types Type based pattern language
XDuce schema A schema is a set of type definitions
E= {Addressbook = addrbook [(Name, Addr, Tel?) *] Name = name [String]Addr = addr[String]Tel = tel[String]
}
An XDuce funtion: telephone list
Consider T= (Name, Addr,Tel?) in
fun mkTelList : T* --> (Name,Tel)* =
name[n], addr[a], tel[t], rest:T* --> name[n],tel[t], mkTelList(rest)
| name[n], addr[a], rest: T*--> mkTelList(rest)
| () --> ()
XDuce subtyping: language inclusion XDuce provides a simple but rather
powerful notion of subtyping based on inclusion between sets of values
Examples Name, Addr <: Name, Addr,Tel? Name, Addr,Tel <: Name,
Addr,Tel? XML Schema extension subtyping
is not captured
Xtasy type system
Type language As expressive as DTD and XML
Schema Base types Attributes and id/idref types Type refining and extension Local type definitions Unordered sequence types
Schema extraction and schema inferring For untyped data, a schema will be
inferred according to the XML Schema style
For typed XML data, the schema will be converted in the internal schema representation
Type inference for query results
Data conformity An algorithm will be defined to
check data conformity to a schema The problem is EXPTIME-complete Optimization techniques exist Further ones has to be found to
deal with unordered sequence types and id/idref types
Query correctness Only type correct queries will be
executed Type correctness is based on
successful matching between the query structural requirements and the type of the data to be queried
Correct queries, an example (1/2)
ConsiderE= {
Adrressbook = addrbook [Person*] Person = (Name, Addr, Tel?) Name = name [String] Addr = addr[String] Tel = tel[String]
}
Correct queries, an example (2/2) A correct query:
Q = from x=addressbook.xml;
y in x//person/addr; z in x//person/name; where y="Pisa" select nome[z]
Correctness & union types Consider:Q’ = from x=addressbook.xml; y in x//person/addr; z in x//person/tel; where y="Pisa" select results[z] Schould we consider this query
correct?
Correctness & union types: existential approach The previous query is considered
as correct The user will be warned about
optional elements required by patterns
Total approach The previous query is considered
as not correct Too severe discipline A lot of queries with non empty
results would be cut off
Type equivalences Several type equivalences laws will
be considered In particular:
(T | U) , S = (T , S) | (T , S) Useful to simplify schema
definitions
Subtyping A subtype relation E E’ will be
defined such that: If a query Q is correct wrt E’ then it is
also correct wrt E Type extension will be supported:
if E is an extension of E’ then E E’
Parametric polymorphism (1/3)
Used in some functional languages (e.g. ML and Haskel) to define generic functions, for example:
funtion Sort (t :Type; L:List t; Ord:t X t Bool): List tbegin.....end.
It will allow us to define generic queries
Parametric polymorphism (2/3)
Parametric types fits well in the description of irregular data structure
For example E(t)= {Adrressbook = addrbook [(Name, Addr, Tel?) *]
Name = name [String] Addr = addr[t] Tel = tel[String]}
addr elements content can have, for example, a street and a city sub-element
Parametric polymorphism (3/3)
A generic query:
Q = t: Type; a : E(t) . from x= a ;
y in x//person/addr; z in x//person/name; where z=“dario" select indirizzo[y]
More precise typing: the type Any* is different from t*
Conclusions The type system will provide:
union types reference types recursive types subtyping parametric polymorphism
Avanzamento
Presentation outline
Proposal What has been done Ongoing and future work
Thesis Goals Formal developement and study of
a type system for XML querying The query language is an abstract
version of XQuery (W3C) The type langueage is expressive
enough to capture the essence of current standards
Xquery type system Only result analisis: XQuery type
system is defined to determine and check at query-analysis time the output type of a query on documents conforming to an expected input type.
Query correctness is not defiend and checked (only some ideas).
What has been done We have:
formally defined the notion of query type correctness
defined a type system to statically check it and to perform result analisys; the rules define a terminating algorithm.
intruduced an alternative, wrt Xquery, approach to deal with recursive types
Observations Our type system also performs query
analisys and, in this respect, presents some differences wrt XQuery approach
Till now, we have considered a type system feeaturing product, union and recursive types
We have discovered that these type mechnanism are sufficient enough to make the study interesting and (as we will see) rather subtle.
Observations discovered that for particular
queries (fortunately not frequent ones) the type system is not able to exactly capture the semantical characterization of correctness
Introduced a further notion of correctness, path-covering, and provided rules to check this property
Papers A first defintion of the type system can be
found in A Typed Text Retrieval Query Language for XML Documents , Journal of the American Society for Information Science and Technology (JASIS) Special Issue 2001
In Types for Correctness of Queries over Semistructured Data, the system has been improved by a finer notion of query correctness and by the notion of path covering. The work will be submitted at WebDB2002 workshop
Tequyla (or µXQuery) SQL-like query language query free-nesting typed:
type conformance of data query correctness query typing (result unalysis)
Tequyla queries The body of a Tequila query is a from
clause composed by XPath patterns x=addressbook.xml;
bind to x the root element of addressbook.xml
y in x//person/addr starting from the root (x) search for a
person element at an arbitrary depth (//), then for an addr sub element (/), finally bind the node found to y
Types T,U ::= () empty sequence
B atomic type (char, int,…)T + U union
T; U sequencel[T] element typeX type name
Type environments: type definitions + type binding for query free variables
E ::= ()X=T, E
x:X, E
A type environment E=
Adrressbook= addrbook [ Person*], Person= person[Name, Addr, (Tel
+EMail)], Name = name [String], Addr = addr[String], Tel= tel[String],
EMail= email[String],x: Adrressbook
A correct query
Q ::=
from y in x//person/addr; z in x// person/name; where y="Pisa" select nome[z]
XPath
An incorrect query
Q ::=
from x=addressbook.xml; y in
x//person/address; z in x/name; where y="Pisa" select nome[z]
Queries:
Q1, Q2 :: = ()
VB
l[Q]
Q1; Q2from x=Q1 select Q2from x in Q1 select Q2x
Q p Observe: no where clauses.
Some notation Given s= {x1= f1,...., xn= fn}
s::E
means that xi = fi s iff xi:T E and fi
T
E|-- Q means that each fv x in Q is
typed in E (x:T E)
Definition of correctness: first step Given a query Q, a schema E for its
free variables, and s::E :
1. [[Q]]s=<f, F> or
2. [[Q]]s=<f, NF> Essentially, in s, Q correctely returns a
forest f (case 1.) if Q’ p in Q, the path p finds a match with the forest returned by Q’
Query correctnessQuery correctness
Given a query Q and E s.t. E|-- Q :
Q is strongly correct iff for each s::E
[[Q]] s=<f, F>
Q is weakly correct iff there exists s::E
[[Q]] s=<f, F>
Q is incorrect iff for each s::E[[Q]] s=<f, NF>
Example: strongly correct query
Consider the type environment X=a[Y],
Y=b[Int]+c[Int],x: X
and the queryx(/b+/c)
Example: weakly correct query
Consider the queryx/b
Only some instance of type X contains the path /b
X=a[Y],Y=b[Int]+c[Int],x: X
Example: incorrect query
Consider the queryx/d
No instance of type X contains the path /d
X=a[Y],Y=b[Int]+c[Int],x: X
Type system To check correctness and to infer the
type of query results we have defined a set of rules that: define an algorithm: determinism +
termination deals with recursion in a different way wrt to
Xquery type system in same cases (// + guarded recursion)
infers context free types do not rely on any notion of type inclusion:
only matching between paths and types
Some properties Given E |-- Q if the system return
E |-- Q :<T, θ> with θ{s, w, i}then
[[Q]] [[T]]and
θ=s/i Q is stongly correct/incorrectIf θ=w then in most cases Q is weakly
correct, but in some cases Q is strongly correct or, even worst, incorrect
Weak correctness problem (1)
Unsoundness for the case θ=w (and incorrect queries) is due to particuluar queries where two different paths start from the same root (x) and traverse two “disjoint” paths
Example:x/b; x/c where
x :X,X=a[Y],Y=b[Int]+c[Int]
Observations Observe, the problem does not arise for
x/b; x/b or x/b; y/cwhere x :X,y: X,X=a[Y],Y=b[Int]+c[Int]Both queries are weakly correct as
inferred by the type system
Strong correctness problem Consider the strongly correct queryConsider
x(/b+/c)wherex: X,X=a[Y],Y=b[Int]+c[Int],
In this case the type system infers: < b[Int]?+c[Int]?, w>
Solution We have a possible solution for
these problems It is based on a different
representation of union types Currentely we are working on the
defiition of simple rules that implement this approach
Path covering In strong correctness we require that for
each alternative path in the input type there is a path selection in the query,
In the notion of path covering we require that each alternative expressed in the query appears in the input type
Path covering, examplesConsider X=a[Y], Y=b[Int]+c[Int],
x: Xand the query
x(/b+/c+/d)
This path selection is not path-covered wrt to X, the path /d is superflous
The same is for x(/b+/d), while both x(/b+/c) and x(/b) are path-covered
Path covering It is useful for programmers as they are
statically informed about extra paths that may ineffeciently attempt to match input data
Moreover they can improve and simplify their queries by eleiminating superflous paths or by subtituting them with actually occurring ones
Path covering The type system defined for
corretness has been easily extended to check path covering
The system constituets a formal framework where several other notions of correctness can be defined and compared
Ongoing and future work Currently we are working on:
the defintion of (simple) rules that solves the unsoundness problems previously outlined
the formal proofs of properties of the current system
In next months we: complete the developement of formal stuff
for both systems for query correctness and for the system for path covering
extend the language with where clauses