70
A Type System for a Semistructured and XML Data Base Management System Ph. D. Thesis Proposal Dario Colazzo

A Type System for a Semistructured and XML Data Base Management System

  • Upload
    jalila

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

A Type System for a Semistructured and XML Data Base Management System. Ph. D. Thesis Proposal Dario Colazzo. Thesis Goals. Formal developement and study of a type system for XML querying Implementation of a concrete type system for an XML data base management system: the Xtasy system. - PowerPoint PPT Presentation

Citation preview

Page 1: A Type System for a Semistructured and XML Data Base Management System

A Type System for a Semistructured and XML Data Base Management System

Ph. D. Thesis Proposal

Dario Colazzo

Page 2: A Type System for a Semistructured and XML Data Base Management System

Thesis Goals Formal developement and study of

a type system for XML querying Implementation of a concrete type

system for an XML data base management system: the Xtasy system

Page 3: A Type System for a Semistructured and XML Data Base Management System

Presentation outline Semistructured data and XML Data models Type languages: DTD, XML

Schema Querying XML data: Tequyla Processing XML data: XDuce Thesis goals

Page 4: A Type System for a Semistructured and XML Data Base Management System

Semistructured data Irregular and instable structure Self-describing representation No separate schema information:

few guarantees of reliability and efficiency of applications

Page 5: A Type System for a Semistructured and XML Data Base Management System

OEM graph

person

addr

person

age

first“Dario Colazzo”

second

name

30 “Pisa”

age

“Carlo”

30

“Sartiani”

name email

[email protected]

addrbook

Page 6: A Type System for a Semistructured and XML Data Base Management System

XML syntax<addrbook>

<person><name>Dario Colazzo</name><addr>Pisa</addr>

</person><person>

<name><first> Carlo </first>

<second> Sartiani</second></name>

<addr>Pisa</addr> <email>[email protected]</email>

</person></addrbook>

Page 7: A Type System for a Semistructured and XML Data Base Management System

Attributes and element reference<db>

<state id="01"> <name>Italy</name> <code>IT</code>

</state>.......<city region=“Toscana” state-of="01">

<name>Italy</name> <code>PI</code>

</city></db>

Page 8: A Type System for a Semistructured and XML Data Base Management System

XML Query Data Model Based on node labeled forest trees

(set of documents) Several kind of nodes:

element node attribute node value node

Identifier and reference attributes modeled as general attribute

Page 9: A Type System for a Semistructured and XML Data Base Management System

XML Tree

person

addr

person

age

first

“Dario Colazzo”

second

name

30 “Pisa”

age

“Carlo” 30“Sartiani”

nameemail

[email protected]

addrbook element node

attribute node

value node

addr

“Pisa”

Page 10: A Type System for a Semistructured and XML Data Base Management System

XML schema languages Document Type Declarations:

schemas as grammars for documents. Regular type expressions

XML Schemas: closer to traditional type languages

Page 11: A Type System for a Semistructured and XML Data Base Management System

DTD Regular type expressions:

T | U union T,U sequence T* zero or more T? zero or one X=T[X] recursive definitions

coupled-tag element declarations global definitions only one base type: string (PCDATA) no type reusing

Page 12: A Type System for a Semistructured and XML Data Base Management System

DTD, example

<!DOCTYPE addrbook[<!ELEMENT addrbook (person*)<!ELEMENT person (name, addr,

tel?)><!ELEMENT name #PCDATA><!ELEMENT addr #PCDATA><!ELEMENT tel #PCDATA>

zero or more

zero or one

Page 13: A Type System for a Semistructured and XML Data Base Management System

XML Schema decoupled-tag: elements and types

may be defined separately local definitions base types: intgers, string,

decimal,... type reusing:

type refining type extension with subtyping

Page 14: A Type System for a Semistructured and XML Data Base Management System

XML Schema, example

<xsd:complexType name="person"><xsd:sequence><xsd:element name="name" type="xsd:string" /><xsd:element name="age" type="xsd:ageType"/><\xsd:sequence>

<\xsd:complexType>

<xsd:complexType name="newPerson" base="typeOfPerson" derivedBy="extension">

<xsd:element name="car" type="xsd:string" /><\xsd:complexType>

Page 15: A Type System for a Semistructured and XML Data Base Management System

Querying XML data XML querying is based on the use of patterns to

select portions of document Untyped query languages:

XQL XML-QL Quilt

Typed: Tequyla XDuce (functional language)

Forthcoming W3C query language...?.. probably Quilt

Page 16: A Type System for a Semistructured and XML Data Base Management System

Tequyla SQL-like query language query free-nesting typed:

query correctness query typing

Currently: only non algorithmical definitions, and weak subtyping

Page 17: A Type System for a Semistructured and XML Data Base Management System

Tequyla queries The body of a Tequila query is a from

clause composed by XPath patterns x=addressbook.xml;

bind to x the root element of addressbook.xml

y in x//person/addr starting from the root (x) search for a

person element at an arbitrary depth (//), then for an addr sub element (/), finally bind the node found to y

Page 18: A Type System for a Semistructured and XML Data Base Management System

A Tequyla query

Q = from x=addressbook.xml;

y in x//person/addr; z in x//person/name; where y="Pisa" select nome[z]

XPath

Page 19: A Type System for a Semistructured and XML Data Base Management System

XDuce Typed functional language Regular expressions types Type based pattern language

Page 20: A Type System for a Semistructured and XML Data Base Management System

XDuce schema A schema is a set of type definitions

E= {Addressbook = addrbook [(Name, Addr, Tel?) *] Name = name [String]Addr = addr[String]Tel = tel[String]

}

Page 21: A Type System for a Semistructured and XML Data Base Management System

An XDuce funtion: telephone list

Consider T= (Name, Addr,Tel?) in

fun mkTelList : T* --> (Name,Tel)* =

name[n], addr[a], tel[t], rest:T* --> name[n],tel[t], mkTelList(rest)

| name[n], addr[a], rest: T*--> mkTelList(rest)

| () --> ()

Page 22: A Type System for a Semistructured and XML Data Base Management System

XDuce subtyping: language inclusion XDuce provides a simple but rather

powerful notion of subtyping based on inclusion between sets of values

Examples Name, Addr <: Name, Addr,Tel? Name, Addr,Tel <: Name,

Addr,Tel? XML Schema extension subtyping

is not captured

Page 23: A Type System for a Semistructured and XML Data Base Management System

Xtasy type system

Page 24: A Type System for a Semistructured and XML Data Base Management System

Type language As expressive as DTD and XML

Schema Base types Attributes and id/idref types Type refining and extension Local type definitions Unordered sequence types

Page 25: A Type System for a Semistructured and XML Data Base Management System

Schema extraction and schema inferring For untyped data, a schema will be

inferred according to the XML Schema style

For typed XML data, the schema will be converted in the internal schema representation

Type inference for query results

Page 26: A Type System for a Semistructured and XML Data Base Management System

Data conformity An algorithm will be defined to

check data conformity to a schema The problem is EXPTIME-complete Optimization techniques exist Further ones has to be found to

deal with unordered sequence types and id/idref types

Page 27: A Type System for a Semistructured and XML Data Base Management System

Query correctness Only type correct queries will be

executed Type correctness is based on

successful matching between the query structural requirements and the type of the data to be queried

Page 28: A Type System for a Semistructured and XML Data Base Management System

Correct queries, an example (1/2)

ConsiderE= {

Adrressbook = addrbook [Person*] Person = (Name, Addr, Tel?) Name = name [String] Addr = addr[String] Tel = tel[String]

}

Page 29: A Type System for a Semistructured and XML Data Base Management System

Correct queries, an example (2/2) A correct query:

Q = from x=addressbook.xml;

y in x//person/addr; z in x//person/name; where y="Pisa" select nome[z]

Page 30: A Type System for a Semistructured and XML Data Base Management System

Correctness & union types Consider:Q’ = from x=addressbook.xml; y in x//person/addr; z in x//person/tel; where y="Pisa" select results[z] Schould we consider this query

correct?

Page 31: A Type System for a Semistructured and XML Data Base Management System

Correctness & union types: existential approach The previous query is considered

as correct The user will be warned about

optional elements required by patterns

Page 32: A Type System for a Semistructured and XML Data Base Management System

Total approach The previous query is considered

as not correct Too severe discipline A lot of queries with non empty

results would be cut off

Page 33: A Type System for a Semistructured and XML Data Base Management System

Type equivalences Several type equivalences laws will

be considered In particular:

(T | U) , S = (T , S) | (T , S) Useful to simplify schema

definitions

Page 34: A Type System for a Semistructured and XML Data Base Management System

Subtyping A subtype relation E E’ will be

defined such that: If a query Q is correct wrt E’ then it is

also correct wrt E Type extension will be supported:

if E is an extension of E’ then E E’

Page 35: A Type System for a Semistructured and XML Data Base Management System

Parametric polymorphism (1/3)

Used in some functional languages (e.g. ML and Haskel) to define generic functions, for example:

funtion Sort (t :Type; L:List t; Ord:t X t Bool): List tbegin.....end.

It will allow us to define generic queries

Page 36: A Type System for a Semistructured and XML Data Base Management System

Parametric polymorphism (2/3)

Parametric types fits well in the description of irregular data structure

For example E(t)= {Adrressbook = addrbook [(Name, Addr, Tel?) *]

Name = name [String] Addr = addr[t] Tel = tel[String]}

addr elements content can have, for example, a street and a city sub-element

Page 37: A Type System for a Semistructured and XML Data Base Management System

Parametric polymorphism (3/3)

A generic query:

Q = t: Type; a : E(t) . from x= a ;

y in x//person/addr; z in x//person/name; where z=“dario" select indirizzo[y]

More precise typing: the type Any* is different from t*

Page 38: A Type System for a Semistructured and XML Data Base Management System

Conclusions The type system will provide:

union types reference types recursive types subtyping parametric polymorphism

Page 39: A Type System for a Semistructured and XML Data Base Management System

Avanzamento

Page 40: A Type System for a Semistructured and XML Data Base Management System

Presentation outline

Proposal What has been done Ongoing and future work

Page 41: A Type System for a Semistructured and XML Data Base Management System

Thesis Goals Formal developement and study of

a type system for XML querying The query language is an abstract

version of XQuery (W3C) The type langueage is expressive

enough to capture the essence of current standards

Page 42: A Type System for a Semistructured and XML Data Base Management System

Xquery type system Only result analisis: XQuery type

system is defined to determine and check at query-analysis time the output type of a query on documents conforming to an expected input type.

Query correctness is not defiend and checked (only some ideas).

Page 43: A Type System for a Semistructured and XML Data Base Management System

What has been done We have:

formally defined the notion of query type correctness

defined a type system to statically check it and to perform result analisys; the rules define a terminating algorithm.

intruduced an alternative, wrt Xquery, approach to deal with recursive types

Page 44: A Type System for a Semistructured and XML Data Base Management System

Observations Our type system also performs query

analisys and, in this respect, presents some differences wrt XQuery approach

Till now, we have considered a type system feeaturing product, union and recursive types

We have discovered that these type mechnanism are sufficient enough to make the study interesting and (as we will see) rather subtle.

Page 45: A Type System for a Semistructured and XML Data Base Management System

Observations discovered that for particular

queries (fortunately not frequent ones) the type system is not able to exactly capture the semantical characterization of correctness

Introduced a further notion of correctness, path-covering, and provided rules to check this property

Page 46: A Type System for a Semistructured and XML Data Base Management System

Papers A first defintion of the type system can be

found in A Typed Text Retrieval Query Language for XML Documents , Journal of the American Society for Information Science and Technology (JASIS) Special Issue 2001

In Types for Correctness of Queries over Semistructured Data, the system has been improved by a finer notion of query correctness and by the notion of path covering. The work will be submitted at WebDB2002 workshop

Page 47: A Type System for a Semistructured and XML Data Base Management System

Tequyla (or µXQuery) SQL-like query language query free-nesting typed:

type conformance of data query correctness query typing (result unalysis)

Page 48: A Type System for a Semistructured and XML Data Base Management System

Tequyla queries The body of a Tequila query is a from

clause composed by XPath patterns x=addressbook.xml;

bind to x the root element of addressbook.xml

y in x//person/addr starting from the root (x) search for a

person element at an arbitrary depth (//), then for an addr sub element (/), finally bind the node found to y

Page 49: A Type System for a Semistructured and XML Data Base Management System

Types T,U ::= () empty sequence

B atomic type (char, int,…)T + U union

T; U sequencel[T] element typeX type name

Type environments: type definitions + type binding for query free variables

E ::= ()X=T, E

x:X, E

Page 50: A Type System for a Semistructured and XML Data Base Management System

A type environment E=

Adrressbook= addrbook [ Person*], Person= person[Name, Addr, (Tel

+EMail)], Name = name [String], Addr = addr[String], Tel= tel[String],

EMail= email[String],x: Adrressbook

Page 51: A Type System for a Semistructured and XML Data Base Management System

A correct query

Q ::=

from y in x//person/addr; z in x// person/name; where y="Pisa" select nome[z]

XPath

Page 52: A Type System for a Semistructured and XML Data Base Management System

An incorrect query

Q ::=

from x=addressbook.xml; y in

x//person/address; z in x/name; where y="Pisa" select nome[z]

Page 53: A Type System for a Semistructured and XML Data Base Management System

Queries:

Q1, Q2 :: = ()

VB

l[Q]

Q1; Q2from x=Q1 select Q2from x in Q1 select Q2x

Q p Observe: no where clauses.

Page 54: A Type System for a Semistructured and XML Data Base Management System

Some notation Given s= {x1= f1,...., xn= fn}

s::E

means that xi = fi s iff xi:T E and fi

T

E|-- Q means that each fv x in Q is

typed in E (x:T E)

Page 55: A Type System for a Semistructured and XML Data Base Management System

Definition of correctness: first step Given a query Q, a schema E for its

free variables, and s::E :

1. [[Q]]s=<f, F> or

2. [[Q]]s=<f, NF> Essentially, in s, Q correctely returns a

forest f (case 1.) if Q’ p in Q, the path p finds a match with the forest returned by Q’

Page 56: A Type System for a Semistructured and XML Data Base Management System

Query correctnessQuery correctness

Given a query Q and E s.t. E|-- Q :

Q is strongly correct iff for each s::E

[[Q]] s=<f, F>

Q is weakly correct iff there exists s::E

[[Q]] s=<f, F>

Q is incorrect iff for each s::E[[Q]] s=<f, NF>

Page 57: A Type System for a Semistructured and XML Data Base Management System

Example: strongly correct query

Consider the type environment X=a[Y],

Y=b[Int]+c[Int],x: X

and the queryx(/b+/c)

Page 58: A Type System for a Semistructured and XML Data Base Management System

Example: weakly correct query

Consider the queryx/b

Only some instance of type X contains the path /b

X=a[Y],Y=b[Int]+c[Int],x: X

Page 59: A Type System for a Semistructured and XML Data Base Management System

Example: incorrect query

Consider the queryx/d

No instance of type X contains the path /d

X=a[Y],Y=b[Int]+c[Int],x: X

Page 60: A Type System for a Semistructured and XML Data Base Management System

Type system To check correctness and to infer the

type of query results we have defined a set of rules that: define an algorithm: determinism +

termination deals with recursion in a different way wrt to

Xquery type system in same cases (// + guarded recursion)

infers context free types do not rely on any notion of type inclusion:

only matching between paths and types

Page 61: A Type System for a Semistructured and XML Data Base Management System

Some properties Given E |-- Q if the system return

E |-- Q :<T, θ> with θ{s, w, i}then

[[Q]] [[T]]and

θ=s/i Q is stongly correct/incorrectIf θ=w then in most cases Q is weakly

correct, but in some cases Q is strongly correct or, even worst, incorrect

Page 62: A Type System for a Semistructured and XML Data Base Management System

Weak correctness problem (1)

Unsoundness for the case θ=w (and incorrect queries) is due to particuluar queries where two different paths start from the same root (x) and traverse two “disjoint” paths

Example:x/b; x/c where

x :X,X=a[Y],Y=b[Int]+c[Int]

Page 63: A Type System for a Semistructured and XML Data Base Management System

Observations Observe, the problem does not arise for

x/b; x/b or x/b; y/cwhere x :X,y: X,X=a[Y],Y=b[Int]+c[Int]Both queries are weakly correct as

inferred by the type system

Page 64: A Type System for a Semistructured and XML Data Base Management System

Strong correctness problem Consider the strongly correct queryConsider

x(/b+/c)wherex: X,X=a[Y],Y=b[Int]+c[Int],

In this case the type system infers: < b[Int]?+c[Int]?, w>

Page 65: A Type System for a Semistructured and XML Data Base Management System

Solution We have a possible solution for

these problems It is based on a different

representation of union types Currentely we are working on the

defiition of simple rules that implement this approach

Page 66: A Type System for a Semistructured and XML Data Base Management System

Path covering In strong correctness we require that for

each alternative path in the input type there is a path selection in the query,

In the notion of path covering we require that each alternative expressed in the query appears in the input type

Page 67: A Type System for a Semistructured and XML Data Base Management System

Path covering, examplesConsider X=a[Y], Y=b[Int]+c[Int],

x: Xand the query

x(/b+/c+/d)

This path selection is not path-covered wrt to X, the path /d is superflous

The same is for x(/b+/d), while both x(/b+/c) and x(/b) are path-covered

Page 68: A Type System for a Semistructured and XML Data Base Management System

Path covering It is useful for programmers as they are

statically informed about extra paths that may ineffeciently attempt to match input data

Moreover they can improve and simplify their queries by eleiminating superflous paths or by subtituting them with actually occurring ones

Page 69: A Type System for a Semistructured and XML Data Base Management System

Path covering The type system defined for

corretness has been easily extended to check path covering

The system constituets a formal framework where several other notions of correctness can be defined and compared

Page 70: A Type System for a Semistructured and XML Data Base Management System

Ongoing and future work Currently we are working on:

the defintion of (simple) rules that solves the unsoundness problems previously outlined

the formal proofs of properties of the current system

In next months we: complete the developement of formal stuff

for both systems for query correctness and for the system for path covering

extend the language with where clauses