27
Spring 2005 Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas Presented by Daria Barger

Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 1

Efficient Incremental Validation of XML Documents

Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas

Presented by Daria Barger

Page 2: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 2

Outline

Introduction Types of constraints Update operations Incremental validation Experiments Conclusions Future work

Page 3: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 3

Introduction The problems of storing and querying XML

documents have attracted a great deal of interest.

Other aspects of XML data management, however, have not yet been satisfactorily explored.

Among them is the problem of checking that documents are valid with respect to their specifications, and that they remain valid after updates.

Page 4: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 4

DTD

One popular form of XML document specification is the Document Type Definition (DTD).

A DTD D is a grammar that defines a set of documents L(D).

Each document in L(D) is said to be valid with respect to D .

Page 5: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 5

The Validation Problem

The validation problem is:

Given a DTD D and an XML document X, is it the case that X L(D) ?

The incremental validation problem is:

Let U be some update operation.

Given X L(D), is it the case that

U(X) L(D)?

Page 6: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 6

Validation of structural constraints

Elements are declared in DTD by rules of the form:<!ELEMENT e c>

<?xml version="1.0"?> <!ELEMENT db (person*)>

<!ELEMENT person(name, dep, email, tel*)> <!ELEMENT name (#PCDATA)> <!ELEMENT dep(#PCDATA)> <!ELEMENT email(#PCDATA)>

<!ELEMENT tel(#PCDATA)>

Content Model:Element- valid iff the string formed by concatenating its children elements belongs to L(E), the language denoted by E.

Content Model:#PCDATA – validation can be done trivially

Page 7: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 7

Validation of attributes

Attributes validation is trivial, except for

ID and IDREF attribute types.

Valid XML document should hold: Values of all ID attributes are unique Value of each IDREF attribute must be equal to

the value of some ID attribute

Page 8: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 8

1-unambiguous regular expressions

The specification of XML DTDs restricts the regular expression used for defining element content to be 1- unambiguous (deterministic).

Marking:

)|(`

)|(

21

cbbaE

cbbaE

Position – subscripted symbol in E`.For given position x, Χ (x) denotes a corresponding (unmarked) symbol in Σ.For example: pos(E’) = {a,b1,b2,c}Χ (b1) =b

Page 9: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 9

1-unambiguous regular expressions

A regular expression E is 1- unambiguous if and only if for all words u,v,w over the subscripted alphabet pos(E) and all x,y in pos(E), the conditionsuxv, uyw L(E`) and x≠y imply Χ(x) ≠ Χ(y)

Which regular expression is deterministic?– (ab)|(ac)– a(b|c)– a(a+b)*ac

Page 10: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 10

The Glushkov automaton for Regular Expressions

otherwise ,

if },{ U F 4.

let ,For 3.

(let ,For 2.

1.

,

i

last(E)

L(E)ε qlast (E)

a}),X(x) follow(E,x{y|y δ(x,a)Σ,a pos(E)x

a}X(x) first(E), {x|x ,a)q δΣa

}{q pos(E) U Q

,F),q(Q,ΣG

i

i

ie

set of positions that appear as the first symbol of some word in L(E’)

set of positions that appear as the last symbol of some word in L(E’)

set of positions that appear immediately after position x in some word in L(E’)

Page 11: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 11

Update operations

Append(p,y) - insert element y as the last child of element p.

A

A A

A A

AAA

AA

p

A A

A A

A

y

Append

Page 12: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 12

Update operations (2)

InsertBefore(x,y) – insert element y as immediate left sibling of element x.(This operation is not defined if x is the root of the document).

A

A A

A A

AAA

AA

A A

A A

Ay

x

Insert Before

Page 13: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 13

Update operations(3)

Delete(x) – delete element x from the document. Note that if x is the root of the document the operation is trivially valid.

A

A A

A A

AAA

AA

x

A

AA

AA

Delete(x)

Page 14: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 14

Observation

The incremental validation concerns only the content of the element where the update takes place. For example, after an Append(p,y) operation only the content of p needs to be revalidated.

Page 15: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 15

Together with the i-th child of p we store the value of for the automaton that validates the content model of p.

This requires auxiliary storage of size O(n log d), where n is a size of XML document, d is size of DTD

The approach

)...(ˆ 1 iww

p

wk

)...(ˆ 1 kwww2

),(ˆ 21 ww

w1

)(ˆ 1w

w3

),,(ˆ 321 www…

Page 16: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 16

Append at the end

Append(p,y) operation

time)log(log

succeedsoperation then the)),...(ˆ( If 1

dnO

Fyww k

p

ywk

)...(ˆ 1 kwww2

),(ˆ 21 ww

w1

)(ˆ 1w

w3

),,(ˆ 321 www…

Page 17: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 17

Arbitrary insertions and deletionsDelete(x) operation

)(||

time)loglog|(|

)),...(ˆ( from starting ...w wrevalidate should We 111k1

nOw

dnwO

www ii

Problem: Complexity

p

wk

)...(ˆ 1 kwww2

),(ˆ 21 ww

w1

)(ˆ 1w

wi

)...(ˆ 1 iww……

Page 18: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 18

1,2 Conflict Free Regular Expression

Let’s consider E=a(b1*|cb2*)

W=acb…b. All b’s match state b2

Delete c from w, receive w’=ab…b

Now all b’s match state b1

We should re - validate the entire string

)...(ˆ)...(ˆ 111111 iiiii wwwwwww

Possible solution:

This condition does not hold always, e.g.

Page 19: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 19

Definition of 1,2 Conflict-free

Let E be regular expression over alphabet ΣFollow(E,x) – set of position in E that can follow x in some path through E.Define

),(|)({),(2 xEfollowzEposyxEfollow

such that )},( zEfollowy

E is 1,2 conflict - free regular expression if:

zyzΧ(y)

xEfollowz

xEfollowyEposzyx

)(

then ),( and

),( if },{)(,,every For 2)

ticdeterminis is E 1)

2

Page 20: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 20

Restricted forms of DTD

1,2 Conflict Free DTD There is no “flipping” between automata states

after the update. The per update complexity for 1,2 Conflict Free

DTD is O(log n + log d) time and O(n log d) auxiliary space.

Conflict-free DTD: No repeated symbols. The per update complexity: O(log n + log d) and

constant auxiliary space.

Page 21: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 21

Incremental validation of ID and IDREF for adding element

Append(p,y) and InsertBefore(x,y) operations require checking that no two ID attributes are the same and every IDREF attribute in y refers to some existing document values.

The complexity:O(|y|log n) time and linear auxiliary space.

|y| = size of added subtree.

Page 22: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 22

Incremental validation of ID and IDREF for deleting element

After Delete(x) operation we have to check that there is no subtree rooted at x that contains a node that has an ID attribute referenced by some other node that is not a descendant of x.

a

b

c

Checking reference counter in delete requires O(log n) time.Updating reference counter in insert/removing IDREF attribute: O(h log n) time.

Page 23: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 23

Valid Insertion

2G256M32M4M512K64K

100

10000

1e+06

1e+08

Document size

Tim

e [m

icro

sec

] Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

Page 24: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 24

Valid DeletionT

ime

[mic

ro s

ec]

100

10000

1e+06

1e+08

2G256M32M4M512K64K

Document size

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

Page 25: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 25

Invalid Deletion

102G256M32M4M512K64K

Document size

100

1000

Tim

e [m

icro

sec

]Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

Page 26: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 26

Conclusions

1. Handled insertion and deletion of subtrees (not leaf nodes only).

2. Validated ID and IDREF attributes.

3. Characterize a class of DTDs appearing to capture most real life DTDs that admits a log time and constant space incremental validation algorithm.

4. Conducted experiments showing that the method is practical for large data documents and behaves much better than full revalidation.

Page 27: Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet

Spring 2005 Daria Barger – DB Seminar 27

Future Work

Handling complex updates, involving several insertions and deletions as a single transactions.