42
japan june 2003 1 The correction of XML data Université Paris II & LRI Michel de Rougemont [email protected] http://www.lri.fr/ ~mdr 1. Approximation and Edit Distance 2. Testers and Correctors 3. Correcting regular binary trees 4. Applications to XML Practical corrector 5. Relative value of documents

Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont [email protected] mdr 1.Approximation and Edit Distance

Embed Size (px)

Citation preview

Page 1: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 1

The correction of XML data

Université Paris II & LRI

Michel de Rougemont

[email protected]

http://www.lri.fr/~mdr

1. Approximation and Edit Distance2. Testers and Correctors3. Correcting regular binary trees4. Applications to XML

Practical corrector5. Relative value of documents

Page 2: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 2

1. Relations Dist (R,S) = # x :

if Dist(R,S) <

2. Edit-distance

3. Trees: Tree-Edit-Distance Min # Deletions,

Insertions

Approximation

)()( xSxR

SR )(. Raritén

Left-deletion

Left-insertion

Page 3: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 3

Binary trees : p-Distance allows permutation

Classical Tree-Edit-Distance

Dist(T1,T2) =2 p-Dist (T1,T2) =1

Dist (T, L) = Min Dist (T,T’)

a

e

b

c d

a

e

b

c

a

e

b

c d

fe

Deletion

Insertion

LT '

Page 4: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 4

1. Satisfiability : Tree |= F

2. Approximate satisfiability

Tree |= F

Image on a class K of trees

Approximate satisfiability

F FF

F fromfar

Page 5: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 5

Logic, testers, correctors

A Tester decides |= for a formula F.

A Corrector takes a tree T close to a language L and find T’ in L close to T.

This is possible if F follows a simple logic.

Theorem. there is linear time corrector for regular binary trees and a constant distance.

Given a tree T, k- close to a regular language L, we find in linear time T’ in L, c.k -close to T.

General problem: given a language L defined in some Logic, find a corrector.

Theorem. (implicit in Alon and al. FOCS2000) There is a linear time corrector for regular words and distance

Application to Model-Checking (LICS2002)

n .

Page 6: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 6

Simple example

Tester for 0+ 1* 0+

Types of segments:

000000011111110000010000 probablyaccepted011110000000110111 rejected with highprobability

0 01

0000011111000111110001100

0 0

Corrector for 0+ 1* 0+ 00000001111110000100000 *

Page 7: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 7

• Tree-automata• Logical definability on trees• Tree grammar• Regular expression

Regular Trees

r(a,b(a,b(a,b(a,b(a,b(a,b)....) r(a(a,b(a,b(a(a,b),b)....),b)

Page 8: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 8

• (q0, q0) q1• (q0,q1) q1

Tree automata

q0 q0

q0

q0

q0

q0

q1

q1

q1

q1

q1

q0 q0

q0q1

q2

(q1,q1)q2

(q1,q0)q2

(q2,-) q2

(-,q2) q2)1,,0,( qqQA

Page 9: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 9

Definition : a subtree t is feasible for L if there are subtrees (for its leaves) which reach states (q1...ql) such that the state of the root q=t(q1...ql) can reach an accepting state (in the automaton for L).

A subtree is infeasible if it is not feasible

Feasible and infeasible subtrees

feasible

infeasible

Page 10: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 10

Fact . If then the number of unfeasible subtrees of length a is O(n).

Fact. If the distance is small, there are few infeasibles trees.

Intuition : make local corrections at the root of the infeasible trees

Infeasible subtrees

nLT .),(Distance

Page 11: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 11

Phase 1 : (Bottom-up) Marking of * nodes, roots of infeasible subtrees.

Phase 2 : (Top-down) Recursive analysis of the * subtrees to make root accept.

Phase 3 : Local corrections

Structure of the corrector

q0

q1

Page 12: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 12

Phase 1 : bottom-up marking

Definitions: 1. A terminal *-node is the first sink node of a run2. A * subtree of a node v is the subtree whose root is v reaching leaves or *-node 3. A node v is a *-node if its state is a sink node when all possible reachable states replace the *-nodes of its *-subtree.4. Compute the size of the subtrees

**

Runs withall possible reachable states (q,q’) reach a sink.

*

O(n) procedure.

Lemma 1: If Dist(T,L)<k, there are at most k *-nodes.

Page 13: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 13

Phase 2 : top-down possible states

**

Let (q,q’) a possible choice at the top *-subtree.

Let q’’ a possible state for the *-node of the left *-subtree

*

q1 q2

q’’ instead of *

Correction needed.

Page 14: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 14

Case 1: One essentially-connected component.

Case 2: General case

Many components

Case analysis of the automaton

Page 15: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 15

Lemma: if (q1,q2,q’’) are in the same connected component, there is a finite subtree t which can correct.

Case a : there is a transition (q,q’) to q’’ with both q,q’ in C: there is a finite tree t1 from q1 to q, a finite tree t2 from q2 to q’ and the correction is:

Case 1: one component

q1 q2

q’’

q q’

q’’

q1q2

t2t1

Page 16: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 16

Case b : there is a transition (q,q’) to q’’ with one of q or q’ being q0: suppose q=q0. The correction uses t2 and cut the left branch.

Case c: there is a transition (q0,q0) to q’’. The correction cuts both branches.

Case 1: b and c

q1 q2

q’’q0 q’

q’’

q2

t2

q1 q2

q’’

q0 q0

q’’

Page 17: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 17

Correction rules

q1 q2 q q’ q’’

q in C

q’ in C

q’’

q0 q’ q’’

q1 q2

q’’ instead of *

Action

Insert,

Insert

Cut,

Insert

Page 18: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 18

Hypothesis : q1 in Ci q2 in Cj q’’ in Ck

Case a: P such that Ci < Ck and Cj < Ck

Find t1 and t2 as in case 1.a

Case 2 : many components

q1 q2

q’’

q q’

q’’

q1q2

t2t1

Page 19: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 19

Case b,c : P such that Ci >Ck and Cj < Ck Find t2 and let Cp=inf(Ci,Ck). Cut the left

branch until Cp.

Case d: P such that Ci >Ck and Cj > Ck Let Cp=inf(Ci,Ck). Cut the left branch until Cp.

Let Cq=inf(Cj,Ck). Cut the right branch until Cq.

Case 2: b and c

q1 q2

q’’ q’

q’’

q2

t2

q1 q2

q’’ q’’

Page 20: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 20

Correction rules

q1 C1

q2

C2

Q

C

q’

C’

q’’

C’’

C1<

C’’

C2<

C’’

C1<C

C2<

C’

q’’

… … …. …. ….

q1 q2

q’’ instead of *

Action

Insert,

Insert

….

Page 21: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 21

Fact 1: finitely many insertions

Fact 2: deletions less predictable

Lemma: If the cut is large, than the distance must be large.

Analysis of the corrector

General Corrector:

1. Do the inductive Marking bottom-up.

2. Apply the recursive analysis of compatible states top-down.

3. For each transition (q,q’) -> q’’ apply the correction, compute the distance and select the rule with smallest distance

4. Select the * states with Minimum Dist..

Procedure is O(n), exponential in k and size(Q)

Page 22: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 22

Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k.

Proof :

# *-nodes < k

Case 1: 0 *-node: no correction

Case 2: at least 1 *-node. Looking at all possible k-variations will correct the errors in the *-subtree and diminish the *-nodes.

General result

Page 23: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 23

Labelled trees of large degree. Structure given by a « grammar », or DTD.

Generalization of automata:1. Unranked tree automaton2. Tree-walking automaton

Method: Code an unranked labelled tree with a binary labelled tree.

Advantage: the correction table is FINITE.

Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k.

Unranked trees: XML

Page 24: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 24

Applications to XML

DTD

<?xml version='1.0' ?><!ELEMENT book (chapter*,title,author)><!ELEMENT chapter (title,para*)><!ELEMENT title (#PCDATA)><!ELEMENT para (#PCDATA)><!ELEMENT author (#PCDATA)>

Binary Normal Form

l -> l1, al1 -> c1, t

c1 -> c, c1c1 -> -c -> t, p1

p1 -> p, p1p1 -> -

a -> datat -> datap -> data

Page 25: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 25

XML tree decomposition

XML file transformed into a binary labelled tree.

Page 26: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 26

XML file with errors

Page 27: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 27

Corrected XML file

No ambiguities on the possible states of q’’

Immediate correction!

Page 28: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 28

XML Correction rules

q1 q2 q q’ q’’

- p1 t p1 c

… … - - -

q1 q2

q’’ instead of *

Action

Insert,

Link

Delete,

Delete

Page 29: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 29

Parser: Xerces, Tree structure : DOM

Phase 1: look at the parent node of *-node. Propose tags for * (c or f)

Phase 2: for each proposal, compute the distance.

*=c, distance=1, replacing c with b.

*=f, distance=2, replacing c with b

and adding an a leaf.

Choose the 1st solution.

Java Implementation

a b c

* b a

d

a

DTD: d (c,b,a) or (f,b,a) c (a,b,b) f (a,b,b,a)

Page 30: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 30

Relative value of documents

• Given a DTD, mark the Web documents as follows:– Infinity if there are far– Dist(Document,DTD)=i

• Provides a relative valued landscape. Works for boolean combinations

• Generalize to – Min{ Dist(D,DTD’) : }'DTDDTD

Page 31: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 31

Distance on words and trees

• On words, how can one compute– Dist(w,w’), a P-problem– Is is possible in less than O(n) ?

• Yes, STOC 2003

– Dist(w,L) and Dist(L,L’)

• Given two trees, how can one compute:– Dist(T,T’) P on ordered trees and

NP-complete on unordered trees– p-Dist(T,T’) NP-complete.

Page 32: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 32

Conclusion

• Testers and Correctors– Testers for approximate

verification– Correctors

• Trees– Regular trees are testable– If T is at distance less than k,then

we can correct it.• Theoretical algorithms

• Practical algorithms

Page 33: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 33

Testers, Correctors and formal

verification

Two different views of logical verification:

1. Formal verification. How can we check if a program satisfies a specification?

Logical proof: theorem proving, model checking

2. Design a tester for the specification (closer to practice: Windows 95 to XP !) (Blum & Kanan)

3. Combine the two approaches to approximately verify a specification (LICS 2002, Sylvain’s thesis)

Page 34: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 34

Testers

Self-testers and correctors for Linear Algebra Blum & Kanan 1985s

Testers for graph properties : k-colorabilityGoldreich and al. 1995s

graph properties have testersAlon and al. 1999

Regular languages have testersAlon and al. 2000s

Testers for Regular tree languages (Mdr and Magniez)

Corrector for regular trees!

2

F

F fromfar

F fromfar k

Page 35: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 35

Blum’s Checker and Tester

Checker for f (Blum, Kannan, ~1990)

P

C

x y

A checker is a probabilistic program with an oracle P such that for all x,k :

if P=f, C(x,k) = Correct

If P(x)!=f(x), Prob[ C(x,k) =Buggy] >1- ½^k

CorrectBuggy

Page 36: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 36

• Distance d(f,g) = | {x : f(x) != g(x)}| / | D|

• A self-tester for f is a probabilistic program T(P, ) such that :

– If d(P,f)=0, then T(P, )=Correct– If d(P,f) > then T(P, )=Buggy

• Corrector. Division (x,y) : Majority { x.r /y.r : r random.}

Self-testing

Page 37: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 37

Property testing on graphs

H random subgraph

G Bipartite

2-colorable

H

2-Colorability

G bipartite Prob [ H is bipartite] =1

G is -far from bipartite Prob [ H is non-bipartite] > 2/3

),( ofset theis EDGK

Page 38: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 38

Property testing on graphs

3-Colorability

G 3-colorable Prob [ H is 3-colorable] =1

G is -far from 3-colorable Prob [ H is non 3-colorable] > 2/3

Generalization to k-colorability

G

H random subgraph

Page 39: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 39

• Which graphs (and matrices) properties have testers?

– Alon and al., STOC 99: Sigma 2 testers

• Compression.

Property testing and descriptive complexity

?)( gsatisfiesU

?)( gsatisfiesV -equivalent

Page 40: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 40

Property testing on words

F : 0*1*

W |= F Prob [ H |= F’ ] =1

W is -far from F Prob [ H |= not F’] >2/3

H random subword

),,( ofset theis UDWK

Word W

Page 41: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 41

A testable regular property

W |= F Prob [ H |= F’ ] =1

W is -far from F Prob [ H |= not F’] >2/3

Many 10 appear in W. Repeating the test will detect it with high probability

H random subword000011110111 ..... F’

Word W

How can we verify F : 0*1* ?

distance(w,w’) =Hamming distance

Page 42: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 42

Regular properties are testable

Theorem. Regular languages are testable.

N. Alon, M. Krivelevich, I. Newman, M. SzegedyFOCS 99.

General idea : if a word is far from a regular language, it contains many subwords which areinfeasible and can be detected.

Theorem. Dyck languages are not testable