Upload
juliet-golden
View
215
Download
0
Embed Size (px)
Citation preview
japan june 2003 1
The correction of XML data
Université Paris II & LRI
Michel de Rougemont
http://www.lri.fr/~mdr
1. Approximation and Edit Distance2. Testers and Correctors3. Correcting regular binary trees4. Applications to XML
Practical corrector5. Relative value of documents
japan june 2003 2
1. Relations Dist (R,S) = # x :
if Dist(R,S) <
2. Edit-distance
3. Trees: Tree-Edit-Distance Min # Deletions,
Insertions
Approximation
)()( xSxR
SR )(. Raritén
Left-deletion
Left-insertion
japan june 2003 3
Binary trees : p-Distance allows permutation
Classical Tree-Edit-Distance
Dist(T1,T2) =2 p-Dist (T1,T2) =1
Dist (T, L) = Min Dist (T,T’)
a
e
b
c d
a
e
b
c
a
e
b
c d
fe
Deletion
Insertion
LT '
japan june 2003 4
1. Satisfiability : Tree |= F
2. Approximate satisfiability
Tree |= F
Image on a class K of trees
Approximate satisfiability
F FF
F fromfar
japan june 2003 5
Logic, testers, correctors
A Tester decides |= for a formula F.
A Corrector takes a tree T close to a language L and find T’ in L close to T.
This is possible if F follows a simple logic.
Theorem. there is linear time corrector for regular binary trees and a constant distance.
Given a tree T, k- close to a regular language L, we find in linear time T’ in L, c.k -close to T.
General problem: given a language L defined in some Logic, find a corrector.
Theorem. (implicit in Alon and al. FOCS2000) There is a linear time corrector for regular words and distance
Application to Model-Checking (LICS2002)
n .
japan june 2003 6
Simple example
Tester for 0+ 1* 0+
Types of segments:
000000011111110000010000 probablyaccepted011110000000110111 rejected with highprobability
0 01
0000011111000111110001100
0 0
Corrector for 0+ 1* 0+ 00000001111110000100000 *
japan june 2003 7
• Tree-automata• Logical definability on trees• Tree grammar• Regular expression
Regular Trees
r(a,b(a,b(a,b(a,b(a,b(a,b)....) r(a(a,b(a,b(a(a,b),b)....),b)
japan june 2003 8
• (q0, q0) q1• (q0,q1) q1
Tree automata
q0 q0
q0
q0
q0
q0
q1
q1
q1
q1
q1
q0 q0
q0q1
q2
(q1,q1)q2
(q1,q0)q2
(q2,-) q2
(-,q2) q2)1,,0,( qqQA
japan june 2003 9
Definition : a subtree t is feasible for L if there are subtrees (for its leaves) which reach states (q1...ql) such that the state of the root q=t(q1...ql) can reach an accepting state (in the automaton for L).
A subtree is infeasible if it is not feasible
Feasible and infeasible subtrees
feasible
infeasible
japan june 2003 10
Fact . If then the number of unfeasible subtrees of length a is O(n).
Fact. If the distance is small, there are few infeasibles trees.
Intuition : make local corrections at the root of the infeasible trees
Infeasible subtrees
nLT .),(Distance
japan june 2003 11
Phase 1 : (Bottom-up) Marking of * nodes, roots of infeasible subtrees.
Phase 2 : (Top-down) Recursive analysis of the * subtrees to make root accept.
Phase 3 : Local corrections
Structure of the corrector
q0
q1
japan june 2003 12
Phase 1 : bottom-up marking
Definitions: 1. A terminal *-node is the first sink node of a run2. A * subtree of a node v is the subtree whose root is v reaching leaves or *-node 3. A node v is a *-node if its state is a sink node when all possible reachable states replace the *-nodes of its *-subtree.4. Compute the size of the subtrees
**
Runs withall possible reachable states (q,q’) reach a sink.
*
O(n) procedure.
Lemma 1: If Dist(T,L)<k, there are at most k *-nodes.
japan june 2003 13
Phase 2 : top-down possible states
**
Let (q,q’) a possible choice at the top *-subtree.
Let q’’ a possible state for the *-node of the left *-subtree
*
q1 q2
q’’ instead of *
Correction needed.
japan june 2003 14
Case 1: One essentially-connected component.
Case 2: General case
Many components
Case analysis of the automaton
japan june 2003 15
Lemma: if (q1,q2,q’’) are in the same connected component, there is a finite subtree t which can correct.
Case a : there is a transition (q,q’) to q’’ with both q,q’ in C: there is a finite tree t1 from q1 to q, a finite tree t2 from q2 to q’ and the correction is:
Case 1: one component
q1 q2
q’’
q q’
q’’
q1q2
t2t1
japan june 2003 16
Case b : there is a transition (q,q’) to q’’ with one of q or q’ being q0: suppose q=q0. The correction uses t2 and cut the left branch.
Case c: there is a transition (q0,q0) to q’’. The correction cuts both branches.
Case 1: b and c
q1 q2
q’’q0 q’
q’’
q2
t2
q1 q2
q’’
q0 q0
q’’
japan june 2003 17
Correction rules
q1 q2 q q’ q’’
q in C
q’ in C
q’’
q0 q’ q’’
q1 q2
q’’ instead of *
Action
Insert,
Insert
Cut,
Insert
japan june 2003 18
Hypothesis : q1 in Ci q2 in Cj q’’ in Ck
Case a: P such that Ci < Ck and Cj < Ck
Find t1 and t2 as in case 1.a
Case 2 : many components
q1 q2
q’’
q q’
q’’
q1q2
t2t1
japan june 2003 19
Case b,c : P such that Ci >Ck and Cj < Ck Find t2 and let Cp=inf(Ci,Ck). Cut the left
branch until Cp.
Case d: P such that Ci >Ck and Cj > Ck Let Cp=inf(Ci,Ck). Cut the left branch until Cp.
Let Cq=inf(Cj,Ck). Cut the right branch until Cq.
Case 2: b and c
q1 q2
q’’ q’
q’’
q2
t2
q1 q2
q’’ q’’
japan june 2003 20
Correction rules
q1 C1
q2
C2
Q
C
q’
C’
q’’
C’’
C1<
C’’
C2<
C’’
C1<C
C2<
C’
q’’
… … …. …. ….
q1 q2
q’’ instead of *
Action
Insert,
Insert
….
japan june 2003 21
Fact 1: finitely many insertions
Fact 2: deletions less predictable
Lemma: If the cut is large, than the distance must be large.
Analysis of the corrector
General Corrector:
1. Do the inductive Marking bottom-up.
2. Apply the recursive analysis of compatible states top-down.
3. For each transition (q,q’) -> q’’ apply the correction, compute the distance and select the rule with smallest distance
4. Select the * states with Minimum Dist..
Procedure is O(n), exponential in k and size(Q)
japan june 2003 22
Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k.
Proof :
# *-nodes < k
Case 1: 0 *-node: no correction
Case 2: at least 1 *-node. Looking at all possible k-variations will correct the errors in the *-subtree and diminish the *-nodes.
General result
japan june 2003 23
Labelled trees of large degree. Structure given by a « grammar », or DTD.
Generalization of automata:1. Unranked tree automaton2. Tree-walking automaton
Method: Code an unranked labelled tree with a binary labelled tree.
Advantage: the correction table is FINITE.
Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k.
Unranked trees: XML
japan june 2003 24
Applications to XML
DTD
<?xml version='1.0' ?><!ELEMENT book (chapter*,title,author)><!ELEMENT chapter (title,para*)><!ELEMENT title (#PCDATA)><!ELEMENT para (#PCDATA)><!ELEMENT author (#PCDATA)>
Binary Normal Form
l -> l1, al1 -> c1, t
c1 -> c, c1c1 -> -c -> t, p1
p1 -> p, p1p1 -> -
a -> datat -> datap -> data
japan june 2003 25
XML tree decomposition
XML file transformed into a binary labelled tree.
japan june 2003 26
XML file with errors
japan june 2003 27
Corrected XML file
No ambiguities on the possible states of q’’
Immediate correction!
japan june 2003 28
XML Correction rules
q1 q2 q q’ q’’
- p1 t p1 c
… … - - -
q1 q2
q’’ instead of *
Action
Insert,
Link
Delete,
Delete
japan june 2003 29
Parser: Xerces, Tree structure : DOM
Phase 1: look at the parent node of *-node. Propose tags for * (c or f)
Phase 2: for each proposal, compute the distance.
*=c, distance=1, replacing c with b.
*=f, distance=2, replacing c with b
and adding an a leaf.
Choose the 1st solution.
Java Implementation
a b c
* b a
d
a
DTD: d (c,b,a) or (f,b,a) c (a,b,b) f (a,b,b,a)
japan june 2003 30
Relative value of documents
• Given a DTD, mark the Web documents as follows:– Infinity if there are far– Dist(Document,DTD)=i
• Provides a relative valued landscape. Works for boolean combinations
• Generalize to – Min{ Dist(D,DTD’) : }'DTDDTD
japan june 2003 31
Distance on words and trees
• On words, how can one compute– Dist(w,w’), a P-problem– Is is possible in less than O(n) ?
• Yes, STOC 2003
– Dist(w,L) and Dist(L,L’)
• Given two trees, how can one compute:– Dist(T,T’) P on ordered trees and
NP-complete on unordered trees– p-Dist(T,T’) NP-complete.
japan june 2003 32
Conclusion
• Testers and Correctors– Testers for approximate
verification– Correctors
• Trees– Regular trees are testable– If T is at distance less than k,then
we can correct it.• Theoretical algorithms
• Practical algorithms
japan june 2003 33
Testers, Correctors and formal
verification
Two different views of logical verification:
1. Formal verification. How can we check if a program satisfies a specification?
Logical proof: theorem proving, model checking
2. Design a tester for the specification (closer to practice: Windows 95 to XP !) (Blum & Kanan)
3. Combine the two approaches to approximately verify a specification (LICS 2002, Sylvain’s thesis)
japan june 2003 34
Testers
Self-testers and correctors for Linear Algebra Blum & Kanan 1985s
Testers for graph properties : k-colorabilityGoldreich and al. 1995s
graph properties have testersAlon and al. 1999
Regular languages have testersAlon and al. 2000s
Testers for Regular tree languages (Mdr and Magniez)
Corrector for regular trees!
2
F
F fromfar
F fromfar k
japan june 2003 35
Blum’s Checker and Tester
Checker for f (Blum, Kannan, ~1990)
P
C
x y
A checker is a probabilistic program with an oracle P such that for all x,k :
if P=f, C(x,k) = Correct
If P(x)!=f(x), Prob[ C(x,k) =Buggy] >1- ½^k
CorrectBuggy
japan june 2003 36
• Distance d(f,g) = | {x : f(x) != g(x)}| / | D|
• A self-tester for f is a probabilistic program T(P, ) such that :
– If d(P,f)=0, then T(P, )=Correct– If d(P,f) > then T(P, )=Buggy
• Corrector. Division (x,y) : Majority { x.r /y.r : r random.}
Self-testing
japan june 2003 37
Property testing on graphs
H random subgraph
G Bipartite
2-colorable
H
2-Colorability
G bipartite Prob [ H is bipartite] =1
G is -far from bipartite Prob [ H is non-bipartite] > 2/3
),( ofset theis EDGK
japan june 2003 38
Property testing on graphs
3-Colorability
G 3-colorable Prob [ H is 3-colorable] =1
G is -far from 3-colorable Prob [ H is non 3-colorable] > 2/3
Generalization to k-colorability
G
H random subgraph
japan june 2003 39
• Which graphs (and matrices) properties have testers?
– Alon and al., STOC 99: Sigma 2 testers
• Compression.
Property testing and descriptive complexity
?)( gsatisfiesU
?)( gsatisfiesV -equivalent
japan june 2003 40
Property testing on words
F : 0*1*
W |= F Prob [ H |= F’ ] =1
W is -far from F Prob [ H |= not F’] >2/3
H random subword
),,( ofset theis UDWK
Word W
japan june 2003 41
A testable regular property
W |= F Prob [ H |= F’ ] =1
W is -far from F Prob [ H |= not F’] >2/3
Many 10 appear in W. Repeating the test will detect it with high probability
H random subword000011110111 ..... F’
Word W
How can we verify F : 0*1* ?
distance(w,w’) =Hamming distance
japan june 2003 42
Regular properties are testable
Theorem. Regular languages are testable.
N. Alon, M. Krivelevich, I. Newman, M. SzegedyFOCS 99.
General idea : if a word is far from a regular language, it contains many subwords which areinfeasible and can be detected.
Theorem. Dyck languages are not testable