28
Performance Analysis of Algorithms to Reason about XML Keys Emir Mu˜ noz Jim´ enez University of Santiago de Chile Yahoo! Labs LatAm Joint work with F. Ferrarotti, S. Hartmann, S. Link, M. Marin DEXA 2012 @ Vienna, Austria, 3rd September 2012 Introduction RW & Motivation XML Keys Results Conclusion 1/24

DEXA 2012 Talk

Embed Size (px)

Citation preview

Page 1: DEXA 2012 Talk

Performance Analysis of Algorithms toReason about XML Keys

Emir Munoz Jimenez

University of Santiago de Chile

Yahoo! Labs LatAm

Joint work with F. Ferrarotti, S. Hartmann, S. Link, M. Marin

DEXA 2012 @ Vienna, Austria, 3rd September 2012

Introduction RW & Motivation XML Keys Results Conclusion 1/24

Page 2: DEXA 2012 Talk

Contribution

An efficient implementation of an algorithm that decides theimplication problem for a tractable and expressive class ofXML keys.

Performance Analysis:

Implication problem over large sets of XML Keys.Non-redundant covers of XML keys and validation of XMLdocuments.

Our experiments show that reasoning about expressive notionsof XML keys can be done efficiently in practice and scales well.

Introduction RW & Motivation XML Keys Results Conclusion 2/24

Page 3: DEXA 2012 Talk

Outline

1 Introduction

2 Related Work & Motivation

3 Reasoning about XML Keys

4 Results

5 Conclusion

Introduction RW & Motivation XML Keys Results Conclusion 3/24

Page 4: DEXA 2012 Talk

Introduction

Keys

XML: de-facto standard for Web data exchange andintegration.

XML ↔ data storage ↔ data processing.

Both industry and academia have long since recognized theimportance of keys in XML data management.

Several notions of XML keys have been proposed. Mostinfluential is due to Buneman et al. who defined keys on thebasis of an XML tree model similar to the one suggested byDOM and XPath.

Introduction RW & Motivation XML Keys Results Conclusion 4/24

Page 5: DEXA 2012 Talk

IntroductionXML Tree Example (1/2)

Example (An XML tree representing an XML document)

E

E

E

A

E E

AA A

E E E

A

MemberTeam

rol

ATeam

Member

rol

A

LeaderTeam

rol

E E

E

SSmith

E

SDexter

E

SCooper

E

E

S

William

E

SBell

E

ATeam

Member

rol

E

E

SDexter

E

E

SCooper

A

E

E

SWilliam

E

SBell

E

A

E

E

SDavis

E

SBrook

E

SJohn

project

db

team

Riders

teamtname

Coyotes

pname

Phobia MediaLow

pname

project

tname

employeeemployeeemployee

name name

lnamefnamelnamefname

name

fname lname

employee

name

fname lname

employeerol

TeamMember

name

fname lname

employee

TeamLeader

rol

name

fname lname

Introduction RW & Motivation XML Keys Results Conclusion 5/24

Page 6: DEXA 2012 Talk

IntroductionXML Tree Example (2/2)

Some Constraints: (gathered as business rules)

(a) A project node is identified by pname, no matter where theproject node appears in the document.

(b) A team node can be identified by tname relatively to aproject node.

(c) Within any given subtree rooted at team, an employee nodeis identified by name.

Introduction RW & Motivation XML Keys Results Conclusion 6/24

Page 7: DEXA 2012 Talk

IntroductionXML Tree Example (2/2)

Some Constraints: (gathered as business rules)

(a) A project node is identified by pname, no matter where theproject node appears in the document. Absolute key

(b) A team node can be identified by tname relatively to aproject node. Relative key

(c) Within any given subtree rooted at team, an employee nodeis identified by name. Relative key

Introduction RW & Motivation XML Keys Results Conclusion 6/24

Page 8: DEXA 2012 Talk

Related Work & Motivation

Keys are important in many perennial tasks in databasemanagement.

The hope is that keys will turn out to be equally beneficial forXML.

In practice, expressive yet tractable notions of XML keys havebeen ignored so far.

We initiate an empirical study of the fragment of XML keyswith nonempty sets of simple key paths.

One of the most fundamental questions on keys is:Can we decide if a new key holds given a set of known keys?(Logical Implication Problem.)

Introduction RW & Motivation XML Keys Results Conclusion 7/24

Page 9: DEXA 2012 Talk

Related Work & MotivationKey implication example

Some Constraints: (gathered as business rules)

(a) A project node is identified by pname, no matter where theproject node appears in the document.

(b) A team node can be identified by tname relatively to aproject node.

(c) Within any given subtree rooted at team, an employee nodeis identified by name.

Suppose, the consideration of a further key:

(d) A project node can be identified in the document by its childnodes pname and team.

Introduction RW & Motivation XML Keys Results Conclusion 8/24

Page 10: DEXA 2012 Talk

Related Work & MotivationKey implication example

Some Constraints: (gathered as business rules)

(a) A project node is identified by pname, no matter where theproject node appears in the document.

(b) A team node can be identified by tname relatively to aproject node.

(c) Within any given subtree rooted at team, an employee nodeis identified by name.

Suppose, the consideration of a further key:

(d) A project node can be identified in the document by its childnodes pname and team. key (a) actually implies (d)!!because (d) is a superkey of (a).

Introduction RW & Motivation XML Keys Results Conclusion 8/24

Page 11: DEXA 2012 Talk

Keys for XMLDefinitions (1/2)

Definition (Value equality)

u is value equal to v (u =v v) if the trees rooted at u and v areisomorphic by an isomorphism that is the identity on string values.(e.g., element nodes employee for “Dexter Cooper”)

E

E

E

A

E E

AA A

E E E

A

MemberTeam

rol

ATeam

Member

rol

A

LeaderTeam

rol

E E

E

SSmith

E

SDexter

E

SCooper

E

E

S

William

E

SBell

E

ATeam

Member

rol

E

E

SDexter

E

E

SCooper

A

E

E

SWilliam

E

SBell

E

A

E

E

SDavis

E

SBrook

E

SJohn

project

db

team

Riders

teamtname

Coyotes

pname

Phobia MediaLow

pname

project

tname

employeeemployeeemployee

name name

lnamefnamelnamefname

name

fname lname

employee

name

fname lname

employeerol

TeamMember

name

fname lname

employee

TeamLeader

rol

name

fname lname

Introduction RW & Motivation XML Keys Results Conclusion 9/24

Page 12: DEXA 2012 Talk

Keys for XMLDefinitions (2/2)

Some Constraints: (in class K of XML keys)

(a) A project node is identified by pname, no matter where theproject node appears in the document.

(ε, (project , {pname}))

(b) A team node can be identified by tname relatively to aproject node.

(project , (team, {tname}))

(c) Within any give subtree rooted at team, an employee node isidentified by name.

( ∗.team, (employee , {name}))

(d) A project node can be identified by its child nodes pnameand team.

(ε, (project , {pname, team}))

Introduction RW & Motivation XML Keys Results Conclusion 10/24

Page 13: DEXA 2012 Talk

Deciding XML Key ImplicationNotions

We say that Σ implies ϕ, denoted by Σ |= ϕ, iff every finiteXML tree T that satisfies all σ ∈ Σ also satisfies ϕ.

Implication Problem for a class C of XML keys: given anyΣ ∪ {ϕ} in C, decide whether Σ |= ϕ.

e.g., let be Σ = {(a), (b), (c)} and ϕ = (d), then Σ |= ϕ istrue.

e.g., let be Σ = {(a), (b)} and ϕ = (c), then Σ |= ϕ is false.

Hartmann & Link characterize the implication problem for theclass K of XML keys in terms of the of reachability problemfor fixed nodes in a suitable digraph.

Introduction RW & Motivation XML Keys Results Conclusion 11/24

Page 14: DEXA 2012 Talk

Deciding XML Key ImplicationMini-trees and Witness Graphs (Hartmann & Link, 2009)

����������������

����������������

�������������������������

�������������������������

��������������������

������������������������������������

����������������

E

S

E

dbE

E public

national

project

E

S

E project

E

S

E

S

(1)

(3)

E

E

E national

public

(2)

yearpname

pname year

××

p

p1 p2

p′

rϕ = qϕ

r′ϕ

rϕ1 r

ϕ2

q′ϕq′ϕ

xϕ1

xϕ1

xϕ2

xϕ2

v′1v′1

v′2v′2

w1

w1

w′1

w′1

(a) Mini tree TΣ,ϕ

db

E

E

E

E E

SS

public

national

project

E

yearpname

××

q′ϕ

(b) Witness graph GΣ,ϕ

ϕ = (ε, (public. ∗.project , {pname.S , year .S}))

Introduction RW & Motivation XML Keys Results Conclusion 12/24

Page 15: DEXA 2012 Talk

An Efficient ImplementationData structures and implementation

Suitable data structures for Mini-tree & Witness-graph.

db

E

E

E

E E

SS

public

national

project

E

yearpname

××

q′ϕ

(c) Witness graph GΣ,ϕ

Node 1

Node 2

vertexEleNode 1

edgeEle

nodeEle

(public)

Node 3 Node 0

vertexEleNode 2

edgeEle edgeEle

nodeEle

(national)

Node 4 Node 6 Node 1

vertexEleNode 3

edgeEle edgeEle edgeEle

nodeEle

(project)

(S)

Node 7

(db)vertexEle

edgeEle

nodeEleNode 0

L

.....

(d) Adj. list for GΣ,ϕ

Introduction RW & Motivation XML Keys Results Conclusion 13/24

Page 16: DEXA 2012 Talk

XML Key Reasoning for Document ValidationApplications

Fast algorithms for the validation of XML documents againstkeys are crucial to ensure the consistency and semanticcorrectness of data stored or exchanged between applications.

We use our implementation of the implication problem tocompute non-redundant covers for sets of XML keys.

Which in turn can be used to significantly speed up theprocess of XML document validation against sets of XMLkeys.

Introduction RW & Motivation XML Keys Results Conclusion 14/24

Page 17: DEXA 2012 Talk

XML Key Reasoning for Document ValidationCover Sets for XML Keys

Fact

Σ is non-redundant if there is no key ψ ∈ Σ such thatΣ− {ψ} |= ψ.

Algorithm 1: Non-redundant Cover for XML keysInput : Finite set Σ of XML keysOutput: A non-redundant cover for Σ

Θ = Σ;1

foreach key ψ ∈ Σ do2

if Θ− {ψ} |= ψ then3

Θ = Θ− {ψ};4

return Θ;5

6

This set can be computed in O(|Σ| × (max{|ψ| : ψ ∈ Σ})2) time.

Introduction RW & Motivation XML Keys Results Conclusion 15/24

Page 18: DEXA 2012 Talk

Experimental ResultsPerformance Analysis

We analyze:

(i) The scalability of the implication problem.(ii) The viability of computing non-redundant cover sets to speed

up the validation of XML documents against XML keys.

For (i) we generated large sets of XML keys in the followingtwo systematic ways:

1. using a manually defined set of 5 to 10 XML keys as seeds, wecomputed new implied keys by successively applying theinference rules from the axiomatization presented byHartmann & Link (2009).

2. we defined some non-implied XML keys, adding witness edgeskeeping qϕ not reachable from q′

ϕ.

For (ii) we generated sets of keys following the same strategy.

Introduction RW & Motivation XML Keys Results Conclusion 16/24

Page 19: DEXA 2012 Talk

Experimental Results – Datasets (1/2)

Table: XML Documents.

Doc ID Document No. ofElements

No. ofAttributes

Size Max.Depth

AverageDepth

Doc1 321gone.xml 311 0 23 KB 5 3.76527Doc2 yahoo.xml 342 0 24 KB 5 3.76608Doc3 dblp.xml 29,494 3,247 1.6 MB 6 2.90228Doc4 nasa.xml 476,646 56,317 23 MB 8 5.58314Doc5 SigmodRecord.xml 11,526 3,737 476 KB 6 5.14107Doc6 mondial-3.0.xml 22,423 47,423 1 MB 5 3.59274

Introduction RW & Motivation XML Keys Results Conclusion 17/24

Page 20: DEXA 2012 Talk

Experimental Results – Datasets (2/2)

SigmodRecord (seed keys)(ε, (issue, {volume.S, number.S}))

(ε, ( ∗.issue, {volume.S, number.S})) Interaction rule

(issue, (articles,{article.title.S})) (Q, (Q′, {P.P1, . . . , P.Pk})),

(issue.articles, (article, {title.S})) (Q.Q′, (P, {P1, . . . , Pk}))

(issue, (articles.article,{initPage.S, endPage.S})) (Q, (Q′.P, {P1, . . . , Pk}))

(ε, (issue.articles.article.authors.author,{position})) (issue, (articles.article,{title.S}))

Sigmod.xml<SigmodRecord>

<issue>

<volume>13</volume>

<number>2</number>

<articles>

<article>

<title>Deadlock Detection is Cheap.</title>

<initPage>19</initPage>

<endPage>34</endPage>

<authors>

<author position="00">Rakesh Agrawal</author>

<author position="01">Michael J. Carey</author>

<author position="02">David J. DeWitt</author>

</authors>

</article> ...

</articles>

</issue> ...

</SigmodRecord>

Introduction RW & Motivation XML Keys Results Conclusion 18/24

Page 21: DEXA 2012 Talk

Experimental ResultsStrategies to generate implied and non-implied XML keys

If we apply the interaction rule to

- (issue, (articles, {article.title.S}))

- (issue.articles, (article, {title.S}))

We derived the implied key

- (issue, (articles.article, {title.S}))

We applied the interaction,context-target, subnodes, context-pathcontainment, target-path containment,subnodes-epsilon and prefix-epsilon ruleswhenever possible.

Non-implied keys

E

E

E

E

E

E

E

E

S

E

S

author

db

conference

issue

articles

article

lastfirst

q′ϕ

ℓ0 = content

Introduction RW & Motivation XML Keys Results Conclusion 19/24

Page 22: DEXA 2012 Talk

Results for Implication ProblemDeciding implication of XML keys

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 20 40 60 80 100 120 140

abs-absabs-relrel-absrel-rel

mix-absmix-rel

Tim

e[m

s]

Size of Σ

(e) XML key implication, all cases.

0

0.5

1

1.5

2

2.5

0 20 40 60 80 100 120 140

mix-relwildcard

Tim

e[m

s]

Size of Σ

(f) Effect of wildcards presence.

Figure: Performance of the Algorithm for the Implication of XML Keys(Σ |= ϕ).

Introduction RW & Motivation XML Keys Results Conclusion 20/24

Page 23: DEXA 2012 Talk

Results for Validation ProblemDocument Validation (1/2)

XML doc. Key Set Time[ms]

321gone & Processed Keys: 23 3.458yahoo Discarded keys: 15

(Doc1 & 2) Cover set: 8 keys

DBLP Processed Keys: 36 12.757(Doc3) Discarded keys: 24

Cover set: 12 keys

nasa Processed Keys: 35 9.23(Doc4) Discarded keys: 28

Cover set size: 7 keys

Sigmod Processed Keys: 24 5.294Record Discarded Keys: 19(Doc5) Cover set: 5 keys

mondial Processed Keys: 26 4.342(Doc6) Discarded Keys: 16

Cover set: 10

0

5000

10000

15000

20000

25000

30000

35000

Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

Runnin

g T

ime [m

s]

XML documents

952ms 1329ms

Full-setCover-set

>30min >25min 1.88min 5min

(a) Non-redundant Cover Sets. (b) Validation Against Cover Sets.

Figure: Non-redundant Cover Sets of XML keys and Validation of XMLDocuments

Introduction RW & Motivation XML Keys Results Conclusion 21/24

Page 24: DEXA 2012 Talk

Results for Validation ProblemDocument Validation (2/2)

There are some very expensive XPath queries.

SigmodRecord

(issue, (articles.article, {title.S})) - there are 1503 nodesin the XPath query “//issue/articles/article”.(ε, (issue.articles.article.authors.author, {S})) - there are3737 nodes in the XPath query“//issue/articles/article/authors/author”.

Introduction RW & Motivation XML Keys Results Conclusion 22/24

Page 25: DEXA 2012 Talk

Main conclusion

This work was motivated by two objectives:

1 To demonstrate that there are expressive classes of XML keys that can bereasoned about efficiently.

2 To show that our observations on the problem of deciding implication hasimmediate consequences for other perennial task in XML databasemanagement.

[1.1] We studied a fragment of XML keys with nonempty sets of simple keypaths.

[1.1.1] Presenting an efficient implementation for the implication problem.

[1.1.2] Showing through experiments that the proposed algorithm runs fast inpractice and scales well.

[2.1] We studied the problem of validating an XML document against a set ofXML keys.

[2.2] Presenting an optimization method for this validation via thecomputation of non-redundant cover sets of XML keys.

[2.3] Our experiments show that enormous time savings can be achieved inpractice.

Introduction RW & Motivation XML Keys Results Conclusion 23/24

Page 26: DEXA 2012 Talk

Questions?

THANKS!

Emir Munoz Jimenez– [email protected]

Introduction RW & Motivation XML Keys Results Conclusion 24/24

Page 27: DEXA 2012 Talk

Axiomatization

Let be PL from the grammar Q→ ℓ | ε | Q.Q | ∗

where ℓ ∈ L is a label, ε is the empty word, “.” is the concatoperator, and “ ∗” is the wildcard.

For nodes v and v′ of an XML tree T , the value intersectionof v[[Q]] and v′[[Q]] is given byv[[Q]] ∩v v

′[[Q]] = {(w,w′) | w ∈ v[[Q]], w′ ∈ v′[[Q]], w =v w′}

We define semantic closure by: Σ∗ = {ϕ ∈ C | Σ |= ϕ}

and also syntactic closure by: Σ+ℜ = {ϕ | Σ ⊢ℜ ϕ}

A set of rules ℜ is sound (complete) if Σ+ℜ ⊆ Σ∗(Σ∗ ⊆ Σ+

ℜ)

A sound and complete set of rules is called axiomatization

1/2

Page 28: DEXA 2012 Talk

Deciding XML Key ImplicationMini-trees and Witness Graphs (2/2)

Theorem (Hartman & Link, 2009)

Let Σ ∪ {ϕ} be a finite set of keys in the class K. We have Σ |= ϕ

if and only if qϕ is reachable from q′ϕ in GΣ,ϕ.

Algorithm 2: XML key implication in K

Input : finite set of XML keys Σ ∪ {ϕ} in KOutput: yes, if Σ |= ϕ; no, otherwise

Construct GΣ,ϕ for Σ and ϕ;1

if qϕ is reachable from q′ϕ in G then return yes; else return no; end if2

2/2