Upload
myles-pearson
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
11
Approximate XML Query AnswersApproximate XML Query Answers
Presenter: Hongyu GuoPresenter: Hongyu Guo
Authors: N. polyzotis, M. Garofalakis, Y. IoannidisAuthors: N. polyzotis, M. Garofalakis, Y. Ioannidis
22
Outline of this talkOutline of this talk
MotivationMotivation TreeSketch ApproachTreeSketch Approach Experimental ResultsExperimental Results Contributions and LimitationsContributions and Limitations
33
OutlineOutline
MotivationMotivation TreeSketch ApproachTreeSketch Approach Experimental ResultsExperimental Results Contributions and LimitationsContributions and Limitations
44
MotivationsMotivations
XML de-facto standard for data exchangeXML de-facto standard for data exchange Need to explore large XML data sets and get fast Need to explore large XML data sets and get fast
feedback from complex XML queriesfeedback from complex XML queries
Conflict between fast ‘on-line’ response and query Conflict between fast ‘on-line’ response and query execution costexecution cost
--Need fast feedback--Need fast feedback
55
XML Query ChallengesXML Query Challenges
Involve complex traversals of the XML data hierarchy Involve complex traversals of the XML data hierarchy Complex queries over massive tree-structured data--very Complex queries over massive tree-structured data--very
expensiveexpensive Approaches: Optimize the query or optimize the data Approaches: Optimize the query or optimize the data
structurestructure
No need for accurate results, we can instead return No need for accurate results, we can instead return approximate query answersapproximate query answers
66
Approximate Query answersApproximate Query answers
Obtain an approximation to the true resultObtain an approximation to the true result Currently employed in relational systems successfullyCurrently employed in relational systems successfully
Use approximate result to get timely feedbackUse approximate result to get timely feedback
XML Data
.
Synopsis
XMLR
XML R’
Query
77
OutlineOutline
MotivationMotivation TreeSketch ApproachTreeSketch Approach Experimental ResultsExperimental Results Contributions and LimitationsContributions and Limitations
--A technique being used to return fast, approximate results--A technique being used to return fast, approximate results
88
Data and Query ModelData and Query Model
a: author a: author n: namen: nameb: book b: book p: paperp: paper y: yeary: year k: keywordk: keyword t: titlet: title
--Some background, XML document--Some background, XML document
99
Data and Query ProcessData and Query Process--Twig Query, Query Tree, and Nested Result Tree--Twig Query, Query Tree, and Nested Result Tree
1010
Basic Query ScenarioBasic Query Scenario
a
n kk
d0
ApproximateApproximateNesting TreeNesting Tree
True True Nesting TreeNesting Tree
XML Data
Synopsis
Key idea is to return fast, accurate feedbackKey idea is to return fast, accurate feedback
1111
Approximate Query AnswersApproximate Query Answers
How to construct concise XML synopses, which How to construct concise XML synopses, which capture the statistical traits of the true datacapture the statistical traits of the true data
How to produce approximate query answers How to produce approximate query answers over the synopsis efficientlyover the synopsis efficiently
-- -- Two key problemsTwo key problems
1212
TreeSketch ConstructionTreeSketch Construction
Step 1: Step 1: Given an XML treeGiven an XML tree T, build a graph synopsis: each node T, build a graph synopsis: each node
represents a set of same tag elements, large treerepresents a set of same tag elements, large tree Step2: Step2:
Compress synopsisCompress synopsis by merging nodes with similar sub- by merging nodes with similar sub-structures (i.e. clustering of the XML elements)structures (i.e. clustering of the XML elements)
Step 3Step 3 Repeat Step 2 until the Repeat Step 2 until the predefined space budgetpredefined space budget constraint is constraint is
metmet Step 4Step 4
Return the TreeSketch SynopsisReturn the TreeSketch Synopsis
…
PerfectPerfect Space BudgetSpace Budget
--Construction Algorithm--Construction Algorithm
1313
More DiscussionsMore Discussions
Graph synopsis constructionGraph synopsis construction Use node to represent a set of same tag elementsUse node to represent a set of same tag elements Query can be retrieved with zero-error Query can be retrieved with zero-error The size can become very large-it can easily be in the order of the original The size can become very large-it can easily be in the order of the original
document sizedocument size
TreeSketch synopsis constructionTreeSketch synopsis construction Compress the synopsis by merging nodesCompress the synopsis by merging nodes Bottom-up merging clustering algorithmBottom-up merging clustering algorithm
Key technique to compress Key technique to compress Clustering Clustering Based on structureBased on structure Model accuracy depends on quality of clusteringModel accuracy depends on quality of clustering
Tight clusters Tight clusters Accurate synopsis, but large model Accurate synopsis, but large model Loose clusters Loose clusters Less accuracy, but small model Less accuracy, but small model
--of the construction procedure--of the construction procedure
1414
Construction ExampleConstruction Example
XML DocumentXML Document (Graph Synopsis)(Graph Synopsis)
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
p1
s2
f5
c11
s3
f6
c12
f4
e8 c9 e10
f7
c13
r
Synopsis node Synopsis node Set of elements of Set of elements of the same tag the same tag
Synopsis edge Synopsis edge Document edge(s)Document edge(s)
--Count same tag elements--Count same tag elements
1515
Construction ExampleConstruction Example
Calculate the number of Calculate the number of children for each edgechildren for each edge
Count [r, p]: mean Count [r, p]: mean #children in p per element #children in p per element in rin r
1
2 = 2 / 1
1 1
111
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
--Calculate number of children per element--Calculate number of children per element
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
1616
Merging NodesMerging Nodes
1
2
2
10.5
P(1)
S(2)
C(4)
F(4)F(4)
E(2)
R(1)
--Less space budget--Less space budget
1
2
1 1
111
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
More Concise Synopsis
TreeSkech synopsisTreeSkech synopsis
1717
Compute Approximate AnswersCompute Approximate Answers--more like the traditional way--more like the traditional way
Travel down the treeTravel down the tree Match a pattern in the structure and return Match a pattern in the structure and return
a sub-treea sub-tree TreeSketch: Fast responseTreeSketch: Fast response
Concise synopsisConcise synopsis Keep statistical informationKeep statistical information
Node: number of same tag elementsNode: number of same tag elements Edge: number of children per elementEdge: number of children per element
1818
Compute Approximate AnswersCompute Approximate Answers
TreeSketchTreeSketch
q0
q1
q2 q3
//section//section
.//equation.//equation.//caption.//caption
QueryQuery Approximate Nesting TreeApproximate Nesting Tree
RR
EE
1x1=11x1=11x1+1x1=21x1+1x1=2
CC
SS
1x2 = 21x2 = 2 1
2
1 1
111
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
Approximate results with structureApproximate results with structure 1) Take advantage of the concise structure1) Take advantage of the concise structure 2) and the statistical data2) and the statistical data
--Example--Example
1919
OutlineOutline
MotivationMotivation TreeSketch ApproachTreeSketch Approach Experimental ResultsExperimental Results Contributions and LimitationsContributions and Limitations
2020
Experimental SetupExperimental Setup
Focus on Focus on the quality of the approximate answers generatedthe quality of the approximate answers generated the efficiency of the construction processthe efficiency of the construction process
Data SetData Set Data Sets: XMark, DBLP, IMDB, SwissProtData Sets: XMark, DBLP, IMDB, SwissProt
Workload: 1000 random twig queriesWorkload: 1000 random twig queries
2121
Evaluation MethodsEvaluation Methods
Error Error Distance between R’ and R Distance between R’ and R Popular metric: Tree-edit distancePopular metric: Tree-edit distance
Min-cost sequence of operations that transform R’ to RMin-cost sequence of operations that transform R’ to R Argument: not capture the structure similarityArgument: not capture the structure similarity
New Evaluation metrics : ESD (Element Simulation Distance)New Evaluation metrics : ESD (Element Simulation Distance) Calculate the number of children for each edge in the tree to capture Calculate the number of children for each edge in the tree to capture
the complete structure of the treethe complete structure of the tree model how well the structure of two trees match from each othermodel how well the structure of two trees match from each other ““degree” of simulation between two trees degree” of simulation between two trees Average ESD for evaluationAverage ESD for evaluation
2222
Experimental ResultsExperimental Results--Approximate answers, compared with TwigXsketches--Approximate answers, compared with TwigXsketches
2323
Experimental ResultsExperimental Results--Relative Errors--Relative Errors
< 5%i.e. 95% accuracy
2424
OutlineOutline
MotivationMotivation TreeSketch ApproachTreeSketch Approach Experimental ResultsExperimental Results Contributions and LimitationsContributions and Limitations
-Strengths and Weaknesses-Strengths and Weaknesses
2525
TreeSketch ApproachTreeSketch Approach
Propose an effective XML-summarization mechanismPropose an effective XML-summarization mechanism Captures the complete tree structure of large XML dataCaptures the complete tree structure of large XML data Experimental results: produce fast and accurate approximate query Experimental results: produce fast and accurate approximate query
answersanswers Author claim: The first work to address the timely problem of Author claim: The first work to address the timely problem of
producing approximate tree-structured answers for complex XML producing approximate tree-structured answers for complex XML queriesqueries
Comparison with the related work: 2 optionsComparison with the related work: 2 options Either compute the exact answer to a path query: expensive Either compute the exact answer to a path query: expensive Or use an approach such as twig-XSketch, which does not capture Or use an approach such as twig-XSketch, which does not capture
the complete tree structure of the underlying XML database the complete tree structure of the underlying XML database
-In this paper-In this paper
2626
LimitationsLimitations
Difficult to optimize some pre-defined parameters, such as the space Difficult to optimize some pre-defined parameters, such as the space budgetbudget
which directly related to the accuracy of the approximate query answerswhich directly related to the accuracy of the approximate query answers too large too large affect the efficiency, too small affect the efficiency, too small quality of the answers; quality of the answers;
depends on the query, data set, and the computing resourcesdepends on the query, data set, and the computing resources
Expecting incremental model construction processExpecting incremental model construction process XML data always increase incrementally, we need to construct the XML data always increase incrementally, we need to construct the
synopsis model incrementallysynopsis model incrementally
More experiments or some real applications are needed to justify the More experiments or some real applications are needed to justify the scalability of this techniquescalability of this technique
-Nice research, Next steps for further investigation-Nice research, Next steps for further investigation
2727
Thank YouThank You // Merci Merci