Upload
wagner-andreas
View
362
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
Citation preview
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
Institute of Applied Informatics and Formal Description Methods (AIFB)
Selectivity Estimation for Hybrid Queries over Text-RichData Graphs
Andreas Wagner, Veli Bicer, and Duc Thanh TranEDBT/ICDT’13
Institute of Applied Informatics and Formal Description Methods (AIFB)
2
Evaluation Results
Selectivity Estimation for Text-Rich Data Graphs
Introduction and Motivation
Institute of Applied Informatics and Formal Description Methods (AIFB)
3
INTRODUCTION & MOTIVATION
Institute of Applied Informatics and Formal Description Methods (AIFB)
4
Text-Rich Data-Graphs and Hybrid Queries
Increasing amount of semi-structured, text-rich data:
Andreas Wagner, Veli Bicer, and Duc Thanh Tran
[1] DBpedia – A Crystallization Point for the Web of Data.
[2] http://webdatacommons.org.
Structured data with unstructured texts (e.g., [1]).
Unstructed data annotated with structured information (e.g., [2]).
Structure Text
Institute of Applied Informatics and Formal Description Methods (AIFB)
5
Text-Rich Data-Graphs and Hybrid Queries (2)
Focus of our work: conjuctive, hybrid queries
TextStructure
?x ?y „keyword“relation attribute
structured query predicates unstructured query predicates
„string“ (query) predicates
Institute of Applied Informatics and Formal Description Methods (AIFB)
6
Problem Definition
Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q.
Decompose problem: sel(Q) = R(Q) * P(Q), [5].
R(Q): upper-bound cardinality for result set.
P(Q): probability for Q having an non-empty result.
Correlation between query predicates (data elements) make approximation of P(Q) hard.
relation?x ?y „keyword“relation attribute
„keyword“„keyword“
relationattribute
attribute
CorrelationsCorrelations
Correlations
Correlations make estimations relying on „indepence assumptions“ error-prone !
[5] Selectivity estimation using probabilistic models.
Institute of Applied Informatics and Formal Description Methods (AIFB)
7
Contributions
Previous works focuses either on structured or on unstructured query constraints.
We introduce a uniform model (BN+) for hybrid queries:Instance of template-based BN well-suited for graph-structed data.
Extend BN with string synopses for estimation of string predicates.
?x ?y „keyword“relation attribute„keyword“
„keyword“relation
relation
CorrelationsCorrelations
Correlations
- Graph synopses [3]
- Join samples [4]- PRMs [5,6]- …
- Fuzzy string matching [7,8]- Extraction operators [9,10]- …
Institute of Applied Informatics and Formal Description Methods (AIFB)
8
SELECTIVITY ESTIMATION FOR TEXT-RICH DATA GRAPHS
Institute of Applied Informatics and Formal Description Methods (AIFB)
9
Preliminaries (1) – Data and Query Model
Data
Query
Class Node
Attribute Value Node
Attribute Edge
Relation Edge
Bag of N-Grams
String Predicate
Relation Predicate
Keyword Node
contains
Entity Node
Institute of Applied Informatics and Formal Description Methods (AIFB)
10
Preliminaries (2) – Bayesian Networks (1)
Bayesian Network (BN) provides means for capturing joint probability distributions (e.g., P(Q)).
BN comprise network structure and parameters.
Nodes = random variables.Edges = dependencies .
Recall: sel(Q) = R(Q) * P(Q)
Institute of Applied Informatics and Formal Description Methods (AIFB)
11
Preliminaries (3) – Bayesian Networks (2)
BN comprise network structure and parameters.
Institute of Applied Informatics and Formal Description Methods (AIFB)
12
Preliminaries (4) – Bayesian Networks (3)
Template-based BNs: templates and template factors [16]. Template is a function Χ(α1,…,αk), and each argument αi is a place-holder to be instantiated to obtain random variables.
Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}.
Template factors define probability distributions shared by all instantiated random variables of a given template.
Shared by all instantiations of XdirectedBy
Entity skeleton for Xperson = {p1,p2,p3} .
Institute of Applied Informatics and Formal Description Methods (AIFB)
13
Template-Based BN for Graph-structured Data
We define a templates for each …Attribute a, Xa(α1). Entity skeleton: all entities having attribute a.
Class c, Xc(α1). Entity skeleton: all entities belonging to class c.
Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target” entities having relation r.
Template representation is compact.
Dynamic partitioning based on entity skeletons.
Advantages
- PRMs [5,6]- …
Template for attribute title.Template for class person.
Template for relation spouse.
Institute of Applied Informatics and Formal Description Methods (AIFB)
14
Integration of String Synopses (1)
Problem: Large sample space for attribute-based templates.
In order to compactly represent Ω, being a large set of strings, we use string synopses (e.g., [7,8,9,10]).
Intuitively, for an attribute-based template a string synopsis does:a) Decide how to “compactly represent” Ω.
b) Compute probabilities for strings given its compact space.
Some synopses even allow to “guess” probabilities for unknown strings.
- Fuzzy string matching [7,8]- Extraction operators [9,10]- …
Entire n-gram space as Ω.
Institute of Applied Informatics and Formal Description Methods (AIFB)
15
Integration of String Synopses (2)
In this work, we use n-gram-based synopses [10].
Consider, e.g., top-k n-gram synopsis [10].Compute n-gram counts and store only top-k n-grams.
Probabilities for known n-grams are exact.
Omitted n-grams are estimated based on heuristics using known n-grams.
[10] Selectivity estimation for extraction operators over text data.
Institute of Applied Informatics and Formal Description Methods (AIFB)
16
Learning of BN+ (1): Structure (1)
Simplify structure via product approximation using trees [11,12].
Fixed Structure Assumption:a) Two templates X1 and X2 are conditionally independent given their
parents, if they do not share a common entity in their skeletons
b) Each class template Xc has no parent.
c) Each relation template Xr is independent of any class template Xc, given its parents.
[11] Approximating discrete probability distributions with dependence trees.
Similar technique has been recently applied for “Lightweight PRMs” [6].
Institute of Applied Informatics and Formal Description Methods (AIFB)
17
Learning of BN+ (2): Structure (2)
Using fixed structure allows to decompose structure learning: „Local“ correlations between attribute/class (e.g., Xmovie → Xtitle)
Reduce network structure to only capture “most important” correlations via maximal spanning forest.
Relation templates connect different trees.
Overall, network structure is determined by „overlapping“ entity skeletons and fixed structure assumption.
Template Model
Institute of Applied Informatics and Formal Description Methods (AIFB)
18
Learning of BN+ (3): Parameters
Based on the learned structure, parameters are learned via collecting sufficient statistics (i.e., frequency counts).
Speed up parameter learning via:Using queries to obtain sufficient statistics.
Using caching during structure / parameter learning.
Institute of Applied Informatics and Formal Description Methods (AIFB)
19
Estimating P(Q) using BN+ (1)
At runtime, templates are instantiated to construct a query-specific ground BN.
Query-specific Ground BN
Query
Template Model
Assignment is a string synopsis element.
Institute of Applied Informatics and Formal Description Methods (AIFB)
20
Estimating P(Q) using BN+ (2)
Given a query-specific ground BN, we use inferencing to obtain the joint probability P(Q).
“Correction” using string synopsis.
Recall: sel(Q) = R(Q) * P(Q)
Query-specific Ground BN
Institute of Applied Informatics and Formal Description Methods (AIFB)
21
EVALUATION
Institute of Applied Informatics and Formal Description Methods (AIFB)
22
Evaluation (1) – Setting
Data: IMDB [14] and DBLP [15].IMDB featured more correlations than DBLP.
Different results between DBLP and IMDB show „relative benefit“.
Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries.
Systems: We used n-gram-based string synopses [10]:
random samples of 1-grams,
top-k 1-grams,
stratified bloom filters on 1-grams.
String predicates were integrated via (1) independence (ind) or (2) conditional independence (bn) assumption.
[13] Spark2: Top-k keyword query in relational data-bases.
[14] A framework for evaluating database key-word search strategies.
Institute of Applied Informatics and Formal Description Methods (AIFB)
23
Evaluation (2) – Setting (2)
Synopsis size: Overall synopsis size depends mainly on string synopsis size.
Synopses sizes {2, 4, 20, 40} MByte memory.∈
Metrics:Efficiency: selectivity estimation time.
Effectiveness: multiplicative error [17].
[17] Independence is good: De-pendency-based histogram syno-pses for high-dimensional data.
Institute of Applied Informatics and Formal Description Methods (AIFB)
24
Evaluation (3) – Effectiveness – IMDB
Institute of Applied Informatics and Formal Description Methods (AIFB)
25
Evaluation (4) – Effectiveness – DBLP
Institute of Applied Informatics and Formal Description Methods (AIFB)
26
Evaluation (5) – Efficiency
Institute of Applied Informatics and Formal Description Methods (AIFB)
27
CONCLUSION
Institute of Applied Informatics and Formal Description Methods (AIFB)
28
Conclusion
Tackled the problem of selectivity estimation for conjunctive, hybrid queries.
We propose a template-based BN, which is well-suited for graph-structured data.
For string predicates, we further propose the integration of string synopses into this model.
Experiments showed that:If there are correlations between un-/structured data elements the accuracy of selectivity estimation can be greatly improved via BN+.
BN caused no overhead in terms of efficiency.
Institute of Applied Informatics and Formal Description Methods (AIFB)
29
QUESTIONS
Slides @ Slideshare …
Paper @ www.aifb.kit.edu …
Institute of Applied Informatics and Formal Description Methods (AIFB)
30
REFERENCES
Institute of Applied Informatics and Formal Description Methods (AIFB)
31
References
[1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009.
[2] http://webdatacommons.org/
[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275–286, 1999.
[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. In SIGMOD, pages 205–216, 2006.
[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461–472, 2001.
[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11):852–863, 2011.
[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227–238, 2004.
[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages 397–408, 2005.
Institute of Applied Informatics and Formal Description Methods (AIFB)
32
References (2)
[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033–1044, 2007.
[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In ICDE, pages 685–696, 2011.
[11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968.
[12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1–48, 2001.
[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering, 23(12):1763–1780, 2011.
[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In CIKM, pages 729–738, 2010.
[15] http://knoesis.org/swetodblp/
[16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.
[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199-210, 2001.