Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Institute of Applied Informatics and Formal Description Methods (AIFB)

Selectivity Estimation for Hybrid Queries over Text-RichData Graphs

Andreas Wagner, Veli Bicer, and Duc Thanh TranEDBT/ICDT’13


2

Evaluation Results

Selectivity Estimation for Text-Rich Data Graphs

Introduction and Motivation


3

INTRODUCTION & MOTIVATION


4

Text-Rich Data-Graphs and Hybrid Queries

Increasing amount of semi-structured, text-rich data:

Andreas Wagner, Veli Bicer, and Duc Thanh Tran

[1] DBpedia – A Crystallization Point for the Web of Data.

[2] http://webdatacommons.org.

Structured data with unstructured texts (e.g., [1]).

Unstructed data annotated with structured information (e.g., [2]).

Structure Text


5

Text-Rich Data-Graphs and Hybrid Queries (2)

Focus of our work: conjuctive, hybrid queries

TextStructure

?x ?y „keyword“relation attribute

structured query predicates unstructured query predicates

„string“ (query) predicates


6

Problem Definition

Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q.

Decompose problem: sel(Q) = R(Q) * P(Q), [5].

R(Q): upper-bound cardinality for result set.

P(Q): probability for Q having an non-empty result.

Correlation between query predicates (data elements) make approximation of P(Q) hard.

relation?x ?y „keyword“relation attribute

„keyword“„keyword“

relationattribute

attribute

CorrelationsCorrelations

Correlations

Correlations make estimations relying on „indepence assumptions“ error-prone !

[5] Selectivity estimation using probabilistic models.


7

Contributions

Previous works focuses either on structured or on unstructured query constraints.

We introduce a uniform model (BN+) for hybrid queries:Instance of template-based BN well-suited for graph-structed data.

Extend BN with string synopses for estimation of string predicates.

?x ?y „keyword“relation attribute„keyword“

„keyword“relation

relation

CorrelationsCorrelations

Correlations

- Graph synopses [3]

- Join samples [4]- PRMs [5,6]- …

- Fuzzy string matching [7,8]- Extraction operators [9,10]- …


8

SELECTIVITY ESTIMATION FOR TEXT-RICH DATA GRAPHS


9

Preliminaries (1) – Data and Query Model

Data

Query

Class Node

Attribute Value Node

Attribute Edge

Relation Edge

Bag of N-Grams

String Predicate

Relation Predicate

Keyword Node

contains

Entity Node


10

Preliminaries (2) – Bayesian Networks (1)

Bayesian Network (BN) provides means for capturing joint probability distributions (e.g., P(Q)).

BN comprise network structure and parameters.

Nodes = random variables.Edges = dependencies .

Recall: sel(Q) = R(Q) * P(Q)


11


BN comprise network structure and parameters.


12


Template-based BNs: templates and template factors [16]. Template is a function Χ(α1,…,αk), and each argument αi is a place-holder to be instantiated to obtain random variables.

Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}.

Template factors define probability distributions shared by all instantiated random variables of a given template.

Shared by all instantiations of XdirectedBy

Entity skeleton for Xperson = {p1,p2,p3} .


13

Template-Based BN for Graph-structured Data

We define a templates for each …Attribute a, Xa(α1). Entity skeleton: all entities having attribute a.

Class c, Xc(α1). Entity skeleton: all entities belonging to class c.

Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target” entities having relation r.

Template representation is compact.

Dynamic partitioning based on entity skeletons.

Advantages

- PRMs [5,6]- …

Template for attribute title.Template for class person.

Template for relation spouse.


14

Integration of String Synopses (1)

Problem: Large sample space for attribute-based templates.

In order to compactly represent Ω, being a large set of strings, we use string synopses (e.g., [7,8,9,10]).

Intuitively, for an attribute-based template a string synopsis does:a) Decide how to “compactly represent” Ω.

b) Compute probabilities for strings given its compact space.

Some synopses even allow to “guess” probabilities for unknown strings.

- Fuzzy string matching [7,8]- Extraction operators [9,10]- …

Entire n-gram space as Ω.


15

Integration of String Synopses (2)

In this work, we use n-gram-based synopses [10].

Consider, e.g., top-k n-gram synopsis [10].Compute n-gram counts and store only top-k n-grams.

Probabilities for known n-grams are exact.

Omitted n-grams are estimated based on heuristics using known n-grams.

[10] Selectivity estimation for extraction operators over text data.


16

Learning of BN+ (1): Structure (1)

Simplify structure via product approximation using trees [11,12].

Fixed Structure Assumption:a) Two templates X1 and X2 are conditionally independent given their

parents, if they do not share a common entity in their skeletons

b) Each class template Xc has no parent.

c) Each relation template Xr is independent of any class template Xc, given its parents.

[11] Approximating discrete probability distributions with dependence trees.

Similar technique has been recently applied for “Lightweight PRMs” [6].


17

Learning of BN+ (2): Structure (2)

Using fixed structure allows to decompose structure learning: „Local“ correlations between attribute/class (e.g., Xmovie → Xtitle)

Reduce network structure to only capture “most important” correlations via maximal spanning forest.

Relation templates connect different trees.

Overall, network structure is determined by „overlapping“ entity skeletons and fixed structure assumption.

Template Model


18

Learning of BN+ (3): Parameters

Based on the learned structure, parameters are learned via collecting sufficient statistics (i.e., frequency counts).

Speed up parameter learning via:Using queries to obtain sufficient statistics.

Using caching during structure / parameter learning.


19

Estimating P(Q) using BN+ (1)

At runtime, templates are instantiated to construct a query-specific ground BN.

Query-specific Ground BN

Query

Template Model

Assignment is a string synopsis element.


20

Estimating P(Q) using BN+ (2)

Given a query-specific ground BN, we use inferencing to obtain the joint probability P(Q).

“Correction” using string synopsis.

Recall: sel(Q) = R(Q) * P(Q)

Query-specific Ground BN


21

EVALUATION


22

Evaluation (1) – Setting

Data: IMDB [14] and DBLP [15].IMDB featured more correlations than DBLP.

Different results between DBLP and IMDB show „relative benefit“.

Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries.

Systems: We used n-gram-based string synopses [10]:

random samples of 1-grams,

top-k 1-grams,

stratified bloom filters on 1-grams.

String predicates were integrated via (1) independence (ind) or (2) conditional independence (bn) assumption.

[13] Spark2: Top-k keyword query in relational data-bases.

[14] A framework for evaluating database key-word search strategies.


23

Evaluation (2) – Setting (2)

Synopsis size: Overall synopsis size depends mainly on string synopsis size.

Synopses sizes {2, 4, 20, 40} MByte memory.∈

Metrics:Efficiency: selectivity estimation time.

Effectiveness: multiplicative error [17].

[17] Independence is good: De-pendency-based histogram syno-pses for high-dimensional data.


24

Evaluation (3) – Effectiveness – IMDB


25

Evaluation (4) – Effectiveness – DBLP


26

Evaluation (5) – Efficiency


27

CONCLUSION


28

Conclusion

Tackled the problem of selectivity estimation for conjunctive, hybrid queries.

We propose a template-based BN, which is well-suited for graph-structured data.

For string predicates, we further propose the integration of string synopses into this model.

Experiments showed that:If there are correlations between un-/structured data elements the accuracy of selectivity estimation can be greatly improved via BN+.

BN caused no overhead in terms of efficiency.


29

QUESTIONS

Slides @ Slideshare …

Paper @ www.aifb.kit.edu …


30

REFERENCES


31

References

[1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009.

[2] http://webdatacommons.org/

[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275–286, 1999.

[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. In SIGMOD, pages 205–216, 2006.

[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461–472, 2001.

[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11):852–863, 2011.

[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227–238, 2004.

[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages 397–408, 2005.

http://webdatacommons.org/


32

References (2)

[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033–1044, 2007.

[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In ICDE, pages 685–696, 2011.

[11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968.

[12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1–48, 2001.

[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering, 23(12):1763–1780, 2011.

[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In CIKM, pages 729–738, 2010.

[15] http://knoesis.org/swetodblp/

[16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.

[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199-210, 2001.

http://knoesis.org/swetodblp/

Education

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs