Keyword Search over Relational Tables and Streams

1

KEYWORD SEARCH OVER RELATIONAL TABLES AND STREAMS

ALEXANDER MARKOWETZUniversity of BonnYIN YANG and DIMITRIS PAPADIASHong Kong University of Science and Technology

Doklea Meci (A.M 2152)May 2012

University Of Crete Department Of Computer Science

3

THE CHALLENGES OF ACCESSING STRUCTURED DATA Query languages:

Numerous complex SQL statements

Schemas: Complex, or nontrivial

schema

R-KWS queries: replaces numerous

complex SQL statements liberates users from

studying a database schema

allows querying for terms in unknown locations (tables/attributes)

INTRODUCTION KeyWord Search (KWS)

each document/Web page constitutes one unit of information a result if it contains a subset of the query’s

keywords has been applied to relational DBMS

allows data retrieval without SQL Relational-Keyword Search (R-KWS)

the basic unit of information is a record/tuple queries cannot be answered by inspecting

records individually results have to be constructed by joining tuples

5

OUTLINE Introduction Relational Keyword Search On Tables

Graph-Based Processing Operator-Based Processing

Optimizations For Continuous GB Predecessor-KL Time-KL

Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh

Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation

Conclusion

RELATIONAL KEYWORD SEARCH ON TABLES Goal: methods for BG and OB processing

avoid the shortcomings of prior systems improve performance of R-KWS in conventional

databases

7

GRAPH-BASED PROCESSING

Basic Idea: given an inverted index I (on disk), it traverses

an undirected data graph G (in memory), searching for MTJNT (Minimal Total Join Networks of Tuples ) results

JNT –Join Networks of Tuples (JNT), which are connected acyclic components of G

A JNT is called Minimal Total JNT (MTJNT) iff it is impossible to remove any node and find the remainder to be total

8

GSEARCH ALGORITHM Basic Idea: the algorithm enumerates all

possible trees in G rooted at sn Result: a tree that corresponds to an MTJNT

9

GSEARCH ALGORITHM GSearch maintains a queue Q of trees

each constituting a fraction of a potential MTJNT Every tree is de-queued and expanded by

adding one new node , resulting in a new tree

The new tree falls into one of three categories: It forms an MTJNT, and is included in the result set It has the potential to become an MTJNT, and is

inserted in Q to be expanded later None of the previous and the tree can be safely

discarded The algorithm terminates when Q becomes

empty

10

GSEARCH ALGORITHM GSearch computes the set of MTJNT

containing node sn and so GB answers an R-KWS query q correctly, completely, without duplicates.

11

OPERATOR-BASED PROCESSING

Basic Idea: Query processing relies on Candidate Networks

(CN)

Candidate Networks (CN) are projections of MTJNT onto the expanded schema a tuple s of relation S maps to node S{K} EG(q), iff s

contains all keywords in K , but does not contain any other term in q\K

An MTJNT projects to a unique CN

12

EXAMPLE

13

EXAMPLE

14






Conclusion

15

OPTIMIZATIONS FOR CONTINUOUS GB

Basic Idea: Keyword labeling a simple and effective method to summarize

reachable keywords for a given node.

Improves performance by avoiding unnecessary calls to GSearch and constraining graph traversals.

A keyword label (KL) of format , stored at node n, indicates a path of h edges in the data graph, connecting n to an occurrence of keyword .

16

EXAMPLE s:[ ,2] corresponds to

the path connecting s to an occurrence of , via 2 edges

17

BENEFITS OF A MIN-COMPLETE LABELING GSearch(G, q, s) is called if s node can reach all query

terms, only if the node stores a KL for every k ∈ q. In any other case, s is guaranteed not to participate

in an MTJNT.

KL-aware Gsearch Algorithm: Inserts into Q iff there exists a set NL of labels with

belows criteria:

The KL in NL can reach all missing keywords; that is, NL

18

EXAMPLE - INTERMEDIATE TREES ABANDONED BY KL-AWARE GSEARCH. ( = 9)

lacking keyword new nodes can only be

added to node can reach in four

hops, the shortest path to

2-nd criteria not satisfied!while = 6; + 4 FAIL! 6+4

19

PREDECESSOR-KL IMPLEMENTATION

Basic Idea: A predecessor-KL is a triplet of the form [k, h, p]

a path of length h, connecting n to an occurrence of keyword k

p is n’s predecessor Every node n must contain a predecessor-KL [k,

h, p] for the shortest path leading from n through p to the occurrence of k

An arriving tuple s can itself contain a keyword, or create new paths between keywords and nodes

require KL insertions and updates each path contains at most edges

20

PREDECESSOR-KL EXAMPLE must keep bothKL [] , KL[,1, ] represent the shortest

path via predecessors and

both paths (to and ) share the same predecessor

suffices to keep KL [] through node

21

TIME-KL

Basic Idea: More efficient labeling that does not require

explicit removal A time-KL is a triplet [k, h, ] indicating a path

of length h to an occurrence of keyword k, which exists until

KL [k, h1, ] dominates another [k, h2, ] iff ( h1 h2 and )

Result: the graph that contains all KL that are not

dominated by others

22

TIME-KL EXAMPLE1) is connected to in via 2

hops 2) is connected to in via 1

hop 3) is connected to in via 3

hops and node expires at 21

Result:

(1) and (2) must be stored as each indicates the shortest path for some period of time.

(3) is not recorded as it expires sooner than the other two

23






Conclusion

24

OPTIMIZATIONS FOR CONTINUOUS OB

Basic Idea: If a selection on a table (e.g., T{}) returns no

tuples, all operator trees using this input can be discarded immediately For data streams, this is not permissible

Even though the selection T{} does not currently produce tuples, it may do so in the future, and all operator trees must thus be maintained.

Solution: optimizations that enable efficient OB R-KWS

over data streams

25

OPERATOR MESH (1/3)Basic Idea:

sharing common subexpressions all operator trees are integrated into an operator mesh, reducing

CPU cost (for evaluating joins) as well as memory overhead (for intermediate results).

The mesh has |SR|* clusters |SR| is the number of streaming relations |K| the number of query keywords

Each cluster contains the operator trees for all CN (Candidate Networks) discovered from a certain

The entire operator mesh has |SR|* leafs/sources, one for each node of the extended schema

Maximum depth of the mesh is +1 Number of edges depends on the schema complexity Different clusters are interconnected only through their

source operators Joins from different clusters do not connect directly

26

OPERATOR MESH EXAMPLE shows the shared execution of four operator

trees

27

OPERATOR MESH EXAMPLE Algorithm:

The first node in a cluster corresponds to the root node , from which CNGen starts

Whenever the algorithm generates a new tree from (by adding a new child to a parent ), a join .op is added to the mesh

The left child of .op is .op (the operator that was inserted when was created)

The right child is the source of For each tree t in CNGen, a pointer is maintained to

the corresponding operator t.op, to decide where to place subsequent joins when t is expanded

The algorithm is initialized with t first .op pointing to the source of

28

PROBLEMS WITH OPERATOR MESH APPROACH Example:

Assume tuples from S{} and T{} and V{},U{, },V {, } are empty

none of the joins , , or requires the output of because they do not receive right input

Worst case:

’s results expire before the arrival of any tuples from V{},U{, } or V {, }

The join has wasted CPU and memory, without any contribution to the query

29

DEMAND-DRIVEN OPERATOR EXECUTION (2/3) This mesh is maintained in main memory

throughout the lifespan of the query. A join is considered to be either

running - operators process input Sleeping – operators ignore input

A join operator is sent to sleep if: it has no input from the right child (a source), or all its parents are sleeping

Sending operators to sleep does not affect the result’s correctness or completeness because either: the operator cannot produce output, or its output would not be consumed

30

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE Shows the state diagram for a join operator

31

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE States are characterized by two binary flags:

d indicating that at least one parent operator is running, and r specifying that the operator’s right input is not

empty. An operator only runs in the topmost state (d/r) Operators exchange messages regarding their

state, in order to ensure that all d and r flags are up-to-date.

When it leaves this state (transition 2 or 3) it goes to sleep (or halts), to wake up (or restart) later (transitions 9 and 10)

a join operator communicates changes (running/sleeping) to its left child that adjusts its d flag

32

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE

Assume U{, } stops producing output

Result: turns off its r flag,

goes to sleep (transition 2)

calls its left child decreases its counter of running parents no further actions

for as there are other running parents ,

33

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE

If T{},V{, } dries up, too, then, goes to sleep

When operator decreases its counter (rParents=0)

Trasition 3

34

EXAMPLE- CONSIDERING THAT THE ONLY RUNNING JOIN OPERATORS ARE AND

Join does not generate results, due to lack of left input

When T{} begins producing output, it causes to adjust its r flag, wake up (transition 9), and

call .Pstart operator restarts

and informs

35

EXAMPLE - ALL JOINS RUN AGAIN EXCEPT AND

Note!!! this method is not restricted to keyword search; it can

equally benefit other data stream applications.

36

PARTIAL-MESH (3/3)BASIC IDEA A Partial-Mesh (PM) is built at runtime and

breaks the distinction between operator initialization Tuple processing

The method maintains relatively few active operators in memory

It is each operator’s responsibility to create its parents before it can produce output

It destroys its parents (and other operators up the tree) if it cannot supply them with input

In large meshes operators are idle Their absence does not affect result’s completeness,

but dramatically reduces memory consumption

37

PARTIAL-MESH EXAMPLE When the leftmost

source S{} first produces output

It creates its direct parents and

when generates results, it creates its own parents

38

PARTIAL-MESH EXAMPLE when outputs a first

tuple t and instantiates , this operator immediately probes t against T {}

39

PARTIAL-MESH ALGORITHM

Basic Idea: TreeGen, is an algorithm for reconstructing a tree

I decideS which parents to create

The algorithm checks the join condition of .op If is the source joined with then is generated

by adding as the rightmost child of in

40

PARTIAL-MESH EXAMPLES OF TREEGEN.

TreeGen(S{} )returns a tree that contains a single node S{}

parent is inserted in the mesh and connected to its left and right inputs

The call TreeGen() returns the tree

The expansion of reveals the parents of (e.g., , , )

41


Graph-Based ProcessingOperator-Based Processing

Optimizations For Continuous GBPredecessor-KLTime-KL

Optimizations For Continuous OBOperator MeshDemand-Driven Operator ExecutionPartial-Mesh

Experimental EvaluationSnapshot R-KWS Queries over TablesContinuous R-KWS Queries over Streams

Conclusion

42

SNAPSHOT R-KWS QUERIES OVER TABLES (1/3)Comparing GB and OB implementation: Experiments are focused on tables

Part (0.2M entries), Supplier (10K), PartSupp (0.8M), Customer (150K), Orders (1.5M), and LineItem (6M)

Two tables can join if and only if there is a foreign-key to primary-key between them

The length of join sequences is restricted to , which ranges between 4 and 6.

43

EXAMPLE

44

EXAMPLE - SEVEN SETS OF R-KWS QUERIES QS 1 -QS 7

QS 1, QS 2 : people’s or companies’ names (denoted as PeopleName), which appear in the columns Customer. Name, Supplier.Name, and Orders.Clerk; (retrieve connections between multiple people)QS 3 /QS 4:terms from the name of apart, for example, “ivory”, from the Part.Name attribute;

45

EXAMPLE - SEVEN SETS OF R-KWS QUERIES QS 1 -QS 7 QS 5, QS 6 :years, which are present in LineItem.ShipDate, LineItem.CommitDate, LineItem.ReceiptDate, Orders.OrderDate; QS 7 :terms from Part.Brand, Part.Mfgr, Part.Size, and Part.Container

46

EXAMPLE- PROCESSING TIME FOR QUERIES QS 1 -QS 7 The below picture depicts the total runtime ( y-axis) of GB and OB The result set cardinality |R| (below the x-

axis) for the seven query sets Report the median values after setting to 4,

5, and 6.

47

SNAPSHOT R-KWS QUERIES OVER TABLES –CONCLUSION

(+) For conventional tables, GB is more

efficient than OB,. GB methods, GSearch avoids duplicate

results reduces the total cost GB is preferable for datasets with

frequent updates (-) Not efficient for queries involving

numerous keywords and/or a large value of T max

consumes a large amount of main memory to store the data graph

Conclusion:On servers dedicated for R-KWS queries, GB is the best choice due to its high performance

(+) OB utilizes the

functionality provided by a DBMS, and, thus, can answer R-KWS queries using much less memory than GB

Conclusion:On servers running multiple applications and only answering R-KWS queries infrequently, OB might be preferable due to its low memory footprint

GB OB

48

CONTINUOUS R-KWS QUERIES OVER STREAMS(2/2)

49

CONTINUOUS R-KWS QUERIES OVER STREAMS

50


51


52


53


54


55


56

CONTINUOUS R-KWS QUERIES OVER STREAMS - CONCLUSION

FM is usually the most

CPU-efficient method for a single query

GB and PM are more economical in terms of memory consumption

FULL MESH (FM) Partial Mesh (PM)

57


Graph-Based ProcessingOperator-Based Processing

Optimizations For Continuous GBPredecessor-KLTime-KL

Optimizations For Continuous OBOperator MeshDemand-Driven Operator ExecutionPartial-Mesh

Experimental EvaluationSnapshot R-KWS Queries over TablesContinuous R-KWS Queries over Streams

Conclusion

58

CONCLUSION – ADVANTAGES OF R-KWS R-KWS handles broad query tasks whose complexity

does not permit handcoded structured queries Presents considerable algorithmic challenges

because query processing has to explore a vast search space

Challenges are faced through a series of contributions

they provide R-KWS semantics that are well defined and easily extensible to streaming environments

develop GB and OB processing techniques that match these semantics and remedy problems encountered in previous systems

they adapt their framework to relational streams, and propose a wide range of optimizations

support their claims through an extensive set of experiments

59

CONCLUSION – FUTURE WORK They plan to further improve R-KWS

performance by means of indexing They intend to integrate ranking into

continuous R-KWS query processing Example:

if there are a sudden burst of results, it may be desirable to report only the top-k answers for the affected period.

Documents

Keyword Search over Relational Tables and Streams