58
KEYWORD SEARCH OVER RELATIONAL TABLES AND STREAMS ALEXANDER MARKOWETZ University of Bonn YIN YANG and DIMITRIS PAPADIAS Hong Kong University of Science and Technology Doklea Meci (A.M 2152) May 2012 University Of Crete Department Of Computer Science 1

Keyword Search over Relational Tables and Streams

  • Upload
    king

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Keyword Search over Relational Tables and Streams. ALEXANDER MARKOWETZ University of Bonn YIN YANG and DIMITRIS PAPADIAS Hong Kong University of Science and Technology Doklea Meci (A.M 2152) May 2012 University O f Crete Department O f Computer Science. - PowerPoint PPT Presentation

Citation preview

Page 1: Keyword  Search over  Relational  Tables and  Streams

1

KEYWORD SEARCH OVER RELATIONAL TABLES AND STREAMS

ALEXANDER MARKOWETZUniversity of BonnYIN YANG and DIMITRIS PAPADIASHong Kong University of Science and Technology

Doklea Meci (A.M 2152)May 2012

University Of Crete Department Of Computer Science

Page 2: Keyword  Search over  Relational  Tables and  Streams

3

THE CHALLENGES OF ACCESSING STRUCTURED DATA Query languages:

Numerous complex SQL statements

Schemas: Complex, or nontrivial

schema

R-KWS queries: replaces numerous

complex SQL statements liberates users from

studying a database schema

allows querying for terms in unknown locations (tables/attributes)

Page 3: Keyword  Search over  Relational  Tables and  Streams

INTRODUCTION KeyWord Search (KWS)

each document/Web page constitutes one unit of information a result if it contains a subset of the query’s

keywords has been applied to relational DBMS

allows data retrieval without SQL Relational-Keyword Search (R-KWS)

the basic unit of information is a record/tuple queries cannot be answered by inspecting

records individually results have to be constructed by joining tuples

Page 4: Keyword  Search over  Relational  Tables and  Streams

5

OUTLINE Introduction Relational Keyword Search On Tables

Graph-Based Processing Operator-Based Processing

Optimizations For Continuous GB Predecessor-KL Time-KL

Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh

Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation

Conclusion

Page 5: Keyword  Search over  Relational  Tables and  Streams

RELATIONAL KEYWORD SEARCH ON TABLES Goal: methods for BG and OB processing

avoid the shortcomings of prior systems improve performance of R-KWS in conventional

databases

Page 6: Keyword  Search over  Relational  Tables and  Streams

7

GRAPH-BASED PROCESSING

Basic Idea: given an inverted index I (on disk), it traverses

an undirected data graph G (in memory), searching for MTJNT (Minimal Total Join Networks of Tuples ) results

JNT –Join Networks of Tuples (JNT), which are connected acyclic components of G

A JNT is called Minimal Total JNT (MTJNT) iff it is impossible to remove any node and find the remainder to be total

Page 7: Keyword  Search over  Relational  Tables and  Streams

8

GSEARCH ALGORITHM Basic Idea: the algorithm enumerates all

possible trees in G rooted at sn Result: a tree that corresponds to an MTJNT

Page 8: Keyword  Search over  Relational  Tables and  Streams

9

GSEARCH ALGORITHM GSearch maintains a queue Q of trees

each constituting a fraction of a potential MTJNT Every tree is de-queued and expanded by

adding one new node , resulting in a new tree

The new tree falls into one of three categories: It forms an MTJNT, and is included in the result set It has the potential to become an MTJNT, and is

inserted in Q to be expanded later None of the previous and the tree can be safely

discarded The algorithm terminates when Q becomes

empty

Page 9: Keyword  Search over  Relational  Tables and  Streams

10

GSEARCH ALGORITHM GSearch computes the set of MTJNT

containing node sn and so GB answers an R-KWS query q correctly, completely, without duplicates.

Page 10: Keyword  Search over  Relational  Tables and  Streams

11

OPERATOR-BASED PROCESSING

Basic Idea: Query processing relies on Candidate Networks

(CN)

Candidate Networks (CN) are projections of MTJNT onto the expanded schema a tuple s of relation S maps to node S{K} EG(q), iff s

contains all keywords in K , but does not contain any other term in q\K

An MTJNT projects to a unique CN

Page 11: Keyword  Search over  Relational  Tables and  Streams

12

EXAMPLE

Page 12: Keyword  Search over  Relational  Tables and  Streams

13

EXAMPLE

Page 13: Keyword  Search over  Relational  Tables and  Streams

14

OUTLINE Introduction Relational Keyword Search On Tables

Graph-Based Processing Operator-Based Processing

Optimizations For Continuous GB Predecessor-KL Time-KL

Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh

Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation

Conclusion

Page 14: Keyword  Search over  Relational  Tables and  Streams

15

OPTIMIZATIONS FOR CONTINUOUS GB

Basic Idea: Keyword labeling a simple and effective method to summarize

reachable keywords for a given node.

Improves performance by avoiding unnecessary calls to GSearch and constraining graph traversals.

A keyword label (KL) of format , stored at node n, indicates a path of h edges in the data graph, connecting n to an occurrence of keyword .

Page 15: Keyword  Search over  Relational  Tables and  Streams

16

EXAMPLE s:[ ,2] corresponds to

the path connecting s to an occurrence of , via 2 edges

Page 16: Keyword  Search over  Relational  Tables and  Streams

17

BENEFITS OF A MIN-COMPLETE LABELING GSearch(G, q, s) is called if s node can reach all query

terms, only if the node stores a KL for every k ∈ q. In any other case, s is guaranteed not to participate

in an MTJNT.

KL-aware Gsearch Algorithm: Inserts into Q iff there exists a set NL of labels with

belows criteria:

The KL in NL can reach all missing keywords; that is, NL

Page 17: Keyword  Search over  Relational  Tables and  Streams

18

EXAMPLE - INTERMEDIATE TREES ABANDONED BY KL-AWARE GSEARCH. ( = 9)

lacking keyword new nodes can only be

added to node can reach in four

hops, the shortest path to

2-nd criteria not satisfied!while = 6; + 4 FAIL! 6+4

Page 18: Keyword  Search over  Relational  Tables and  Streams

19

PREDECESSOR-KL IMPLEMENTATION

Basic Idea: A predecessor-KL is a triplet of the form [k, h, p]

a path of length h, connecting n to an occurrence of keyword k

p is n’s predecessor Every node n must contain a predecessor-KL [k,

h, p] for the shortest path leading from n through p to the occurrence of k

An arriving tuple s can itself contain a keyword, or create new paths between keywords and nodes

require KL insertions and updates each path contains at most edges

Page 19: Keyword  Search over  Relational  Tables and  Streams

20

PREDECESSOR-KL EXAMPLE must keep bothKL [] , KL[,1, ] represent the shortest

path via predecessors and

both paths (to and ) share the same predecessor

suffices to keep KL [] through node

Page 20: Keyword  Search over  Relational  Tables and  Streams

21

TIME-KL

Basic Idea: More efficient labeling that does not require

explicit removal A time-KL is a triplet [k, h, ] indicating a path

of length h to an occurrence of keyword k, which exists until

KL [k, h1, ] dominates another [k, h2, ] iff ( h1 h2 and )

Result: the graph that contains all KL that are not

dominated by others

Page 21: Keyword  Search over  Relational  Tables and  Streams

22

TIME-KL EXAMPLE1) is connected to in via 2

hops 2) is connected to in via 1

hop 3) is connected to in via 3

hops and node expires at 21

Result:

(1) and (2) must be stored as each indicates the shortest path for some period of time.

(3) is not recorded as it expires sooner than the other two

Page 22: Keyword  Search over  Relational  Tables and  Streams

23

OUTLINE Introduction Relational Keyword Search On Tables

Graph-Based Processing Operator-Based Processing

Optimizations For Continuous GB Predecessor-KL Time-KL

Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh

Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation

Conclusion

Page 23: Keyword  Search over  Relational  Tables and  Streams

24

OPTIMIZATIONS FOR CONTINUOUS OB

Basic Idea: If a selection on a table (e.g., T{}) returns no

tuples, all operator trees using this input can be discarded immediately For data streams, this is not permissible

Even though the selection T{} does not currently produce tuples, it may do so in the future, and all operator trees must thus be maintained.

Solution: optimizations that enable efficient OB R-KWS

over data streams

Page 24: Keyword  Search over  Relational  Tables and  Streams

25

OPERATOR MESH (1/3)Basic Idea:

sharing common subexpressions all operator trees are integrated into an operator mesh, reducing

CPU cost (for evaluating joins) as well as memory overhead (for intermediate results).

The mesh has |SR|* clusters |SR| is the number of streaming relations |K| the number of query keywords

Each cluster contains the operator trees for all CN (Candidate Networks) discovered from a certain

The entire operator mesh has |SR|* leafs/sources, one for each node of the extended schema

Maximum depth of the mesh is +1 Number of edges depends on the schema complexity Different clusters are interconnected only through their

source operators Joins from different clusters do not connect directly

Page 25: Keyword  Search over  Relational  Tables and  Streams

26

OPERATOR MESH EXAMPLE shows the shared execution of four operator

trees

Page 26: Keyword  Search over  Relational  Tables and  Streams

27

OPERATOR MESH EXAMPLE Algorithm:

The first node in a cluster corresponds to the root node , from which CNGen starts

Whenever the algorithm generates a new tree from (by adding a new child to a parent ), a join .op is added to the mesh

The left child of .op is .op (the operator that was inserted when was created)

The right child is the source of For each tree t in CNGen, a pointer is maintained to

the corresponding operator t.op, to decide where to place subsequent joins when t is expanded

The algorithm is initialized with t first .op pointing to the source of

Page 27: Keyword  Search over  Relational  Tables and  Streams

28

PROBLEMS WITH OPERATOR MESH APPROACH Example:

Assume tuples from S{} and T{} and V{},U{, },V {, } are empty

none of the joins , , or requires the output of because they do not receive right input

Worst case:

’s results expire before the arrival of any tuples from V{},U{, } or V {, }

The join has wasted CPU and memory, without any contribution to the query

Page 28: Keyword  Search over  Relational  Tables and  Streams

29

DEMAND-DRIVEN OPERATOR EXECUTION (2/3) This mesh is maintained in main memory

throughout the lifespan of the query. A join is considered to be either

running - operators process input Sleeping – operators ignore input

A join operator is sent to sleep if: it has no input from the right child (a source), or all its parents are sleeping

Sending operators to sleep does not affect the result’s correctness or completeness because either: the operator cannot produce output, or its output would not be consumed

Page 29: Keyword  Search over  Relational  Tables and  Streams

30

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE Shows the state diagram for a join operator

Page 30: Keyword  Search over  Relational  Tables and  Streams

31

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE States are characterized by two binary flags:

d indicating that at least one parent operator is running, and r specifying that the operator’s right input is not

empty. An operator only runs in the topmost state (d/r) Operators exchange messages regarding their

state, in order to ensure that all d and r flags are up-to-date.

When it leaves this state (transition 2 or 3) it goes to sleep (or halts), to wake up (or restart) later (transitions 9 and 10)

a join operator communicates changes (running/sleeping) to its left child that adjusts its d flag

Page 31: Keyword  Search over  Relational  Tables and  Streams

32

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE

Assume U{, } stops producing output

Result: turns off its r flag,

goes to sleep (transition 2)

calls its left child decreases its counter of running parents no further actions

for as there are other running parents ,

Page 32: Keyword  Search over  Relational  Tables and  Streams

33

DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE

If T{},V{, } dries up, too, then, goes to sleep

When operator decreases its counter (rParents=0)

Trasition 3

Page 33: Keyword  Search over  Relational  Tables and  Streams

34

EXAMPLE- CONSIDERING THAT THE ONLY RUNNING JOIN OPERATORS ARE AND

Join does not generate results, due to lack of left input

When T{} begins producing output, it causes to adjust its r flag, wake up (transition 9), and

call .Pstart operator restarts

and informs

Page 34: Keyword  Search over  Relational  Tables and  Streams

35

EXAMPLE - ALL JOINS RUN AGAIN EXCEPT AND

Note!!! this method is not restricted to keyword search; it can

equally benefit other data stream applications.

Page 35: Keyword  Search over  Relational  Tables and  Streams

36

PARTIAL-MESH (3/3)BASIC IDEA A Partial-Mesh (PM) is built at runtime and

breaks the distinction between operator initialization Tuple processing

The method maintains relatively few active operators in memory

It is each operator’s responsibility to create its parents before it can produce output

It destroys its parents (and other operators up the tree) if it cannot supply them with input

In large meshes operators are idle Their absence does not affect result’s completeness,

but dramatically reduces memory consumption

Page 36: Keyword  Search over  Relational  Tables and  Streams

37

PARTIAL-MESH EXAMPLE When the leftmost

source S{} first produces output

It creates its direct parents and

when generates results, it creates its own parents

Page 37: Keyword  Search over  Relational  Tables and  Streams

38

PARTIAL-MESH EXAMPLE when outputs a first

tuple t and instantiates , this operator immediately probes t against T {}

Page 38: Keyword  Search over  Relational  Tables and  Streams

39

PARTIAL-MESH ALGORITHM

Basic Idea: TreeGen, is an algorithm for reconstructing a tree

I decideS which parents to create

The algorithm checks the join condition of .op If is the source joined with then is generated

by adding as the rightmost child of in

Page 39: Keyword  Search over  Relational  Tables and  Streams

40

PARTIAL-MESH EXAMPLES OF TREEGEN.

TreeGen(S{} )returns a tree that contains a single node S{}

parent is inserted in the mesh and connected to its left and right inputs

The call TreeGen() returns the tree

The expansion of reveals the parents of (e.g., , , )

Page 40: Keyword  Search over  Relational  Tables and  Streams

41

OUTLINE Introduction Relational Keyword Search On Tables

Graph-Based ProcessingOperator-Based Processing

Optimizations For Continuous GBPredecessor-KLTime-KL

Optimizations For Continuous OBOperator MeshDemand-Driven Operator ExecutionPartial-Mesh

Experimental EvaluationSnapshot R-KWS Queries over TablesContinuous R-KWS Queries over Streams

Conclusion

Page 41: Keyword  Search over  Relational  Tables and  Streams

42

SNAPSHOT R-KWS QUERIES OVER TABLES (1/3)Comparing GB and OB implementation: Experiments are focused on tables

Part (0.2M entries), Supplier (10K), PartSupp (0.8M), Customer (150K), Orders (1.5M), and LineItem (6M)

Two tables can join if and only if there is a foreign-key to primary-key between them

The length of join sequences is restricted to , which ranges between 4 and 6.

Page 42: Keyword  Search over  Relational  Tables and  Streams

43

EXAMPLE

Page 43: Keyword  Search over  Relational  Tables and  Streams

44

EXAMPLE - SEVEN SETS OF R-KWS QUERIES QS 1 -QS 7

QS 1, QS 2 : people’s or companies’ names (denoted as PeopleName), which appear in the columns Customer. Name, Supplier.Name, and Orders.Clerk; (retrieve connections between multiple people)QS 3 /QS 4:terms from the name of apart, for example, “ivory”, from the Part.Name attribute;

Page 44: Keyword  Search over  Relational  Tables and  Streams

45

EXAMPLE - SEVEN SETS OF R-KWS QUERIES QS 1 -QS 7 QS 5, QS 6 :years, which are present in LineItem.ShipDate, LineItem.CommitDate, LineItem.ReceiptDate, Orders.OrderDate; QS 7 :terms from Part.Brand, Part.Mfgr, Part.Size, and Part.Container

Page 45: Keyword  Search over  Relational  Tables and  Streams

46

EXAMPLE- PROCESSING TIME FOR QUERIES QS 1 -QS 7 The below picture depicts the total runtime ( y-axis) of GB and OB The result set cardinality |R| (below the x-

axis) for the seven query sets Report the median values after setting to 4,

5, and 6.

Page 46: Keyword  Search over  Relational  Tables and  Streams

47

SNAPSHOT R-KWS QUERIES OVER TABLES –CONCLUSION

(+) For conventional tables, GB is more

efficient than OB,. GB methods, GSearch avoids duplicate

results reduces the total cost GB is preferable for datasets with

frequent updates (-) Not efficient for queries involving

numerous keywords and/or a large value of T max

consumes a large amount of main memory to store the data graph

Conclusion:On servers dedicated for R-KWS queries, GB is the best choice due to its high performance

(+) OB utilizes the

functionality provided by a DBMS, and, thus, can answer R-KWS queries using much less memory than GB

Conclusion:On servers running multiple applications and only answering R-KWS queries infrequently, OB might be preferable due to its low memory footprint

GB OB

Page 47: Keyword  Search over  Relational  Tables and  Streams

48

CONTINUOUS R-KWS QUERIES OVER STREAMS(2/2)

Page 48: Keyword  Search over  Relational  Tables and  Streams

49

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 49: Keyword  Search over  Relational  Tables and  Streams

50

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 50: Keyword  Search over  Relational  Tables and  Streams

51

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 51: Keyword  Search over  Relational  Tables and  Streams

52

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 52: Keyword  Search over  Relational  Tables and  Streams

53

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 53: Keyword  Search over  Relational  Tables and  Streams

54

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 54: Keyword  Search over  Relational  Tables and  Streams

55

CONTINUOUS R-KWS QUERIES OVER STREAMS

Page 55: Keyword  Search over  Relational  Tables and  Streams

56

CONTINUOUS R-KWS QUERIES OVER STREAMS - CONCLUSION

FM is usually the most

CPU-efficient method for a single query

GB and PM are more economical in terms of memory consumption

FULL MESH (FM) Partial Mesh (PM)

Page 56: Keyword  Search over  Relational  Tables and  Streams

57

OUTLINE Introduction Relational Keyword Search On Tables

Graph-Based ProcessingOperator-Based Processing

Optimizations For Continuous GBPredecessor-KLTime-KL

Optimizations For Continuous OBOperator MeshDemand-Driven Operator ExecutionPartial-Mesh

Experimental EvaluationSnapshot R-KWS Queries over TablesContinuous R-KWS Queries over Streams

Conclusion

Page 57: Keyword  Search over  Relational  Tables and  Streams

58

CONCLUSION – ADVANTAGES OF R-KWS R-KWS handles broad query tasks whose complexity

does not permit handcoded structured queries Presents considerable algorithmic challenges

because query processing has to explore a vast search space

Challenges are faced through a series of contributions

they provide R-KWS semantics that are well defined and easily extensible to streaming environments

develop GB and OB processing techniques that match these semantics and remedy problems encountered in previous systems

they adapt their framework to relational streams, and propose a wide range of optimizations

support their claims through an extensive set of experiments

Page 58: Keyword  Search over  Relational  Tables and  Streams

59

CONCLUSION – FUTURE WORK They plan to further improve R-KWS

performance by means of indexing They intend to integrate ranking into

continuous R-KWS query processing Example:

if there are a sudden burst of results, it may be desirable to report only the top-k answers for the affected period.