27
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th , 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin Yang Dimitris Papadias Hong Kong University of Science and Technology

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

Embed Size (px)

Citation preview

Page 1: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

Keyword Search on Relational Data

Streams

Alexander MarkowetzYin Yang

Dimitris Papadias

Hong Kong University of Science and Technology

Page 2: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

2

Replacing SQL With Keywords

Query: “Tarantino, Travolta”

T2 Movies

T4 ActorsT3 Plays

T1 Directors T1.did = T2.did

T2.mid = T3.mid

T3.aid = T4.aid

Schema -Graph

Travolta120t4,1

nameaid other att.t-ID

Tarantino145t4,2

......t4,3

2t3,1

midt-ID

5t3,2

5t3,3

120

aid

120

145

other att.

...t3,4 ...

Pulp Fiction2t2,1

titlemid other att.t-ID

100

did

Some Trash5t2,2 139

......t2,3 ...

Tarantino100t1,1

namedid other att.t-ID

......t1,2

Page 3: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

3

SQL queries & Operator Trees

Select *

From actors A1, plays P1, movies M, plays P2, actors A2

Where A1.name = Tarantino and A2.name = Travolta and A1.aid = P1.aid and P1.mid = M.mid and M.mid = P2.mid and P2.aid = A2.aid

σ name = Tarantino

Actors Plays Movies Plays

σ name = Travolta

Actors

Select *

From directors D, movies M, plays P, actors A

Where D.name = Tarantino and A.name = Travolta and D.did = M.did and M.mid = P.mid and P.aid = A.aid

σ name = Tarantino

σ name = Travolta

Directors Movies Plays Actors

Page 4: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

4

Data Graph G

v1

t1

u2

s2

v2

t2u1

v3 v4

u3

s1V

S T U

k3

k1 k3

k1, k2 k3 k2 k1, k2

• Nodes = Tuples

• Edges = “can be joined”

• Supports query processing– [Bhalotia et al., ICDE, 2002]

Page 5: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

5

MTJNT

• Sub-graph of G– Contain all keywords

– Minimal

• Answer R-KWS query

• Limited to Tmax nodes– Longer joins =

irrelevant results• [Hristidis et al. VLDB 2002]

v1

t1

u2

s2

v2

t2u1

v3 v4

u3

s1

k3

k1 k3

k1, k2 k3 k2 k1, k2

v1

t1

u2

s2

v2

t2u1

v3 v4

u3

s1

k3

k1 k3

k1, k2 k3 k2 k1, k2

Page 6: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

6

Candidate Networks (CN)

• Abstractions of MTJNT

V{k1,

k2}T{} U{k3}

v1 t1 s2 t2 u3k1, k2 k3 V{k1,

k2}T{} S{} T{} U{k3}

u1 t1 s2 t2 u3

v2

k1 k3

k2

U{k1} T{} S{} T{} U{k3}

V{k2}

v1 t1 u2

k1, k2 k3

v4 t2 u3

k1, k2 k3

alex
Note M:1Note: duplicate nodes in CN
Page 7: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

7

u1 t1 s2 t2 u3

v2

k1 k3

k2

Operator Trees for Candidate Networks

U{k1} T{} S{} T{} U{k3}

V{k2}

U{k1} T{} S{} T{} V{k2} U{k3}

CN

Output: MTJNTOP-Tree

•Leaves = Selections•Inner nodes = Joins

Page 8: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

8

Challenges…

…of S-KWS, with regards to R-KWS:

• Stream Semantics

• Efficient CN-Generation

• Optimized Operator Execution

• Stream Specific Issues

Page 9: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

9

Instantaneous Data Graph

v1

t1

u2

s2

v2

t2u1

v3 v4

u3

s1

k3

k1k3

k1, k2 k3 k2 k1, k2

[1, 10)

[2, 11) [6, 15)[4, 13)[3, 12)

[9, 18)[6, 15)[3, 12)[5, 14)[1, 10)

[7, 16)

[5, 10)

Page 10: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

10

Architecture

Query qCandidateNetworks

Operator Trees

Results

Translate

Execute

Generate

Page 11: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

11

CN-Generation

• Target: Avoid duplicate CN• Idea: Model CN as unique tree• Pre-order generation

U{k4} T{} S{k1} T{k1} U{k3}

V{k2}

U{k4}

T{}

S{k1}

T{k1}

U{k3} V{k2}

Page 12: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

12

Operator Mesh

S{k1} U{k2 , k3} V{k2 , k3}

Buffer.out

j1.out

j1

j5 j6

j7

S{k1 }, T {}, U{k2 , k3}

S{k1 }, T {}, V{k2 , k3}

S{k1 }, T{k2 , k3}

j2

T{}

Buffer.out

V{k2}

Buffer.out

j2.out

T{k1}

Buffer.out

j3

j3.out

j4

U{k3}

Buffer.out Buffer.out Buffer.out

output operator

T{k2 , k3}

Buffer.out

S{k1 }, T {}, V{k2 }, T{k1 }, U{k3}

Page 13: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

13

Demand Driven Operator Execution (I)

Pstop()rParents--

Pstop()rParents--

rParents == 0

rParents == 0

RightSide = Ø

RightSide = ØPStart()rParents++

Tuple arrivesat right input

Tuple arrives at right input

2

3

1

4

6

5 7

8

9

10

! d / rsleeping

d / !rsleeping

d / rrunning

! d / !rsleeping

PStart()rParents++

Page 14: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

14

Demand Driven Operator Execution (II)

S{k1} U{k2 , k3} V{k2 , k3}

Buffer.out

j1.out

j1

j5 j6

j7

S{k1 }, T {}, U{k2 , k3}

S{k1 }, T {}, V{k2 , k3}

S{k1 }, T{k2 , k3}

j2

T{}

Buffer.out

V{k2}

Buffer.out

j2.out

T{k1}

Buffer.out

j3

j3.out

j4

U{k3}

Buffer.out Buffer.out Buffer.out

output operator

T{k2 , k3}

Buffer.out

S{k1 }, T {}, V{k2 }, T{k1 }, U{k3}

Page 15: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

15

Dynamic Mesh Expansion

• Create operators when input is available• Destroy, if no input

T{} V{k2}U{k2, k3}

V{k2, k3}

T{k2, k3}

S{k1}

Output

T{} V{k2}U{k2, k3}

V{k2, k3}

T{k2, k3}

S{k1}

Output

Page 16: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

16

Purging Expired Tuples from Buffers

• Positive / Negative– Negative tuples

remove positives

• Sliding windows1. Bottom-Up

• Similar to Pos./Neg.

2. Lazy• Expired tuples remain

in buffers• Purged during join

execution

U{k1} T{} S{}

Page 17: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

17

Mesh Migration

• Schema changes at runtime

• Stream removal– Trivial

• Stream Arrival– Generate new operator mesh, on top of old– Preserve intermediate results– Observer node order

Page 18: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

18

Window Size

1

10

102

103

104

5 10 20 40 80w in min.

CPU in sec. PM FM Forest

|R|: 8 33 236 1582 11213 106

107

108

109

5 10 20 40 80

Bytes PM FM Forest

w in min.

• PM excels in memory, FM in speed• Forest always ranks last

– Omitted from remaining slides

CP

U

Consum

ption

Peak M

emory

Page 19: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

19

Keyword Frequency

10

20

30

40

50

60

0.003 0.007 0.01 0.013 0.016KWF

CPU in sec. PM FM

|R|: 4 95 236 596 1089

4*106

5*106

6*106

7*106

8*106

0.003 0.007 0.01 0.013 0.016

Bytes PM FM

KWF

• No impact on number of tuples or joinability• More MTJNT

– Require CPU and memory for production and storage

• FM looses memory advantage– Tuples travel higher in mesh

CP

U

Consum

ption

Peak M

emory

Page 20: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

20

Number of Streaming Relations

0

10

20

30

40

50

60

70

5 10 15 20 25|SR|

CPU in sec. PM FM

|R|: 24 137 236 331 374

106

107

108

5 10 15 20 25

Bytes PM FM

|SR|

I: 0.421 1.406 2.422 3.406 4.516

• Linear growth in mesh– No more then 4 neighbors in schema

• Relative performance as expected

CP

U

Consum

ption

Peak M

emory

Page 21: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

21

Join Selectivity

20

30

40

50

60

70

80

1/1500 1/1250 1/1000 1/750 1/500sel

CPU in sec. PM FM-D

72 116 236 585 1758|R|:4*106

5*106

6*106

7*106

8*106

9*106

107

1/1500 1/1250 1/1000 1/750 1/500

Bytes PM FM-D

sel

• Quadratic increase in number of graph edges– Increasing intermediate results and MTJNT– Increasing CPU and memory consumption

• More tuples travel up the mesh:– PM (re-) generates more operators; hence requires more CPU

CP

U

Consum

ption

Peak M

emory

Page 22: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

22

Number of Keywords

1

10

102

103

104

2 3 4 5|K|

CPU in sec. PM FM

|R|:2492 236 6 1

106

107

108

109

1010

2 3 4 5

Bytes PM FM

|K|

I: 0.078 2.422 61.968 1566

• Exponential growth in the complexity of the operator mesh– Reflected in initialization time

• Most operators in large meshes commonly idle– Increased savings through PM

CP

U

Consum

ption

Peak M

emory

Page 23: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

23

Maximal size of results Tmax

1

10

102

103

104

2 3 4 5 6Tmax

CPU in sec. PM FM

|R|:5 41 236 1412 7161

106

107

108

2 3 4 5 6Tmax

Bytes PM FM

I: 0.062 0.359 2.422 18.078 140.515

• Similar impact to |K|– Exponential growth of mesh

- Reflected by initialization phase

• Exponential growth of results

CP

U

Consum

ption

Peak M

emory

Page 24: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

24

Demand Driven Operator Execution

5 10 20 40 80

CPU in sec.

10

102

103

w in min.

FM-!D-LFM-D-L

106

107

108

5 10 20 40 80

Bytes

w in min.

FM-!D-LFM-D-L

• Reduced CPU– Avoided unnecessary joins

• Reduced memory consumption– Avoided storage of many intermediate results

CP

U

Consum

ption

Peak M

emory

Page 25: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

25

Conclusion

• Benefits larger than R-KWS• More intricate than R-KWS• Contributions:

– S-KWS semantics

– CN generation

– S-KWS query processing• Full Mesh

• Partial Mesh

• Optimizations

Page 26: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

26

Future Work

• Advanced Demand Driven Operator Ex.

• Graph Based S-KWS

• Parallel queries

• Top-k results

• Combination with R-KWS

Page 27: ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007. Keyword Search on Relational Data Streams Alexander Markowetz Yin

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.

27

… for ListeningQuestions?

Thank You …