Upload
noreen-lindsay-hawkins
View
216
Download
0
Embed Size (px)
Citation preview
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
Keyword Search on Relational Data
Streams
Alexander MarkowetzYin Yang
Dimitris Papadias
Hong Kong University of Science and Technology
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
2
Replacing SQL With Keywords
Query: “Tarantino, Travolta”
T2 Movies
T4 ActorsT3 Plays
T1 Directors T1.did = T2.did
T2.mid = T3.mid
T3.aid = T4.aid
Schema -Graph
Travolta120t4,1
nameaid other att.t-ID
Tarantino145t4,2
......t4,3
2t3,1
midt-ID
5t3,2
5t3,3
120
aid
120
145
other att.
...t3,4 ...
Pulp Fiction2t2,1
titlemid other att.t-ID
100
did
Some Trash5t2,2 139
......t2,3 ...
Tarantino100t1,1
namedid other att.t-ID
......t1,2
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
3
SQL queries & Operator Trees
Select *
From actors A1, plays P1, movies M, plays P2, actors A2
Where A1.name = Tarantino and A2.name = Travolta and A1.aid = P1.aid and P1.mid = M.mid and M.mid = P2.mid and P2.aid = A2.aid
σ name = Tarantino
Actors Plays Movies Plays
σ name = Travolta
Actors
Select *
From directors D, movies M, plays P, actors A
Where D.name = Tarantino and A.name = Travolta and D.did = M.did and M.mid = P.mid and P.aid = A.aid
σ name = Tarantino
σ name = Travolta
Directors Movies Plays Actors
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
4
Data Graph G
v1
t1
u2
s2
v2
t2u1
v3 v4
u3
s1V
S T U
k3
k1 k3
k1, k2 k3 k2 k1, k2
• Nodes = Tuples
• Edges = “can be joined”
• Supports query processing– [Bhalotia et al., ICDE, 2002]
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
5
MTJNT
• Sub-graph of G– Contain all keywords
– Minimal
• Answer R-KWS query
• Limited to Tmax nodes– Longer joins =
irrelevant results• [Hristidis et al. VLDB 2002]
v1
t1
u2
s2
v2
t2u1
v3 v4
u3
s1
k3
k1 k3
k1, k2 k3 k2 k1, k2
v1
t1
u2
s2
v2
t2u1
v3 v4
u3
s1
k3
k1 k3
k1, k2 k3 k2 k1, k2
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
6
Candidate Networks (CN)
• Abstractions of MTJNT
V{k1,
k2}T{} U{k3}
v1 t1 s2 t2 u3k1, k2 k3 V{k1,
k2}T{} S{} T{} U{k3}
u1 t1 s2 t2 u3
v2
k1 k3
k2
U{k1} T{} S{} T{} U{k3}
V{k2}
v1 t1 u2
k1, k2 k3
v4 t2 u3
k1, k2 k3
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
7
u1 t1 s2 t2 u3
v2
k1 k3
k2
Operator Trees for Candidate Networks
U{k1} T{} S{} T{} U{k3}
V{k2}
U{k1} T{} S{} T{} V{k2} U{k3}
CN
Output: MTJNTOP-Tree
•Leaves = Selections•Inner nodes = Joins
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
8
Challenges…
…of S-KWS, with regards to R-KWS:
• Stream Semantics
• Efficient CN-Generation
• Optimized Operator Execution
• Stream Specific Issues
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
9
Instantaneous Data Graph
v1
t1
u2
s2
v2
t2u1
v3 v4
u3
s1
k3
k1k3
k1, k2 k3 k2 k1, k2
[1, 10)
[2, 11) [6, 15)[4, 13)[3, 12)
[9, 18)[6, 15)[3, 12)[5, 14)[1, 10)
[7, 16)
[5, 10)
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
10
Architecture
Query qCandidateNetworks
Operator Trees
Results
Translate
Execute
Generate
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
11
CN-Generation
• Target: Avoid duplicate CN• Idea: Model CN as unique tree• Pre-order generation
U{k4} T{} S{k1} T{k1} U{k3}
V{k2}
U{k4}
T{}
S{k1}
T{k1}
U{k3} V{k2}
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
12
Operator Mesh
S{k1} U{k2 , k3} V{k2 , k3}
Buffer.out
j1.out
j1
j5 j6
j7
S{k1 }, T {}, U{k2 , k3}
S{k1 }, T {}, V{k2 , k3}
S{k1 }, T{k2 , k3}
j2
T{}
Buffer.out
V{k2}
Buffer.out
j2.out
T{k1}
Buffer.out
j3
j3.out
j4
U{k3}
Buffer.out Buffer.out Buffer.out
output operator
T{k2 , k3}
Buffer.out
S{k1 }, T {}, V{k2 }, T{k1 }, U{k3}
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
13
Demand Driven Operator Execution (I)
Pstop()rParents--
Pstop()rParents--
rParents == 0
rParents == 0
RightSide = Ø
RightSide = ØPStart()rParents++
Tuple arrivesat right input
Tuple arrives at right input
2
3
1
4
6
5 7
8
9
10
! d / rsleeping
d / !rsleeping
d / rrunning
! d / !rsleeping
PStart()rParents++
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
14
Demand Driven Operator Execution (II)
S{k1} U{k2 , k3} V{k2 , k3}
Buffer.out
j1.out
j1
j5 j6
j7
S{k1 }, T {}, U{k2 , k3}
S{k1 }, T {}, V{k2 , k3}
S{k1 }, T{k2 , k3}
j2
T{}
Buffer.out
V{k2}
Buffer.out
j2.out
T{k1}
Buffer.out
j3
j3.out
j4
U{k3}
Buffer.out Buffer.out Buffer.out
output operator
T{k2 , k3}
Buffer.out
S{k1 }, T {}, V{k2 }, T{k1 }, U{k3}
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
15
Dynamic Mesh Expansion
• Create operators when input is available• Destroy, if no input
T{} V{k2}U{k2, k3}
V{k2, k3}
T{k2, k3}
S{k1}
Output
T{} V{k2}U{k2, k3}
V{k2, k3}
T{k2, k3}
S{k1}
Output
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
16
Purging Expired Tuples from Buffers
• Positive / Negative– Negative tuples
remove positives
• Sliding windows1. Bottom-Up
• Similar to Pos./Neg.
2. Lazy• Expired tuples remain
in buffers• Purged during join
execution
U{k1} T{} S{}
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
17
Mesh Migration
• Schema changes at runtime
• Stream removal– Trivial
• Stream Arrival– Generate new operator mesh, on top of old– Preserve intermediate results– Observer node order
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
18
Window Size
1
10
102
103
104
5 10 20 40 80w in min.
CPU in sec. PM FM Forest
|R|: 8 33 236 1582 11213 106
107
108
109
5 10 20 40 80
Bytes PM FM Forest
w in min.
• PM excels in memory, FM in speed• Forest always ranks last
– Omitted from remaining slides
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
19
Keyword Frequency
10
20
30
40
50
60
0.003 0.007 0.01 0.013 0.016KWF
CPU in sec. PM FM
|R|: 4 95 236 596 1089
4*106
5*106
6*106
7*106
8*106
0.003 0.007 0.01 0.013 0.016
Bytes PM FM
KWF
• No impact on number of tuples or joinability• More MTJNT
– Require CPU and memory for production and storage
• FM looses memory advantage– Tuples travel higher in mesh
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
20
Number of Streaming Relations
0
10
20
30
40
50
60
70
5 10 15 20 25|SR|
CPU in sec. PM FM
|R|: 24 137 236 331 374
106
107
108
5 10 15 20 25
Bytes PM FM
|SR|
I: 0.421 1.406 2.422 3.406 4.516
• Linear growth in mesh– No more then 4 neighbors in schema
• Relative performance as expected
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
21
Join Selectivity
20
30
40
50
60
70
80
1/1500 1/1250 1/1000 1/750 1/500sel
CPU in sec. PM FM-D
72 116 236 585 1758|R|:4*106
5*106
6*106
7*106
8*106
9*106
107
1/1500 1/1250 1/1000 1/750 1/500
Bytes PM FM-D
sel
• Quadratic increase in number of graph edges– Increasing intermediate results and MTJNT– Increasing CPU and memory consumption
• More tuples travel up the mesh:– PM (re-) generates more operators; hence requires more CPU
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
22
Number of Keywords
1
10
102
103
104
2 3 4 5|K|
CPU in sec. PM FM
|R|:2492 236 6 1
106
107
108
109
1010
2 3 4 5
Bytes PM FM
|K|
I: 0.078 2.422 61.968 1566
• Exponential growth in the complexity of the operator mesh– Reflected in initialization time
• Most operators in large meshes commonly idle– Increased savings through PM
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
23
Maximal size of results Tmax
1
10
102
103
104
2 3 4 5 6Tmax
CPU in sec. PM FM
|R|:5 41 236 1412 7161
106
107
108
2 3 4 5 6Tmax
Bytes PM FM
I: 0.062 0.359 2.422 18.078 140.515
• Similar impact to |K|– Exponential growth of mesh
- Reflected by initialization phase
• Exponential growth of results
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
24
Demand Driven Operator Execution
5 10 20 40 80
CPU in sec.
10
102
103
w in min.
FM-!D-LFM-D-L
106
107
108
5 10 20 40 80
Bytes
w in min.
FM-!D-LFM-D-L
• Reduced CPU– Avoided unnecessary joins
• Reduced memory consumption– Avoided storage of many intermediate results
CP
U
Consum
ption
Peak M
emory
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
25
Conclusion
• Benefits larger than R-KWS• More intricate than R-KWS• Contributions:
– S-KWS semantics
– CN generation
– S-KWS query processing• Full Mesh
• Partial Mesh
• Optimizations
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
26
Future Work
• Advanced Demand Driven Operator Ex.
• Graph Based S-KWS
• Parallel queries
• Top-k results
• Combination with R-KWS
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, 2007.
27
… for ListeningQuestions?
Thank You …