2006/3/21 1
Multiple Aggregations over Data Stream
Rui Zhang, Nick Koudas, Beng Chin OoiDivesh Srivastava
SIGMOD 2005
2006/3/21 2
Outline
• Introduction to Giga-Scope DSMS
• Multiple Aggregations Problem
• The proposed approach
- choice of phantoms
- space allocation problem
• Conclusion
2006/3/21 3
Giga-Scope
• A DSMS appears to monitor high speed IP traffic data.
LFTA
HFTA
Main MemoryProcessing low speed data stream seed by LFTA.
Network Interface CardSimple low level query over high speed data stream, which serve to reduce data volumes
DSMS
2006/3/21 4
2,1
24,1
3,1
17,1
2,22,3
4,1
Single Aggregation in Giga-Scope
224
223174
1
2
3
4
5
6
7
8
9
0
LFTA HFTA
(group, count)
R
Select A, count(*)From RGroup by A;
2006/3/21 5
Cost of Processing a Single Aggregation
• probe (c1) : The cost of looking up the hash table in LFTAs and possible update in case of a collision
• eviction (c2) : The cost of transferring an entry from LFTAs to HFTAs
2006/3/21 6
Processing Multiple Aggregation Naively
Select A, count(*)From RGroup by A;
Select B, count(*)From RGroup by B;
Select C, count(*)From RGroup by C;
(2, 3, 4 )(24, 4, 3)(2, 3, 4)(2, 3, 4)(4, 2, 3)
R(A, B, C) LFTA HFTA
Hash Table A
Hash Table B
Hash Table C
•(2,1)
•(3,1)
•(4,1)
•(24,1)
•(4,1)
•(3,1)
•(2,3)
•(3,3)
•(4,3)
•(2,1)
•(4,1)
•(3,2)
15c1 +1c2+7c2
The end of Epoch !!
2006/3/21 7
Processing Multiple Aggregation by maintaining phantoms
R(A, B, C)
(2, 3, 4 )(24, 4, 3)(2, 3, 4)(2, 3, 4)(4, 2, 3)
The end of Epoch !!
LFTA
Hash Table A
Hash Table B
Hash Table C
Hash Table ABC
(2, 3 )
(3, 3 )
(4, 3 )
(24, 1 )
(4, 1 )
(2, 1 )
(4, 1 )
(3, 1 )
14c1 +8c2
HFTA
1
2
3
4
5
6
7
8
9
0
(2, 3, 4, 1 )
(24, 4, 3, 1)
(2, 3, 4, 2 )
(4, 2, 3, 1 )
(2, 3, 4, 3 )
(3, 1 )(3, 2 )
2006/3/21 8
The problem • Consider a set of aggregation queries over a data stream that differ
only in their group attribute. Determine an optimal sharing setting for the queries with limit memory.
AB BC BD CD
ABC ABD BCD
ABCD
Q1 Q2 Q3 Q4
Given queries
-choice of phantoms
-space allocation
2006/3/21 9
Idea by maintaining phantoms
• : the collision rate without phantoms• : collision rate with phantoms• : the collision rate of phantom ABC• The total cost:
– Without phantom :– With the phantom :
E1= 3nc1+3x1nc2
E2= nc1+3x2nc1+3x1’x2nc2
x1
x1’x2
2006/3/21 10
Example
A
B
C
ABC
C2
C1
C1
In the case, the phantom benefits the cost
To be fair ,the total space used for the hash tables should be the same with or without the phantoms
E1= 3c1+3x1c2
E2= c1+3x2c1+3x1’x2c2
A
B
C
M/3
M/3
M/3
x1
x1’
M/4
M/4
M/4
M/4
E1-E2=(2-3x2)c1+3(x1-x1’x2)c2When x20, the phantom
benefits the cost.
x2
C1
x1
x1
E1-E2=F(x1, x2 , x1’)
2006/3/21 11
g=3000b=1000
The probability of k groups out of g hashed to a buckets
Bk is the number of buckets having k groupsnrg :The expected number of record for each group(1-1/k): the collision rate in the bucket :collision happen in the bucket
g: number of groups of a relation
b: number of buckets in the hash table
Key point
The collision rate estimation
2006/3/21 12
Algorithmic strategies for choosing the phantoms
• Benefit=the difference between the maintenance costs without or with the phantom.
Greedy by Increasing Collision Rate• The configuration I only includes all the queries
• We calculate the maintenance cost if a phantom R is added to I
• By comparing with the maintenance cost when R is not in I , we can get the benefit
• After we add this phantom to I ,we iterate with the other phantoms
• As more phantoms are added into I, the overall collision rate goes up and benefit decreases
• Stop when the benefit becomes negative.
2006/3/21 13
Algorithmic strategies for choosing the phantoms
Greedy by Increasing Collision Rate
AB BC BD CD
ABC ABD BCD
ABCD
Q1 Q2 Q3 Q4
g=2837
g=2117
g=1846
g=2387 g=2249
g=1946 g=1899 g=1999
Available memory=12000
Allocate AB=(1846/7690)*120000Allocate BC…Allocate BD…Allocate CD…
Try ABCD (Linear proportional Allocation)Allocate ABCD=(2837/10527)*12000Allocate AB=(1846/10527)*12000Allocate BC…Allocate BD…Allocate CD…
The process ends when
benefit become negative
E1-E2=F(x1, x2, x1’)
bABCDxABCDBenefit
2006/3/21 14
Space Allocation
A B
AB
tb
g
b
gc
bbM
gc
b
g
b
gc
b
gc
cb
g
b
gc
b
g
b
gc
cxxcxcxce
))(2()(
))(2(
)(2
)(2
2
2
1
12
21
01
2
2
1
12
0
01
22
2
1
11
0
0
0
01
22110101
By partial derivatives of e to 0.
22
22
1
1
b
g
b
gWhen , e has minimum cost.
Thereby, the space allocated is proportional to square root of number of group.
Optimal solution for the two level graph
x0
x1 x2
2006/3/21 15
Algorithmic strategies for choosing the phantoms
One way of allocating hash table space to a relation is proportional to the number of groups in the table
We can allocate space for a relation with g is a constant and we set it large
g
2006/3/21 16
Algorithmic strategies for choosing the phantoms
Greedy by Increasing Space• We calculate the benefit of each phantom according to the
cost model
• We calculate the benefit per unit space for each phantom R, benefit/
• We choose the phantom with the largest benefit per unit space as the first phantom to instantiate
• The process ends when the benefit per unit space becomes negative
gR( n )
2006/3/21 17
Algorithmic strategies for choosing the Algorithmic strategies for choosing the phantomsphantoms
• Greedy by Increasing Space
AB BC BD CD
ABC ABD BCD
ABCD
Q1 Q2 Q3 Q4
g=2837
g=2117
g=1846
g=2387 g=2249
g=1946 g=1899 g=1999
E1-E2=(2-3x2)c1+3(x1-x1’x2)c2
Benefit/Space as a metric
Benefit=2
Benefit=1Benefit=-1
Try ABCDAvailable memory=1200012000-7690=43104310-2837=1473
The process ends when1. Benefit become negative2. The space is exhausted
2006/3/21 19
Space Allocation
• According to Abel’s impossibility theorem, equations of order higher than 4 cannot be solved algebraically, we say unsolvable
• More general multi-level configurations generate equations of even higher order which are unsolvable
• We would use heuristics to decide space allocation for the these unsolvable cases based on the analysis available
2006/3/21 20
Space Allocation
• Super-node with Linear Combination
• Super-node with Square Root Combination
• Linear Proportional Allocation
• Square Root Proportional Allocation