View
221
Download
0
Embed Size (px)
Citation preview
Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations
Thanh Tran, Andrew McGregor, Yanlei Diao,
Liping Peng, Anna Liu
University of Massachusetts, Amherst
Presented by Xin Miao
Thanh Tran
Yanlei Diao
Previous works:
PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010
Capturing Data Uncertainty in High-Volume Stream Processing. CIDR 2009
Probabilistic Inference over RFID Streams in Mobile Environments. ICDE 2009
Efficient Data Interpretation and Compression over RFID Streams. ICDE 2008
Computational AstrophysicsAstrophysical surveys generate
observations of 108 stars and galaxies, 0.5 TB – 20 TB nightly data rates
Observations are noisy
(o_id, time, (x,y)p, luminosityp, colorp)
continuous discrete
Computational AstrophysicsQueries are issued to detect dynamic
features, transient events, anomalous behaviors
Q1:
SELECT group_id, max(O.luminosity)FROM Observations O [RANGE 1 hour] GROUP BY area_id(O.(x,y), AREA_DEF) as group_idHAVING max(O.luminosity) > 20
Quality of the returned answer?
Query answer
group_id max_luminosity existence prob.
10
max_luminositypgroup_id existence prob.
0.60
RFID Tracking and MonitoringRFID technology used for object tracking and
monitoring◦ E.g., Supply chain, health care management
Raw RFID readings are noisy and incomplete
Inference yields stream with object locations
(time, tag_id, weight, (x,y)p, sizep)
continuous discrete
08:15
1 2.5kg
(10.1,12.0)
L
10:24
2 6.3kg
(2.5,3.4) M
15:30
3 1.2kg
(25.6,32.1)
S
17:42
4 3kg (13.4,26.5)
L
RFID Tracking and Monitoring
Query detecting violations of a fire code:Location: (time, tag_id, weight, (x,y)p, sizep)
Q2:
SELECT group_id, sum(S.weight)FROM Locations SGROUP BY area_id(S.(x,y)) as group_idHAVING sum(S.weight) > 200
Query answer
10
sum_weightpgroup_id existence prob.
0.75
Quality of the returned alert?
Problem Statement
Uncertain attributes: discrete and continuousUncertain attributes: discrete and continuous
Complex relational operations WHERE, GROUP BY-HAVING, AGGREGATION
Complex relational operations WHERE, GROUP BY-HAVING, AGGREGATION
Objectives: Computing result distributions with bounded
errors Query processing on high-volume streams
Objectives: Computing result distributions with bounded
errors Query processing on high-volume streams
ChallengesCharacterizing the uncertainty of query
results requires the probability distributions of uncertain attributes in intermediate and final query results
Computing distributions of aggregates is hard• To compute an aggregate of n discrete random
variables may require enumerating an exponential number 2n of possible worlds
E.g.: Q1
SELECT group_id, max(O.luminosity)FROM Observations O [RANGE 1 hour] GROUP BY area_id(O.(x,y), AREA_DEF)
as group_idHAVING max(O.luminosity) > 20
10
max_luminositypgroup_id TEP
0.6
ChallengesOffering query answers with bounded errors
is crucial.◦ State-of-the-art: Monte Carlo simulation with
unbounded errors MCDB, SIGMOD’08 Handling uncertain data in array database,
ICDE’08 Database support for probabilistic attributes and
tuples, ICDE’08
In data streams, query processing needs to employ incremental computation as tuples arrive
State-of-The-Art Continuous random variables modeled by
Gaussian mixture models [SIGMOD’10] Closed-form solutions for aggregates Cannot be applied for selection, group by,
aggregates
Monte Carlo simulation gives relative approximations for evaluating HAVING predicates [Re & Suciu, VLDBJ’09]
For discrete distributions, best technique to compute sum needs O(nD3), where D is the domain size [Kanagal & Deshpande, ICDE’09]
SolutionQuery evaluation framework:
◦ Mixed-type data model◦ Approximate representations◦ Approximation metrics
Approximation algorithms for aggregates◦ Randomized algorithms: all aggregates◦ Deterministic algorithms
max, min sum, count
Query planning
Mixed-type Data Model for Query EvaluationUncertain attributes:
◦Ax: continuous◦Ay: discrete
Certain attributes◦Az: continuous/discrete
Mixed-type distribution: g=<p, f>◦p: tuple existence probability (TEP)◦ f: joint density function for all uncertain
attributes)(*)|(),(
|yAPyxfyAxAf y
yAxA
yx
Data Model ExampleLocation:(o_id, time, xp, luminosityp, colorp)
o_id time xp luminosityp colorp p
1 08:15
10.1
201.25 R 0.9
2 10:24
9.5 98.63 G 0.8
3 15:30
21.3
312.6 B 0.75
4 17:42
32.7
135.8 Y 0.86
TEP
¿ 𝑓 ( 𝐴𝑥=(10.1,201 .25 ) , 𝐴𝑦=𝑅)¿
𝑓 𝐴𝑥∨𝐴 𝑦 (10.1,201 .25|𝑅 )∗𝑃 (𝐴𝑦=𝑅)¿=¿𝑃 (𝐴𝑥=(10.1,201 .25 ) )∗𝑃( 𝐴𝑦=𝑅)¿
Data Model Example
pAAAP zyx 1]),,[(
),(*],,[ yAxAfpzAyAxAP yxzyx
Location:(o_id, time, xp, luminosityp, colorp)
o_id time xp luminosityp colorp p
1 08:15
10.1
201.25 R 0.9
2 10:24
9.5 98.63 G 0.8
3 15:30
21.3
312.6 B 0.75
4 17:42
32.7
135.8 Y 0.86
TEP
ConditioningWhat is the TEP and pdf function
after conditioning?
luminosity
xp TEP
1 0.9
luminosity
xp TEP
1 ?
luminosity
xp TEP
1 ?
1<xP<2
2<xP<3
1 3
?
?
Conditioning
Mixed-type: <p, f>
Support S
Condition on Range I
Support S’= S I
U
Normalize
Support S
Mixed-type: <p’, f’>
IS
dxxfq
)(
qxfxf /)()('
pqp
xfpxfp
'
)('')(
Truncated distribution
luminosity
xp TEP
1 0.7
2 0.8
3 0.6
… … … …
Relational Processing under Mixed-type ModelExecution of Q1:
luminosity
xp TEP
1 1
2 1
3 1
… … … …
CGi
iL ≤ x ≤ (i+1)L
Group I
luminosity
xp TEP
q
1 1 0.7
2 1 0.8
3 1 0.6
… … … … …
ObjStream
GROUPBY/AGGRMAX
σMAX>20
{areaNo, max_luminp}
Relational Processing under Mixed-type Model
maxGi(luminosity)
max_lumin TEP
0.99
σ(max>20)
max_lumin TEP
0.58
luminosity
xp TEP
1 0.7
2 0.8
3 0.6
… … … …
Execution of Q1:
ObjStream
GROUPBY/AGGRMAX
σMAX>20
{areaNo, max_luminp}
ObjStream
GROUPBY/AGGRMAX
σMAX>20
{areaNo, max_luminp}
Gi
Normalized again!
Approximation Framework: RepresentationEmploy cumulative distribution
functions (CDFs) ◦ To approximate distributions of aggregates
Two forms of CDFs: StepCDFs and LinCDFs
×
×
×
×
×
××
××
××
××
××
××
Approximation Framework: MetricKolmogorov-Smirnov (KS) distance:
◦ Between two CDFs F, F’ KS(F, F’) = supx |F(x) – F’(x)|
(ε, δ) approximation:
• KS distance between the approximate distribution and the exact distribution is at most ε, with probability (1-δ)
• δ=0: deterministic, δ>0: randomized
Bounded-Error Monte-Carlo Simulation(ε, δ) approximation of aggregates A = f(Y1, Y2, ...)
e.g., sum, count, avg, min, max
1. For the t-th tuple, generate m=ln(2δ-1)/(2ε2) samples, yt1,
yt2,… from the distribution Yt
2. Compute m aggregated values, ai=A(y1i, y2
i,…) (Based on existing deterministic algorithm Φ)
3. Return the CDF from these values
O(ε-2logδ-1) greater than the time and space of ΦNumber of samples is O(1/ε2); high cost for small ε
y11
…
y12
y1m
y21
…
y22
y2m
y31
…
y32
y3m
yt1
…
yt2
ytm
……
time
m samples at each time instance
Distributions of MAXMt = max(Y1, Y2, …, Yt)(Simple Case)Yi: modeled by a distribution
that can take values from a universe U={1,2,3,…,n}
Objective: compute the CDF of Mt, namely FtM
i
iMt xYPxF )()(
Basic Algorithm Complexity O(tn) Inefficient for stream processing for large n
)(*)()( 1 xYPxFxF tMt
Mt
MAX: IntuitionDynamically partition the universe into
consecutive intervalsUse the estimates for any intermediate
point since CDF is non-decreasing of the two ends of an interval to estimate
MAX: Approximate Representation with Invariants Approximate using StepCDF Partition the universe into
consecutive intervals: [1, n] = [ai, bi], ai+1=bi+1
Maintain the estimates of cumulative probabilities, cai and cbi , for [ai, bi]
I1 I2
a1 b1 a2
b2
ca1
cb1
cb2
ca2,
Invariants:(1) Estimates of the two ends of an interval
are close
(2) Estimates of two adjacent intervals are separated
Accuracy
Performance
)'1( aibi cc
'1)1( aiia cc11 )1||(log)5.01(' Ue
MAX: Algorithm Employ a splitting schemeOn seeing tuple t:
1. Update:
2. Subpartition to ensure Invariant (1)
3. Adjust by splitting and shifting while ensuring Invariant (2)
v1 v2 v3 a b
I1
c’a
c’b
I2
Step 2a b
I
ca
cb
v1 v2 v3
I
a b
c’a
c’b
v1 v2 v3
Step 1
I1
c’a
c’b
I22I21
Step 3
a b
v1 v2 v3 v1 v2 v3a b
I1
c’a
c’b
I22I21
Step 3
𝑐𝑎′ =𝑐𝑎 ⋅ 𝑃 [𝑌 𝑡≤𝑎 ]𝑐𝑏
′ =𝑐𝑏⋅ 𝑃 [𝑌 𝑡≤𝑏]𝑐𝑏′ >𝑐𝑎
′ ⋅(1+𝜖 ) 𝑏−𝑣2>12(𝑏−𝑎)
MAX: AlgorithmAt any step in the algorithm, the
number of intervals is bounded as follows:
The maximum generation of an interval is
MAX: Analysis
2. Number of intervals is bounded
3. Number of times an interval is split is bounded, i.e., logU
(ε, 0) algorithm for max, update time is O(ε-1 logU lnε-1 )
1. Estimates of the two ends of an interval are bounded Estimates of any point in an interval are bounded
Extend to continuous distributions:A general approach is to consider a large universe of size 264. The complexity is then proportional to log 264 =64.
MAX: Experimental Results
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1100
1000
10000
100000
1000000
10000000
Det
Rand(90%)
Rand(95%)
Rand(99%)
KS Requirement (epsilon)
Th
rou
gh
pu
t
Distributions of SUMApproximate representation
using quantiles◦ Uniform quantiles
CDF
a b
2/3
1
1/3
𝑘=1 /𝜖
SUM: IntuitionAssume each takes values from a finite
set of size at most
On receiving each new tuple, we can produce an intermediate approximation
𝐹 𝑡𝑆 (𝑥 )=Σ𝑣∈𝑉 𝑡
𝐹 𝑡−1𝑆 (𝑥−𝑣 ) 𝑃 [𝑌 𝑡=𝑣 ]
𝐹 (𝑥 )=Σ𝑣∈𝑉 𝑡𝐿𝑖𝑛𝐶𝐷𝐹 𝑡− 1 (𝑥−𝑣 )𝑃 [𝑌 𝑡=𝑣 ]
Approximation of using LinCDF
Intermediate approximation of
SUM: Algorithm
𝐹 (𝑥 )=Σ𝑣∈𝑉 𝑡𝐿𝑖𝑛𝐶𝐷𝐹 𝑡− 1 (𝑥−𝑣 )𝑃 [𝑌 𝑡=𝑣 ]
b) Shifting and scaling LinCDFa) LinCDF before updating,which is
1𝜖𝑞𝑢𝑎𝑡𝑖𝑙𝑒𝑠
c) Composing with linear interpolation
𝜆𝜖points
SUM: Algorithm
c) Composing with linear interpolation
d) Simplify . Get a new LinCDF,which is
1𝜖𝑞𝑢𝑎𝑡𝑖𝑙𝑒𝑠
𝜆𝜖points
Simplify
𝐿𝑖𝑛𝐶𝐷𝐹 𝑡 (𝑥 )→𝐿𝑖𝑛𝐶𝐷𝐹𝑡 (𝑥)
Can be improved to log timeusing binary search
SUM: Analysis(ε, 0) algorithm for sum
◦Space ◦Update time
Supporting continuous distributions◦Discretize input distribution by
LinCDF or StepCDF◦Total error is the sum of
discretization error and approximation error
SUM: Experimental Results
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1100
1000
10000
100000
DetRand(90%)Rand(95%)Rand(99%)
KS Requirement (epsilon)
Th
rou
gh
pu
t
Approximate Query AnswersExtended KS distance, named KSM
KSM(G, G’) = max(|p – p’|, supx |p F(x) – p’ F’(x)|,
supx|p (1 – F(x)) – p’ (1 – F’(x))|)Bound quantities such as , and
Query approximation objective: For query answer set and exact answer set ◦ and should have the same number of tuples◦ The corresponding attribute in the
corresponding tuple in is at most with prob. 1-
Approximate Query AnswersConsider Select-From-Where-Groupby-Having block◦ One aggregate predicate in the Having clause
Selection/Group By
Aggregation
Selection/Projection
ε = 0
ε > 010
max_luminositypgid TEP
0.6ε
Error occurs here!
Query PlanningPlanning: find a query plan that meets the
objective
Proposition on Selection: Selection on an attribute with (ε, δ) –approx, using a range condition is (2ε, δ) If the selection uses a union of ranges, the approximation
error is twice the sum, i.e., 2εi.
Top-down approach to provision error bounds◦ If the error is ε, we should provision ε/2 for the
approximation of sum
Query Planning: Experimental Results
0.01 0.03 0.05 0.07 0.090
1000
2000
3000
4000
5000
6000
7000
DetRand(90%)Rand(95%)Rand(99%)
KS Requirement (epsilon)
Th
rou
gh
pu
t
ConclusionEvaluation framework and approximation
techniques for complex operations◦ Randomized algorithms: general◦ Deterministic algorithms: often better◦ For complex queries, the errors are bounded
while having throughput of thousands of tuples per sec
Future work◦ Wider range of aggregates◦ Correlation among derived attributes◦ Query optimization