34
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

Embed Size (px)

Citation preview

Probabilistic Threshold Range Aggregate Query

Processing over Uncertain Data

Wenjie Zhang

University of New South Wales & NICTA, AustraliaJoint work:

Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

Outline

DB@UNSW

2

Background and Preliminaries Probabilistic Threshold Range Aggregate

Query Exact query processing Approximate query processing: Simple

Sampling & Double Sampling Experiments

Conclusion

Applications

DB@UNSW

3

Many applications involve data that is imperfect due to data randomness and incompleteness limitation of equipment delay or lose in data transfer … …

Applications Sensor networks Environmental surveillance Moving objects Data cleaning and integration … …

Applications

DB@UNSW

4

Sensor Networks: Sensor readings are often imprecise due to equipment

limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08)

Applications

DB@UNSW

5

Mobile Equipments / Moving Objects A mobile object reports its location periodically, the

exact location is often uncertain.

Applications

DB@UNSW

6

Satellite data

Applications

DBG @ UNSW

Data Quality Social Data Collection: Errors and estimation

inherent in customer surveys and sampling

7

Outline

DB@UNSW

8

Background and Preliminaries Modeling Uncertainty & Related Work

Probabilistic Threshold Range Query Conclusion

Modeling Uncertainty ( cont. )

DB@UNSW

9

Uncertain Objects Model1. Continuous case: described using a probability

density function (PDF) fU such that . E.g., uniform distribution, normal distribution.

Uu U duuf 1)(

Modeling Uncertainty ( cont. )

DB@UNSW

10

Uncertain Objects Model2. Discrete case : described using a set of

instances each instance u has an occurrence probability pu

1 Uu up

Possible World Semantics

DB@UNSW

11

Given a set of uncertain objects U1,U2, ..., Un, a possible world W = u1,u2, .., un is a set of n instances --- one instance per uncertain object

The probability of a possible worlds is

P(W) =

Let Ω be the set of all possible world, clearly,

n

i iuP1 )(

1)( WWP

Probabilistic Queries:

DB@UNSW

12

Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]

Aggregate Queries [BDJR05, MJ07, CG07]

Join Queries [CSP06, AW07]

Top-k queries [SIC07, YLSK08, RDS07, HJZL08]

Nearest Neighbor Queries [KKR07, CCMC08]

Skyline Queries [PJLY07]

… …

Range query

DBG @ UNSW

13

Uncertain objects, exact query Probability threshold is often assigned

Related Work

DB@UNSW

14

Range Queries [TCXNKP05, BPS06, AY08]

Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t.

Appearance probability

r

o .reg ion

rregionoxdxxpdfo

.)(.

U-tree

DB@UNSW

15

Probabilistically Constrained Region ( PCR ) [TCXNKP05]

PCR (0.2) Multi PCRs

Outline

DB@UNSW

16

Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate

Query (PTRA) Conclusion

Contribution

DB@UNSW

17

Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample

techniques for approximate answer.

Problem Statement

DB@UNSW

18

Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq

Problem Definition

DB@UNSW

19

Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b)

Exact Query Processing ( aU-Tree)

DB@UNSW

20

Main idea: add aggregate information on U-tree Advantage: stop at intermediate level if

pruned or fully covered by the query Disadvantage: otherwise, still need to drill

down to the leaf nodes. For a large portion of uncertain objects,

appearance probability needs to be computed Expensive for a massive number of instances

per object!

Exact Query Processing ( aU-Tree)

DB@UNSW

21

singleSample

DB@UNSW

22

Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query

region, then the approximate appearance probability is m’/m

singleSample ( cont. )

DB@UNSW

23

An immediate application of Chernoff-Hoeffding bound

doubleSample

DB@UNSW

24

Single Sampling is expensive when there is a massive number of objects!

Sampling the uncertain objects as well. Naive : uniform sampling objects from all

uncertain objects.

doubleSample: Accuracy

DB@UNSW

25

•Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed.•Using Chernoff-Hoeffding bound.

doubleSample: Our Approach

DB@UNSW

26

Skew! Aim: select K disjoint groups covering all objects

with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.)

The optimization problem is NP-hard. Observation:

Min-skew is a good heuristic to conduct such a group.

aU-tree groups objects with a similar principle to the min-skew.

doubleSample: Our Approach

DB@UNSW

27

Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! Find a level L such that the number of nodes at level

L is smaller than K but the number of nodes at level L-1 is larger than K.

Feed the min-skew algorithm with the subtrees at level L.

(note: if at a level L, the number of nodes = K, then these K subtrees are chosen.)

Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object.

Experiments

DB@UNSW

28

Algorithms:

exact, singleSample, doubleSample

Data set:

LB : 53k objects at long beach country

CA : 62k objects at California

Synthetic aircraft dataset in 3D

10k instances for each points follow Uniform or constrained-Gaussian

Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K

Efficiency

DB@UNSW

29

Accuracy

DB@UNSW

30

Accuracy ( cont. )

DB@UNSW

31

Conclusion

DB@UNSW

32

Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with

theoretic guarantee?

DB@UNSW

33

Thanks

Min-Skew technique

DB@UNSW

34