Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng

Cleaning Uncertain Data with Quality Guarantees

Reynold Cheng, Jinchuan Chen, Xike Xie

2008 VLDB

Presented by SHAO Yufeng

Outline

Background

Related works

Data and Query model

PWS-quality model

Cleaning procedure

Experiments result

Uncertain Database(old model)

Inherent in various application

Examples: RFID data sensor networks data protected because of privacy reason

Infeasible to eliminate all uncertainty in many models

Uncertain Database(new model)

Previous model focus on query in the uncertain database

But what if we are able to reduce SOME of the uncertainty in this kind of database?

New model are required to produce optimal solution

Example 1: Sensor probing Some sensors in the sensor network might have

transmission problems and cannot update data

Commands can be sent to refresh some sensors

New certain data are obtained

Limited by the bandwidth / battery power, cannot probe too often

Example 2: Movie Rating

Movie ratings(IMDB, Netflix) collected from customers might contain some uncertainty

managers can communicate with customers to verify the rating data

New certain movie rating data is obtained

Limited by the human power or other resource

Cleaning Data

UncertainDB

Query

Ambiguous result

LESSUncertain

DB

Query

LESS ambiguousresult

Cleaning procedure

Real model example A database of some products and theirs

price(uncertain)Key Product

IDPrice ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

Price of product a has two different possible values: 120 (prob 0.7 ) or 80 (prob 0.3)

Query Example 1:

Key Product ID

Price ($)

Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

Query 1(Range Query): Select product with price in range [100$, 110$]

Possible world result:({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)

Query Example 2:

Key Product ID

Price ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

Query 2 (Max query):Select product with highest price

Possible world answer:({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036)({b1}, 0.06), ({b1, c2}, 0.054)({c3}, 0.054)

Clean up example Suppose we have some amount of resource to clean

up some data

Assume we clean up the information related to product a and c

New database with less uncertainty

Key Product ID

Price ($) Prob.

a2 a 80 1

b1 b 110 0.6

b2 b 90 0.4

c3 c 100 1

d1 d 10 1

Clean up example (Cont.)

Key Product ID

Price ($) Prob.

a2 a 80 1

b1 b 110 0.6

b2 b 90 0.4

c3 c 100 1

d1 d 10 1

Run query 1 again:Select product with price in range [100$, 110$]

New possible world result:({b1,c3}, 0.6), ({c3}, 0.4)

Old possible result:({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)

Apparently less uncertain in the cleaned database, but clean up procedure limited by budget

New database with less uncertainty

Background

Related works


PWS-quality model

Cleaning procedure

Experiments result

Outline

Important related works

Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003: 551-562 Mentioned about the ideas of doing clean up in Max/Min and Range query, but

not real implementation

P. Andritsos, A. Fuxman, and R. Miller. Clean answers over

dirty databases: A probabilistic approach. In ICDE, 2006.

Introduce the technique to rewrite query

Important related works (Cont) Jinchuan Chen, Reynold Cheng: Quality-Aware Probing of Uncertain

Data with Resource Constraints. SSDBM 2008

Similar cleaning method

continuous pdf function representation of uncertainty

Support less query type(only range query)

Chris Mayfield, Jennifer Neville, Sunil Prabhakar ERACER: A Database Approach for Statistical Inference and Data Cleaning SIGMOD 2010 Use the attribute level correlation to provide optimized clean up

Background

Related works

Database and Query model

PWS-quality model

Cleaning procedure

Experiments result

Outline

System Structure

ProbabilisticDatabase

QueryEngine

QueryAnswer

User

QualityEvaluator

Data CleaningAlgorithm

Quality Manager

PWS-qualityscore

Cleaning Budget

External Data Sources

Cleaning Manager

CleaningSet

Cleaning request

Dataupdate

Queryrequest

Important Notations

Key Product ID

Price ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

tuple ti(total n tuples)

x-tuple τi(total m x-tuple)

uncertain attribute

existential probability (ei)

One x-tuple

Important Notations

Key Product ID

Price ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

tuple ti(total n tuples)

x-tuple τi(total m x-tuple)

uncertain attribute

existential probability (ei)

One x-tuple

Query in possible world model

PossibleWorld

ProbabilisticDB

PossibleWorld

PossibleWorld

PW-Result

PW-Result

PWS-Quality

Final QueryAnswer

(b1,0.28), (c2,0.18), (c3,0.1)

0.18

0.1

0.1

{b1,c2}, 0.18

{b1,c3}, 0.1

- 1.44

Qualification probability(pi) of c2: 0.18Qualification probability(Pk) of c: 0.28

Possible Range Query(PRQ) Given a closed interval , where and , a PRQ returns

a set of tuples , where is the non-zero probability that .

],[ ba Rba , ba ),( ii pt ip

],[ bavi

Key Product ID

Price ($)

Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

Range Query:Select product with price in range [100$, 110$]

Possible world result set:({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)

Prob. qj of occurrence

Probabilistic Maximum Query(PMaxQ) A PMaxQ returns a set of tuples , where , the probability

of , is the non-zero probability that , where and .

),( ii pt ip it

ji vv ij nj ,...,1

Key Product ID

Price ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

Query:Select product with highest price

Possible world answer:({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036)({b1}, 0.06), ({b1, c2}, 0.054)({c3}, 0.054)

Background

Related works


PWS-quality model

Cleaning procedure

Experiments result

Outline

PWS-quality Suppose we have two sets of possible world result:

0.20.1 0.1 0.1

0.2

0.9

0.1

{a2,b1} {a1,b2,c1} {b3,c2}

{b1}

0.3

{a1,c1}

We need a measurement to tell which result is more uncertain and by how

Solution:

Use entropy like measurement to calculate the PWS-quality (degree of uncertainty)

PWS-Quality: Calculation Let qj be the prob. of getting distinct PW-result rj

Let d be the number of distinct pw-result

Negative S(D, Q) score, larger the score, better the quality

0 means no uncertainty(only 1 possible world result exist)

d

jjj qqQDS

1

log),(

PWS-quality example Suppose we have a set of possible world result:

PWS score:

S(D,Q) = 0.5*log0.5 + 0.4*log0.4 + 0.1*log0.1= -0.496

0.4

0.1

{b1}{a1,c1}

0.5

{b2}

PWS-quality problem

However, calculating PWS-quality for all possible worlds are too expensive

# of possible world result might be exponential

Need to speed up the algorithm

x-Form PWS-Quality x-Form of PWS-Quality

g(k,D,Q)= func(existential & qualification probs. of tuples in k-th x-tuple)

Summation of quality information of all the result x-tuples

Only consider x-tuples whose tuples are in query answer

k

QDkgQDS

),,(),(

x-Form of PRQ (Range Query)

Each g(k, D, Q) only require O(|τk|) time

pi and Pk are the qualification probability of the current tuple ti and current x-tuple tK which can be calculated easily

ki

ki

tik

tkkii

pPwhere

PPepQDkg

)1log()1(log),,(

x-Form of PMaxQ (Max Query)

Require O(|τk|2) to calculate g(k, D, Q) for PMaxQ

Details of the proof will be talked at the end of present

k

kik

ik

ik

iki

jjk

ik

i

i

jjkikikik

k,i

k,ik

τi

ie

p

e

pe

where

eepQDkg

v

ti-th

k

0

))(1(

))1log(log(),,(

, oforder descendingin sorted

, is of tuple theSuppose

1,

1,

,

,

1,

,

1 1,,,,

x-form PWS-quality summary

By transforming the original PWS-quality calculation to the x-form PWS calculation, we avoid the exponential computation time

Total computation time O(m log(n/m))

Compared to the query time, the x-form PWS-quality calculation time is small. (will be shown in the experiment)

Background

Related works


PWS-quality model

Cleaning procedure

Experiments result

Outline

Cleaning with limited budget

With a limited budget, say, 10 Units, which tuples should we clean?

Key Product ID

Price ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 0.2

d1 d 10 1

Clean cost: 5 unit

Clean cost: 7 unit

Clean cost: 10 unit

Example of cleaning After Cleaning, the tuple

existential probability become 1

This x-tuple contracted to 1 single tuple with certain attribute value

Key Product ID

Price ($) Prob.

a1 a 120 0.7

a2 a 80 0.3

b1 b 110 0.6

b2 b 90 0.4

c3 c 100 1

d1 d 10 1

Quality improvement Expected Quality after cleaning

The set of x-tuple that we are going to clean is represented by X = {τ1, ···, τ|x|}

Quality Improvement

But quality improvement calculation is exponential

Computation example:

Key Product ID

Price ($)

Prob. QP

a1 a 120 0.7 0.35

a2 a 80 0.3 0

b1 b 110 0.6 0.09

b2 b 90 0.4 0

c1 c 140 0.5 0.5

c2 c 110 0.3 0.05

c3 c 100 0.2 0.024

d1 d 10 1 0


if we decided to clean up x-tuple c

Computation example (Cont.):

Key Product ID

Price ($)

Prob. QP

a1 a 120 0.7 0.7

a2 a 80 0.3 0

b1 b 110 0.6 0.18

b2 b 90 0.4 0

c1 c 140 0.5

c2 c 110 0.3

c3 c 100 1 0.12

d1 d 10 1 0

New PWS-quality S(D’, Q) = -1.17


We decided to clean up x-tuple cone possible case is c3 is the real world case


Key Product ID

Price ($)

Prob. QP

a1 a 120 0.7 0.7

a2 a 80 0.3 0

b1 b 110 0.6 0.18

b2 b 90 0.4 0

c1 c 140 0.5

c2 c 110 1 0.12

c3 c 100 0.2

d1 d 10 1 0


We decided to clean up x-tuple canother possible case is c2 is the real world case

New PWS-quality S(D’, Q) = -1.17


Key Product ID

Price ($)

Prob. QP

a1 a 120 0.7 0.35

a2 a 80 0.3 0

b1 b 110 0.6 0.09

b2 b 90 0.4 0

c1 c 140 0.5 0.5

c2 c 110 0.3 0.05

c3 c 100 0.2 0.024

d1 d 10 1 0


To clean up x-tuple cwe have 3 different possible real world scenarios

Expected quality of cleaning up x-tuple c = 0 * 0.5 + (-1.17) * 0.3 + (- 1.17) * 0.2 = -0.585

x-form quality improvement calculation of the quality improvement in x-form will

become following

X is the set of x-tuple that we are going to clean

proof: rewrite the original E(S(D’(t), Q)) as

left side is equal to 0, right side is unchanged after the cleaning

Xk

QDkgQDXI

),,(),,(

Optimal Data Cleaning Algorithm in x-form quality improvement problem, we get the

following objective function:

cK: the cleaning cost k-th x-tuple

C: total cleaning budget Z: total number of x-tuple with pi in (0,1)

Can be transformed to 0/1 Knapsack problem

Zkb

Ccbtosubject

QDkgbMaximize

k

Z

k kk

Z

k k

,...,1},1,0{

),,(

1

1

DP algorithm

Time complexity O(CZ) Space Complexity O(CZ2) C: total budget Z: number of x-tuples

Other heuristics methods:

Random

MaxQP Select x-tuples with highest qualification probability

Greedy: Rank x-tuples with max expected quality improvement

per cleaning cost

Background

Related works


PWS-quality model

Cleaning procedure

Experiments result

Outline

Experiment set up

Size of DB 10 K x-tuples, 100 K tuples (synthetic)4,999 x-tuples, 10,037tuples (Netflix movie ratings)

Prob. distributions Gaussian (variance = 100)

Cleaning cost Uniform in [1,10]

Resource Budget [20,500]default = 30

PWS-quality(S) vs database size(Z) (PRQ)

200 400 600 800 1000 1200 1400 1600 1800 2000-6000

-5000

-4000

-3000

-2000

-1000

0

z

S

GaussianUniform

Quality evaluation performance(PRQ)

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

20

40

60

80

100

120

z

time(

ms)

Query EvaluationQuality Caculation

(database size)

Running time for Clean up selection(PMaxQ)

100

101

102

10310

-2

10-1

100

101

102

103

C

time(

ms)

BasicRandomMaxQPDPGreedy

Total budget

Quality improvement vs Budget(PRQ)

10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

C

IRandomMaxQPDPGreedy

Total budget

Quality

Improvem

ent

Quality improvement vs Budget(PMaxQ)

10 15 20 25 30 35 40 45 500.5

1

1.5

2

2.5

3

C

I

RandomMaxQPDPGreedy

Total budget

Quality

Improvem

ent

Quality improvement vs Budget(PRQ, real data)

0 20 40 60 80 1000

5

10

15

C

IRandomMaxQPDPGreedy

Quality

Improvem

ent

Total budget

Thank you

Q & A

Appendix: Deriving x-form of PRQ

d

rtjji

ji

qp1

d

jjj qqQDS

1

log),(

jkji r

krt

ij Peq )1(

d

j rk

rtij

jkji

PeqQDS1

)1(log),(

...))1log(...)1log(...log...(log

......

...))1log(...)1log(...log...(log

...))1log(...)1log(...log...(log),(

11

112

111

kid

ki

ki

PPeeq

PPeeq

PPeeqQDS

ii ep log

kit

ik pP

)1log()1( kk PP

m

k tkkii

ki

PPepQDS1

))1log()1(log(),(

Appendix: Deriving x-form of PMaxQ

d

jjj qqQDS

1

log),(

jkji r

jkrt

ij vreq ).Pr(

),(

1,1).Pr(

kjs

llkjk evr

A number in [0, ]k

d

j rjk

rtij

jkji

vreqQDS1

)).Pr(log(),(

n

iii ep

1

log

m

k i

i

jjkik

k

e1 1

,, )1log(

Documents

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng