Upload
ryder-reid
View
218
Download
2
Embed Size (px)
Citation preview
Cleaning Uncertain Data with Quality Guarantees
Reynold Cheng, Jinchuan Chen, Xike Xie
2008 VLDB
Presented by SHAO Yufeng
Outline
Background
Related works
Data and Query model
PWS-quality model
Cleaning procedure
Experiments result
Uncertain Database(old model)
Inherent in various application
Examples: RFID data sensor networks data protected because of privacy reason
Infeasible to eliminate all uncertainty in many models
Uncertain Database(new model)
Previous model focus on query in the uncertain database
But what if we are able to reduce SOME of the uncertainty in this kind of database?
New model are required to produce optimal solution
Example 1: Sensor probing Some sensors in the sensor network might have
transmission problems and cannot update data
Commands can be sent to refresh some sensors
New certain data are obtained
Limited by the bandwidth / battery power, cannot probe too often
Example 2: Movie Rating
Movie ratings(IMDB, Netflix) collected from customers might contain some uncertainty
managers can communicate with customers to verify the rating data
New certain movie rating data is obtained
Limited by the human power or other resource
Cleaning Data
UncertainDB
Query
Ambiguous result
LESSUncertain
DB
Query
LESS ambiguousresult
Cleaning procedure
Real model example A database of some products and theirs
price(uncertain)Key Product
IDPrice ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
Price of product a has two different possible values: 120 (prob 0.7 ) or 80 (prob 0.3)
Query Example 1:
Key Product ID
Price ($)
Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
Query 1(Range Query): Select product with price in range [100$, 110$]
Possible world result:({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)
Query Example 2:
Key Product ID
Price ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
Query 2 (Max query):Select product with highest price
Possible world answer:({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036)({b1}, 0.06), ({b1, c2}, 0.054)({c3}, 0.054)
Clean up example Suppose we have some amount of resource to clean
up some data
Assume we clean up the information related to product a and c
New database with less uncertainty
Key Product ID
Price ($) Prob.
a2 a 80 1
b1 b 110 0.6
b2 b 90 0.4
c3 c 100 1
d1 d 10 1
Clean up example (Cont.)
Key Product ID
Price ($) Prob.
a2 a 80 1
b1 b 110 0.6
b2 b 90 0.4
c3 c 100 1
d1 d 10 1
Run query 1 again:Select product with price in range [100$, 110$]
New possible world result:({b1,c3}, 0.6), ({c3}, 0.4)
Old possible result:({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)
Apparently less uncertain in the cleaned database, but clean up procedure limited by budget
New database with less uncertainty
Background
Related works
Data and Query model
PWS-quality model
Cleaning procedure
Experiments result
Outline
Important related works
Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003: 551-562 Mentioned about the ideas of doing clean up in Max/Min and Range query, but
not real implementation
P. Andritsos, A. Fuxman, and R. Miller. Clean answers over
dirty databases: A probabilistic approach. In ICDE, 2006.
Introduce the technique to rewrite query
Important related works (Cont) Jinchuan Chen, Reynold Cheng: Quality-Aware Probing of Uncertain
Data with Resource Constraints. SSDBM 2008
Similar cleaning method
continuous pdf function representation of uncertainty
Support less query type(only range query)
Chris Mayfield, Jennifer Neville, Sunil Prabhakar ERACER: A Database Approach for Statistical Inference and Data Cleaning SIGMOD 2010 Use the attribute level correlation to provide optimized clean up
Background
Related works
Database and Query model
PWS-quality model
Cleaning procedure
Experiments result
Outline
System Structure
ProbabilisticDatabase
QueryEngine
QueryAnswer
User
QualityEvaluator
Data CleaningAlgorithm
Quality Manager
PWS-qualityscore
Cleaning Budget
External Data Sources
Cleaning Manager
CleaningSet
Cleaning request
Dataupdate
Queryrequest
Important Notations
Key Product ID
Price ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
tuple ti(total n tuples)
x-tuple τi(total m x-tuple)
uncertain attribute
existential probability (ei)
One x-tuple
Important Notations
Key Product ID
Price ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
tuple ti(total n tuples)
x-tuple τi(total m x-tuple)
uncertain attribute
existential probability (ei)
One x-tuple
Query in possible world model
PossibleWorld
ProbabilisticDB
PossibleWorld
PossibleWorld
PW-Result
PW-Result
PWS-Quality
Final QueryAnswer
(b1,0.28), (c2,0.18), (c3,0.1)
0.18
0.1
0.1
{b1,c2}, 0.18
{b1,c3}, 0.1
- 1.44
Qualification probability(pi) of c2: 0.18Qualification probability(Pk) of c: 0.28
Possible Range Query(PRQ) Given a closed interval , where and , a PRQ returns
a set of tuples , where is the non-zero probability that .
],[ ba Rba , ba ),( ii pt ip
],[ bavi
Key Product ID
Price ($)
Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
Range Query:Select product with price in range [100$, 110$]
Possible world result set:({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)
Prob. qj of occurrence
Probabilistic Maximum Query(PMaxQ) A PMaxQ returns a set of tuples , where , the probability
of , is the non-zero probability that , where and .
),( ii pt ip it
ji vv ij nj ,...,1
Key Product ID
Price ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
Query:Select product with highest price
Possible world answer:({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036)({b1}, 0.06), ({b1, c2}, 0.054)({c3}, 0.054)
Background
Related works
Data and Query model
PWS-quality model
Cleaning procedure
Experiments result
Outline
PWS-quality Suppose we have two sets of possible world result:
0.20.1 0.1 0.1
0.2
0.9
0.1
{a2,b1} {a1,b2,c1} {b3,c2}
{b1}
0.3
{a1,c1}
We need a measurement to tell which result is more uncertain and by how
Solution:
Use entropy like measurement to calculate the PWS-quality (degree of uncertainty)
PWS-Quality: Calculation Let qj be the prob. of getting distinct PW-result rj
Let d be the number of distinct pw-result
Negative S(D, Q) score, larger the score, better the quality
0 means no uncertainty(only 1 possible world result exist)
d
jjj qqQDS
1
log),(
PWS-quality example Suppose we have a set of possible world result:
PWS score:
S(D,Q) = 0.5*log0.5 + 0.4*log0.4 + 0.1*log0.1= -0.496
0.4
0.1
{b1}{a1,c1}
0.5
{b2}
PWS-quality problem
However, calculating PWS-quality for all possible worlds are too expensive
# of possible world result might be exponential
Need to speed up the algorithm
x-Form PWS-Quality x-Form of PWS-Quality
g(k,D,Q)= func(existential & qualification probs. of tuples in k-th x-tuple)
Summation of quality information of all the result x-tuples
Only consider x-tuples whose tuples are in query answer
k
QDkgQDS
),,(),(
x-Form of PRQ (Range Query)
Each g(k, D, Q) only require O(|τk|) time
pi and Pk are the qualification probability of the current tuple ti and current x-tuple tK which can be calculated easily
ki
ki
tik
tkkii
pPwhere
PPepQDkg
)1log()1(log),,(
x-Form of PMaxQ (Max Query)
Require O(|τk|2) to calculate g(k, D, Q) for PMaxQ
Details of the proof will be talked at the end of present
k
kik
ik
ik
iki
jjk
ik
i
i
jjkikikik
k,i
k,ik
τi
ie
p
e
pe
where
eepQDkg
v
ti-th
k
0
))(1(
))1log(log(),,(
, oforder descendingin sorted
, is of tuple theSuppose
1,
1,
,
,
1,
,
1 1,,,,
x-form PWS-quality summary
By transforming the original PWS-quality calculation to the x-form PWS calculation, we avoid the exponential computation time
Total computation time O(m log(n/m))
Compared to the query time, the x-form PWS-quality calculation time is small. (will be shown in the experiment)
Background
Related works
Data and Query model
PWS-quality model
Cleaning procedure
Experiments result
Outline
Cleaning with limited budget
With a limited budget, say, 10 Units, which tuples should we clean?
Key Product ID
Price ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 0.2
d1 d 10 1
Clean cost: 5 unit
Clean cost: 7 unit
Clean cost: 10 unit
Example of cleaning After Cleaning, the tuple
existential probability become 1
This x-tuple contracted to 1 single tuple with certain attribute value
Key Product ID
Price ($) Prob.
a1 a 120 0.7
a2 a 80 0.3
b1 b 110 0.6
b2 b 90 0.4
c3 c 100 1
d1 d 10 1
Quality improvement Expected Quality after cleaning
The set of x-tuple that we are going to clean is represented by X = {τ1, ···, τ|x|}
Quality Improvement
But quality improvement calculation is exponential
Computation example:
Key Product ID
Price ($)
Prob. QP
a1 a 120 0.7 0.35
a2 a 80 0.3 0
b1 b 110 0.6 0.09
b2 b 90 0.4 0
c1 c 140 0.5 0.5
c2 c 110 0.3 0.05
c3 c 100 0.2 0.024
d1 d 10 1 0
Query 2 (Max query):Select product with highest price
if we decided to clean up x-tuple c
Computation example (Cont.):
Key Product ID
Price ($)
Prob. QP
a1 a 120 0.7 0.7
a2 a 80 0.3 0
b1 b 110 0.6 0.18
b2 b 90 0.4 0
c1 c 140 0.5
c2 c 110 0.3
c3 c 100 1 0.12
d1 d 10 1 0
New PWS-quality S(D’, Q) = -1.17
Query 2 (Max query):Select product with highest price
We decided to clean up x-tuple cone possible case is c3 is the real world case
Computation example (Cont.):
Key Product ID
Price ($)
Prob. QP
a1 a 120 0.7 0.7
a2 a 80 0.3 0
b1 b 110 0.6 0.18
b2 b 90 0.4 0
c1 c 140 0.5
c2 c 110 1 0.12
c3 c 100 0.2
d1 d 10 1 0
Query 2 (Max query):Select product with highest price
We decided to clean up x-tuple canother possible case is c2 is the real world case
New PWS-quality S(D’, Q) = -1.17
Computation example (Cont.):
Key Product ID
Price ($)
Prob. QP
a1 a 120 0.7 0.35
a2 a 80 0.3 0
b1 b 110 0.6 0.09
b2 b 90 0.4 0
c1 c 140 0.5 0.5
c2 c 110 0.3 0.05
c3 c 100 0.2 0.024
d1 d 10 1 0
Query 2 (Max query):Select product with highest price
To clean up x-tuple cwe have 3 different possible real world scenarios
Expected quality of cleaning up x-tuple c = 0 * 0.5 + (-1.17) * 0.3 + (- 1.17) * 0.2 = -0.585
x-form quality improvement calculation of the quality improvement in x-form will
become following
X is the set of x-tuple that we are going to clean
proof: rewrite the original E(S(D’(t), Q)) as
left side is equal to 0, right side is unchanged after the cleaning
Xk
QDkgQDXI
),,(),,(
Optimal Data Cleaning Algorithm in x-form quality improvement problem, we get the
following objective function:
cK: the cleaning cost k-th x-tuple
C: total cleaning budget Z: total number of x-tuple with pi in (0,1)
Can be transformed to 0/1 Knapsack problem
Zkb
Ccbtosubject
QDkgbMaximize
k
Z
k kk
Z
k k
,...,1},1,0{
),,(
1
1
DP algorithm
Time complexity O(CZ) Space Complexity O(CZ2) C: total budget Z: number of x-tuples
Other heuristics methods:
Random
MaxQP Select x-tuples with highest qualification probability
Greedy: Rank x-tuples with max expected quality improvement
per cleaning cost
Background
Related works
Data and Query model
PWS-quality model
Cleaning procedure
Experiments result
Outline
Experiment set up
Size of DB 10 K x-tuples, 100 K tuples (synthetic)4,999 x-tuples, 10,037tuples (Netflix movie ratings)
Prob. distributions Gaussian (variance = 100)
Cleaning cost Uniform in [1,10]
Resource Budget [20,500]default = 30
PWS-quality(S) vs database size(Z) (PRQ)
200 400 600 800 1000 1200 1400 1600 1800 2000-6000
-5000
-4000
-3000
-2000
-1000
0
z
S
GaussianUniform
Quality evaluation performance(PRQ)
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
20
40
60
80
100
120
z
time(
ms)
Query EvaluationQuality Caculation
(database size)
Running time for Clean up selection(PMaxQ)
100
101
102
10310
-2
10-1
100
101
102
103
C
time(
ms)
BasicRandomMaxQPDPGreedy
Total budget
Quality improvement vs Budget(PRQ)
10 20 30 40 50 60 70 80 90 1000
5
10
15
20
25
30
35
C
IRandomMaxQPDPGreedy
Total budget
Quality
Improvem
ent
Quality improvement vs Budget(PMaxQ)
10 15 20 25 30 35 40 45 500.5
1
1.5
2
2.5
3
C
I
RandomMaxQPDPGreedy
Total budget
Quality
Improvem
ent
Quality improvement vs Budget(PRQ, real data)
0 20 40 60 80 1000
5
10
15
C
IRandomMaxQPDPGreedy
Quality
Improvem
ent
Total budget
Thank you
Q & A
Appendix: Deriving x-form of PRQ
d
rtjji
ji
qp1
d
jjj qqQDS
1
log),(
jkji r
krt
ij Peq )1(
d
j rk
rtij
jkji
PeqQDS1
)1(log),(
...))1log(...)1log(...log...(log
......
...))1log(...)1log(...log...(log
...))1log(...)1log(...log...(log),(
11
112
111
kid
ki
ki
PPeeq
PPeeq
PPeeqQDS
ii ep log
kit
ik pP
)1log()1( kk PP
m
k tkkii
ki
PPepQDS1
))1log()1(log(),(
Appendix: Deriving x-form of PMaxQ
d
jjj qqQDS
1
log),(
jkji r
jkrt
ij vreq ).Pr(
),(
1,1).Pr(
kjs
llkjk evr
A number in [0, ]k
d
j rjk
rtij
jkji
vreqQDS1
)).Pr(log(),(
n
iii ep
1
log
m
k i
i
jjkik
k
e1 1
,, )1log(