View
218
Download
3
Tags:
Embed Size (px)
Citation preview
2
Conditional functional dependencies (CFDs)
– Motivation for extending FDs with conditions: data cleaning
– Syntax and semantics
– Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs)
– Motivation: data cleaning and schema matching
– Syntax and semantics
– Static analysis: satisfiability, implication, axiomatizability Algorithms and open research issues
– SQL techniques for inconsistency detection– Heuristic for satisfiability and implication checking– Repair
Outline of Part III
3
Conditional functional dependencies (CFDs)
– Motivation for extending FDs with conditions: data cleaning
– Syntax and semantics
– Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs)
– Motivation: data cleaning and schema matching
– Syntax and semantics
– Static analysis: consistency, implication, axiomatizability Algorithms and open research issues
– SQL techniques for inconsistency detection– Heuristic for satisfiability and implication checking– Repair
Conditional functional dependencies (CFDs)
4
Data in real-life is often dirty
Errors, conflicts and inconsistencies
Australia: 500,000 dead people retain active Medicare cards
US: Pentagon asked 275 dead/wounded officers to re-enlist.
UK: there are 81 million National Insurance numbers but only 60
million eligible citizens.
It is estimated that in a 500,000 customer database, 120,000
customer records become invalid within a year, due to deaths,
divorces, marriages, moves.
typical data error rate in industry: 1% - 5%, up to 30%
. . .
5
Dirty data is costly
Poor data costs US companies $600 billion annually
Wrong price data in retail databases costs US customers $2.5
billion each year
AAA improves data quality by 20%, and saves $150,000… in
postage stamps alone
30%-80% of the development time for data cleaning in a data
integration project
and don’t forget
CIA intelligence on WMD in Iraq!
The need for (semi-)automated methods to clean data!
6
Characterizing the consistency of data
One of the central technical problems is how to tell whether the
data is dirty or clean
Specify consistency using integrity constraints
Inconsistencies emerge as violations of constraints
Constraints considered so far: traditional – functional dependencies– inclusion dependencies– denial constraints (a special case of full dependencies)– . . .
Question: are these traditional dependencies sufficient?
7
Example: customer relation
Schema: Cust(country, area-code, phone, street, city, zip)
Instance:
country area-code phone street city zip
44 131 1234567 Mayfield NYC EH4 8LE
44 131 3456789 Crichton NYC EH4 8LE
01 908 3456789 Mountain Ave NYC 07974
functional dependencies (FDs):
cust[country, area-code, phone] cust[street, city, zip]
cust[country, area-code] cust[city]
The database satisfies the FDs. Is the data consistent?
8
Capturing inconsistencies in the data
cust ([country = 44, zip] [street])
In the UK, zip code uniquely determines the street
The constraint may not hold for other countries It expresses a fundamental part of the semantics of the data It can NOT be expressed as a traditional FD
– It does not hold on the entire relation; instead, it holds on
tuples representing UK customers only
country area-code phone street city zip
44 131 1234567 Mayfield NYC EH4 8LE
44 131 3456789 Crichton NYC EH4 8LE
01 908 3456789 Mountain Ave NYC 07974
9
Two more constraints
cust([country = 44, area-code = 131, phone] [street, zip, city = EDI])
cust([country = 01, area-code = 908, phone] [street, zip, city = MH])
– In the UK, if the area code is 131, then the city has to be EDI– In the US, if the area code is 908, then the city has to be MH
t1, t2 and t3 violate these constraints
– refining cust([country, area-code, phno] [street, city, zip])
– combining constants and variables
id country Area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH4 8LE
t2 44 131 3456789 Crichton NYC EH4 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
10
The need for new constraints
cust([country = 44, zip] [street])
cust([country = 44, area-code = 131, phone] [street, zip, city = EDI])
cust([country = 01, area-code = 908, phone] [street, zip, city = MH])
They capture inconsistencies that traditional FDs cannot detect
Traditional constraints were developed for schema design, not for
data cleaning! Data integration in real-life: source constraints
– hold on a subset of sources– hold conditionally on the integrated data
They are NOT expressible as traditional FDs– do not hold on the entire relation– contain constant data values, besides logical variables
11
An extension of traditional FDs: (R: X Y, Tp) X Y: embedded traditional FD on R Tp: a pattern tableau
– attributes: X Y
– each tuple in Tp consists of constants and unnamed
variable _
Example: cust([country = 44, zip] [street]) (cust (country, zip street), Tp) pattern tableau Tp
Conditional Functional Dependencies (CFDs)
country zip street
44 _ _
12
Represent
cust([country = 44, area-code = 131, phone] [street, zip, city = EDI])
cust([country = 01, area-code = 908, phone] [street, zip, city = MH])
cust([country, area-code, phone] [street, city, zip])
as a SINGLE CFD: (cust(country, area-code, phone street, city, zip), Tp) pattern tableau Tp: one tuple for each constraint
Example CFDs
country area-code phone street city zip
44 131 _ _ Edi _
01 908 _ _ MH _
_ _ _ _ _ _
13
Express
cust[country, area-code] cust[city]
as a CFD: (cust(country, area-code, city), Tp) pattern tableau Tp: a single tuple consisting of _ only
CFDs subsume traditional FDs
Traditional FDs as a special case
country area-code city
_ _ _
14
a b (a matches b) if – either a or b is _ – both a and b are constants and a = b
tuple t1 matches t2: t1 t2
(a, b) (a, _), but (a, b) does not match (a, c)
DB satisfies (R: X Y, Tp) iff for any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB, if t1[X] = t2[X] tp[X], then t1[Y] = t2[Y] tp[Y]– tp[X]: identifying the set of tuples on which the constraint tp
applies, ie, { t | t[X] tp[X]} – t1[Y] = t2[Y] tp[Y]: enforcing the embedded FD, and the
pattern of tp
Semantics of CFDs
15
cust([country = 44, zip] [street])
Tuples t1 and t2 violate the CFD t1[country, zip] = t2[country, zip] tp[country, zip] t1[street] t2[street]
The CFD applies to t1 and t2 since they match tp[country, zip]
Example: violation of CFDs
id country area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH4 8LE
t2 44 131 3456789 Crichton NYC EH8 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
country zip street
44 _ _
16
(cust(country, area-code city), Tp)
Tuple t1 does not satisfy the CFD t1[country, area-code] = t1[country, area-code] tp1[country, area-code] t1[city] = t1[city]; however, t1[city] does not match tp1[city]
In contrast to traditional FDs, a single tuple may violate a CFD
Violation of CFDs by a single tuple
id country area-code city
tp1 44 131 Edi
tp2 01 908 MH
tp3 _ _ _
id country area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH4 8LE
t2 44 131 3456789 Crichton NYC EH8 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
17
Conditional tables, Codd tables and variable tables have been studied for incomplete information
Conditional tables: representing infinitely many relation instances, one for each instantiation of variables
Pattern tableau in a CFD: each pattern tuple is a constraint, and all constraints applying to the same relation instance
Relational table, traditional dependencies and CFDs One end of the spectrum: relations consisting of data only The other end of the spectrum: traditional dependencies defined
in terms of logic variables CFD: in the between, both data values and logic variables
CFDs: enforcing binding of semantically related data values
CFDs vs. conditional tables
18
“Dirty” constraints?
A set of CFDs may be inconsistent!
Inconsistent: (R(A B), Tp)
In any nonempty database DB and for any tuple t in DB,– tp1: t[B] must be b– tp2: t[B] must be c– Inconsistent if b and c are different
inconsistent = { 1, 2 }, 1 = (R(A B), Tp1), 2 = (R(B A), Tp2)
Why?
id A B
tp1 _ b
tp2 _ cTp
id A B
tp1 true b
tp2 false c
id B A
tp3 b false
tp4 c true
19
The satisfiability problem
The satisfiability problem for CFDs is to determine, given a set
of CFDs, whether or not there exists a nonempty database DB
that satisfies , i.e., for any in , DB satisfies .
Whether or not makes sense
For traditional FDs, it is not an issue: one can specify any FDs
without worrying about their consistency
In contrast, a set of CFDs may be inconsistent!
20
The complexity of the satisfiability analysis
Theorem. The satisfiability problem for CFDs is NP-complete.
Nontrivial: contrast this with the trivial consistency analysis of FDs!
Proof idea: Upper bound: the small model property: if is satisfiable, then
there is DB that satisfies and consists of a single tuple! Lower bound: reduction from the non-tautology problem
Good news: PTIME special cases
Theorem. Given a set of CFDs on a relation schema R, the
satisfiability of can be determined in O(| |2) time if either the schema R is predefined (fixed), or no attributes in have a finite domain
Proof idea: an extension of chase for CFDs
21
The implication problem
The implication problem for CFDs is to determine, given a set of
CFDs and a single CFD , whether implies , denoted by |=
, i.e., for any database DB, if DB satisfies , then DB satisfies .
Example:
= { 1, 2 } , 1 = (R(A B), Tp1), 2 = (R(B C), Tp2)
= (R(A C), Tp)
|= . Why?
id A B
tp1 _ b
Tp1 Tp2 id B C
tp1 _ c
id A C
tp a c
22
The complexity of the implication problem
For traditional FDs, the implication problem is in linear time
In contrast, the implication problem for CFDs is intractable
Theorem. The implication problem for CFDs is coNP-complete.
Tractable special cases
Theorem. Given a set of CFDs and a single CFD on a relation
schema R, whether |= can be determined in O((| | + | |)2)
time if either the schema R is predefined, or no attributes in and have a finite domain
Proof idea: an extension of chase for CFDs
23
Finite axiomatizability: Flashback
Armstrong’s axioms can be found in every database textbook:
Reflexivity: If Y X, then X Y
Augmentation: If X Y , then XZ YZ
Transitivity: If X Y and Y Z, then X Z
Sound and complete for FD implication, i.e, |= iff can be
inferred from using reflexivity, augmentation, transitivity.
Question: is there a sound and complete inference system for the
implication analysis of CFDs?
24
Finite axiomatizability of CFDs
Theorem. There is a sound and complete inference system I for
implication analysis of CFDs
Sound: if |- , i.e., can be proved from using I, then |= Complete: if |= , then |- using I
The inference system is more involved than its counterpart for
traditional FDs, namely, Armstrong’s axioms.
There are 5 axioms.
A normal form of CFDs: (R: X A, tp), tp is a single pattern tuple.
25
Axioms for CFDs: extension of Armstrong’s axioms
Reflexivity: If A X, then (R : X A, tp), where
A1 … Ak A A
_ … _ _ _ or
A1 … Ak A A
_ … _ a a
Augmentation: If (X A, tp) and B attr(R), then (BX A, t’p)
A1 … Ak A
tp[A1] … tp[Ak] tp[A]
A1 … Ak B A
tp[A1] … tp[Ak] _ tp[A]
tp t’p
26
Axioms for CFDs: transitivity
Transitivity: if ([A1,…,Ak] [B1,…,Bm], tp)
and ([B1,…,Bm] [C1,…,Cn], t’p)
A1 … Ak B1 … Bm
tp[A1] … tp[Ak] tp[B1] tp[Bm]
A1 … Ak C1 … Cn
tp[A1] … tp[Ak] t’p[C1] t’p[Cn]
B1 … Bm C1 … Cn
tp’[B1] … t’p[Bm] t’p[C1] t’p[Cm]
([A1,…,Ak] [C1,…,Cn], t’p)
match
27
Axioms for CFDs: reduction
reduction: if ([B, X] A, tp), tp[B] = _, and tp[A] = a
A1 … Ak B A
tp[A1] … tp[Ak] _ a
then (X A, t’p) A1 … Ak A
tp[A1] … tp[Ak] a
28
Axioms for CFDs: finite domain upgrade
upgrade: if only consistent values for B are b1, b2, . . . , bn, dom(B) = { b1, …, bn, …,
bm}, and (R : [A1, . . . ,Ak, B] A, tp)
A1 … Ak B A
tp[A1] … tp[Ak] b1 tp[A]
tp[A1] … tp[Ak] … tp[A]
tp[A1] … tp[Ak] bn tp[A]
then (R : [A1, . . . ,Ak, B] A, tp)
A1 … Ak B A
tp[A1] … tp[Ak] _ tp[A]
29
Static analyses: CFD vs. FD
satisfiability implication finite axiom’tyCFD NP-complete coNP-complete yes
FD O(1) O(n) yes
General setting:
satisfiability implication finite axiom’tyCFD O(n2) O(n2) yes
FD O(1) O(n) yes
in the absence of finite-domain attributes:
Theorem: In the absence of finite-domain attributes, Reflexivity, Augmentation, Transitivity and Reduction are sound and completefor CFD implication complications: finite-domain attributes, interaction between satisfiability and implication analyses
30
Conditional functional dependencies (CFDs)
– Motivation for extending FDs with conditions: data cleaning
– Syntax and semantics
– Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs)
– Motivation: data cleaning and schema matching
– Syntax and semantics
– Static analysis: consistency, implication, axiomatizability Algorithms and open research issues
– SQL techniques for inconsistency detection– Heuristic for satisfiability and implication checking– Repair
Conditional Inclusion dependencies (CINDs)
31
Example: Amazon database
Schema:
order(asin, title, type, price, country, county) -- source
book(asin, isbn, title, price, format) -- target
CD(asin, title, price, genre)
asin: Amazon standard identification number
Instances: asin title type price country county
a23 H. Porter book 17.99 US DL
a12 J. Denver CD 7.94 UK Reyden
asin isbn title price
a23 b32 Harry Porter 17.99
a56 b65 Snow white 7.94
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
order
book CD
32
Schema matching
Traditional inclusion dependencies:
order[asin, title, price] book[asin, title, price]
order[asin, title, price] CD[asin, title, price]
These inclusion dependencies do not make sense!
Inclusion dependencies from source to target (e.g., Clio)
asin title type price country county
asin isbn title price asin title price genre
33
Schema matching: dependencies with conditions
asin title type price country county
asin isbn title price asin title price genre
Conditional inclusion dependencies:
order[asin, title, price; type = book] book[asin, title, price]
order[asin, title, price; type = CD] CD[asin, title, price]
order[asin, title, price] book[asin, title, price] holds only if type = book
order[asin, title, price] CD[asin, title, price] holds only if type = CD
The constraints do not hold on the entire order table
34
Date cleaning with conditional dependencies
CIND1: order[asin, title, price; type = book] book[asin, title, price]
CIND2: order[asin, title, price; type = CD] CD[asin, title, price]
Tuple t1 violates CIND1
Tuple t2 violates CIND2
id asin title type price country county
t1 a23 H. Porter book 17.99 US DL
t2 a12 J. Denver CD 7.94 UK Reyden
asin isbn title price
a23 b32 Harry Porter 17.99
a56 b65 Snow white 7.94
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
order
book CD
35
More on data cleaning
CD[asin, title, price; genre = ‘a-book’] book[asin, title, price; format = ‘audio’]
Inclusion relation CD[asin, title, price] book[asin, title, price] holds only if genre = ‘a-book’, i.e., when the CD is an audio book
In addition, the format of the corresponding book must be audio – a pattern for the referenced tuple
asin isbn title price format
a23 b32 Harry Porter 17.99 Hard cover
a56 b65 Snow White 17.94 audio
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
book
CD
36
(R1[X; Xp] R2[Y; Yp], Tp) R1[X] R2[Y]: embedded traditional IND from R1 to R2 Tp: a pattern tableau
– attributes: X Xp Y Yp
– tuples in Tp consist of constants and unnamed variable _
Example: express
CIND1: order[asin, title, price; type = book] book[asin, title, price]
(order[asin, title, price; type] book[asin, title, price; nil], Tp)
nil: empty list pattern tableau Tp
Conditional Inclusion Dependencies (CINDs)
asin title price type asin title price
_ _ _ book _ _ _
37
CIND2: order[asin, title, price; type = CD] CD[asin, title, price]
CIND3: CD[asin, title, price; genre = ‘a-book’] book[asin, title,
price; format = ‘audio’]
Examples CINDs
asin title price type asin title price
_ _ _ CD _ _ _
asin title price genre asin title price format
_ _ _ a-book _ _ _ audio
(order[asin, title, price; type] CD[asin, title, price; nil], Tp)
(CD[asin, title, price; genre] book[asin, title, price; format], Tp)
38
R1[X] R2[Y] X: [A1, …, An] Y : [B1, …, Bn]
As a CIND: (R1[X; nil] R2[Y; nil], Tp) pattern tableau Tp: a single tuple consisting of _ only
CINDs subsume traditional INDs
Traditional CINDs as a special case
A1 … An B1 … Bn
_ _ _ _ _ _
39
DB = (DB1, DB2), where DBj is an instance of Rj, j = 1, 2.
DB satisfies (R1[X; Xp] R2[Y; Yp], Tp) iff for any tuples t1 in DB1,
and any tuple tp in the pattern tableau Tp, if t1[X, Xp] tp[X,
Xp], then there exists t2 in DB2 such that t1[Y] = t2[Y] (traditional IND semantics) t2[Y, Yp] tp[Y, Yp] (matching the pattern tuple on Y, Yp)
Patterns: t1[X, Xp] tp[X, Xp]: identifying the set of R1 tuples on which tp
applies: { t1 | t1[X, Xp] tp[X, Xp] } t2[Y, Yp] tp[Y, Yp]: enforcing the embedded IND and the
constraint specified by patterns Y, Yp
Semantics of CINDs
40
(CD[asin, title, price; genre] book[asin, title, price; format], Tp)
The following DB satisfies the CIND
Example
asin isbn title price format
a23 b32 Harry Porter 17.99 Hard cover
a56 b65 Snow white 7.94 audio
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
book
CD
asin title price genre asin title price format
_ _ _ a-book _ _ _ audio
41
More examples
CIND1: (order[asin, title, price; type] book[asin, title, price; nil], Tp)
The following DB violates CIND1. Why?
id asin title type price country county
t1 a23 H. Porter book 17.99 US DL
t2 a12 J. Denver CD 7.94 UK Reyden
asin isbn title price
a23 b32 Harry Porter 17.99
a56 b65 Snow white 7.94
asin title price genre
a12 J. Denver 17.99 country
a56 S. White 7.94 a-book
order
book CD
asin title price type asin title price
_ _ _ book _ _ _
42
The satisfiability problem for CINDs
The satisfiability problem for CINDs is to determine, given a set of
CINDs, whether or not there exists a nonempty database DB
that satisfies , i.e., for any in , DB satisfies .
Recall
Any set of traditional INDs is always satisfiable!
For CFDs, the satisfiability problem is intractable.
In contrast.
Theorem. Any set of CINDs is always satisfiable!
Despite the increased expressive power, the complexity of the
satisfiability analysis does not go up.
43
The implication problem for CINDs
The implication problem for CINDs is to decide, given a set of
CINDs and a single CIND , whether implies ( |= ).
For traditional INDs, the implication problem is PS PACE-complete
For CINDs, the complexity does not hike up, to an extent:
Theorem. For CINDs containing no finite-domain attributes, the
implication problem is PSPACE-complete
In the general setting, however, we have to pay a price:
Theorem. The implication problem for CINDs is EXPTIME-complete
Proof idea:
Lower bound: reduction from two-player tiling game
Upper bound: an extension of the chase for CINDs
44
Finite axiomatizability of CINDs
Rules for inferring IND implication:
– Reflexivity: If R[X] R[X]
– Projection and Permutation: If R1[A1, …, Ak] R2[B1, …, Bk],
then R1[Ai1, …, Aik] R2[Bi1, …, Bik],
– Transitivity: If R1[X] R2[Y] and R2[Y] R3[Z], then
R1[X] R3[Z]
Sound and complete for IND implication
CINDs retain the finite axiomatizability
Theorem. There is a sound and complete inference system for
implication analysis of CINDs
There are 8 axioms.
45
Inference rules for CINDs
Normal form of CINDs: (R1[X; Xp] R2[Y; Yp], tp), tp is a single pattern tuple tp[A] is a constant iff A is in Xp or Yp (tp[B] = _ if B is in X or Y)
Inference rules
Reflexivity: (R[X; nil] R[X; nil], tp), where
A1 … Ak A1 … Ak
_ … _ _ … _
Projection and permutation: If (R1[X; Xp] R2[Y; Yp], tp), then (R1[X’; X’p] R2[Y’; Y’p], t’p), for any permutation of X, Xp
X Xp Y Yp
_ tp[Xp] _ tp[Yp]
tp t’p
X’ X’p Y’ Y’p
_ tp[X’p] _ tp[Y’p]
46
Axioms for CINDs: transitivity
Transitivity: if (R1[X; Xp] R2[Y; Yp], tp),
and (R2[Y; Yp] R3[Z; Zp], t’p),
X Xp Y Yp
_ tp[Xp] _ tp[Yp]
X Xp Z Zp
_ tp[Xp] _ t’p[Zp]
Y Yp Z Zp
_ tp[Yp] _ t’p[Zp]
(R1[X; Xp] R3[Z; Zp], t”p)
equal
47
downgrading: if (R1[X, A; Xp] R2[Y, B; Yp], tp),
X A Xp Y B Yp
_ _ tp[Xp] _ _ tp[Yp]
(R1[X; Xp, A] R2[Y; Yp, B], t’p)
X Xp A Y Yp B
_ tp[Xp] a _ tp[Yp] a
Axioms for CINDs: downgrading
48
Axioms for CINDs: augmentation
augmentation: if (R1[X; Xp] R2[Y; Yp], tp), A attr(R1),
X Xp Y Yp
_ tp[Xp] _ tp[Yp]
X Xp A Y Yp
_ tp[Xp] a _ tp[Yp]
(R1[X; Xp, A] R2[Y; Yp], t’p)
49
Axioms for CINDs: reduction
reduction: if (R1[X; Xp] R2[Y; Yp, B], tp),
X Xp Y Yp B
_ tp[Xp] _ tp[Yp] tp[B]
then (R1[X; Xp] R2[Y; Yp], t’p),
X Xp Y Yp
_ tp[Xp] _ tp[Yp]
50
Axioms for CFDs: finite domain reduction
F-reduction: if (R1[X; Xp, A] R2[Y; Yp], tp),
dom(A) = { a1,…, an}
X Xp A Y Yp
_ tp[Xp] a1 _ tp[Yp]
_ tp[Xp] … _ tp[Yp]
_ tp[Xp] an _ tp[Yp]
then (R1[X; Xp] R2[Y; Yp], tp),
X Xp Y Yp
_ tp[Xp] _ tp[Yp]
51
Axioms for CFDs: finite domain upgrade
upgrade: if (R1[X; Xp, A] R2[Y, B; Yp], tp),
dom(A) = { a1,…, an}
X Xp A Y Yp B
_ tp[Xp] a1 _ tp[Yp] a1
_ tp[Xp] … _ tp[Yp] …
_ tp[Xp] an _ tp[Yp] an
then (R1[X,A; Xp] R2[Y, B; Yp], tp),
X A Xp Y B Yp
_ _ tp[Xp] _ _ tp[Yp]
52
Static analyses: CIND vs. IND
satisfiability implication finite axiom’tyCIND O(1) EXPTIME-complete yes
IND O(1) PSPACE-complete yes
General setting:
satisfiability implication finite axiom’tyCIND O(1) PSPACE-complete yes
IND O(1) PSPACE-complete yes
in the absence of finite-domain attributes:
Theorem: In the absence of finite-domain attributes, Reflexivity, Projection and Permutation, Transitivity, Augmentation, Downgrading and Reduction are sound and complete for CIND implication
CINDs retain most complexity bounds of their traditional counterpart
53
CFDs and CINDs taken together
We need both CFDs and CINDs for data cleaning schema matching
Theorem. The implication problem for CFDs and CINDs is
undecidable
Not surprising: The implication problem for traditional FDs and INDs
is already undecidable
Theorem. The consistency problem for CFDs and CINDs is
undecidable
In contrast, any set of traditional FDs and INDs is consistent!
Proof idea: induction from the implication problem for FDs and INDs
54
Static analyses: CFD + CIND vs. FD + IND
satisfiability implication finite axiom’tyCFD + CIND undecidable undecidable No
FD + IND O(1) undecidable No
CINDs and CFDs properly subsume FDs and INDs
Both the satisfiability analysis and implication analysis are
beyond reach in practice
This calls for effective heuristic methods
55
Conditional functional dependencies (CFDs)
– Motivation for extending FDs with conditions: data cleaning
– Syntax and semantics
– Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs)
– Motivation: data cleaning and schema matching
– Syntax and semantics
– Static analysis: consistency, implication, axiomatizability Algorithms and open research issues
– SQL techniques for inconsistency detection– Heuristic for satisfiability and implication checking– Repair
Algorithms and open research issues
56
Detecting CFD Violations
country area-code phone street city zip
44 131 _ _ Edi _
01 908 _ _ MH _
_ _ _ _ _ _
country area-code phone street city zip
44 131 1234567 Mayfield NYC EH4 8LE
44 131 3456789 Crichton NYC EH4 8LE
01 908 3456789 Mountain Ave NYC 07974
CFD: (cust(country, area-code, phone street, city, zip),
Tp)
detection
57
Detecting CFD violations
Input: a set of CFDs and a database DB
Output: the set of tuples in DB that violate at least one CFD in
Approach: automatically generate SQL queries to find violations
Complication 1: consider (R: X Y, Tp), the pattern tableau may be
large (recall that each tuple in Tp is in fact a constraint)
Goal: the size of the SQL queries is independent of Tp
Trick: treat Tp as a data table
CINDs can be checked along the same lines
58
Single CFD: step 1
A pair of SQL queries, treating Tp as a data table
– Single-tuple violation (pattern matching)
– Multi-tuple violations (traditional FDs)
(cust(country, area-code, phone street, city, zip), Tp)
Single-tuple violation: Qc
select *
from R t, Tp tp
where t[country] tp[country] AND t[area-code] tp[area-code]
AND t[phone] tp[phone]
(t[street] <> tp[street] OR t[city] <> tp[city] OR t[zip] <> tp[zip]))
– <>: not matching;
– t[A1] tp[A1]: (t[A1] = tp[A1] OR tp[A1] = _)
59
Single CFD: step 2
Multi-tuple violations (the semantics of traditional FDs): Qvselect distinct t.country, t.area-code, t.phone
from R t, Tp tp
where t[country] tp[country] AND t[area-code] tp[area-code]
AND t[phone] tp[phone]
group by t.country, t.area-code, t.phone
having count(distinct street, city, zip) > 1
Tp is treated as a data table
(cust(country, area-code, phone street, city, zip), Tp)
60
Multiple CFDs
Complication 2: if the set has n CFDs, do we use 2n SQL queries,
and thus 2n passes of the database DB?
Goal: 2 SQL queries no matter how many CFDs are in the size of the SQL queries is independent of Tp
Trick: merge multiple CFDs into one Given (R: X1 Y1, Tp1), (R: X2 Y2, Tp2) Create a single pattern table: Tm = X1 X2 Y1 Y2, Introduce @, a don’t-care variable, to populate attributes of
pattern tuples in X1 – X2, etc (tp[A] = @) Modify the pair of SQL queries by using Tm
61
Handling multiple CFDs
zip state
07974 NJ
90291 CA
01202 _
CFD2: (zipstate, T2)area state
_ _
212 NY
CFD1: (areastate, T1)
area zip state
480 95120 CA
310 90995 CA
CFD3: (area,zipstate, T3)
area zip state
CFD2: @ 07974 NJ
CFD2: @ 90291 CA
CFD2: @ 01202 _
CFD1: _ @ _
CFD1: 212 @ NY
CFD3: 480 95120 CA
CFD3: 310 90995 CA
CFDM:(area,zipstate, TM)
Qc: select * from R t, TM tp where t[area] ≍ tp[area] AND t[zip] ≍ tp[zip] AND t[state] <> tp[state]
Qv: select distinct area, zip from Macro group by area, zip having count(distinct state) > 1
Macro:
select (case tp[area] when “@” then “@” else t[area] end) as area . . .
from R t, TM tp
where t[area] ≍ tp[area] AND t[zip] ≍ tp[zip] AND tp[state] =_
62
Keeping things tidy…
cc area phone addr city zip statet1: +1 908 111-1111 Elm Str. MH 07974 NJt2: +1 908 222-2222 Pine Str. MH 07974 NYt3: +1 212 333-3333 Oak Str. NYC 01202 NYt4: +1 310 444-4444 Main Str. LA 90291 CA
area state
_ _
212 NY
CFD: (areastate, T)
t5: +1 310 555-5555 Rice Str. LA 90291 CT
Tuple deletions: A tuple deletion might remove violations.
Tuples that were dirty, before the deletion, might become clean!
Tuple insertions: A tuple insertion might introduce violations.
Tuples that were clean, before the insertion, might become dirty!
Updating database
63
Incremental inconsistency detection
Incremental approach: compute change V such that
the new violations V’ = the old violations V + V
Why?
Input: a set of CFDs, a database DB, the set V of tuples in DB
that violate , and changes DB to DB
Output: the set V’ of tuples in DB + DB that violate CFDs in
DB: a set of tuples to be inserted into DB or deleted from DB
Approaches: Batch approach:
– Compute DB’ = DB + DB – Apply the SQL detection queries to DB’
64
The need for incremental inconsistency detection
Small DB to DB tends to incur only small changes V to V
– more efficient to compute V then computing V’ starting from
scratch
Minimize unnecessary recomputation and traversal of DB
Goal: generate new SQL queries that perform incremental detection
modify the SQL detection queries by leveraging DB and V
design and maintain auxiliary structures (indexing, mark)
65
Logging violations
area state
_ _
212 NY
CFD: (areastate, T)
cc area phone addr city zip state BC BV
t1: +1 908 111-1111 Elm Str. MH 07974 NJ 0 1t2: +1 908 222-2222 Pine Str. MH 07974 NY 0 1t3: +1 212 333-3333 Oak Str. NYC 01202 NJ 1 0t4: +1 310 444-4444 Main Str. LA 90291 CA 0 0
Use one pair of columns, for each CFD Attribute BC records whether tuple t violates query Qc of the CFD Attribute BV records whether tuple t violates query Qv of the CFD
Initialization:update R t set t[BC] = 1 where t in (QC)update R t set t[BV] = 1 where t in (QV)
66
Handling deletions
Let tdel be the tuple we want to delete from R
Step 1: delete from R t where t = tdel
Step 2: update R t set t[BV] = 0
where t[BV] = 1 AND t[area] = tdel[area] AND 1 = (select count(distinct state) from R t’
where t’[area] = tdel[area])
cc area phone addr city zip state BC BV
t1: +1 908 111-1111 Elm Str. MH 07974 NY 0 1t2: +1 908 222-2222 Pine Str. MH 07974 NJ 0 1t3: +1 212 333-3333 Oak Str. NYC 01202 NJ 1 0t4: +1 310 444-4444 Main Str. LA 90291 CA 0 0
area state
_ _
212 NY
CFD: (areastate, T) How about
batch deletions?How about
batch deletions?
67
Handling insertions
cc area phone addr city zip state BC BV
t3: +1 212 333-3333 Oak Str. NYC 01202 NJ 1 0t4: +1 310 444-4444 Main Str. LA 90291 CA 0 0tins: +1 310 555-555 Rice Str. LA 90291 CT
area state
_ _
212 NY
CFD: (areastate, T)
Let tins be the tuple we want to insert into R
Step 1: insert into R values tins
Step 2: update R t set t[BC] = 1
where t = tins AND exists (
select * from T tp where t[area] t≍ p[area] AND t[state] <>tp[state])
Step 3: update R t set t[BV] = 1, tins[BV] = 1
where t[area] = tins[area] AND t[state] ≠ tins[state] AND exists (
select * from T tp where tins[area] t≍ p[area] AND tp[state] = _)
Step 2 Step 3
0 1
/ 1
Step 1
How aboutbatch insertions?
How aboutbatch insertions?
68
Handling batch insertions
Step 1
Let Rins be the tuples we want to insert into R
Step 1: Find tuples in Rins which violate Qc
Step 2: Find clean tuples in R that become
dirty, due to some tuple(s) in Rins
Step 3: Find tuples in Rins that become dirty
due to some dirty tuple(s) in R
Step 4: Find clean tuples in Rins that violate the CFD
Step 5: Insert Rins in R
Step 2
Step 3
Step 4
Rins:
R:
69
The source code
Let Rins be the tuples we want to insert into R
Step 1: update Rins tins set tins[BC] = 1 where tins in (QC)
Step 2: update R t set t[BV] = 1 where t[BC] = 0 AND t[BV] = 0 AND
exists (select * from T where t[area] t≍ p[area] AND tp[state] = _) AND
exists (select * from Rins tins where tins[area] = t[area] AND
tins[state] ≠ t[state])
Step 3: update Rins tins set tins[BV] = 1 where
exists (select * from R t where tins[area] = t[area] AND t[BV] = 1)
Step 4: update Rins tins set tins[BV] = 1 where tins[BC] = 0 AND tins[BV] = 0 AND
tins[area] IN (select area from Rins t’ins, Tp tp where
t’ins[BC] = 0 AND t’ins[BV] = 0 AND t’ins[area] t≍ p[area] AND
tp[state] = _
group by area having count(distinct state) > 1)
Step 5: insert into R values (select * from Rins)
70
Scalability in Instance Size
0
0.5
1
1.5
2
2.5
3
10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
Instance Size
Tim
e (s
ecs)
, NumAttrs =3, NumConsts =50% Noise =5%
71
Scalability in NumConsts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
NumConsts
Tim
e (s
ecs) , Sz =100K
, TabSz =1K Noise =5%
72
Scalability in Noise
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
Noise
Tim
e (s
ecs)
, Sz =100K , TabSz =30K NumAttrs =2 ,NumConsts100% =
73
Merging CFDs
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
Instance Size
Tim
e (s
ecs)
Merged, TabSz = 700
Non-Merged, TabSz = 200 + 500
74
Incremental Deletions
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
# of tuples deleted
Tim
e (s
ecs)
Incremental
Non-incremental
75
Incremental Batch Deletions
0
2
4
6
8
10
1K 2K 3K 4K 5K 6K 7K 8K 9K 10K
# of tuples deleted
Tim
e (s
ecs)
Incremental
Non-incremental
76
Incremental Insertions
0
0.5
1
1.5
2
2.5
3
10 20 30 40 50 60 70 80 90 100
# of tuples inserted
Tim
e (s
ecs) Incremental
Non-incremental
77
Incremental Batch Insertions
0
1
2
3
4
5
6
7
1K 2K 3K 4K 5K 6K 7K 8K 9K 10K
# of tuples inserted
Tim
e (s
ecs)
Incremental
Non-incremental
78
Checking the satisfiability of CFDs
Input: a set of CFDs
MAXSC: find a maximum subset of that is consistent
Complexity: the MAXSC problem for CFDs is NP-complete
Theorem: there is an -approximation algorithm for MAXSC there exist constant such that for the subset m found by the
algorithm has a bound: card(m) > card(OPT())
Proof idea: approximation factor preserving reduction to
MAXGSAT
Open questions: effective heuristic algorithms for checking the satisfiability of CFDs + CINDs (undecidable) for determining implication of CFDs, CINDs, and CFDs + CINDs for finding minimum cover of CFDs, CINDs, and CFDs + CINDs
79
Automated methods for finding a repair
Input: a relational database DB, and a set of CFDs
Output: a repair DB’ of DB such that cost(DB’, DB) is minimal repair: DB’ satisfies “good”: cost(DB’, DB)
– DB’ is “close” to the original data in DB– Minimizing changes to “accurate” attributes
Complexity: Finding an optimal repair is NP-complete (data complexity) for traditional FDs, for a fixed
set of FDs (or INDs) and fixed schema PSPACE-complete for CFDs + CINDs (combined complexity)
Open questions: effective heuristic for repairing databases based
on CFDs, CINDs, and CFDs + CINDs
80
Incremental repair
Input: a clean database DB, changes DB to DB, and a set of
CFDs
Output: a repair DB’ of DB + DB
Complexity. The local data cleaning problem is NP-hard, even if
DB consists of a single tuple.
Open questions: find effective heuristic algorithms for incrementally
repairing databases based on CFDs CINDs CFDs + CINDs
81
Discovering CFDs and CINDs
Input: Sample databases of a schema R
Output: CFDs and CINDs that hold on all (or most) database
instances of R
Difficulty. A naïve approach may find non-representative CFDs and
CINDs as large as the sample data
Open questions: find effective method for discovering CFDs CINDs CFDs + CINDs
82
Summary
Conditional functional dependencies– for data cleaning rather than schema design – complexity bounds of satisfiability and implication analyses– a sound and complete inference system
Conditional inclusion dependencies– for data cleaning and schema matching in practice– complexity bounds of satisfiability and implication analyses– a sound and complete inference system
Complexity bounds for CFDs and CINDs taken together SQL techniques for automatic detection of CFD violations
– a pair of SQL queries for validating multiple CFDs– incremental techniques for validating CFDs
A practical method for data cleaning and schema matching
83
References
Conditional Functional Dependencies for Data Cleaning
The 23rd International Conference on Database Engineering
(ICDE), 2007.Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios
Kementsietsidis
Extending Dependencies with Conditions Loreto Bravo, Wenfei Fan, Shuai Ma
Improving Data Quality: Consistency and Accuracy Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, Shuai Ma
Conditional Functional Dependencies for Capturing Data
Inconsistencies Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis