Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
ERACER: A Database Approach forStatistical Inference and Data Cleaning
Chris Mayfield Jennifer Neville Sunil Prabhakar
Department of Computer Science, Purdue UniversityWest Lafayette, Indiana, USA
SIGMOD 2010, Indianapolis
Problem
Real data is dirtyI Inconsistent, incomplete, corrupted, outdated, etc.I Safety measures (e.g., constraints) are often not usedI Poor decisions based on dirty data costs billions annually
Data cleaning is hardI Typically ad hoc, interactive, exploratory, etc.I Uncertain process: what to do with the “errors?”I Maintenance of results (e.g., lineage/provenance)I Consumes large amount of data management time
(see Fan, Geerts, & Jia, VLDB 2008 tutorial)
Mayfield, Neville, & Prabhakar SIGMOD 2010 2 of 21
Example 1: Genealogy Data
5M people from Pedigree Resource FileI Person (ind id, birth, death)
I Relative (ind id, rel id, role)
B: 1744D: 1804
B: ?D: ?
B: 1781D: 1807
B: ?D: ?
B: ?D: ?
B: ?D: ?
B: 1743D: 1787
B: 1707D: 1766
B: 1741D: 1771
B: 1763D: 1826
B: ?D: ?
B: 1769D: 1769
B: 1770D: 1770
Integrated from manysources, e.g.:
I Census recordsI Immigration listsI Family history
societiesI Individual
submissions
Mayfield, Neville, & Prabhakar SIGMOD 2010 3 of 21
Example 2: Sensor Networks
I 54 sensors, every 31seconds, for 38 days
I ≈ 18% obviously incorrectI Multiple data types
2M readings of Intel Lab Data
I Sensor (epoch, mote id,temp, humid, light, volt)
I Neighbor (id1, id2,distance)
Source: http://db.csail.mit.edu/labdata/labdata.html
Mayfield, Neville, & Prabhakar SIGMOD 2010 4 of 21
Insight
Correlations within tuples, e.g.:I Birth and death yearsI Temperature and humidity values
Correlations across tuples, e.g.:I Parents and childrenI Neighboring sensors
Apply statistical relational learningI Don’t just clean tuples in isolation
(e.g., functional dependencies)I Propagate inferences multiple times
Input:I Possible tuple
dependenciesI Correlation
model skeletonOutput:
I PDFs formissing data
I Flags fordirty data
Mayfield, Neville, & Prabhakar SIGMOD 2010 5 of 21
Baseline Approach:
Bayesian networks
Exact inference (junction tree)
Bayes Net Toolbox for Matlab
Bayesian Network Formulation
Model template specifies conditional dependencies:
IndividualIndividual
Birth DeathFather
Mother =⇒
F.b
F.d
I.b
M.b
M.d
I.d
C1.b C2.b C3.b
S.b
S.d
C2.d C2.d C3.d
Conditional probability distribution (CPD) at each node:
P(I.d | I.b) death year, given the birth yearP(I.b | M.b , F.b) birth year, given parent birth years
Prior distribution at nodes with no parents: P(I.b)
Simplified version of Relational Bayesian Networks(see e.g., Getoor & Taskar 2007)
Mayfield, Neville, & Prabhakar SIGMOD 2010 7 of 21
Maximum Likelihood Estimation
F.b
F.d
I.b
M.b
M.d
I.d
C1.b C2.b C3.b
S.b
S.d
C2.d C2.d C3.d
1. Learn CPDs from data, e.g.:CREATE TABLE cpt_birth AS
SELECT birth , death , count (*)
FROM person
GROUP BY birth , death;
2. Share CPDs across all nodes:-- P(I.d | I.b = 1750)
SELECT death , count
FROM cpt_birth
WHERE birth = 1750;
3. Run inference (e.g., junction tree)I Construct Bayesian networkI Bind evidence (query from DB)I Extract results (store in DB)
Mayfield, Neville, & Prabhakar SIGMOD 2010 8 of 21
Challenges and Lessons Learned
Limiting model assumptionsI Fixed CPD structure (e.g., always two parents)I Acyclicity constraint (can’t model sensor data)
Potentially millions of parametersI Becomes very inefficientI Floating point underflow
Not scalable to large data setsI DB may not fit into main memoryI Moving data in/out of R, Matlab, etc.
Not designed for data cleaningI Propagates outliers/errors in original dataI Need to look beyond the Markov blanket
Mayfield, Neville, & Prabhakar SIGMOD 2010 9 of 21
ERACER Approach:
Relational dependency networks
Approximate inference algorithm
SQL-based framework
Integrated data cleaning
Relational Dependency Networks
For example, at each sensor and time epoch:
SensorSensor
Temp Humid
Neighbor Neighbor Neighbor
=⇒
S1.t
S1.hS2.t
S4.h
S2.h S3.t
S3.h S4.t
In contrast to Bayesian networks, RDNs:I approximate the full joint distributionI learn CPDs locally based on component modelsI allow cyclic dependencies (i.e., many-to-many)I use aggregation to deal with heterogeneity
(see Neville & Jensen, JMLR 2007)
Mayfield, Neville, & Prabhakar SIGMOD 2010 11 of 21
Component Models
Convolution (for genealogy)I parent age: MPA = P (I.b− P.b)I death age: MDA = P (I.d− I.b)
-- death age model
SELECT hist(death - birth)
FROM person;
0 20 40 60 80 100
0.00
0.02
0.04
Expected value = 54.0 Standard deviation = 29.0
Regression (for sensors)I mean temperature:S.t ∼ β0 + β1 · S.h+ β2 · avg(N.t) + β3 · avg(N.h)
I mean humidity:S.h ∼ γ0 + γ1 · S.t+ γ2 · avg(N.t) + γ3 · avg(N.h)
Mayfield, Neville, & Prabhakar SIGMOD 2010 12 of 21
ERACER Framework
Learning (one time, offline):1. Extract graph structure using domain knowledge2. RDNs aggregate existing data to learn parameters
Inference (multiple iterations):3. Apply component models to every value in DB4. Combine predictions to deal with heterogeneity5. Evaluate posterior distributions for cleaning6. Repeat 3–5 until happy (i.e., results converge)
Mayfield, Neville, & Prabhakar SIGMOD 2010 13 of 21
Step 1: Extract Graphical Model
Construct nodes:INSERT INTO node
SELECT make_nid(epoch , mote_id), -- creates simple key
new_basis(temp), new_basis(humid),
new_basis(light), new_basis(volt)
FROM sensor;
basis data type:initial original value, if any (e.g., humid)
pdf current prediction (or distribution)suspect data cleaning flag (true = outlier)round when pdf/suspect was last updated
Construct edges:INSERT INTO link
SELECT make_nid(a.epoch , a.mote_id),
make_nid(b.epoch , b.mote_id)
FROM neighbor AS c -- e.g., within 6 meters
INNER JOIN sensor AS a ON c.id1 = a.mote_id
INNER JOIN sensor AS b ON c.id2 = b.mote_id
WHERE a.epoch - 30 <= b.epoch
AND a.epoch + 30 >= b.epoch;
Mayfield, Neville, & Prabhakar SIGMOD 2010 14 of 21
Step 2: Learn RDN Parameters
Aggregate original data values:CREATE TABLE learn AS
SELECT
-- individual instances
min(expect(i.t)) AS ti , min(expect(i.h)) AS hi ,
min(expect(i.l)) AS li , min(expect(i.v)) AS vi ,
-- average neighbor values
avg(expect(n.t)) AS tn , avg(expect(n.h)) AS hn ,
avg(expect(n.l)) AS ln , avg(expect(n.v)) AS vn
FROM node AS i
LEFT JOIN link AS l ON i.nid = l.id1
LEFT JOIN node AS n ON l.id2 = n.nid
GROUP BY i.nid;
Optional: apply noise filters, sample data, etc.
Estimate applicable component modelsI Convolution: use built-in hist aggregateI Regression: export to R; use lm function
Mayfield, Neville, & Prabhakar SIGMOD 2010 15 of 21
Steps 3–6: Approximate Inference
For each round of inference:
1. Update predictions via the erace aggregate queryI Infers/cleans all attributes in a single function call
SELECT erace(i, n)
FROM node AS i
LEFT JOIN link AS l ON i.nid = l.id1
LEFT JOIN node AS n ON l.id2 = n.nid
GROUP BY i;
Key design choice: grouping by tuples, not attributes
2. Store results via CREATE TABLE AS (i.e., propagation)I Faster than UPDATE over the entire relation (MVCC)I Other optimizations possible (e.g., indexes on nid’s)
Mayfield, Neville, & Prabhakar SIGMOD 2010 16 of 21
erace Aggregate Function
SELECT erace(i, n)
ti hi tn hn
? 40% 21◦ 38%35◦ 23%24◦ ?
For each attribute in i:I Select applicable modelI Apply/combine predictionsI Evaluate (cf. init and prev)
Data cleaning algorithm:I Run inference for known values, as if missingI Is original evidence within expected range?I Replace outliers with inferred distributionsI Do not propagate suspects (rely on other data)
Many more details in the paper!
Mayfield, Neville, & Prabhakar SIGMOD 2010 17 of 21
Experiments:
Generate synthetic populations
Randomly set attributes to NULL
Compare inferred values to original
Genealogy Data Results
Accuracy of birth pdfs:
20 30 40 50 60
02
46
810
Percent Missing (per blanket)
Mea
n A
bsol
ute
Err
or
●● ● ● ●
● ● ●●
●
BayesNetERACER
0 5 10 15 20 25
05
1015
20
Percent Corrupt (per blanket)
Mea
n A
bsol
ute
Err
or
●
●
●
●●
●
●
BayesNetERACER
Variance (uncertainty):
20 30 40 50 60
05
1015
Percent Missing (per blanket)
Ave
rage
Sta
ndar
d D
evia
tion
● ● ● ● ● ● ● ● ●
●
BayesNetERACER
0 5 10 15 20 25
05
1015
2025
Percent Corrupt (per blanket)
Ave
rage
Sta
ndar
d D
evia
tion
●●
●
●●
●
●
BayesNetERACER
Mayfield, Neville, & Prabhakar SIGMOD 2010 19 of 21
Sensor Data Results
Accuracy of temperature pdfs:
30 40 50 60 70 80 90
01
23
45
Percent Missing
Mea
n A
bsol
ute
Err
or
●● ● ● ● ●
●
●
BaselineERACER
10 20 30 40 50 60 70
01
23
45
Percent Corrupt
Mea
n A
bsol
ute
Err
or
●
●
●
●
●
●
●
●
BaselineERACER
Accuracy of humidity pdfs:
30 40 50 60 70 80 90
02
46
810
Percent Missing
Mea
n A
bsol
ute
Err
or
● ● ● ● ● ●●
●
BaselineERACER
10 20 30 40 50 60 70
02
46
810
Percent Corrupt
Mea
n A
bsol
ute
Err
or
●
●
●
●
●
●
●●
BaselineERACER
Mayfield, Neville, & Prabhakar SIGMOD 2010 20 of 21
Summary
Database-centric approach for approximate inferenceI Statistical framework for correcting errorsI Efficient; no need to move data to/from R/Matlab
Synergy of imputation and data cleaning tasksI Additional evidence identifies errors more accuratelyI Corrected data values improve the quality of inference
Empirical evaluation on two real-world data setsI Similar accuracy to Bayesian network baselineI Significant gains in runtime performanceI Added benefit of simultaneous data cleaning
Mayfield, Neville, & Prabhakar SIGMOD 2010 21 of 21