View
625
Download
1
Embed Size (px)
DESCRIPTION
Ahmed K. Elmagarmid (IEEE Fellow and ACM Distinguished Scientist) gave a lecture on Data Quality: Not Your Typical Database Problem in the Distinguished Lecturer Series - Leon The Mathematician.
Citation preview
Data QualityNot your Typical Database Problem
2011 © Copyright QCRI. Confidential document.
Not your Typical Database Problem
Ahmed Elmagarmid
Executive DirectorQatar Computing Research Institute
Where are we located?
2011 © Copyright QCRI. Confidential document.
Where are we located?
2
2011 © Copyright QCRI. Confidential document. 33
2011 © Copyright QCRI. Confidential document. 4
Qatar Foundation
2011 © Copyright QCRI. Confidential document.
Qatar Foundation
5
EDUCATIONSCIENCE &RESEARCH
COMMUNITY DEVELOPMENT
2.8 percent of GDP to be spent on research
2011 © Copyright QCRI. Confidential document.
be spent on research annually by 2015
Qatar
Biomedical
Qatar Energy &
Environment
Qatar
Computing
Qatar Foundation Research Division
2011 © Copyright QCRI. Confidential document.
Biomedical
Research
Institute
QBRI
Environment
Research
Institute
QEERI
Computing
Research
Institute
QCRI
QCRI Overview
2011 © Copyright QCRI. Confidential document.
QCRI Overview
8
QCRI Vision
To make Qatar a global center forcomputing research by becoming theworld’s recognized leader in Arabic
2011 © Copyright QCRI. Confidential document.
world’s recognized leader in Arabiclanguage technologies and in key areasvital to the global growth of Qataribusiness and entrepreneurial activity .
9
AcademiaAcademia
National Institutions (QCRI)
� Grand practical challenges � National and global impact� Localized skills & knowledge
National Institutions (QCRI)
� Grand practical challenges � National and global impact� Localized skills & knowledge
Gra
nd C
halle
ngesQCRI Model
2011 © Copyright QCRI. Confidential document.
10
� Individual projects� Students move on� Theoretical & basic
research
� Individual projects� Students move on� Theoretical & basic
research
� Localized skills & knowledge� Large teams and long term� Example peers: INRIA, MPI
� Localized skills & knowledge� Large teams and long term� Example peers: INRIA, MPI G
rand
Cha
lleng
esP
roje
ct-b
ased
Basic Research Applied Research
Research Parks
� Commercialization� Entrepreneurship� Incubation
Research Parks
� Commercialization� Entrepreneurship� Incubation
10
QCRI Ecosystem
QCRIQCRI
QUQU
HKUHKUQEERIQEERI
QBRIQBRISidraSidra MITMIT
2011 © Copyright QCRI. Confidential document.
BoeingBoeing
AljazeeraAljazeera
YahooYahooGoogleGoogle
MicrosoftMicrosoft
ALTISALTIS
QSAQSAMEEZAMEEZA
QSTPQSTP
QPQP
Energy Co.
Energy Co.
WikiMediaWikiMedia
IBMIBM
11
Arabic Language
Social Computing
Scientific Computing
QCRI Research Centers
2011 © Copyright QCRI. Confidential document.
Language Technologies
Computing Computing
Data Analytics
Cloud Computing
12
Prof. Rich DeMilloGeorgia Tech, Chair
QCRI Scientific Advisory Council
Prof. Joichi Ito MIT Media Lab Director
Prof. Ruzena BajcsyUniversity of California – Berkeley
Lord Rupert RedesdaleUK House of Lords
2011 © Copyright QCRI. Confidential document. 13
Lew TuckerVice President, Cisco
Prof. Alfred V. AhoColumbia University
Yousef KhalidiVice President, Microsoft
Prof. Dick LiptonGeorgia Tech
Rashid
KamalHalima
Kulood
The 60 Doers!
Scientific
Ihab Mourad
Michele
John
Chu
Amr
ElKindi
Nan
Data AnalyticsPaolo
Management and Support Team
AhmedAbdellatif
Agathe
Jill
Melissa
Amal
Nada
Samreen
Computing
Richard P.
Richard
Hend
2011 © Copyright QCRI. Confidential document.
Simon G.
MohamedSimon P.
Maged
William
Khulood
Amira
Ahmed A.
Gokop
Mustafa
Cloud Computing
Arabic Language
Technologies
Kareem Stephan
Wei
Preslav
Lolwa
Ahmed T.
Francisco
ThuyLinh
Safdar
Sihem
AyshaSofiane
MikalaiRuth
Gautam
Social Computing
Samreen
Othmane
Ahmed A.
Aybuke Shameem
Walid Peng
Ahmed M.
Ahmed T.
Khaled
Tarek
Strategic Partnerships
2011 © Copyright QCRI. Confidential document.
Strategic Partnerships
15
AgendaStrategic Partnerships
2011 © Copyright QCRI. Confidential document. 16
5-YEAR QCRI MANPOWER PLAN
82102
110
2011 © Copyright QCRI. Confidential document.
10-11 11-12 12-13 13-14 14-15
21
34+13 +48 +20 +8
17
This Talk
2011 © Copyright QCRI. Confidential document.
This Talk
Data Quality
18
Data Quality
Enhancing the usability of the acquired data and increasing the confidence of query results"Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group
2011 © Copyright QCRI. Confidential document.
state of denial about this issue. " -Gartner Group
19
Real life data is often dirty: Data
error rates in industry: 1% - 30%
(Redman, 1998)
Dirty Data is Expensive
Obama administration offered
$19 billion grants for health IT, i.e.
improve EMRs in 2009
2011 © Copyright QCRI. Confidential document. 20
Erroneously priced data in retail
databases costs US customers
$2.5 billion each year
The Data Warehousing Institute
estimates that data quality
problems cost U.S. businesses
more than $600 billion a year
(2002)
Where to start? Data Quality everywhere !
• Data Entry• Information Extraction• Integration from multiple sources
2011 © Copyright QCRI. Confidential document.
• Integration from multiple sources• Standardization and transformation• Business rules compliance
21
““““Academic ”””” Data Cleaning
● Pick a well understood data problem under some scoping
assumptions and solve independently
� Duplicates
� Functional Dependency violations
� Matching dependency violations
2011 © Copyright QCRI. Confidential document.
� Matching dependency violations
� Missing value imputation
● Piece-meal approach to tackle the complexity and sometimes the
intractability of the problem
� Repairing violations of FD constraints in special cases (no deletion, left hand
side changes only, allowing variable etc.)
22
““““Academic ”””” Data Cleaning
• Despite their theoretic and algorithmic beauty, rarely used
– Problems never exist in isolation
– Fixes to one problem often introduce “other” problems– Data usually not accessible to mess with
2011 © Copyright QCRI. Confidential document.
– Data usually not accessible to mess with– Integrity constraints!... What integrity constraints?!!
23
““““Practitioner ”””” Data Cleaning
• Will share some scary stories
– “post-it notes” as an expert messaging system– “written permission” to change value of a record– Default values and best practices
2011 © Copyright QCRI. Confidential document.
– Default values and best practices
– “Call John.. He will know what to do”
24
This Talk
● Few data quality challenges and (hopefully) research
directions
2011 © Copyright QCRI. Confidential document.
● Summary of recent efforts at QCRI
25
10 Data Quality Issues
2011 © Copyright QCRI. Confidential document.
10 Data Quality Issues
26
Issue 1: The data trio
2011 © Copyright QCRI. Confidential document. 27
Quality
DATA
Extraction remains a key source of data errors
Acquiring the semantics/schema of the underlying un structured data sources (document, emails, related Web info, click traces, profiles, interests, etc.)
2011 © Copyright QCRI. Confidential document. 28
Integration aggravates the problem
Linked data as an attempt to live with errors .. link as you go
2011 © Copyright QCRI. Confidential document. 29
m1
Slide 29
m1 I'm not sure about this idea of putting "linked data" so prominent in this slide on IImourad, 7/23/2011
Issue 2: Data level or application level
• Cleaning data tables by trusting the schema table! Is rarely useful• Will share a story
– Bell-core with 1800 inter-linked databases– Rule-based logic for sanity checking– Post-it messages to communicate between data quality officers
2011 © Copyright QCRI. Confidential document.
– Post-it messages to communicate between data quality officers .. Who work in shifts!
– Data cleaning action is meaningless if not tied to a business logic or to a process. Should never be against FDs
30
Issue 3: Protect your gain: DQ Dashboard
● How to protect against going backwards
● How to protect your gains during the cleansing process
● Metrics:
2011 © Copyright QCRI. Confidential document.
● Metrics:
�Minimality Principle: mostly and widely used in academic
cleaning
�Value of information: to spot the most important problem to fix
31
Issue 3: Protect your gain - Ideas
• Root-cause analysis for data cleaning
• Chase problems to the source to reason about “progress”
2011 © Copyright QCRI. Confidential document.
• Leveraging “Provenance” to design progress meters
32
Issue 4: Data is not an orphan!
● Data Stewards are not imaginary characters! Important data
has stewards and custodians
● Need to go through these guardians first
2011 © Copyright QCRI. Confidential document.
� Some health care requires a signed form per changed cell stating
reasons for change
● Possible approaches:
� How to avoid stewards?
� How to integrate them in the process or minimize their involvement?
33
Issue 5: How clean is clean?
• Quality awareness eats up 10% of the budget [Telecom Experience]
• How to avoid over-cleaning
• Example: “Bill Forgiveness”, a real-life experience: roaming
2011 © Copyright QCRI. Confidential document.
• Example: “Bill Forgiveness”, a real-life experience: roaming charges and cross-carrier calls have a very complicated business model
• Possible approaches
– Measure cleaning progress
– Clean only to satisfy some application needs
34
Issue 6: Online cleaning a necessity not a feature
● We live in a complex world → complex applications with 100s and 1000s of components and parameters
● Clean as you go .. Clean on demand .. Clean opportunistically .. Can be the only hope
2011 © Copyright QCRI. Confidential document.
● New concepts:� Iterative cleaning
� Cleaning dynamic and evolving data
● Off-line cleaning can still benefit historical data but is becoming less and less important
35
Issue 7: Application quality
• Data Quality → Information Quality → Application quality
• Realizes the levels of complexity in current BI apps
• Data usage should influence data cleaning
2011 © Copyright QCRI. Confidential document.
• Data usage should influence data cleaning
– “Usage-based” data cleaning
36
Issue 8: SW engineering DQ
• Current focus on discrete values with simple integrity constraints (FD, uniqueness…)
• We are good at checking if data complies with rules
• Real business rules are often “assertions” and expressed in
2011 © Copyright QCRI. Confidential document.
• Real business rules are often “assertions” and expressed in “turing-complete” languages
• Checking “did we write the assertions right?” becomes a lot harder
• But also.. need to think if we wrote the right assertions!
37
Issue 9: DQ Theory?
• ACID in transaction management were not only sensible requirements but also had algorithms and methods to enforce them during transactions processing
• Does it make sense to do the same for Quality? Plausible properties along
2011 © Copyright QCRI. Confidential document.
• Does it make sense to do the same for Quality? Plausible properties along with actions for maintaining acceptable quality during data manipulation
• Some of these already exist: Timeliness, Currency, Consistency, etc. but lack methods of enforcement
38
Issue 10: Scale .. Scale
• Terabytes and Petabytes of data requires new ways to enforce data quality
• Which ball to drop
2011 © Copyright QCRI. Confidential document.
• Leveraging application semantics and data usage
• Sampling to learn from the few and apply on the masses
• Active learning to replace human feedback (GDR as a solution)
39
Sample QCRI Projects
2011 © Copyright QCRI. Confidential document.
Sample QCRI Projects
40
GDR – Guided Data Repair
• Scalable ways to involve experts
• Repurposing destructive automatic techniques to guide repairs
• Value of Information measures to generate the most important
questions
• Judicious use of active learning from user feedbackUser QueryUser Query
2011 © Copyright QCRI. Confidential document. 41
• Judicious use of active learning from user feedback
Input Database
Instance
Detect Errors
and Violations
Learn and
Repair
Database
Clean Database
Instance
User QueryUser Query
Results
GDR Architecture
2011 © Copyright QCRI. Confidential document. 42
Probabilistic Data Cleaning
Uncertain
Error Detection
Possible
Repair
Generation
Clean Database
Instance
User QueryUser Query
Clean Database
InstancePossible
2011 © Copyright QCRI. Confidential document. 43
Input
Database
Instance
Instance
Probabilistic Results
InstancePossible
Clean Instance
Possible Repairs
A possible repair is a clustering of the input tuples
Person
X1
{P1}
X2
{P1,P2}
X3
{P1,P2,P5}
Possible RepairsID Name ZIP Income
P1 Green 51519 30k
2011 © Copyright QCRI. Confidential document. 44
Uncertain Clustering
{P1}
{P2}
{P3,P4}
{P5}
{P6}
{P1,P2}
{P3,P4}
{P5}
{P6}
{P1,P2,P5}
{P3,P4}
{P6}
P1 Green 51519 30k
P2 Green 51518 32k
P3 Peter 30528 40k
P4 Peter 30528 40k
P5 Gree 51519 55k
P6 Chuck 51519 30k
Thank You
2011 © Copyright QCRI. Confidential document.
Thank You
www.qcri.qa