46
Data Quality Not your Typical Database Problem 2011 © Copyright QCRI. Confidential document. Not your Typical Database Problem Ahmed Elmagarmid Executive Director Qatar Computing Research Institute

Data Quality: Not Your Typical Database Problem

Embed Size (px)

DESCRIPTION

Ahmed K. Elmagarmid (IEEE Fellow and ACM Distinguished Scientist) gave a lecture on Data Quality: Not Your Typical Database Problem in the Distinguished Lecturer Series - Leon The Mathematician.

Citation preview

Page 1: Data Quality: Not Your Typical Database Problem

Data QualityNot your Typical Database Problem

2011 © Copyright QCRI. Confidential document.

Not your Typical Database Problem

Ahmed Elmagarmid

Executive DirectorQatar Computing Research Institute

Page 2: Data Quality: Not Your Typical Database Problem

Where are we located?

2011 © Copyright QCRI. Confidential document.

Where are we located?

2

Page 3: Data Quality: Not Your Typical Database Problem

2011 © Copyright QCRI. Confidential document. 33

Page 4: Data Quality: Not Your Typical Database Problem

2011 © Copyright QCRI. Confidential document. 4

Page 5: Data Quality: Not Your Typical Database Problem

Qatar Foundation

2011 © Copyright QCRI. Confidential document.

Qatar Foundation

5

Page 6: Data Quality: Not Your Typical Database Problem

EDUCATIONSCIENCE &RESEARCH

COMMUNITY DEVELOPMENT

2.8 percent of GDP to be spent on research

2011 © Copyright QCRI. Confidential document.

be spent on research annually by 2015

Page 7: Data Quality: Not Your Typical Database Problem

Qatar

Biomedical

Qatar Energy &

Environment

Qatar

Computing

Qatar Foundation Research Division

2011 © Copyright QCRI. Confidential document.

Biomedical

Research

Institute

QBRI

Environment

Research

Institute

QEERI

Computing

Research

Institute

QCRI

Page 8: Data Quality: Not Your Typical Database Problem

QCRI Overview

2011 © Copyright QCRI. Confidential document.

QCRI Overview

8

Page 9: Data Quality: Not Your Typical Database Problem

QCRI Vision

To make Qatar a global center forcomputing research by becoming theworld’s recognized leader in Arabic

2011 © Copyright QCRI. Confidential document.

world’s recognized leader in Arabiclanguage technologies and in key areasvital to the global growth of Qataribusiness and entrepreneurial activity .

9

Page 10: Data Quality: Not Your Typical Database Problem

AcademiaAcademia

National Institutions (QCRI)

� Grand practical challenges � National and global impact� Localized skills & knowledge

National Institutions (QCRI)

� Grand practical challenges � National and global impact� Localized skills & knowledge

Gra

nd C

halle

ngesQCRI Model

2011 © Copyright QCRI. Confidential document.

10

� Individual projects� Students move on� Theoretical & basic

research

� Individual projects� Students move on� Theoretical & basic

research

� Localized skills & knowledge� Large teams and long term� Example peers: INRIA, MPI

� Localized skills & knowledge� Large teams and long term� Example peers: INRIA, MPI G

rand

Cha

lleng

esP

roje

ct-b

ased

Basic Research Applied Research

Research Parks

� Commercialization� Entrepreneurship� Incubation

Research Parks

� Commercialization� Entrepreneurship� Incubation

10

Page 11: Data Quality: Not Your Typical Database Problem

QCRI Ecosystem

QCRIQCRI

QUQU

HKUHKUQEERIQEERI

QBRIQBRISidraSidra MITMIT

2011 © Copyright QCRI. Confidential document.

BoeingBoeing

AljazeeraAljazeera

YahooYahooGoogleGoogle

MicrosoftMicrosoft

ALTISALTIS

QSAQSAMEEZAMEEZA

QSTPQSTP

QPQP

Energy Co.

Energy Co.

WikiMediaWikiMedia

IBMIBM

11

Page 12: Data Quality: Not Your Typical Database Problem

Arabic Language

Social Computing

Scientific Computing

QCRI Research Centers

2011 © Copyright QCRI. Confidential document.

Language Technologies

Computing Computing

Data Analytics

Cloud Computing

12

Page 13: Data Quality: Not Your Typical Database Problem

Prof. Rich DeMilloGeorgia Tech, Chair

QCRI Scientific Advisory Council

Prof. Joichi Ito MIT Media Lab Director

Prof. Ruzena BajcsyUniversity of California – Berkeley

Lord Rupert RedesdaleUK House of Lords

2011 © Copyright QCRI. Confidential document. 13

Lew TuckerVice President, Cisco

Prof. Alfred V. AhoColumbia University

Yousef KhalidiVice President, Microsoft

Prof. Dick LiptonGeorgia Tech

Page 14: Data Quality: Not Your Typical Database Problem

Rashid

KamalHalima

Kulood

The 60 Doers!

Scientific

Ihab Mourad

Michele

John

Chu

Amr

ElKindi

Nan

Data AnalyticsPaolo

Management and Support Team

AhmedAbdellatif

Agathe

Jill

Melissa

Amal

Nada

Samreen

Computing

Richard P.

Richard

Hend

2011 © Copyright QCRI. Confidential document.

Simon G.

MohamedSimon P.

Maged

William

Khulood

Amira

Ahmed A.

Gokop

Mustafa

Cloud Computing

Arabic Language

Technologies

Kareem Stephan

Wei

Preslav

Lolwa

Ahmed T.

Francisco

ThuyLinh

Safdar

Sihem

AyshaSofiane

MikalaiRuth

Gautam

Social Computing

Samreen

Othmane

Ahmed A.

Aybuke Shameem

Walid Peng

Ahmed M.

Ahmed T.

Khaled

Tarek

Page 15: Data Quality: Not Your Typical Database Problem

Strategic Partnerships

2011 © Copyright QCRI. Confidential document.

Strategic Partnerships

15

Page 16: Data Quality: Not Your Typical Database Problem

AgendaStrategic Partnerships

2011 © Copyright QCRI. Confidential document. 16

Page 17: Data Quality: Not Your Typical Database Problem

5-YEAR QCRI MANPOWER PLAN

82102

110

2011 © Copyright QCRI. Confidential document.

10-11 11-12 12-13 13-14 14-15

21

34+13 +48 +20 +8

17

Page 18: Data Quality: Not Your Typical Database Problem

This Talk

2011 © Copyright QCRI. Confidential document.

This Talk

Data Quality

18

Page 19: Data Quality: Not Your Typical Database Problem

Data Quality

Enhancing the usability of the acquired data and increasing the confidence of query results"Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group

2011 © Copyright QCRI. Confidential document.

state of denial about this issue. " -Gartner Group

19

Page 20: Data Quality: Not Your Typical Database Problem

Real life data is often dirty: Data

error rates in industry: 1% - 30%

(Redman, 1998)

Dirty Data is Expensive

Obama administration offered

$19 billion grants for health IT, i.e.

improve EMRs in 2009

2011 © Copyright QCRI. Confidential document. 20

Erroneously priced data in retail

databases costs US customers

$2.5 billion each year

The Data Warehousing Institute

estimates that data quality

problems cost U.S. businesses

more than $600 billion a year

(2002)

Page 21: Data Quality: Not Your Typical Database Problem

Where to start? Data Quality everywhere !

• Data Entry• Information Extraction• Integration from multiple sources

2011 © Copyright QCRI. Confidential document.

• Integration from multiple sources• Standardization and transformation• Business rules compliance

21

Page 22: Data Quality: Not Your Typical Database Problem

““““Academic ”””” Data Cleaning

● Pick a well understood data problem under some scoping

assumptions and solve independently

� Duplicates

� Functional Dependency violations

� Matching dependency violations

2011 © Copyright QCRI. Confidential document.

� Matching dependency violations

� Missing value imputation

● Piece-meal approach to tackle the complexity and sometimes the

intractability of the problem

� Repairing violations of FD constraints in special cases (no deletion, left hand

side changes only, allowing variable etc.)

22

Page 23: Data Quality: Not Your Typical Database Problem

““““Academic ”””” Data Cleaning

• Despite their theoretic and algorithmic beauty, rarely used

– Problems never exist in isolation

– Fixes to one problem often introduce “other” problems– Data usually not accessible to mess with

2011 © Copyright QCRI. Confidential document.

– Data usually not accessible to mess with– Integrity constraints!... What integrity constraints?!!

23

Page 24: Data Quality: Not Your Typical Database Problem

““““Practitioner ”””” Data Cleaning

• Will share some scary stories

– “post-it notes” as an expert messaging system– “written permission” to change value of a record– Default values and best practices

2011 © Copyright QCRI. Confidential document.

– Default values and best practices

– “Call John.. He will know what to do”

24

Page 25: Data Quality: Not Your Typical Database Problem

This Talk

● Few data quality challenges and (hopefully) research

directions

2011 © Copyright QCRI. Confidential document.

● Summary of recent efforts at QCRI

25

Page 26: Data Quality: Not Your Typical Database Problem

10 Data Quality Issues

2011 © Copyright QCRI. Confidential document.

10 Data Quality Issues

26

Page 27: Data Quality: Not Your Typical Database Problem

Issue 1: The data trio

2011 © Copyright QCRI. Confidential document. 27

Quality

DATA

Page 28: Data Quality: Not Your Typical Database Problem

Extraction remains a key source of data errors

Acquiring the semantics/schema of the underlying un structured data sources (document, emails, related Web info, click traces, profiles, interests, etc.)

2011 © Copyright QCRI. Confidential document. 28

Page 29: Data Quality: Not Your Typical Database Problem

Integration aggravates the problem

Linked data as an attempt to live with errors .. link as you go

2011 © Copyright QCRI. Confidential document. 29

m1

Page 30: Data Quality: Not Your Typical Database Problem

Slide 29

m1 I'm not sure about this idea of putting "linked data" so prominent in this slide on IImourad, 7/23/2011

Page 31: Data Quality: Not Your Typical Database Problem

Issue 2: Data level or application level

• Cleaning data tables by trusting the schema table! Is rarely useful• Will share a story

– Bell-core with 1800 inter-linked databases– Rule-based logic for sanity checking– Post-it messages to communicate between data quality officers

2011 © Copyright QCRI. Confidential document.

– Post-it messages to communicate between data quality officers .. Who work in shifts!

– Data cleaning action is meaningless if not tied to a business logic or to a process. Should never be against FDs

30

Page 32: Data Quality: Not Your Typical Database Problem

Issue 3: Protect your gain: DQ Dashboard

● How to protect against going backwards

● How to protect your gains during the cleansing process

● Metrics:

2011 © Copyright QCRI. Confidential document.

● Metrics:

�Minimality Principle: mostly and widely used in academic

cleaning

�Value of information: to spot the most important problem to fix

31

Page 33: Data Quality: Not Your Typical Database Problem

Issue 3: Protect your gain - Ideas

• Root-cause analysis for data cleaning

• Chase problems to the source to reason about “progress”

2011 © Copyright QCRI. Confidential document.

• Leveraging “Provenance” to design progress meters

32

Page 34: Data Quality: Not Your Typical Database Problem

Issue 4: Data is not an orphan!

● Data Stewards are not imaginary characters! Important data

has stewards and custodians

● Need to go through these guardians first

2011 © Copyright QCRI. Confidential document.

� Some health care requires a signed form per changed cell stating

reasons for change

● Possible approaches:

� How to avoid stewards?

� How to integrate them in the process or minimize their involvement?

33

Page 35: Data Quality: Not Your Typical Database Problem

Issue 5: How clean is clean?

• Quality awareness eats up 10% of the budget [Telecom Experience]

• How to avoid over-cleaning

• Example: “Bill Forgiveness”, a real-life experience: roaming

2011 © Copyright QCRI. Confidential document.

• Example: “Bill Forgiveness”, a real-life experience: roaming charges and cross-carrier calls have a very complicated business model

• Possible approaches

– Measure cleaning progress

– Clean only to satisfy some application needs

34

Page 36: Data Quality: Not Your Typical Database Problem

Issue 6: Online cleaning a necessity not a feature

● We live in a complex world → complex applications with 100s and 1000s of components and parameters

● Clean as you go .. Clean on demand .. Clean opportunistically .. Can be the only hope

2011 © Copyright QCRI. Confidential document.

● New concepts:� Iterative cleaning

� Cleaning dynamic and evolving data

● Off-line cleaning can still benefit historical data but is becoming less and less important

35

Page 37: Data Quality: Not Your Typical Database Problem

Issue 7: Application quality

• Data Quality → Information Quality → Application quality

• Realizes the levels of complexity in current BI apps

• Data usage should influence data cleaning

2011 © Copyright QCRI. Confidential document.

• Data usage should influence data cleaning

– “Usage-based” data cleaning

36

Page 38: Data Quality: Not Your Typical Database Problem

Issue 8: SW engineering DQ

• Current focus on discrete values with simple integrity constraints (FD, uniqueness…)

• We are good at checking if data complies with rules

• Real business rules are often “assertions” and expressed in

2011 © Copyright QCRI. Confidential document.

• Real business rules are often “assertions” and expressed in “turing-complete” languages

• Checking “did we write the assertions right?” becomes a lot harder

• But also.. need to think if we wrote the right assertions!

37

Page 39: Data Quality: Not Your Typical Database Problem

Issue 9: DQ Theory?

• ACID in transaction management were not only sensible requirements but also had algorithms and methods to enforce them during transactions processing

• Does it make sense to do the same for Quality? Plausible properties along

2011 © Copyright QCRI. Confidential document.

• Does it make sense to do the same for Quality? Plausible properties along with actions for maintaining acceptable quality during data manipulation

• Some of these already exist: Timeliness, Currency, Consistency, etc. but lack methods of enforcement

38

Page 40: Data Quality: Not Your Typical Database Problem

Issue 10: Scale .. Scale

• Terabytes and Petabytes of data requires new ways to enforce data quality

• Which ball to drop

2011 © Copyright QCRI. Confidential document.

• Leveraging application semantics and data usage

• Sampling to learn from the few and apply on the masses

• Active learning to replace human feedback (GDR as a solution)

39

Page 41: Data Quality: Not Your Typical Database Problem

Sample QCRI Projects

2011 © Copyright QCRI. Confidential document.

Sample QCRI Projects

40

Page 42: Data Quality: Not Your Typical Database Problem

GDR – Guided Data Repair

• Scalable ways to involve experts

• Repurposing destructive automatic techniques to guide repairs

• Value of Information measures to generate the most important

questions

• Judicious use of active learning from user feedbackUser QueryUser Query

2011 © Copyright QCRI. Confidential document. 41

• Judicious use of active learning from user feedback

Input Database

Instance

Detect Errors

and Violations

Learn and

Repair

Database

Clean Database

Instance

User QueryUser Query

Results

Page 43: Data Quality: Not Your Typical Database Problem

GDR Architecture

2011 © Copyright QCRI. Confidential document. 42

Page 44: Data Quality: Not Your Typical Database Problem

Probabilistic Data Cleaning

Uncertain

Error Detection

Possible

Repair

Generation

Clean Database

Instance

User QueryUser Query

Clean Database

InstancePossible

2011 © Copyright QCRI. Confidential document. 43

Input

Database

Instance

Instance

Probabilistic Results

InstancePossible

Clean Instance

Page 45: Data Quality: Not Your Typical Database Problem

Possible Repairs

A possible repair is a clustering of the input tuples

Person

X1

{P1}

X2

{P1,P2}

X3

{P1,P2,P5}

Possible RepairsID Name ZIP Income

P1 Green 51519 30k

2011 © Copyright QCRI. Confidential document. 44

Uncertain Clustering

{P1}

{P2}

{P3,P4}

{P5}

{P6}

{P1,P2}

{P3,P4}

{P5}

{P6}

{P1,P2,P5}

{P3,P4}

{P6}

P1 Green 51519 30k

P2 Green 51518 32k

P3 Peter 30528 40k

P4 Peter 30528 40k

P5 Gree 51519 55k

P6 Chuck 51519 30k

Page 46: Data Quality: Not Your Typical Database Problem

Thank You

2011 © Copyright QCRI. Confidential document.

Thank You

www.qcri.qa