28
Data Quality: Common Problems & Checks David M. Walker [email protected] +44 (0) 7050 028 911 - http://www.datamgmt.com Data Management & Warehousing Date: 24 April 2009 Location: Zagreb, Croatia

ETIS09 - Data Quality: Common Problems & Checks - Presentation

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Quality:

Common Problems & Checks

David M. Walker

[email protected]

+44 (0) 7050 028 911 - http://www.datamgmt.com

Data Management & Warehousing

Date: 24 April 2009

Location: Zagreb, Croatia

Page 2: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

24 April 2009 © 2009 Data Management & Warehousing Page 2

Agenda

•! Introduction

•! Common Problems

•! Automated Checking

•! Profiling Checks

•! Conclusions

Page 3: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Introduction

•! Data Quality problems are a SOURCE SYSTEM

issue and an ETL issue

–! They just manifest themselves in the data warehouse

•! Prevention is better than cure

–! Fixing the source system or the ETL is ALWAYS

cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself

•! Data Quality is a continuous process

–! It is never finished and always needs to be monitored

24 April 2009 © 2009 Data Management & Warehousing Page 3

Page 4: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

The Impact of Poor Data Quality

•! Devalues the data warehouse

–!Discourages people from trusting or using the

system and therefore curtailing the life of the

data warehouse

•! Highlights failings in the source system

and/or the business process

–!Businesses would rather fix at any cost in the

data warehouse and pretend that there isn’t a

source system problem

24 April 2009 © 2009 Data Management & Warehousing Page 4

Page 5: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Common Problems

•! 11 types of problem that account for the

most common problems

•! They usually reflect poor design and/or

implementation of systems

•! Most can be fixed or monitored and

managed to limit the impact

24 April 2009 © 2009 Data Management & Warehousing Page 5

Page 6: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Referential Issues

•! Keys that are not unique

–!Systems that do not enforce the unique

primary key or flat files or spreadsheets

–!Also generated by ETL that creates a

surrogate key incorrectly

•! Referential Integrity Failures

–!Where referentially integrity is not enforced

values in the child table are created that are

not in the parent table

24 April 2009 © 2009 Data Management & Warehousing Page 6

Page 7: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Type Issues

•! Format Errors

–!Typically in Date/Time type fields

–!02/04/2009 2nd April (UK) or 4th Feb (US)

•! Inappropriate Data Types

–!Storing Dates in Character Strings

20090624 as YYYYMMDD format string

–!But what about 20090230?

24 April 2009 © 2009 Data Management & Warehousing Page 7

Page 8: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Model Issues

•! De-normalised tables

–! Commonly created for performance reasons

–! Inherently duplicates data

–! Often gets out of sync

•! Data/Column Retirement

–! Upgrade to system retires a column

–! ETL continues to use the old column

•! Poor Table/Column Naming

–! Don’t assume that a column does what it says

–! Don’t assume that a column is still being used for it’s original

purpose

24 April 2009 © 2009 Data Management & Warehousing Page 8

Page 9: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Content Issues

•! Null Values

–!Systems that have many optional fields will

often have missing values

–!Null values allow rows to be silently omitted

from queries

•! Inappropriate Values

–!Databases allow special characters and/or

leading/trailing white space

–! “DataspaceQuality” != “DatatabQuality”

24 April 2009 © 2009 Data Management & Warehousing Page 9

Page 10: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Feed Issues

•! Missing Data

–!Where a stream of files are loaded by ETL if

one is dropped it can go un-noticed

–!Common with CDR type loads in Telcos

•! Late Data

–!A short term data quality issue

–!Leaves users believing there is a problem

–!Produces inconsistent reporting over time

24 April 2009 © 2009 Data Management & Warehousing Page 10

Page 11: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Automated Checking

•! Regularly run checks

•! Broad Coverage across systems

–! 100s and 1000s not 10s of queries

–! Run in a low priority loop in the background

•! Used against:

–! Sources

–! Data Staging

–! Data Warehouse

•! No Product Required

–! We often implement this as a controlling shell script and lots of

small scripts, one for each check

24 April 2009 © 2009 Data Management & Warehousing Page 11

Page 12: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Trending

•! Absolute Trending

–! Track an expected value over time

•! e.g. Returned Mail is usually less than 500 items per day

•! If the value is <500 status is green, 501 to 1000 amber and >1000 red

•! Statistical Process Control (SPC) Trending

–! Track an expected value where the value changes over time

•! e.g. Telco CDRs – expect more as the company grows

•! Don’t want to be continuously changing the threshold

•! Compare current load to historical means

•! If current load within 2 Standard Deviations – Green,

3 Standard Deviations – Amber,

4 or more Standard Deviations - Red

24 April 2009 © 2009 Data Management & Warehousing Page 12

Page 13: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Flow Control

•! Flow Control

–! ETL manipulates data

•! Joins, De-duplicates, Filters, Aggregates, etc

–! Use the formula:

Source Count - Filtered Count – DeDup Count – Target Count = 0

•! Trusted Source

–! Compare the result with a third system

–! e.g. Does the Count of Switch CDRs =

Count of those processed by the billing system Count of those processed in the DWH

24 April 2009 © 2009 Data Management & Warehousing Page 13

Page 14: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Business Rules Based

•! Specific rules to match known business

rules

–!Account holders > 18 (Sys Date – DoB)

–!Account holders < 115

–!Credit Card numbers are 16 digits long

–!Number of accounts without a status

•! Result should yield Zero

24 April 2009 © 2009 Data Management & Warehousing Page 14

Page 15: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Automated Checking - Ops

•! Managed by exception

–!Red given priority

–!Amber are always followed up

•! Massive number of checks

–!100’s are good

–!1000’s are better

•! Presentation

–!Alerts, RAG, Graphical, Numerical, etc.

24 April 2009 © 2009 Data Management & Warehousing Page 15

Page 16: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Profiling Checks

•! Run manually because they need to be

interpreted by a human

•! Leads to new business rules being added

to the automated checks

•! Can be done with simple reporting tools or

commercial data profiling tools

24 April 2009 © 2009 Data Management & Warehousing Page 16

Page 17: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Frequency Outliers

•! Count discreet values in a table and check

items with many more or less than normal

–!e.g. DoB 01-01-01 many times more common

than any other value indicates source default

and something that needs work

–!e.g. Count of SMS messages significantly

lower on a given day may equate to a genuine

system failure and therefore not a DQ

problem

24 April 2009 © 2009 Data Management & Warehousing Page 17

Page 18: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Maximum & Minimum

•! Determine what a valid range for any

value should be

–!e.g. age between 18 and 115

–! Immediately finds individual data quality

issues that can be resolved

–!Allows an analyst to create new business

rules to prevent future problems

24 April 2009 © 2009 Data Management & Warehousing Page 18

Page 19: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Sequential Keys

•! If a system has a sequential key:

Max Value – Min Value – Count = 0

•! If this is true – is it too perfect for an

operational system and therefore test data

•! If this is false – what has caused the gaps,

are the deletions intentional?

24 April 2009 © 2009 Data Management & Warehousing Page 19

Page 20: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Types

•! Validation of mis-used data types before

loading

–!e.g. Dates in Character fields

–!Format: YYYYMMDD

–!Check MM between 01 and 12

–!Check DD between 01 and 31

–!Check MMDD does not include 0230, 0231

–!etc.

24 April 2009 © 2009 Data Management & Warehousing Page 20

Page 21: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Skewed Pattern Profiling

•! Looking for specific patterns in data

–!e.g. UK National Insurance Numbers (?) have

the format AA 99 99 99 A

–!Pattern match all values looking for

exceptions

•! Number Lengths are a special case

–!e.g Credit Card Numbers are 16 digits long

24 April 2009 © 2009 Data Management & Warehousing Page 21

Page 22: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Content Checking

•! Content Checking is the manual review of

character strings

•! Needs a good understanding of the nature

of the data

•! Often determines the need to do analysis

of other types

24 April 2009 © 2009 Data Management & Warehousing Page 22

Page 23: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Nulls & White Space

•! Nulls

–! Fields that have large proportion of nulls are usually

not useful

–! Also common is default status of null

(e.g. Account us either closed or null)

•! White Space

–! Not Null fields with a single space

–! Tab instead of space

–! Leading/Trailing white space

–! Double White Space: “DavidSpaceSpaceWalker”

24 April 2009 © 2009 Data Management & Warehousing Page 23

Page 24: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Punctuation & Control Chars

•! Punctuation

–!CSV files that are not properly quoted perform

field shifts

–!Address lines with extra commas

•! Control Characters

–!Data fields that contain ASCII character codes

0 to 31 and 127 to 159 are often ‘invisible’

when viewed in queries but cause failures

–!Also be aware of ‘code-page’ specifics

24 April 2009 © 2009 Data Management & Warehousing Page 24

Page 25: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Problem Management Matrix

24 April 2009 © 2009 Data Management & Warehousing Page 25

Page 26: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Continuous DQ Process

24 April 2009 © 2009 Data Management & Warehousing Page 26

Page 27: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Quality is FREE …

… as long as you are prepared to

INVEST HEAVILY in it

Philip Crosby 1980

Especially true of

Data Quality

24 April 2009 © 2009 Data Management & Warehousing Page 27

Page 28: ETIS09 -  Data Quality: Common Problems & Checks - Presentation

Data Quality:

Common Problems & Checks

David M. Walker

[email protected]

+44 (0) 7050 028 911 - http://www.datamgmt.com

Data Management & Warehousing

Thank You