2002 11 12DeMaio DataQualityIssues

Embed Size (px)

Citation preview

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    1/28

    Understanding Data Quality

    Issues:

    Finding Data Inaccuracies

    Art DeMaio

    Evoke Software

    VP Technical Sales Support

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    2/28

    Agenda

    Why is Understanding Data Important Methodology for Assessing Data

    Defining

    Weighting

    Profiling

    Revisiting

    Finding

    Addressing

    Maintaining

    What is Profiling

    Benefits of the Assessment

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    3/28

    What the Experts say

    Information quality is not an esotericnotion;it directly affects the effectiveness

    and efficiency of business processes.

    Information quality also plays a major rolein customer satisfaction.

    - Larry P. English

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    4/28

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    5/28

    Whats in Your DATA

    three-quarters (of participatingcompanies) reported significant problems as

    a result of defective data, with a third

    failing to bill or collect receivables as aresult.

    - In a PricewaterhouseCoopers survey of 600 CIOs,IT directors or similar executives

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    6/28

    What is Data Quality?

    Accuracy of Content

    Structure

    Completeness

    Timeliness

    Presentation

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    7/28

    Assessing Your Data

    2-Weight/Impact

    3-Profile

    Data

    6-Address

    Source Data

    7-Maintain

    4-Revisit

    Definitions,

    Weights

    5-Findings1-DefineIssues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    8/28

    Defining Issues

    Standard list

    Key requirements

    Content

    Structure

    Completeness

    Update list by project or source

    Source Data

    1-Define

    Issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    9/28

    Defining Issues-sampleConstants

    Definition Mismatches

    Filler Containing Data

    Inconsistent Cases

    Inconsistent Data Types

    Inconsistent Null Rules

    Invalid Keys

    Invalid Values

    Miscellaneous

    Missing Values

    Orphans

    Out of Range

    Pattern Exceptions

    Potential Constants

    Potential Defaults

    Potential Duplicates

    Potential Invalids

    Potential RedundantValues

    Potential Unused Fields

    Rule Exceptions

    Unused Fields

    Source Data

    1-Define

    Issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    10/28

    Weight Impact

    After the issues are initially

    identified:

    Some issues are more

    critical than others

    Weights are not priorities

    Assign a weighting factor

    (1-5)

    Weighting factors

    SHOULD change byproject

    2-Weight/Impact

    Source Data

    1-Define

    Issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    11/28

    Profile Data

    What does Data Profiling mean?

    2-Weight/Impact

    3-Profile

    Data

    Source Data

    1-Define

    Issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    12/28

    What is Data Profiling?

    The use of analytical techniques on data for the

    purpose of developing a thorough knowledge of its

    content, structure and quality.

    A process of developing information about data

    instead of information from data.

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    13/28

    Information About Data: (Data Profiling)

    30% of entries in SUPPLIER_ID are blank

    the range of values in UNIT_PRICE is 5.99 to 4599.99there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows

    Information FROM Data: (not Data Profiling)

    Texas auto buyers buy more Cadillacs per capita than any other state

    The average mortgage amount increased last year by 6%

    10% of last year's customers did not buy anything this year

    What is Data Profiling?

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    14/28

    Profile Data

    This is multi-step process

    Collect documentation

    Review the DATA itself

    Compare data to documentation

    Identify and detail specific issues

    2-Weight

    /Impact

    3-Profile

    Data

    Source Data

    1-Define

    Issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    15/28

    Revisit

    Review the issues and weights

    Should there be more or less issues

    What are they?

    Are the relative importance of each

    issue different?

    2-Weight

    /Impact

    3-Profile

    Data

    Source Data

    4-Revisit

    Definitions,

    Weights

    1-Define

    Issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    16/28

    Findings

    Your findings tell others about the

    data

    Documented reports and/or charts

    Results database

    Quality Assessment Score

    2-Weight

    /Impact

    3-Profile

    Data

    Source Data

    4-Revisit

    Definitions,

    Weights

    5-Findings1-DefineIssues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    17/28

    Findings-Chart

    Sample Company Issue Findings

    0

    5

    10

    15

    20

    25

    Issue Categor y

    Coun

    to

    fIssue

    Constant

    Definition Mismatch

    Filler Containing Data

    Inconsistent Case

    Inconsistent Data Type

    Inconsistent Null Rule

    Invalid Keys

    Invalid Values

    Miscellaneous

    Missing Values

    Orphans

    Out of Range

    Pattern Exception

    Potential Constant

    Potential Default

    Potential Duplicates

    Potential InvalidPotential Redundant

    Potential Unused

    Rule Exceptions

    Unused

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    18/28

    Findings-Chart

    Issues Possible

    Issue T ype Discovered IssuesConstants 1 59

    Definition Mismatches 4 59

    Filler Containing Data 1 59

    Inconsistent Cases 3 59

    Inconsistent Data Types 15 59

    Inconsistent Null Rules 6 59

    Invalid Keys 1 3

    Invalid Values 1 59Miscellaneous 10 59

    Missing Values 18 59

    Orphans 2 2

    Out of Range 3 59

    Pattern Exceptions 10 59

    Potential Constants 1 59

    Potential Defaults 1 59

    Potential Duplicates 3 59Potential Invalids 4 59

    Potential RedundantValues 21 59

    Potential Unused Fields 1 59

    Rule Exceptions 3 3

    Unused Fields 1 59

    110 1070

    Raw Score 89.7%

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    19/28

    Findings-ChartW e i g h t Issues Possible

    F a c t o r Issue T ype Discovered Issues

    4 Constants 1 59

    2 Definition Mismatches 4 59

    3 Filler Containing Data 1 59

    1 Inconsistent Cases 3 59

    2 Inconsistent Data Types 15 59

    3 Inconsistent Null Rules 6 59

    5 Invalid Keys 1 3

    5 Invalid Values 1 59

    1 Miscellaneous 10 593 Missing Values 18 59

    4 Orphans 2 2

    5 Out of Range 3 59

    4 Pattern Exceptions 10 59

    2 Potential Constants 1 59

    2 Potential Defaults 1 59

    1 Potential Duplicates 3 59

    3 Potential Invalids 4 59

    4 Potential RedundantValues 21 59

    3 Potential Unused Fields 1 59

    5 Rule Exceptions 3 3

    4 Unused Fields 1 59

    110 1070

    Weighted Score 76.2%

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    20/28

    Findings-Chart

    5 4 3 2 1 Weight Factor

    8 35 30 21 16 Issues identified in weight factor

    35.03% 31.19% 10.17% 8.90% 9.04% Average rate per factor

    175.1% 124.7% 30.5% 17.8% 9.0% Total Average by weight

    Weighted Issue Rate

    - 23.8%

    Weighted Assessment Score - 76.2%

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    21/28

    Address the Issues

    Addressing your findings

    Actual vs. Potential

    Subject Matter Expertise

    Cleansing Requirements

    2-Weight

    /Impact

    3-Profile

    Data

    6-Address

    Source Data

    4-Revisit

    Definitions,

    Weights

    5-Findings1-DefineIssues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    22/28

    Maintain Vigilance

    Maintain

    Complete the cycle

    Periodic review

    Document score changes

    2-Weight

    /Impact

    3-Profile

    Data

    6-Address

    Source Data

    7-Maintain

    4-Revisit

    Definitions,

    Weights

    5-Findings1-DefineIssues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    23/28

    Why Do The Assessment?

    Quantify the quality issues

    Isolate true problems

    Proactive review

    reduces the cost of resolving issues

    reduces the risk of customer dissatisfaction

    Define the scope of issues

    Determine the resources required to address

    issues

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    24/28

    Why Do The Assessment?

    Project

    Timeline

    When you find an Issue

    Cost

    toAddressanIssue

    ProjectCosts

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    25/28

    Why should it be done

    TIME

    Pay me now or Pay me later

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    26/28

    When Should It Be Done?

    Every IT data project Warehousing

    CRM

    ERP

    EAI

    M&A

    Ongoing based on

    Criticality of the system Current status (score)

    Need to re-purpose data

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    27/28

  • 8/10/2019 2002 11 12DeMaio DataQualityIssues

    28/28

    Bibliography

    Larry P. English: Improving Data Warehouse and BusinessInformation Quality, John Wiley & Sons Inc., 1999

    Jack Olson, Data Profiling: The Accuracy Dimension,

    Morgan Kaufmann, 2002

    Thomas C. Redman: Data Quality for the Information Age,

    Artech House, 1996

    PricewaterhouseCoopers, Global Data Management Survey,

    2001