35
1 Do you have valid, consistent and accurate data? Consider a data quality solution. functions PROCs Bill Fehlner, Education, SAS 416 307-4513 [email protected] Blue Fusion SDK SAS Job dfPower Studio odbc SAS library Data base Vault SAS/ACCESS Data base

Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

1

Do you have valid, consistent and accurate data?Consider a data quality solution.

functionsPROCs

Bill Fehlner, Education, SAS416 307-4513 [email protected]

BlueFusionSDK

SASJob

dfPowerStudio

odbcSAS

libraryData base

VaultSAS/ACCESS

Data base

Page 2: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

2

Agenda

Why worry about data quality?

Data samples before and after cleaning.

Interactive Data Cleaning

Data Cleaning with programs

Where to learn more

Page 3: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

3

Why Worry about Data Quality?META Group – Ten to twenty percent of the data

used to build data warehouses is corrupt or incomplete.

Data Warehousing Institute – estimates that data quality problems cost US corporations more than $600 billion per year (2002).

Page 4: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

4

Primary Sources of Data Quality Problems

• Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002

Page 5: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

5

Data Quality Characteristics

•Data quality affects several attributes associated with data:

– Accuracy – Is it realistic or believable?

– Consistency – Is it consistently defined andmaintained?

– Validity – Is the data valid, based on business or industry rules and standards?

Page 6: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

6

Agenda

Why worry about data quality?

Data samples before and after cleaning.

Interactive Data Cleaning

Data Cleaning with programs

Where to learn more

Page 7: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

7

Definitely Inconsistent, and not that Accurate

Page 8: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

8

The issues here

Inconsistent use of name prefixes

Inconsistent capitalization

Use of nicknames for given names

Misspelling of last names

Occasional use of middle name

Page 9: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

9

Accurate and Consistent

Page 10: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

10

When name processing is not enough

Page 11: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

11

The issues here

Inconsistent use of name prefixes

Inconsistent capitalization

Use of nicknames for given names

Misspelling of last names

Occasional use of middle name

Some names apparently the same have

different addresses

Page 12: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

12

More issues with addresses

Inconsistent use of punctuation

Inconsistent reference to direction

Inconsistent use of street extensions

Misspelling of street names

Page 13: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

13

Joint processing of name and address

Page 14: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

14

Common Processes in a Data Quality Initiative

Consistency Analysis

Standardization Schemes

Gender Analysis

Entity Analysis

Data Parsing

Data Casing

Page 15: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

15

Tasks simplified by a Data Quality Initiative

Matching rows in multiple tables.

De-duplication of rows in one table.

Address Verification.

Page 16: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

16

Agenda

Why worry about data quality?

Data samples before and after cleaning.

Interactive Data Cleaning

Data Cleaning with programs

Where to learn more

Page 17: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

17

DataFlux Functionality

functionsPROCs

BlueFusionSDK

SASJob

dfPowerStudio

odbcSAS

libraryData base

VaultSAS/ACCESS

Data base

Page 18: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

18

dfPower Studio’s Vault

Page 19: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

19

dfPower Studio’s Vault

Page 20: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

20

dfPower Base - Analysis

Page 21: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

21

dfPower Base - Analysis

Page 22: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

22

dfPower Base - Analysis

Page 23: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

23

dfPower Base – Analysis Report

Page 24: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

24

dfPower Base – Analysis Report

Page 25: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

25

dfPower Studio’s Vault

Page 26: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

26

Agenda

Why worry about data quality?

Data samples before and after cleaning.

Interactive Data Cleaning

Data Cleaning with programs

Where to learn more

Page 27: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

27

SAS – Proc Scheme

Page 28: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

28

SAS – DQ functions

Page 29: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

29

SAS – Proc Matchprocess name only

Page 30: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

30

Match Codes – name only

Page 31: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

31

SAS – Proc Matchprocess name and address

Page 32: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

32

Match Codes – name and address

Page 33: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

33

DataFlux Functionality

functionsPROCs

BlueFusionSDK

SASJob

dfPowerStudio

odbcSAS

libraryData base

VaultSAS/ACCESS

Data base

Page 34: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

34

How to Learn More

• Instructor based training:

– http://support.sas.com /training/Canada

– “SAS Data Quality-Cleanse”, a two-day course, starting on April 22 in Toronto

Page 35: Do you have valid, consistent and accurate data? Consider ...€¦ · Do you have valid, consistent and accurate data? Consider a data quality solution. PROCs functions Bill Fehlner,

35

How to Learn More

• Technical Support

– http://support.sas.com/rnd/warehousing/quality