26
> >> Data Cleansing Data Cleansing

Data Cleansing

  • Upload
    yitta

  • View
    22

  • Download
    2

Embed Size (px)

DESCRIPTION

Data Cleansing. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Cleansing

> >>

Data CleansingData Cleansing

Page 2: Data Cleansing

> >>

““A company’s most important asset is A company’s most important asset is information. A corporation’s ability to information. A corporation’s ability to compete, adapt, and grow in a business compete, adapt, and grow in a business climate of rapid change is dependent in climate of rapid change is dependent in large measure on how well the company large measure on how well the company uses information to make decisions … uses information to make decisions … Sharing information that isn’t clean and Sharing information that isn’t clean and consolidated to the fullest extent can consolidated to the fullest extent can substantially reduce the effectiveness of a substantially reduce the effectiveness of a system of significant investment and system of significant investment and considerable pay-off potential.considerable pay-off potential.” ”

Stoker, 1999Stoker, 1999

Page 3: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Data Cleansing and Data Quality Steps in Data Cleansing

• Why is “Dirty” Data a Problem? Why is Legacy Data “Dirty”? To Cleanse or Not To Cleanse

Parsing • Matching Correcting • Consolidating Standardizing

• Conclusion Demonstration Questions

INTRODUCTION

WHY “DIRTY” DATA

CLEANSING STEPS

CONCLUSION

Today’s CoverageToday’s Coverage

Page 4: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Data Cleansing and Data QualityData Cleansing and Data Quality Data is a product that can be characterized

as either “quality” or “non-quality.” The ability to make quality decisions depends in part on the decision-maker’s ability to access quality data.

Data cleansing is the process that insures that the same piece of information is referred to in only ONE way. When data is clean, its users can focus on its use and not its credibility.

Page 5: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Steps in Data CleansingSteps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating

Page 6: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Why is Data “Dirty” Why is Data “Dirty” and Why is This a Problem?and Why is This a Problem? Simply put, dirty data for data warehouses is

the product of relying on data from legacy systems.

But if company’s have relied on this data for decades, why is it a problem today?

Because a data warehouse “promises” to deliver “a single version of the truth.” Unfortunately integrating data from different sources magnifies its problems.

Page 7: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Why is Legacy Data “Dirty” ?Why is Legacy Data “Dirty” ? Dummy Values, Absence of Data, Multipurpose Fields, Cryptic Data, Contradicting Data, Inappropriate Use of Address Lines, Violation of Business Rules, Reused Primary Keys, Non-Unique Identifiers, and Data Integration Problems

Page 8: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

To Cleanse or Not to CleanseTo Cleanse or Not to Cleanse CAN the legacy data be cleansed? Sometimes the answer is “NO” Then, SHOULD it be cleansed? Again, sometimes “NO” Next, WHERE should it be cleansed? Finally, HOW should it be cleansed?

Page 9: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Steps in Cleansing DataSteps in Cleansing Data Parsing

Correcting

Standardizing

Matching

Consolidating

Page 10: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

ParsingParsing

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.

Page 11: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

ParsingParsing

Input Data from Source FileBeth Christine Parker, SLS MGRRegional Port AuthorityFederal Building12800 Lake CalumetHedgewisch, IL

Parsed Data in Target FileFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

Page 12: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

CorrectingCorrecting

Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.

Page 13: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

CorrectingCorrecting

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Parsed DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL

Page 14: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

StandardizingStandardizing

Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

Page 15: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

StandardizingStandardizing

Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected DataPre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Page 16: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Parsing, Correcting, StandardizingParsing, Correcting, Standardizing

Mr. Bill St. John III

101 S. Main Strete

Sant. Louis, MO 63181

TITLE FIRST CONC. LAST GENER.

HSNO ST-NM ST-TYPE

CITY STATE POSTSt.

St. 63118

NAME LINE

STREET LINE

GEOG. LINE

ST-DIR

William

Page 17: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

MatchingMatching

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

Page 18: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Match PatternsMatch Patterns

Business Name

Street BranchType

Customer#/Tax ID

City VendorCode

Pattern PatternI.D.

Exact

Exact Exact

ExactExactExactExactExact

Exact

Exact Exact

ExactVClose

Exact

Exact

Exact

ExactExact

ExactExact

VClose

VClose

VClose

VCloseVClose

Close

Close

Close

Blanks

Blanks

AAAAAA

ABAAA-

ABA-AA

ABCCAA

BBACAA

P110

P115

P120

S300

S310

Page 19: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

MatchingMatchingCorrected Data (Data Source #1)Pre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398

Corrected Data (Data Source #2)Pre-name: Ms.First Name: Elizabeth1st Name Match Standards: Beth, Bethany, BethelMiddle Name: ChristineLast Name: Parker-LewisTitle: Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr., Suite 2City: ChicagoState: ILZip: 60633Zip+Four: 2398Phone: 708-555-1234Fax: 708-555-5678

Page 20: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

ConsolidatingConsolidating

Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Page 21: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

ConsolidatingConsolidating

Corrected Data (Data Source #1)

Corrected Data (Data Source #2)

Consolidated DataName: Ms. Beth (Elizabeth) Christine Parker-LewisTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingAddress: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Phone: 708-555-1234Fax: 708-555-5678

Page 22: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

ConsolidatingConsolidating

William William JonesJones

Janet Janet JonesJones

Karen Karen JonesJones

William William Jones JrJones Jr..

Page 23: Data Cleansing

> >><<<

CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS

Recommended Best PracticesRecommended Best Practices

1. Use metadata to document rules

2. Determine data cleansing schedule

3. Build quality into new and existing systems

Page 24: Data Cleansing

> >><<<

CLEANSING STEPSINTRODUCTION WHY “DIRTY” DATA CONCLUSION

Legacy Systems View (3 Clients)Legacy Systems View (3 Clients)

Account No.83451234 Policy No.

ME309451-2

TransactionB498/97

Page 25: Data Cleansing

> >><<<

CLEANSING STEPSINTRODUCTION WHY “DIRTY” DATA CONCLUSION

The Reality – ONE ClientThe Reality – ONE Client

Account No.83451234 Policy No.

ME309451-2

TransactionB498/97

Page 26: Data Cleansing

> >><<<

CLEANSING STEPSINTRODUCTION WHY “DIRTY” DATA CONCLUSION

DemonstrationDemonstration Vality

http://www.vality.com Trillium Software

http://www.trilliumsoft.com First Logic

http://www.firstlogic.com