Upload
yitta
View
22
Download
2
Embed Size (px)
DESCRIPTION
Data Cleansing. - PowerPoint PPT Presentation
Citation preview
> >>
Data CleansingData Cleansing
> >>
““A company’s most important asset is A company’s most important asset is information. A corporation’s ability to information. A corporation’s ability to compete, adapt, and grow in a business compete, adapt, and grow in a business climate of rapid change is dependent in climate of rapid change is dependent in large measure on how well the company large measure on how well the company uses information to make decisions … uses information to make decisions … Sharing information that isn’t clean and Sharing information that isn’t clean and consolidated to the fullest extent can consolidated to the fullest extent can substantially reduce the effectiveness of a substantially reduce the effectiveness of a system of significant investment and system of significant investment and considerable pay-off potential.considerable pay-off potential.” ”
Stoker, 1999Stoker, 1999
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Data Cleansing and Data Quality Steps in Data Cleansing
• Why is “Dirty” Data a Problem? Why is Legacy Data “Dirty”? To Cleanse or Not To Cleanse
Parsing • Matching Correcting • Consolidating Standardizing
• Conclusion Demonstration Questions
INTRODUCTION
WHY “DIRTY” DATA
CLEANSING STEPS
CONCLUSION
Today’s CoverageToday’s Coverage
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Data Cleansing and Data QualityData Cleansing and Data Quality Data is a product that can be characterized
as either “quality” or “non-quality.” The ability to make quality decisions depends in part on the decision-maker’s ability to access quality data.
Data cleansing is the process that insures that the same piece of information is referred to in only ONE way. When data is clean, its users can focus on its use and not its credibility.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Steps in Data CleansingSteps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Why is Data “Dirty” Why is Data “Dirty” and Why is This a Problem?and Why is This a Problem? Simply put, dirty data for data warehouses is
the product of relying on data from legacy systems.
But if company’s have relied on this data for decades, why is it a problem today?
Because a data warehouse “promises” to deliver “a single version of the truth.” Unfortunately integrating data from different sources magnifies its problems.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Why is Legacy Data “Dirty” ?Why is Legacy Data “Dirty” ? Dummy Values, Absence of Data, Multipurpose Fields, Cryptic Data, Contradicting Data, Inappropriate Use of Address Lines, Violation of Business Rules, Reused Primary Keys, Non-Unique Identifiers, and Data Integration Problems
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
To Cleanse or Not to CleanseTo Cleanse or Not to Cleanse CAN the legacy data be cleansed? Sometimes the answer is “NO” Then, SHOULD it be cleansed? Again, sometimes “NO” Next, WHERE should it be cleansed? Finally, HOW should it be cleansed?
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Steps in Cleansing DataSteps in Cleansing Data Parsing
Correcting
Standardizing
Matching
Consolidating
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
ParsingParsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
ParsingParsing
Input Data from Source FileBeth Christine Parker, SLS MGRRegional Port AuthorityFederal Building12800 Lake CalumetHedgewisch, IL
Parsed Data in Target FileFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
CorrectingCorrecting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
CorrectingCorrecting
Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398
Parsed DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: Lake CalumetCity: HedgewischState: IL
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
StandardizingStandardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
StandardizingStandardizing
Corrected DataFirst Name: BethMiddle Name: ChristineLast Name: ParkerTitle: SLS MGRFirm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: South Butler DriveCity: ChicagoState: ILZip: 60633Zip+Four: 2398
Corrected DataPre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Parsing, Correcting, StandardizingParsing, Correcting, Standardizing
Mr. Bill St. John III
101 S. Main Strete
Sant. Louis, MO 63181
TITLE FIRST CONC. LAST GENER.
HSNO ST-NM ST-TYPE
CITY STATE POSTSt.
St. 63118
NAME LINE
STREET LINE
GEOG. LINE
ST-DIR
William
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
MatchingMatching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Match PatternsMatch Patterns
Business Name
Street BranchType
Customer#/Tax ID
City VendorCode
Pattern PatternI.D.
Exact
Exact Exact
ExactExactExactExactExact
Exact
Exact Exact
ExactVClose
Exact
Exact
Exact
ExactExact
ExactExact
VClose
VClose
VClose
VCloseVClose
Close
Close
Close
Blanks
Blanks
AAAAAA
ABAAA-
ABA-AA
ABCCAA
BBACAA
P110
P115
P120
S300
S310
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
MatchingMatchingCorrected Data (Data Source #1)Pre-name: Ms.First Name: Beth1st Name Match Standards: Elizabeth, Bethany, BethelMiddle Name: ChristineLast Name: ParkerTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr.City: ChicagoState: ILZip: 60633Zip+Four: 2398
Corrected Data (Data Source #2)Pre-name: Ms.First Name: Elizabeth1st Name Match Standards: Beth, Bethany, BethelMiddle Name: ChristineLast Name: Parker-LewisTitle: Firm: Regional Port AuthorityLocation: Federal BuildingNumber: 12800Street: S. Butler Dr., Suite 2City: ChicagoState: ILZip: 60633Zip+Four: 2398Phone: 708-555-1234Fax: 708-555-5678
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
ConsolidatingConsolidating
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
ConsolidatingConsolidating
Corrected Data (Data Source #1)
Corrected Data (Data Source #2)
Consolidated DataName: Ms. Beth (Elizabeth) Christine Parker-LewisTitle: Sales Mgr.Firm: Regional Port AuthorityLocation: Federal BuildingAddress: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398Phone: 708-555-1234Fax: 708-555-5678
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
ConsolidatingConsolidating
William William JonesJones
Janet Janet JonesJones
Karen Karen JonesJones
William William Jones JrJones Jr..
> >><<<
CONCLUSIONINTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS
Recommended Best PracticesRecommended Best Practices
1. Use metadata to document rules
2. Determine data cleansing schedule
3. Build quality into new and existing systems
> >><<<
CLEANSING STEPSINTRODUCTION WHY “DIRTY” DATA CONCLUSION
Legacy Systems View (3 Clients)Legacy Systems View (3 Clients)
Account No.83451234 Policy No.
ME309451-2
TransactionB498/97
> >><<<
CLEANSING STEPSINTRODUCTION WHY “DIRTY” DATA CONCLUSION
The Reality – ONE ClientThe Reality – ONE Client
Account No.83451234 Policy No.
ME309451-2
TransactionB498/97
> >><<<
CLEANSING STEPSINTRODUCTION WHY “DIRTY” DATA CONCLUSION
DemonstrationDemonstration Vality
http://www.vality.com Trillium Software
http://www.trilliumsoft.com First Logic
http://www.firstlogic.com