Upload
doanphuc
View
217
Download
1
Embed Size (px)
Citation preview
© 2006 IBM Corporation
IBM Information IntegrationCapabilities
2
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
Parallel ProcessingRich Connectivity to Applications, Data, and Content
IBM Information Server
Discover, model, and govern information
structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information
for in-line delivery
Unified Deployment
Unified Metadata Management
3
Where is my information?
How do I get it when I need it?
What does it mean?
Can I trust it?
How do I get it in the form I need?
How do I get it where it needs to go?
How do I control it?
Why Is it Important to Start with Understanding?
4
Physical Metadata: WebSphere Information Analyzer
Data-centric analysis of application, database and file-based sources
Secure, detailed profiling of fields, across fields, and across sources
Creation of metadata from profiling results
Results instantly promotable across IBM Information Server
UnderstandAnalyze source data structures, and
monitor adherence to integration and quality rules
WebSphere Information Analyzer
DataAnalysts
Subject Matter Experts
Physical View
5
WebSphere Information Analyzer
What is it?What is it?Next generation data profiling and analysis tool for heterogeneous enterprise data sources
• Integrates profiling capabilities from three distinct products
What does it do?What does it do?Analyzes data sources to discover structure, contents and quality of information
• Infers the “reality” of the data, not just the data definition• Finds and reports missing, inaccurate and inconsistent data • Allows review of the quality of data throughout the life cycle
Who uses it?Who uses it?Business and Data Analysts, Data Quality Specialists, Data Architects and Data Stewards
6
WebSphere Information Analyzer
End-to-End Data Profiling and Content Analysis– Combines data profiling, data audit, and data format investigation technologies
– Provides column, primary key, foreign key, and cross-domain analysis
– Incorporates comparative analysis against established baselines over time
– Leverages central repository for analysis results with project- and role-level data security
Driven by Business– Intuitive and Collaborative Environment
– Visualization of data analysis
– Extensive Reporting of analytical results
Exploiting Unique Information Integration Platform Advantages– Shared metadata and connectivity services
– Shared analytical results with WebSphere DataStage/QualityStage
– Parallel Engine technology for highly scalable performance
7
A single unified and integrated framework
A new and exciting visual design
Pillar menu focused on methodology and user-based tasks, not products
Environment that promotes collaboration
Personalization and customization
Information Analyzer Home Screen
8
Full graphical enablement and display of key analytical data
Potential problems flagged for easy identification
Multiple open workspaces and tabs for easy navigation to facilitate review
Ability to filter results to quickly focus on business issues
Information Analyzer Drill Down
9
Quality Controls for Completeness and Validity of data values
Incomplete or Invalid values set by value, range, or reference sources
Consistency checks for data formats
Information Analyzer Validation
10
Information Analyzer Spotlight: Column Analysis
•Domain Values & Validation
•Data Classification
•Data Properties
•Formats
11
Frequencies of data values and format patterns
Classification of data by system and user
Inferences of data properties (e.g. data type, length, uniqueness)
Information Analyzer Spotlight
12
Quality Controls for Completeness and Validity of data values
Incomplete or Invalid values set by value, range, or reference sources
Conformity checks for data formats
Information Analyzer Spotlight
13
Easily generate reference tables of default, valid, or invalid data
Incorporate transformation mapping values
Preview table output
Export reference tables to desired location for ongoing use
Leverage in WebSphere DataStage or QualityStage jobs
Information Analyzer Spotlight
14
Drilldown to underlying data
Review exception conditions from profiling or data rules
View in workspace with associated information
Filter drilldown results to enhance understanding
Information Analyzer Spotlight
15
Information Analyzer Spotlight: Table Analysis
•Primary Keys(single or multi-column)
•Key Duplicates
16
Evaluate single or multi-column primary keys
Summary and detail of column uniqueness
Details of primary key duplicates
Review of frequency distribution
Information Analyzer Spotlight
17
Information Analyzer Spotlight: Cross Table Analysis
•Foreign Key Relationships
•Referential Integrity
•Cross-Domain Relationships
•Data Redundancy
18
Evaluate single or multi-column foreign keys across any number of tables and sources
Summary of referential integrity
Details of key violations including orphaned values
Test any set of common domains for compatibility or redundancy
Information Analyzer Spotlight
19
Information Analyzer Spotlight: Baseline Analysis
•Current-to-Prior Comparison
•Content & Structural Variation
20
Compare a checkpoint or current analysis to a baseline
Table-level summary & column-level details
Identify changes in structure or content
Includes changes in quality measures
Turns data profiling into an ongoing event throughout project lifecycle
Information Analyzer Spotlight
21
All analytical processes can be scheduled
Scheduling supports: start date or delay, repeating definitions, end date or delay, and repeat count to stop schedule
Information Analyzer Spotlight
22
Information Analyzer Spotlight
Notes on any Repository Object– Metadata
Information– Any Analytical
Result
Supports user-defined Status and Type for subsequent reporting
23
Multi-level security and administration framework:
Suite
Product
Project
Data source
Standard Authentication controls
User, role, and privilege assignment
Environment that supports critical compliance regulations
Information Analyzer Highlights
24
Metadata discovery shared across Suite
Projects register interest only in Data Sources of concern
Metadata Import focused on user interest
Analytical results published in secured framework
Information Analyzer Spotlight
25
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
Parallel ProcessingRich Connectivity to Applications, Data, and Content
IBM Information Server
Discover, model, and govern information
structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information
for in-line delivery
Unified Deployment
Unified Metadata Management
26
Why Should I Care About Cleansing Information?
Lack of information standards– Different formats & structures
across different systems
Data surprises in individual fields– Data misplaced in the database
Information buried in free-form fields
Data myopia– Lack of consistent identifiers inhibit
a single view
The redundancy nightmare– Duplicate records with a lack of
standards
Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name Tax ID Telephone
J Smith DBA Lime Cons. 228-02-1975 6173380300Williams & Co. C/O Bill 025-37-1888 415-392-20001st Natl Provident 34-2671434 3380321HP 15 State St. 508-466-1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19-84-103 RS232 Cable 6' M-F CandS
CS-89641 6 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 6 Foot Cable
90328574 IBM 187 N.Pk. Str. Salem NH 0145690328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 0145690238495 Int. Bus. Machines 187 No. Park St Salem NH 0415690233479 International Bus. M. 187 Park Ave Salem NH 0415690233489 Inter-Nation Consults 15 Main Street Andover MA 0234190345672 I.B. Manufacturing Park Blvd. Bostno MA 04106
27
Data Cleansing: WebSphere QualityStage
Specialized data quality functions seamlessly integrated with DataStage
Visual tools for defining complex matching and survivorship logic
Ensures clean, standardized, de-duplicated information
Enables a single version of the truth
Cleanse
Subject Matter Experts
Standardize and correct source data fields, and match records together
across sources to create a single view
WebSphere QualityStage™
Visual Match Rule Design
DataAnalysts
28
Integrated Approach - QualityStage & Information Analyzer
Sharing metadata
Both Information Analyzer and QualityStage store Table metadata in the common repository
• Allows sharing of metadata definitions• Provides single metadata import from data source ~ for use in both tools
– Analytical information available in QS Designer• Enables QualityStage user to see analysis data for shared tables• “Analytical Information” tab on the
EditRow dialog when looking at thedetails of an individual column from…
– …a Table Definition– …a stage editor
• “Analytical Information” tab on the TableDefinition dialog
29
Standardization Benefits
Direct from DB or flat file
Optimize disk
Rules are now ‘first class’ objects
30
Introduction to New Match Design Environment -Features
The Major Components
Holding AreaHistogram
Data Viewer
Decision Rules
Pass Composer
Cutoff Tuning
31
Statistics
Introduction to New Match Design Environment -Features
The Major Components (cont.)
Baseline Analysis
Customizable Graphics
32
QualityStage ProcessData
Quality Assessment
(DQA)
Investigation
Data Re-Engineering (DRE)
Standardization Matching Survivorship
Blk 1, 1 St, 05-0005-00 Frist St, Block 11 First Str, #05-001, St, #05-00
Blk 1|First St|05-00Blk 1|First St|05-001|First St|#05-001|St|#05-00
Blk 1|First St|05-00Blk 1|First St|05-001|First St|#05-001|St|#05-00
#05-00, Blk 1, First St#05-00, 1, St
0001 25.0% L^^T^-^0001 25.0% ^-^+TL^0001 25.0% ^OT#^-^0001 25.0% ^T#^-^
33
Investigation - Character
1. Double Click
34
Investigation - Character
2. Select Column 3. Add
35
Investigation - Character
9. Define output as desired
36
Standardization
1. Double Click
Job: Tech Symposium\QualityStage\2.Standardarize\StanAndGenMatchFreqODBC
37
Standardization
1. Double Click
38
Standardization
6. Stage Properties
39
Standardization
7. Output tab to map columns
40
Standardization
8. OK
41
Match Design - Unduplicate
42
Match Design – Unduplicate - Overview
The Major ComponentsHolding AreaHistogram
Data Viewer
Decision Rules
Pass Composer
Cutoff Tuning
43
Match Design - Unduplicate
1. Create Specification
44
Match Design - Unduplicate
Blank Specification
45
Match Design - Unduplicate
2. Select Match Type
46
Match Design - Unduplicate3.
Double
click
on
link t
o loa
d meta
data
4. Load
5. NavigateAnd OK
47
Match Design - Unduplicate
OK
48
Match Design - Unduplicate
6.Click on ‘MyPass’
‘Blocking’
‘Match Commands’
49
Match Design - Unduplicate
8.Save Match Specification
50
Match Design - Unduplicate
9.Give Name and ‘Save’
51
Match Design - Unduplicate
10. Configuration
52
Match Design - Unduplicate
11. Data Sample
12. Data Frequency
13. Data Source Name14. User Name (qsmatch)15. Password (qsmatch)
53
Match Design - Unduplicate
16. Add Blocking Columns
54
Match Design - Unduplicate
17. Select Column
55
Match Design - Unduplicate
18. Add MATCH Column
56
Match Design - Unduplicate
19. Business Name
57
Match Design - Unduplicate
20. Compare Type
58
Match Design - Unduplicate
21. Data ColumnRight-Click
59
Match Design - Unduplicate
Frequencies
60
Match Design - Unduplicate
22. Select
23. Parameter
61
Match Design – Unduplicate (Fully Configured)
62
Match Design – Unduplicate
Grouping option:Match Sets: See all matches and duplicates togetherMatch Pairs+Sort: See the master record repeated
63
Match Design – Unduplicate
Default Display (Grouped by Match Sets)
Grouped by Match Pairs and then sorted Ascending by Weight
64
Match Design – Unduplicate
Compare Weights:See how any two records score
65
Match Design – Unduplicate
Statistics Tab
Change What Shows
66
Match Design – Unduplicate
Change How Shows
67
Match Design – UnduplicateTOTAL Statistics Tab
Change What Shows
Change How Shows
68
Match Implementation - Unduplicate
69
Uduplication Implementation
Job: Tech Symposium\QualityStage\3.Unduplicate\Unduplicate
1. Double Click
70
Uduplication Implementation
2. Click ‘…’
71
Uduplication Implementation
8. Output Tab to map columns
72
Survive
73
Survive
Job: Tech Symposium\QualityStage\4.Survive\Survive
1. Double Click
74
Survive
3. Highlight and‘Modify Rule’
2. Select Group Identification Column
75
Survive
4. Output Column5. Technique
76
Survive
Out-of-the-boxTechniques
77
Survive
‘Complex’ available
78
Single Design Environment
All phases of data quality:– Investigate
– Standardize
– Match• Unduplicate• Reference
– Survive
79