IBM Cloud and Cognitive Software Fast Start 2020 #FastStart2020
IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak™ for Data – A Data Quality Deep Dive
Dan SchallenkampData and AI, Offering Manager for Data Quality
Thurs. 30-April-2020 CHI UG Meeting
Legal Disclaimer
© IBM Corporation 2020. All Rights Reserved.The information contained in this publication is provided for informational purposes only. While efforts were made to verify the
completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreementgoverning the use of IBM software.
References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.
Session Agenda
• Where is Data Quality Positioned in our offerings?• Business Value / Purpose
• Data Quality – Key Capabilities
• What’s New in the current GA release?• Demo
• What’s Planned in the Next release?• Demo
3© 2020 IBM Corporation
You Are Here: How this session fits in the DataOps story
5
The AI LadderA prescriptive approach to accelerating the journey to AI
IBM DataOps / © 2020 IBM Corporation
InfuseOperationalize AI throughout the business
AnalyzeBuild and scale AI with trust and transparency
CollectMake data simple and accessible
OrganizeCreate a business-ready analytics foundation
ModernizeMake your data ready for an AI and hybrid cloud world
DataOps is the concept to deliver Business Ready Data
6
COLLECTORGANIZE
ANALYZE
INFUSE
your data with
AI
Analytics and AI at scale and speed
to drive
Operational Efficiency
Data Quality
Data privacy & compliance
DataOps(DevOps for Data + Data Operations)
• A concept, like DevOps for Data, enabling collaboration between data consumer & data provider at speed & scale
• Automated data operations providing curated data pipeline
• Drives agility and innovation everywhere
People Process Technology
© 2020 IBM Corporation
Data Quality – Key Capabilities
Cloud Developer Services / © 2017 IBM Corporation 8
Cloud Pak for Data
Enterprise Data Integration
Enterprise Data Quality
Enterprise Data Governance
Enterprise Data Consumption
DataStage
• Search and find relevant data• Connect & prepare data for consumption & analysis• Consume and analyze the data• Comment, rate and share
• Data lineage• Data ownership• Data stewardship• Data governance workflow• Discover metadata assets• Classify data assets• Build data glossary• Manage metadata repository• Manage Reference Data
• Deep data profiling• Data quality scoring• Apply and monitor validation rules against source data
Data Governance Teams
Data CitizensIBM Watson Knowledge Catalog on Cloud Pak for Data
AI LifecycleGround Truth gathering
Data Cleansing
Feature Engineering
Model Selection
Parameter OptimizationEnsembleModel Validation
Model Deployment
Runtime Monitoring
Model Improvement
Watson Studio, Watson Machine Learning, and Open Scale
• Build ETL jobs• Run ETL jobs• Monitor• Extract data• Collect metadata• Move data• Ingest data
Data Engineers
End-to-End Platform for Business-Ready DataIntegration of data quality (from Information Analyzer) data governance (Information Governance Catalog) and data consumption (from Watson Knowledge Catalog) now under one experience and brand.
Relationship &Overlap Analysis
PrimaryKey Analysis
Colum
nA
nalysis Source 1 Source 2
Rules Analysis
Source 1 Source 2
Analyze – Deep Data Profiling & AnalysisProvides the key understanding of the source data
• Column analysis• Business Term Assignments• Data Classification• Data Quality scores• Primary Key analysis• Relationship and Overlap analysis
Monitor Data Quality – using Business RulesEvaluates user-defined rules against the source data
• Data Rules – targeted evaluation• Rule Sets – combined assessment
…
…
Data Profiling and Quality – Core Capabilities
9© 2020 IBM Corporation
How to get the best results from Quick scan and Auto Discovery ... Example: for your critical data elements
DQ DimensionsStep 4
Examine the 11 built-in data quality dimensions, enable/disable as needed, create and install custom dimensionsUsed to calculate the DQ Score for Given columns
Business TermsStep 1Define Terms, Policies and Rules for your top 50 or 150 CDEs
Data ClassesStep 2
Examine the 200+ built-in data classes, disable those you don’t need, create and test custom data classes.
You must link every data class to a business term.
Automation RulesStep 3
Create Automation Rules for your top 50 or 150 CDEs
- ARs trigger based on Business term assignments - Can automatically bind/create Quality Rules
Step 5 Auto Discover• Automatic metadata import• Analysis• Auto classification• Auto term assignment• Data quality scores
InnovationHomework
Spend time customizing the tool
10© 2020 IBM Corporation
Quick scan – Blazing Fast Bulk Discovery
An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.
(see screen shots in demo section below)
11© 2020 IBM Corporation
Classification
Automatic Business Term Assignment
Data Sources
Systems of Record
Cloud
Social Media
News
Systems of Engagement
Others
Documents
Systems of Insights
Hadoop
Curator DashboardDecisions
Recommendations & Auto Term Assignment
Approve Reject / Modify
Enterprise Data Catalog
Feedback
Data Discovery(Quick scan)
Cognitive & Deep Learning
ML Classification
Rule Based Classifiers
Publish Training
12© 2020 IBM Corporation
AutomatedData Classification
Regex/Valid Value/Java Classifiers
Java Script Classifiers
Column Similarity classifiers
Public Domain Classifiers
Table Classifiers
Auto Grouping and Suggestion
13© 2020 IBM Corporation
AutomatedData Quality
Quality Analysis
Quality Rules
Quality Dimensions
Automation Rules
M/L Suggested rules
Business Term Assignment
14© 2020 IBM Corporation
Data Quality
- The Importance of Quality Addresses- A word on Workflow
The Importance of Quality Addresses
Good quality addresses are foundational to so many initiatives including:
• Know Your Customer (Prospect, Employee, Vendor, Patient)
• Data Quality in general and Matching and Deduplication specifically
• Shipping, mailing, logistics
IBM’s QualityStage Address Verification Interface (AVI) is tightlyintegrated with QualityStage
Questions :
• What do you use today to parse, correct, enhance & verify addresses?
• How often do you cleanse all your addresses and at what cost?
• Do you need to add lat/long coordinates to addresses?
16© 2020 IBM Corporation
Capabilities
– Supports over 248 countries and territories
– Improved verification, suggestion and correction results in batch or real time
– Bi-directional Transliteration support for 8 languages
– Tightly integrated into InfoSphere QualityStage
– Process multiple countries in a single run
– Latitude and longitude assignment
– US Census* and UK PAF data
Benefits
– Reduced errors in shipping/mailing & other activity, lowers cost
– Better customer service and increased revenue
– Increase business confidence when using enterprise data for critical decision making
– Enhanced and standardized address data supports record matching & de-duplication
Address Parse/Validate/Enhance
17© 2020 IBM Corporation
Data Quality – What’s New in Watson Knowledge Catalog?
EVERYTHING is New! All DQ is New!
Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 18
Data Quality – Retire the two older IA clients in 11.7.1 SP2
11.7.1 – Information Analyzer OneUIzero footprint, microservices based client (requires the ‘UG Stack’)
– Information Analyzer WorkbenchWindows based thick client
–Information Analyzer Thin Client(old/first thin client)
19© 2020 IBM Corporation
A Unified User eXperience (UX) across IIS and WKC
Information Analyzer
+Watson Knowledge Catalog
Information Governance Catalog
IBM Cloud Pak for Data
Unified User Experience &
Single Catalog
ProductStrategyNew
20© 2020 IBM Corporation
Data Quality within ICP/WKC
+Watson Knowledge Catalog
IBM Cloud Pak for Data
New
21© 2020 IBM Corporation
Quick scan – Blazing Fast Bulk Discovery
An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.
(see screen shots in demo section below)
22© 2020 IBM Corporation
Data Rule Definition Management – For the business user
23© 2020 IBM Corporation
Accelerating Data Quality through ML based automationMachine Learning
assisted Data Quality
• Auto Business Term Assignment – ML assisted
• Auto Business Rule Suggestion – via Automation Rules based on term assignment and data class
• Auto Discovery – a quick way to kickoff bulk analysis operations including:
• Metadata import• Data profiling• Data quality scores• Term assignment
Innovation
Think 2019 / 6912A / February, 2019 / © 2019 IBM Corporation 24
Accelerating the Quality & Governance Process
Automating theGovernance Process
• Utilizing Machine Learning for an accelerated Metadata Classification Process (Auto Business Term assignment)
• Automatically classify data -- including understanding your PII risk
Innovation
Automation through Machine Learning
25© 2020 IBM Corporation
Automation Rules
• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions
26© 2020 IBM Corporation
Automation Rules – Designed for the business user Innovation
• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions
27© 2020 IBM Corporation
SQL Virtual Tables
Can greatly simplify the creation and maintenance of data rule logic by ‘pushing’ the complexities to the source database. Table JOINs, filters, etc.
28© 2020 IBM Corporation
Data Quality – What’s New?
In IIS 11.7.1 SP2 and Also in WKC?
What’s New with the Nov 2019 Release?
IIS 11.7.1 SP2 and CPD WKC 2.5
1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo
2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)
30© 2020 IBM Corporation
Create/Edit/Delete Virtual Columns (both) 1 of 2
• Choose ‘Create virtual column’ from the Columns tab
• If you ‘Select’ an existing virtual column you can choose ‘Edit’ or ‘Delete’
31© 2020 IBM Corporation
Create/Edit/Delete Virtual Columns (both) 2 of 2• Add two or more
columns
• Move up or down
• Choose field separate and other settings
• Provide a name and description
• Treated like any other column. You can analyze, run Rules against it, etc.
32© 2020 IBM Corporation
Limit # of Data Rule output exceptions (both)
• Sometimes the first 100 or 1000 exceptions are more than enough to share in order to describe and diagnose the quality issue
• Can be a big time savings and disk savings vs the output of all exceptions
33© 2020 IBM Corporation
Validity Benchmark is back (both)
• A longtime IA feature that some customers are using
• Added to help those customers make the move to the new UI and to WKC
34© 2020 IBM Corporation
‘Manage’ Flag in Data Rules (IIS only today)
• Previously only available in DQEC
• And only showed up in DQEC if the Data Rule has been executed
35© 2020 IBM Corporation
Planned Live Demo
36© 2020 IBM Corporation
IBM Cloud Pak for Data WKCSelect Roadmap Items
What’s New with the Nov 2019 Release?
IIS 11.7.1 SP2 and CPD WKC 2.5
1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo
2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)
38© 2020 IBM Corporation
What Can We Expect in the Next Release?Planned for mid-June, 2020 release (subject to change) WKC 3.0 and 11.7.1 FP1
1. New much more intuitive Data Quality menu structure (both)2. Negative term classification (both)3. WKC experience for Data Rule exceptions (DQEC replacement) (WKC)4. Data Rule binding drag and drop (both)5. Visualization of Data Quality scores over time (both)6. On-going DQ architecture modernization (WKC)7. New ‘Column Similarity’ (aka Fingerprint) data class (WKC)8. Many minor UX improvements (retain user preferences, etc.) (both)9. Relationship Analysis more intuitive (both)10.Globalization (Translation of our UIs into several languages) (WKC)
11.ML Based Data Rule Definition Generation (WKC)12.Suggested Automation Rule (available today in 11.7.1 SP2, planned for WKC)
39© 2020 IBM Corporation
Negative Term Classification
• Improving DQ & Governance for business term assignment
• Remember what the user has manually rejected
• Compare to what is already published
40© 2020 IBM Corporation
Innovation – Column Similarity
41
• ‘No Class Detected’ columns are grouped based on similarity
• User can inspect each group, determine the cutoff score
• Create a new codeless Data Class
• The next time analysis is run, the new Data Class is working
• This is a quick way to create codeless custom Data Classes that are unique to a given customer’s data
Easy Data Class Creation – ‘Column Similarity’
• Mimic how a human brain thinks
• Find patterns that are similar across the multiple datasets under evaluation,
• Present them to the user as clusters of “similar patterns”
42© 2020 IBM Corporation
New Visualizations and Navigation
43© 2020 IBM Corporation
New Visualizations – Data Quality score over time
44© 2020 IBM Corporation
New Visualizations – Data Quality score over time
45© 2020 IBM Corporation
New Navigation Structure
46© 2020 IBM Corporation
New Navigation Structure
47© 2020 IBM Corporation
Relationship Analysis
48© 2020 IBM Corporation
Thank you
Dan SchallenkampData and AI, Offering Manager for Data Quality—[email protected]+1-704-458-0467
49© 2020 IBM Corporation
50© 2020 IBM Corporation