View
249
Download
10
Category
Preview:
Citation preview
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
1
Data Profiling,Data Governance, and
Data Quality forMaster Data Management
David LoshinKnowledge Integrity, Inc.
www.knowledge-integrity.com
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
2
AgendaWe will explore how data profiling, data quality, and data governance will support the emergence of the core master data program:
Master data management basicsMaster object identification and modelingData consolidation and integrationOperational master service needsData governance brings it all under controlReview of Data Quality technology for MDM
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
3
What is “Master Data”?Core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections, and taxonomies, e.g.:
CustomersSuppliersPartsProductsLocationsContact mechanisms
“David Loshin purchased seat 15B on US Airways flight 238 from Baltimore (BWI) to San Francisco (SFO) on July 20, 2006.”
Master Data Object Value
Customer David Loshin
Product Seat 15B
Flight 238
Location BWI
Location SFO
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
4
What is Master Data Management?
Master Data Management (MDM) incorporates the business applications, information management methods, and data management tools to implement the policies, procedures, infrastructure that support the capture, integration, and subsequent shared use of accurate, timely, consistent and complete master data.
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
5
Challenges to MDM Success
•Uncoordinated Enterprise•Data Ownership•Performance Management
Organizational
•Data Quality•Data Standards•Data Governance
Operational
•MDM Architecture•Transition Plan•System Performance
Technical
Understanding and addressing these issues providesgrounding for the core MDM operational components
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
6
The MDM Component Layer Model
Architecture
Identification
Management
Governance
Integration
BusinessProcess
Management
Master Data Model MDM System Architecture
MDM Service Layer Architecture
DataStandards
MetadataManagement
DataQuality
DataStewardship
Administration/Configuration
Hierarchy Management Identity Management
Migration Plan
Record Linkage Merging and Consolidation
Identity Search and Resolution
MDM Component Service Layer
Application Integration and Synchronization Service Layer
MDM Business Component Layer
Business Rules
Business Process Integration
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
7
ArchitectureMaster data modelMDM system frameworkService layer architecture
ArchitectureMaster Data Model MDM System Architecture
MDM Service Layer Architecture
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
8
Central Master/Coexistence
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
9
Registry
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
10
Transaction Hub
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
11
MDM Architectural StylesDifferent underlying architectural approaches influence ways of integrating data management services:
Data assessment and master object identificationUnderlying master data modelData profilingParsing, Standardization, NormalizationCleansingEnhancement
Two high-level approaches:RepositoryRegistry
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
12
Repository – Initial Data Consolidation1. Data sources are staged and cleansed2. Cleansed data is brought into a master
staging area for consolidation3. Records are checked for previous registration
with the unique identification services4. Survivorship rules are applied5. For new records, a unique identifier is created6. The resulting record is enhanced7. The enhanced record is stored in the master
repository8. The unique identification service is notified
MasterRepository
UID1
IntakeSourcedata Staging Cleanse EnhanceSurvivorship
UniqueIdentification
Service
3
2
1
4
5
6
8
MasterStaging
IntakeSourcedata Staging Cleanse
IntakeSourcedata Staging Cleanse 7
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
13
Registry – Initial Data Consolidation1. Source data is cleansed and enhanced2. The unique identification service is consulted with the
identifying information to determine if the entity has already been registered, and if not, to generate a unique identifier for registration
3. The records are managed within their respective data locations, and
4. The records are tagged with the returned unique identifier5. The mapping from the registry to the data locations is
registered with the associated identifying information
IntakeSource 2data Staging Cleanse
1
IntakeSource 3data Staging Cleanse
IntakeSource 1data Staging Cleanse
Enhance
Enhance
Enhance
UniqueIdentification
Service
UID1
UID1
UID1
UID12
3
4
5
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
14
Master Data ModelLimited universe of common master objects
Party, customer, product, part, supplier, claim, instrument
Universal models may be suitable as starting pointsChallenges:
Resolution of metadata in a consistent mannerCreating a model that accommodates all applications properly
CUST
First VARCHAR(15)Middle VARCHAR(15)Last VARCHAR(21)Address1 VARCHAR(45)Address2 VARCHAR(45)City VARCHAR(30)State CHAR(2)ZIP CHAR(9)
Nightingale-Patterson
CUSTOMER
FirstName VARCHAR(14)MiddleName VARCHAR(14)LastName VARCHAR(20)TelNum NUMERIC(10)
Nightingale-Patterso
Last VARCHAR(21)
LastName VARCHAR(20)
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
15
Master Object Identification and ModelingDocument business processesSegregate services from dependent dataIdentify target data areasInventory and identify candidate master entitiesSource data analysisAssembling the Target Model
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
16
Business Process ModelingA “business process” is a coordinated set of activities intended to achieve a desired goal or produce a desired output productModels are designed to capture both the high level and detail of the business process
activityinputs
controls
actors
output
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
17
Sharing Enterprise Data ObjectsInteractions between activities depend on shared data:
activity
activity
activity
Instance values of common data types representing business facts communicate input and control during the business process
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
18
Segregating Master Data ServicesEvaluating the catalog of data elements through which business processes are coordinated yields a candidate list of master data objectsAssembling that catalog enables a review of the business application services
activity
Identify application services at the business level as they access master data objects
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
19
Finding Master DataWhat data elements constitute our “master data”?How do we locate and isolate master data objects that exist within the enterprise?How do we assess the variances between the different representations in order to consolidate instances into a single view?
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
20
Centralizing Semantics – Basic QuestionsMaster data will be distributed and replicated across the application environmentsThe initial goal is to consolidate the replicated copies into a single repositoryBefore materializing a single master record for any entity, one must be able to:
Discover which data resources may contain entity informationUnderstand which attributes carry identifying informationExtract identifying information from the data resourceTransform the identifying information into a standardized or canonical formEstablish similarity to other standardized records
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
21
CUSTOMER
FirstName VARCHAR(14)MiddleName VARCHAR(14)LastName VARCHAR(30)TelNum NUMERIC(10)
CUST
First VARCHAR(15)Middle VARCHAR(15)Last VARCHAR(40)Address1 VARCHAR(45)Address2 VARCHAR(45)City VARCHAR(30)State CHAR(2)ZIP CHAR(9)
Structures and Semantics Vary Across the Enterprise
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
22
CUSTOMER
FirstName VARCHAR(14)MiddleName VARCHAR(14)LastName VARCHAR(30)TelNum NUMERIC(10)
CUST
First VARCHAR(15)Middle VARCHAR(15)Last VARCHAR(40)Address1 VARCHAR(45)Address2 VARCHAR(45)City VARCHAR(30)State CHAR(2)ZIP CHAR(9)
CUSTOMER
CUST
Identifying Information
Contact Information
“A customer is an individual who has
purchased one of our products”
“A customer is an individual to whom we have delivered one of
our products”
Syntax Structure Semantics
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
23
Master Object ResolutionResolution of candidate master data types requires a compete view of what composes the information architectureThis entails cataloging data sets, their attributes, data domains, definitions, contexts, and semanticsThis view must facilitate the resolution of:
Format at the element level,Structure at the instance level, andSemantics across all levels
This introduces three challenges:Collecting and analyzing master metadataResolving similarity in structureUnderstanding and unifying master data semantics
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
24
Collecting and Analyzing Master MetadataMetadata sources:
Data dictionariesE/R modelsCOBOL copybooksSubject matter experts
Document all data element characteristics within a metadata repository using a standard representationThe consolidated metadata repository will enumerate data characteristics in a standardized wayThe standard representation enables statistical analyses such as:
Assessing frequency of occurrence of namesComparing the lengths of name fieldsDiscovering dependencies between attribute names and assigned types
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
25
Resolving Similarity in StructureDifferent underlying master meta-models are likely to share many commonalitiesStructures will probably reflect similar collections of attributes and relations
Example: Many customer data sets will contain customer names along with variant contact mechanisms such as addresses or telephone numbers
The data analyst can review existing models to identify similar object structuresTwo “hints” in structural similarity:
Overlapping structures may reflect data sets carrying identifying information with variant sets of attributesDerived structures potentially reflect an embedding of core master data attributes within a specialized version of the same object
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
26
Unifying Master Data SemanticsWhat is the qualitative difference between pure syntactic/structural metadata and its underlying meaning?Review semantic consistency in data element naming and data types, sizes, and structuresNext, document business meanings:
What are the data element definitions?Are there authoritative sources for these definitions?Do similar objects have different business meanings?
Resolution of variance will help in standardizing both the representations and meanings for identified master data object types
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
27
Proposing the Target ModelEvaluate the catalog of identified data elements
Seek out the frequently created, referenced, modified, retired
Assess object organizational structureEvaluate conceptual structures as they map to business process useExample: locations are composed of street, city, state, ZIP code
Identify and resolve anomalies across data element sizes, types, formatsPropose an object modelValidate the object model within the information frameworkValidate the object model within the application framework
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
28
Data Profiling for Master Data AnalysisCharacteristics of data element metadata is critical for analysisCombination of artifact review and empirical analysisData profiling can provide “ground-truth” evidence of consistency with metadata
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
29
The Real Data Quality ProblemThere are no objective measures of data qualityThe only industry metrics are based on name deduplication or address standardizationData cleansing does not address long term quality improvementThe answer: Use data profiling to analyze the current state of data and correlate anomalies to data consumer expectations
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
30
The Assessment ProcessData set selection, profiling and profile reviewImpact AnalysisRule Definition, Review, and RefinementObtain Objective Measurements
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
31
What Can Go Wrong with Data?Data entry errorsData conversion errorsUnexpected valuesMismatched syntax, formats and structuresInconsistent valuesMissing values
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
32
Reasons for Data DegradationLimited tracking of documentation with systemExploitation of “shortcuts” for implementationMultiple front end interfaces to same back endLimited system functionality driving “creativity”Changes in use and perception of data
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
33
Forensics and ArcheologyUsing the proper analysis, profiling reveals errors in the data, as well as insight into their originNaïve analysts use profiling to find and correct erroneous dataSavvy analysts use profiling to find and correct erroneous processes
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
34
Hidden SecretsDefault null valuesProgramming errorsData element domainsFormat ConstraintsAbstract Data TypesAugmented DomainsOverloaded attribute useOrphansUnused attributes
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
35
Default Null Values
Social Security Numbers ‘000000000’, ‘999999999’
Names‘?’, ‘??’, ‘???’, ‘????’, ‘?????’,‘N/A’, ‘NA’, ‘NONE’, ‘None’, ‘UNKNOWN’, ‘Unknown’, ‘n/a’, ‘na’, ‘none’, ‘unknown’
Phone Numbers‘No phone number provided’000-000-0000999-999-9999
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
36
Programming ErrorsIn one application, we had been assured that all records were evaluated prior to insertion into the database to ensure uniquenessProfiling results showed that numerous duplicates existedAfter bringing this to the attention of the programmers, they acknowledged that a programming error existed and was to be fixed in the next release
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
37
Data Element DomainsAn evaluation of a “Sex” column revealed three distinct values: “M”, “F”, and “U”
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
38
Format ConstraintsPattern analysis of a telephone number field showed that there were telephone numbers that contained both numerals and alphabetic letters:
1-800-MATTRES
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
39
Abstract Data TypesIn legacy systems, pattern analysis can reveal embedded use of abstract data types such as dates, SSNs, telephone numbers, etc.
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
40
Augmented Domains A set of manufactured values are added to a known domain to account for alternate use of the attribute
Additional “local codes” added to FIPS county codes to account for more than one location per county
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
41
Overloaded Attribute UseIn one table a column named FOREIGN_COUNTRY contained telephone numbers and email addresses
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
42
OrphansIn a review of a legacy foreign key relationship for case management, it was determined that a relatively large percentage of records in one table did not have a corresponding record in the other table
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
43
Unused AttributesIn one data quality assessment, we found one set of columns that were null a significant part of the time, and another that always had the same valueThese results indicate that the attribute is not actually being used
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
44
More Hidden SecretsMetadata
Data typesAttribute sizesRangesValue set cardinality
StructureEmbedded tablesRelational structureRelationship cardinalityKey relationshipsMaster reference data
Business RulesDomain constraintsDependency constraintsDerivation rulesCompleteness rulesConsistency rules
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
45
Summary of Data ProfilingProfiling used to:
Provide empirical analysis of data value setsAssess existing metadata against published descriptionsAssess quality of each data setIdentify outliers for further reviewCharacterize feasibility of consolidation
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
46
IdentificationEvery object subject to “mastering” is managed using a unique representation within the master repositoryAny time data intended to refer to that object is seen by an application, its unique representation must be found, verified, and presented back to the application by the MDM platform
IdentificationRecord Linkage Merging and Consolidation
Identity Search and Resolution
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
47
IdentificationUnique IdentificationParsing and StandardizationNormalizationRecord LinkageSurvival Strategies
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
48
Example – Unique Identification
David H LoshinH David LoshinDavid Howard LoshinDavid LoshinHoward Loshin
David LosrinDavid Lotion
David Lasrin
David LoskinDavid Lashin
David Laskin
David Loshing
Mr. LoshinJill and David LoshinD LoshinLoshin DavidLoshin, Howard
The Loshin FamilyHD Loshin
Mr. David Loshin
Enterprise Knowledge Management : The Data Quality Approach (The Morgan Kaufmann Series in Data Management Systems) (Paperback) by David Loshin
Business Intelligence: The Savvy Manager's Guide (The Savvy Manager's Guides) (Paperback -June 2003) by David Loshin
The Geometrical Optics Workbook(Paperback - June 1991) by David S Loshin
?
Howard David Loshin
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
49
Identity Search and Resolution
David Howard Loshin
Loshin, Howard
Howard David LoshinHoward David LoshinHoward David Loshin
Objective: Provide the services that will seek the matching record in the master index that represents the “queried” object
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
50
Record Linkage – Parsing and StandardizationParsing
Identifying and tagging pieces of each data value within a semantic context
StandardizationCorrecting terms based on defined rulesAssembling components into recognized patterns
NormalizationAdjust data elements into a common representation
TransformationRule-based modifications into target canonical representationsTransformation into target format
Loshin, HowardD LoshinHowardDFirst HowardMiddle DLast Loshin
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
51
Record Linkage – Similarity ScoringRecords are deemed to be the same when their identifying information matches within a reasonable degree of similaritySpecific bits and pieces of identifying information are necessary to enable distinction between unique objectsIdentifying values are weighted as to their contribution to the similarity scoreThe analysts must perform a continual assessment of similarity criteria to inoculate against false positives
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
52
Merging and ConsolidationStatic merging and consolidation takes existing data sets, seeks to segment records into equivalence classes representing the same entity, and merges data from matched records into a single master record
Depends on automationThresholds for matching set ahead of timeRecords between thresholds pulled for administrative review
Inline merging and consolidation matches new instances against existing master records and updates the master records with validated current changes
Integrates rules with probabilistic algorithms
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
53
Merging and Consolidation
Knowledge Integrity, Inc. 301-754-6350
Loshin 301-754-6350DavidHoward
Knowledge Integrity Incorporated 301 754-6350
LotionDavid 1163 Kersey Rd
David Loshin
Knowledge Integrity
301-754-6350
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
54
SurvivorshipRules govern the determination of surviving data element values
Quality of data sourceCurrency of data sourcePriority
Loshin 301-754-6350DavidHoward
LotionDavid 1163 Kersey Rd
Loshin 301-754-6350DavidHoward 1163 Kersey Rd
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
55
Identification – SummaryValues are parsed into recognizable componentsComponents are standardizedIdentifying information is collected for searchMatching is facilitated via similarity scoringMatched records are consolidated during integration
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
56
Operational Master Data ServicesMDM operations require ongoing support for a number of data services:
Inline classification and matchingData cleansingData enhancementUnique identification
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
57
Inline ClassificationMaster objects are grouped by their characteristics into hierarchiesEach presented object is subjected to a classification service to assist in narrowing scope for matching
¼” metal l thr philClassification
Service
“screw”
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
58
Data Standardization/CleansingKnowledge base is used to transform into a standardized format
¼” metal l thr phil
” in
thr threaded
phil Philips
l left
hx Hexr right
flt Flat
wd wood
”
thr
phil
l
in
threaded
Philips
left
1/4metal
CleansingService
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
59
Inline MatchingObjects are submitted to the matching service to determine if an instance already exists in the master repositoryRelies on traditional parsing/standardization/linkage
¼” metal l thr philMatchingService
“screw”
MasterRepositoryscrews
parts
Operationalprocess
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
60
Unique IdentificationFor unmatched records:
Identifying information is extractedNew unique identifier is created by UID serviceRecord is stored to master repositoryMaster index is updatedUpdates are communicated
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
61
Data EnhancementReference data sets are consulted inline for value-added data appendExample:
Customer data is enhanced with geo-demographicsVendor data is enhanced with reliability scoreProduct data is enhanced with tolerance metrics
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
62
GovernanceData StandardsData Quality ManagementData Stewardship
GovernanceDataStandards
MetadataManagement
DataQuality
DataStewardship
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
63
Aligning Data Objectives and Business Strategy
Clarify and understand the existing “Information Architecture”Create an inventory of data assets
Applications, data assets, documentation, metadata, usageInventory of data elements and “owning” application
Finance
Sales
Marketing
HumanResources
Legal
Compliance
CustomerService
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
64
Standardizing Data Sharing
Customers
Partners
LoBs
In any enterprise, we can be confident that information exchanges are understood the same way through the definition and use of data standards
Standard definitions, processes, exchanges
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
65
Standards: Challenges for Critical Data ElementsAbsence of clarity
…makes it difficult to determine semantics
Ambiguity in definition…introduces conflict into the process
Lack of Precision…leads to inconsistency in representation and reporting
Variant source systems and frameworks…encourage “turf-oriented” biases
Flexibility of data motion mechanisms…leads to multitude of approaches for data movement
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
66
Data Quality Management GoalsEvaluate business impact of poor data quality and develop ROI models for Data Quality activitiesDocument the information architecture showing data models, metadata, information usage, and information flow throughout enterpriseIdentify, document, and validate Data Quality expectationsEducate your staff in ways to integrate Data Quality as an integral component of system development lifecycleGovernance framework for Data Quality event tracking and ongoing Data Quality measurement, monitoring, and reporting of compliance with customer expectationsConsolidate current and planned Data Quality guidelines, policies, and activities
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
67
Engineering Data Quality into the System
Analyze/profiledata
Assessdata qualitydimensions
Application
IMS
Flat File
RDBMS
VSAM
Createmonitoring
system
Recommenddata
transformations
Generate data quality reports
Send data quality reportsto data owners
Improved enterprisedata quality
Data quality,Validity, &
Transformationrules
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
68
Operationalizing Data GovernanceActualization of data governance activities enables:
The identification of explicit and hidden risks associated with data expectationsThe actualization of the implementation of business policyOversight of the definition of critical data elementsMonitoring and auditing information quality rule complianceManaging enterprise data ownership and stewardshipCoordination and oversight of enterprise data quality
In general, data governance provides management oversight for organizational observance of different kinds of information policies
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
69
Roles and Responsibilities
Executive Sponsorship
Data Governance Oversight
Data Steering Committee
Provide senior management support at the C-level, warrants the enterprise adoption of measurably high quality data, and negotiates quality SLAs with external data suppliers.
LOB Data GovernanceLOB Data GovernanceLOB Data GovernanceLOB Data Governance
Strategic committee composed of business clients to oversee the governance program, ensure that governance priorities are set and abided by, delineates data accountability.
Tactical team tasked with ensuring that data activities have defined metrics and acceptance thresholds for quality meeting business client expectations, manages governance across lines of business, sets priorities for LOBs and communicates opportunities to the Governance Oversight committee.
Data governance structure at the line of business level, defines data quality criteria for LOB applications, delineates stewardship roles, reports activities and issues to Data Coordination Council
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
70
Stewardship: Manual InterventionIssues with addressing data quality events:
Immediate remediation of flawed data – does this imply data correction?Not all data flaws can be captured via automated processes –this implies manual reviews
Accuracy may only be measured by comparing values directlyCarefully integrate manual intervention when necessary in a controlled manner
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
71
Summary - GovernanceOne of critical success factors for MDM deploymentBenefits all stakeholdersEstablishes link between business objectives and information quality
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
72
MDM Functional Services
Browsing &Editing
PropagationSynchronization
RulesManager
WorkflowManager
Create
AccessControl
Read Modify Retire
AuthenticationSecurityManagement
Data Quality MetadataManagementIntegration
Audit
Modeling IdentityManagementMatch/Merge Hierarchy
Management
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
73
Example: Generic Master Data Services
“Object Locate”
Master Index
“Object Factory”
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
74
Data Quality Technologies for MDMData parsing and standardizationRecord Linkage/MatchingData scrubbing/cleansingData enhancementData profilingData auditing/monitoring
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
75
Data Parsing and StandardizationDefined patterns fed into “rules engine” used for:
Distinguishing between valid and invalid stringsTriggering actions when invalid strings are recognized
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
76
Parsing and Standardization Example
(999)999-9999
999-999-9999999.999.99991-(999)999-9999
1-999-999-99991 999.999.9999
1 (999) 999-9999
Area CodeExchangeLine
3017546350
(301) 754-6350(999) 999-9999
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
77
Parsing and Standardization - ApproachSimplest approach: standard regular expression or context-free parsing with actions attached to success statesMore complex:
Standard parsing with variant lookaheadContext-sensitive parsingTable-driven string/pattern matching
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
78
Data Scrubbing/CleansingIdentify and correct flawed data
Data imputationAddress correctionElimination of extraneous dataDuplicate eliminationPattern-based transformations
Complements (and relies on) parsing and standardization
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
79
Cleansing Example
IBMInternational Bus Mach
Int’l Bus MachIntl Business MachinesIBM Inc.
International Business Machines
Lookup Table
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
80
Matching/Record LinkageIdentity recognition and harmonizationApproaches used to evaluate “similarity” of recordsUse in:
Duplicate analysis and eliminationMerge/PurgeHouseholdingData EnhancementData CleansingCustomer Data Integration
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
81
Matching/Record Linkage - Example
Knowledge Integrity, Inc. 301-754-6350
Loshin 301-754-6350DavidHoward
Knowledge Integrity Incorporated 301 754-6350
LotionDavid 1163 Kersey Rd
David Loshin
Knowledge Integrity
301-754-6350
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
82
Matching/Record LinkagePairwise comparisons reliant on similarity scoring (inefficient)More efficient algorithms block records into candidate sets, then do pairwise comparisons
Statistical (Fellegi & Sunter, Jaro, Winkler)Artificial Intelligence (e.g., Borthwick’s MEDD)
Approximate string matchingPhonetic compression (Soundex, NYSIIS)N-gramsWinkler (probability-based)
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
83
Data EnhancementData improvement process that relies on record linkageValue-added improvement from third-party data sets:
Address correctionGeo-Demographic/Psychographic importsList append
Typically partnered with data providers
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
84
Subject Number Percent
OCCUPANCY STATUS
Total housing units 1,704 100.0
Occupied housing units 1,681 98.7
Vacant housing units 23 1.3
Tenure
Occupied housing units 1,681 100.0
Owner-occupied housing units 1,566 93.2
Renter-occupied housing units 115 6.8
Subject
SCHOOL ENROLLMENTPopulation 3 years and over enrolled in school 1,571 100
Nursery school, preschool 142 9Kindergarten 102 6.5Elementary school (grades 1-8) 662 42.1High school (grades 9-12) 335 21.3College or graduate school 330 21
EDUCATIONAL ATTAINMENTPopulation 25 years and over 3,438 100
Less than 9th grade 122 3.59th to 12th grade, no diploma 249 7.2High school graduate (includes equivalency) 996 29Some college, no degree 674 19.6Associate degree 180 5.2Bachelor's degree 676 19.7Graduate or professional degree 541 15.7
Percent high school graduate or higher 89.2 (X)Percent bachelor's degree or higher 35.4 (X)
Number Percent
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
85
Data ProfilingEmpirical analysis of “ground truth”Couples:
Statistical analysisFunctional dependency analysisAssociation rule analysis
The statistical analysis part is a no-brainerThe other stuff is hard
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
86
Data Profiling – ApproachColumn Profiling/Frequency Analysis (Easy)
Done in place using database capability (e.g., SQL), or using hash tables
Cross-Table/Redundancy (Harder)May be done using queries, or using set intersection algorithms
Cross-Column/Dependency (Hardest)Uses fast iterative set partitioning algorithmsA priori, TANE, FD_MINE
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
87
Data Auditing/MonitoringProactive assessment of compliance with defined rulesProvide reporting or scorecardingUseful for
“Quality” processes“Data debugging”Business impact assessment
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
88
IssuesData quality is seen as a technical problem requiring a technical solutionDisconnect between information value and achieving business objectives leads to ignoring DQ until it is too lateData is often corrected, instead of flawed processes
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
89
Integration with Master DataMaster reference data managementSemantic metadata integrated with information qualityData quality activity trackingService-oriented processingCorrelating business impacts to information valueComplex rule frameworks and systemsBusiness rule definition and management
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
90
Tools Summary1st generation tools focus on “correction”2nd generation tools look at analysis and discoveryGrowing MDM needs:
Standardized rulesBusiness impact correlationPerformance metricsSemantic metadataGeneric metadata-based descriptive capabilityHistorical auditing and tracking
© 2007 Knowledge Integrity, inc.www.knowledge-integrity.com
(301)754-6350
91
Questions?If you have questions, comments, or suggestions, please contact meDavid Loshin301-754-6350loshin@knowledge-integrity.com
Recommended