49508437 Quality Stage Wipro Copy

Session 1:QualityStage Essentials

Objectives

Data QualityIntroduction to QualityStageDeveloping with QualityStageInvestigate and Data Quality AssessmentData PreparationStandardizeRule Set OverridesMatchSurvive

Data Migration Challenges

Data Quality Increases ROI

Better decision makingImproved marketing accuracy and scopeIncreased knowledge of customers Improved inventory and asset management Improved risk analysis, auditing and reporting

Data Quality There are two significant definitions of data qualityInherent Data QualityCorrectness or accuracy of data - The degree to which data accurately reflects the real-world object that it represents

Pragmatic Data QualityThe value that accurate data has in supporting the work of the enterpriseData that does not help enable the enterprise accomplish its mission has no quality, no matter how accurate it is

Data Quality ChallengesDifferent or inconsistent standards in structure, format or valuesMissing data, default valuesSpelling errors, data in wrong fieldsBuried informationData myopia Data anomalies

Different or Inconsistent StandardsMARC DILORENZO ESQ BOSTONMRS DENNIS MARIO HARTFORDMR & MRS T. ROBERTS CHICAGO Source 3DILORENZO, MARK 6793MARIO, DENISE 0215ROBERTS, TOM & MARY 8721 Source 2 Source 1 MARK DI LORENZO MA93 DENIS E. MARIO CT15 TOM & MARY ROBERTS IL21

Missing Data & Default ValuesNAMESOC. SEC. #TELEPHONEDo the field values match the meta data labels?

Buried Information Legacy Meta Desc.Legacy Record ValuesRobert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTADTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345NAME 1ADDRESS 1ADDRESS 2ADDRESS 3ADDRESS 4ADDRESS 5

The Anomalies Nightmare90328574 90328575 90238495 90233479 902334899023488990345672IBMI.B.M. Inc.International Bus. M. Int. Bus. MachinesInter-Nation ConsultsInt. Bus. ConsultantsI.B. Manufacturing8,494.003,432.002,243.005,900.006,800.0010,243.0015,999.00 187 N.Pk. Str. Salem NH 01456187 N.Pk. St. Sarem NH 01456187 No. Park StSalem NH 04156187 Park Ave Salem NH 0415615 Main St. Andover MA 02341PO Box 9 Boston MA 02210Park Blvd. Boston MA 04106Spelling ErrorsAnomaliesNo common keyLack of Standards

What data challenges do you face?Acct #NameAddressCityStateZipNote5154155Peter J. Lalonde40 Beacon St.Melrose, Mass02176ODP5152335LaLonde, Peter76 George 617-210-0824BostonYESMA021115146261Lalonde, Sofie40 Bacon StreetMelroseMACHK ID87121Pete & Soph Lalond76 George RoadBostonMASSFR Alert87458P. Lalonde FBOS.Lalonde40 Becon Rd.MelroseMA02176

Common Data Quality ApproachesAnalysis and AssessmentEnterprise-level: Data Quality Assessment (DQA)Project-level: Data investigationData Re-engineering MethodsStandardizationRecord linkage/matchingConsolidationInformation Engineering MethodsInitial loadNet changeReal-timeOngoing MetricsProject-level: Post-Data Quality Assessment (PDQA)Enterprise-level: Repeated DQAs to establish trends

Data Re-engineering MethodologyUnderstanding the quality of your data and its impact on achieving success Standardizing content, structure and meaning of datain preparation for matching and downstream processing Identifying and linking duplicate entities or like entities Selecting the Best-of-breed data for downstream processingDo your data sources contain what you think they do?Does your data mean what you think it does?Can you correct and improve the quality of your data?Can you make the data meaningful to users?Can you deliver & update the data in a timely manner?How do you match records with the same meaning?Which source should you use for this project?Is your data sent to users based on events or content?Are you able to keep data synchronized across systems?

Why InvestigateDiscover trends and potential anomalies in the data100% visibility of single domain and free-form fieldsIdentify invalid and default valuesReveal undocumented business rules and common terminologyVerify the reliability of the data in the fields to be used as matching criteriaGain complete understanding of data within context

How to InvestigateSingle domain (character and type)Freeform text (Word)

What is StandardizeThe revealed patterns drive the conditioning rules. Pattern Manipulation: Applying business logic to data chaos. Standards Definition:Enforcing business standards on data elements. Field Structuring:Transforming the input to an output which meets the business requirement.

How to standardizeParsing specific data fields into smaller, lower-level (atomic) data elements Categorization of identified elementsSeparation of Name, Address, and Area from freeform Name & Address linesIdentification of Distinct Material Categories (e.g. Sutures vs. Orthopedic Equipment)Refinement of a data element Name = MS GRACY E MATHEWS becomes: Title = MS First Name = GRACY Middle Name = E Last Name = MATHEWSPart Description = BLK ACER MONITOR becomes: Color = BLACK Type = ACER Part = MONITOR

Why StandardizeNormalize values in data fields to standard valuesTransform First Name = MIKE MICHAELTransform Title = Doctor DrTransform Address = ST. Michael Street Saint Michael St. Transform Color = BLK BLACK Applying phonetic coding to key wordsNYSIISSoundexTypically applied to Name fields (first, last, street, city)

QualityStage StandardizeHighly flexible pattern recognition languageField or domain specific standardization (i.e. unique rules for names vs. addresses vs. dates, etc.)Customizable classification and standardization tablesUtilizes results from data investigation

QualityStage Standardize Example

MatchConditioned data and QualityStages matching engine link the previously unlinkable. Match Construction: Reliability of input data defines a match result. Statistical Analysis & Match Scoring:Linkage probability determined on a sliding scale by field level comparison. Report Generation:All business rules applied have easy to understand report structure.

What is MatchIdentifying all records on one file that correspond to similar records on another fileIdentifying duplicate records in one fileBuilding relationships between records in multiple filesPerforming statistical and probabilistic matchingCalculating a score based on the probability of a match

How to MatchSingle file (Unduplication) or two file (Geomatch)Different match comparisons for different types of data (e.g. exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison )Generation of composite weights from multiple fieldsUse of probabilistic or statistical algorithmsApplication of match cutoffs or thresholds to identify automatic and clerical match levelsIncorporation of override weights to assess particular data conditions (e.g. default values, discriminatory elements)

QualityStage MatchOver 25 match comparison algorithms providing a full spectrum of fuzzy matching functionsStatistically-based method for determining matches (Probabilistic Record Linkage Theory)Field-by-field comparisons for agreement or disagreementAssignment of weights or penaltiesOverrides for unique data conditionsScore results to determine the probability of matched recordsThresholds for final match determinationAbility to measure informational content of data

QualityStage Match Examples

What is SurviveCreation of best-of-breed surviving data based on record or field level informationDevelopment of cross-reference file of related keysProduction of load exception reportsCreating output formats:Relational table with primary and foreign keysTransactions to update databasesCross-reference files, synonym tables

Why surviveProvide consolidated view of dataProvide consolidated view containing the best-of-breed dataResolve conflicting values and fill missing valuesCross-populate best available dataImplement business and mapping rulesCreate cross-reference keys

How to surviveHighly flexible rulesRecord or field level survivorship decisionsRules can be based upon data frequency, data recency (i.e. date), data source, value presence or lengthRules can incorporate multiple testsQualityStage featuresPoint-and-click (GUI-based) creation of business rules to determine best-of-breed surviving dataPerformed at record or field level

QualityStage Survive Examples

Data Re-engineering Methods

Exercise 1-1:Course ProjectCourse business case: WINN Insurance CRM projectSee QualityStage Essentials Exercises, page 4

Course Project DesignInvestigateAssess Data QualityStandardize CountryAdd Unique KeyAppend Data to a common formatInvestigateConditioned ResultsApply User OverridesIdentify Duplicate Customer RecordsSurvive the BestCustomer RecordCondition Name, Address and AreaSelect US Data for further processingSelect US Data for further processing

Module SummaryFive common data quality contaminantsDifferent standardsMissing and default valuesSpillover and buried information AnomaliesNo consolidated viewApproaches to Data QualityData re-engineering methods

Introduction to QualityStage

Why QualityStageProbabilistic record linkage results in highest level of accurate, complete and justifiable match rates Most flexible parsing/standardization capabilities Handles complex free-form data Ability to verify 200+ country addresses allows for global support Transparent parallelism exploits multiple CPUs which provides unmatched performance and scalability Bi-directional meta data exchange ensures users understand data Productivity, connectivity and interoperability via tight integration with DataStage and RTI Services

QualityStage ArchitectureWindows&NT ServerBUILD ONCERUN ANYWHERETCP/IP (FTP)QualityStage Server Platforms

TCP/IP

QualityStage DesignerDesignerClient GUI for designing projectsWindows NT, 2000, XPEnter meta dataDefine StagesBuild Jobs Standardization RulesDesigner Repository

Designer - ToolbarNEW Project, Data File definition, Data Field definition, Stage, or JobCUT, COPY, PASTE Items listed on the right pane of the work areaRUNThe job selected on the right paneDISPLAY Change display of right pane to Large icons, Small icons or show Details

Designer - Rule SetsPre-defined rules for parsing and standardizing:NameAddressArea (City, State and Zip)Multi-national address processingValidate structure:Tax IDUS PhoneDateEmailAppend ISO country codesPre-process or filter name, address and areaRule sets are stored locally with the Designer (separate from the repository)

Designer Rule Set Options The name and location is defined in the Designer File, Designer Options, Standardize Process Definition Dictionary

Quality Stage ServerDeployment modesBatchReal-timeReal-time via APIMaster Projects DirectoryProject information is deployed to the serverProject work files are stored on the server in project libraries

Directory StructureDesignerServer

QualityStage Designer C:\Ascential\QualityStageDesigner70Designer RepositoryC:\Ascential\QualityStageDesiger70\QualityStageDesigner.mdbRule SetsC:\Ascential\QualityStageDesigner70

QualityStage ServerC:\Ascential\QualityStageServer70Master Projects DirectoryC:\ProjectsSample Project DirectoryC:\Projects\QualitySample Project ResultsC:\Projects\Quality\Data

Master Projects Directory

Master projects directory resides on the serverMultiple users can share the same Master Projects and Project directoryAll project libraries are stored under the Master Projects directory

Project LibrariesProject libraries are stored under the Master Projects directory

Project LibraryDescriptionIpe_env.shQualityStage Environment shellControlsStage and job control membersDataLocation of input and output filesDICStage and job dictionaryIPICFGEnvironment configurationLogsLocation of job run logsScriptsJob scripts dependent on the server typeTempTemp work space

QualityStage Licensed StagesQualityStageWAVESPostal Certification SolutionsCASSSERPGeoLocator

Exercise 2-1: Configure QualityStageConfigure the Designer for the development serverRun profileDesigner OptionsServer Master Projects directory Designer OptionsStarting the QualityStage ServerDuring the courseDevelopment environment

Run Profile One or multiple profiles Defines for the Designer the server component location and access Required: Host Type Host Server Path Master Project Directory Optional: Alternate Locale Local Report Data Location

Run Profile: Adv Project SettingsLocation of the input and output data filesLocation of the control members for each stage and jobServer temporary work locationLogs for each stage and job Scripts to execute jobs

Run Profile: FTP SettingsIf you are connecting to a remote server then you need the login ID and password for the server.

QualityStage Designer OptionsLocal working temp directory on your local PCLocation of the rule setsDefault location for importing projectsPreferred editor for reviewing rule sets and result file

Module SummaryQualityStage ComponentsArchitectureCommunication: Designer and Server use TCP/IP (FTP) to communicateConfigurationUser ProfileDesigner OptionsStarting the ServerProjectsProjects are defined in the DesignerTo run and execute jobs, QualityStage Server must be runningProject libraries are stored on the server

Developing with QualityStage

Module ObjectivesIntroduce the concepts, components and methods for developing projects in QualityStageAfter this module you will be able to:Define data files and field definitionsBuild Stages and design JobsDeploy and run JobsLocate and review results

Application ComponentsQualityStage ApplicationProject ComponentsStagesJobsData File DefinitionsMeta dataFile Name Requirements

StagesAbbreviateBuildCASSCollapseFormat ConvertInvestigateStagesSortStandardizeSurvive TransferUnijoinWAVESZ4changesMatchMultinational StandardizeParseProgramSelectSERP** Licensed Stages additional licensing required

What is a Job?A job is an executable QualityStage programJobs can be run interactively or in batch modeIn this course, jobs will be run interactively under the control of QualityStage Designer

Job Development OverviewDesignerImport or enter file definitions and meta data defining your sources and targetsAdd stages defining the process or taskDeploy the jobServerRun the jobReview results

Job Development Process1.Define data filesEnter or import meta data2.Define and build stages 3.Define job4.Deploy the jobMove project definitions to project libraries on the server5.Run the job 6.Review results

Executing a Job: Deploy and RunQualityStage ServerDeploy & RunDeploy: Transfer project information to the serverJob ScriptRUN: Execute the job script on the server

QualityStage Job Run ModesFILE MODEDATA STREAMProcess each record through a job before passing all the records to the next jobProcess each record and then pass it immediately on to the next job

Exercise 3-1: Deploy and RunOpen the demo project QualitySelect a jobSelect the Run button on the toolbarUncheck the Deploy boxChoose Execute File ModeChoose Run from Start to EndReview project libraries on the server

Data File Formats and DefinitionsData File NamesOne to eight charactersNo spaces or extensionsFile names are uppercase and case-sensitiveData File LocationData folder in project libraryFormatsQualityStages processes fixed record length sequential filesAlphanumeric characters

Exercise 3-2: Define a ProjectChoose, New icon from the Tool BarChoose ProjectProject Name: WinnCRMProject Description: Winn Insurance CRM ProjectChoose OK

Defining Meta DataData field definitions can be entered or imported into the DesignerImporting options include:Cobol copybooksODBC enabled MetaStage MetabrokerVisual Warehouse

Exercise 3-3: Define a Data FileLeft pane, select Data File DefinitionsRight pane, right-click, select New FileFilename AUTOHOMEFile: Auto and Home PoliciesChoose OK

Exercise 3-4: Data Field DefinitionsLeft pane, select Data File DefinitionsLeft pane, select AUTOHOMERight pane, right click, select New FieldComplete field information

Lab 3-5: Copy Data File and Field DefinitionsLeft pane, select Data File DefinitionsRight pane, select AUTOHOMERight-click, select COPYLeft pane, select Data File DefinitionsRight pane, right-click, select PASTEName File: LIFEChoose OK

Module SummaryData file definitionsData file formatMeta dataJobs and StagesRun and DeployProject Libraries

Investigate and Data Quality Assessment

Module ObjectivesDescribe how the Investigate stage is used to assess data quality in the project life cycleIdentify the three types of Investigate stageCharacter Discrete InvestigateCharacter Concatenate InvestigateWord InvestigateDesign Investigate stages and run Investigate jobsReview and analyze Investigate results

Project Planning & RequirementsData AssessmentDefine Development PlanDefine Business RequirementsDefine Data RequirementsRequirementsPlanningApplication Design Plan

High-Level DFDInvestigateAssess Data QualityStandardize CountryAdd Unique KeyAppend Data to a common formatApply User OverridesIdentify Duplicate Customer RecordsSurvive the BestCustomer RecordReject NON US Data Pre-Process US DataSelect US Data for further processingCondition Name, Address and AreaInvestigateConditioned Results

Data AssessmentVerify the domainReview each field and verify the data matches the meta dataIdentify data formats, missing and default valuesIdentify data anomaliesFormatStructureContentDiscover unwritten business rulesIdentify data preparation requirements

Investigate StageFeaturesAnalyze free-form and single domain fieldsProvide frequency distributions of distinct values and patternsInvestigate methodsCharacter DiscreteCharacter ConcatenateWord

Investigate Methods

MethodWhyCharacter DiscreteAnalyzing field values, formats, and domainsCharacter ConcatenateCross-field correlation, checking logic relationships between fieldsWord InvestigationIdentifying free-form fields that may require parsing and discovery of key words for classification

Investigate TerminologyOptions that represent the data. Options: Character (C), Type (T), Skipped (X)TokensField MasksIndividual units of data

Character MaskUsageCFor viewing the actual character values of the dataTFor viewing the pattern of the dataXFor ignoring characters

Field Mask Examples

Character Discrete: Field Mask (C)haracter Usage: Domain qualityView the contents of each field to verify that the data values match the field labels Investigate Stage:Generates Reports for frequency and pattern referencesReport naming conventions: jobp.FRQ Results sorted by frequency, descending orderjobp.SRT Results sorted by field mask, ascending orderjob.PAT Pattern reference file

DOB00000908 45.309% [X]| DOB00000005 0.250% 00000000 [X]| 00000000DOB00000004 0.200% 19440225 [X]| 19440225DOB00000004 0.200% 19440609 [X]| 19440609DOB00000004 0.200% 19460212 [X]| 19460212POLNUMB 00000001 0.050% 014669402 [X]| 014669402 POLNUMB 00000208 11.00% 617-338-0300[X]| 617-338-0300 POLNUMB 00000001 0.050% AM07B002470 [X]| AM07B002470POLNUMB 00000001 0.050% AM07B002736 [X]| AM07B002736Character Discrete - Character ResultsField NameFRQ CountSample ExampleFRQ %Field Mask[X] indicates a new set of example records

Character Discrete: Field Mask (T)ypeUsage: Data formats (patterns):View the format of field which contain that you suspect may follow or conform to a specific format, e.g., dates, PIN, Tax ID, account numbers. Generates reports for frequency and pattern referencesReport naming conventions: jobp.FRQ Results sorted by frequency, descending orderjobp.SRT Results sorted by field mask, ascending orderjob.PAT Pattern reference file

Exercise 4-1: Character Discrete InvestigateCreate Investigate jobIdentify the type of investigationSelect input fileChoose field (s) and mask optionsStage and run jobReview report results

Lab 4-1: Character Discrete Investigate Type TAdd Investigate jobIdentify the type of investigationSelect input fileChoose field (s) and mask optionsStage and run jobReview report results

Character ConcatenateUsage: Identify Field RelationshipsInvestigate one or more fields to uncover any relationship between the field values. QualityStage ToolkitUses combinations of character masksGenerates Reports for frequency and pattern referencesReport naming conventions: jobp.FRQ Results sorted by frequency, descending orderjobp.SRT Results sorted by field mask, ascending orderjob.PAT Pattern reference file

00000908 45.309% bbbbbbbbbbbbbbbb [X] | 00000020 2.009% bbbbnnnnbbbbbbbb [X] | 1904 00001096 54.691% nnnnnnnnbbbbbbbb [X] | 06011944

Character Concatenate ResultsFRQ CountSample / ExampleFRQ %Field Mask[X] indicates a new set of example recordsDOB and DOD Fields

Exercise 4-2: Character Concatenate Add Investigate jobIdentify the type of investigationSelect input fileChoose field (s) and mask optionsStage and run jobReview report results

Word InvestigateUsage: Pattern free-form fields and lexical analysisTo view the pattern of the data within a freeform text field and parse it into individual tokens QualityStageApply rules sets to free-form fieldsDiscover parsing requirementsPattern dataGenerates reports for word frequency, pattern frequency distributions, and word classification

Word Investigation Results^D?T 639 N MILLS AVE^D?S 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE 0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVEABBOTT ABBOTT ? ;0000000001ABERCON ABERCON ? ;0000000001ABERCORN ABERCORN ? ;0000000007ABERDEEN ABERDEEN ? ;0000000001Pattern ReportsWord Classification ReportsWord Frequency Reports

Rule SetsRules for parsing, classifying, and organizing dataRule Set DomainsCountry processingPre-processingDomain ProcessingName: Business and PersonalStreet AddressArea: Locality, City, State and Zip/Postal codesMultinational Address Processing

ParsingParse free-form data with the SEPLIST and a STRIPLISTSEPLIST - Any character in the SEPLIST will separate tokens, and become a token itselfSTRIPLIST - Any character in the STRIPLIST will be ignored in the resulting patternThe SEPLIST is always applied first

Parsing ExampleToken1Token2Token3Token4Token5Token6Token7Token8120MainSt.N.W.Token1Token2Token3Token4120MainStNWToken1Token2Token3Token4Token5120MainStNWSEPLIST .STRIPLIST .SEPLIST STRIPLIST .SEPLIST .STRIPLIST Example: 120 Main St. N.W.

Data Typing: Classifying TokensIdentify and type the token in terms of its business meaning and valueMASK KEY:N Numeric tokenA Alpha tokenM Mixed TokenMPATTERN KEY:^ Numeric token? Unclassified alpha token@, Mixed TokenT Street TypeU Unit Type>

21 WINGATE STREET APARTMENT 601T ^?^ParseClassify known wordsand assign default tagsU^D?T 639 N MILLS AVE^D?T 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE 0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVEABBOTT ABBOTT ? ;0000000001ABERCON ABERCON ? ;0000000001ABERCORN ABERCORN ? ;0000000007ABERDEEN ABERDEEN ? ;0000000001Example: Word Investigate

Lab 4-3: Word Investigation Address and AreaAdd Investigate jobIdentify the type of investigationSelect input fileChoose rule set and field(s)Choose Advanced OptionsStage and run jobReview report results

Data Quality AssessmentReview and analyze each field for the following information:How often is the field populated?What are the anomalies and out-of-range values? How often does each one occur?How many unique values were found?What is the distribution of the data or patterns?Use Investigate results to:Update business requirementsDefine development plan and application design

QuizWhat is domain integrity?What is the difference between a Type C and a Type T field mask?When might you use a Type X field mask?Where can you find the Investigate reports?

Module SummaryDRE Methodology: Data Quality AssessmentCharacter discrete, concatenate and word investigationField MasksCharacter (C)Type (T)Ignore (X)Parsing SEPLIST, STRIPLISTData ClassificationPatterns

Data Preparation

Data PreparationFormat of data fileUnique record identifierCommon record layout

Data File FormatPreferred data file format for QualityStage is:Fixed record lengthFix fielded dataSequential file with terminated recordsAlphanumeric dataQualityStage provides the following features for working with other file formats:ODBC enabled for pulling/pushing data from/to a tableUnterminated and Variable lengthFixed-length unterminatedThe Transfer (GTF) stage is used to read in the various formats and output a fixed-record length terminated file

Unique Record KeyEvery record should start the QualityStage process with a unique record keyThis key can be created in QualityStage or by other tools like DataStageThe QualityStage Investigate Stage will help validate if a unique key exists This unique key provides developers with a way to audit each record as it passes through the QualityStage applicationThe Transfer Stage can be used to create a new key field and populate the new field with a unique value

Common Data FormatFields identified for processing should be moved forward from each source and appended into a single new source fileAllows for efficiently processing all data in one stream using one set of rulesIn QualityStage, appending data files is accomplished with the Transfer (GTF) stage

Transfer Stage (GTF)Transforms data file formats to fixed length flat filesAdds new fieldsAssign literal values such as a source indicatorGenerate and assign a sequential valueReformatting record layoutsDropping fieldsFormat field dataCase formattingRight/left justificationRight/left fillConcatenate fieldsAppend Data files

Add a Record KeyInput Data FileOutput Data FileTransfer Stage(GTF)

Record Key Best PracticesAdd a unique record identifier in the QualityStage process or prior to entering QualityStage processingCreate a 12 byte fieldThe first 2 bytes indicate the sourcePositions 3 through 11 store a sequential numberPosition 12 is intentionally left blank for training providing a space between the record key and the data

Append Data FilesThe transfer stage can read one input file and produce one output file

To append data, you will need to define a Transfer stage for each file you want to append

Be careful of the order the first transfer does not generally append only subsequent transfer stages referencing the same output file should append data

Transfer Stage 1Transfer Stage 2Append options selectedCOMBINEDAUTOHOMELIFE

Exercise 5-1:Add a Record Key and Append Data FilesRead in each source of dataDefine a new output file with a common formatCreate Transfer Stage 1Create new Record Key fieldPopulate the Record Key fieldAdd AUTOHOME Data to new COMBINED output fileAUTOHOMECOMBINEDStage name: AHKEYStage type: TransferJob Name: Append

Lab 5-1: Append LIFE to COMBINED OutputCreate transfer stageDefine new record key fieldPopulate the record key fieldAppend LIFE to AUTOHOME in the COMBINED output fileLIFECOMBINEDStage name: LFKEYStage type: TransferJob Name: Append

Module SummaryQualityStage requires files to be fixed record length terminated records.The Transfer stage can be used to:Convert file formats to fixed record lengthAdd new fields and populated with literal values or sequential numbersAppend data filesFormat FieldsReformat record layout

Standardize

Module ObjectivesDescribe the Standardize Stage in the Data Re-engineering MethodologyIdentify Rule SetsApply the Standardize StageInterpret Standardize resultsInvestigate unhandled data and patterns

Project Lifecycle: DevelopmentConstruct ApplicationReview & RefineStandardize DataFind Duplicate Candidate (Match)Survive Best of Breed (Survive)

Development{Unit Test

StandardizeTransformationParsing free form fieldsComparison threshold for classifying like wordsBucketing data tokensStandardizationApplying standard values and standard formatsPhonetic Coding for use in MatchingNYSIISSoundex

Standardize Example

21 WINGATE STREET APARTMENT 601ParseClassify &assign default tagsStandardize ProcessOutput FileKey:^ = Single numeric? = One or more unknown alphasT = Street typeU = Unit type

Standardize StageStandardize StageUses Rule sets for: Country processingPre-domain processingUSPREPDomain processingUSADDRUSAREAUSNAMEMulti-national Address WAVES

Types of Rule SetsCountry IdentifierCOUNTRYDomain Pre-processorUSPREPDomain Specific: USNAMEDomain Specific: USADDRDomain Specific: USAREA

Example: Country IdentifierInput Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREETOutput Record

USY100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CAYSITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GBY28 GROSVENOR STREET LONDON W1X 9FEUSN123 MAIN STREET

Example: Domain Pre-ProcessorInput Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148Output Record

Name DomainJIM HARRISAddress Domain92 DEVIR STREETArea DomainMALDEN MA 02148Other Domain(781) 322-2426

Example: Domain-SpecificInput Record

100 SUMMER STREET 15TH FLOOROutput Record

House Number100Street NameSUMMERStreet Suffix TypeSTFloor TypeFLFloor Value15Address TypeSNYSIIS of Street NameSANARReverse Soundex of Street NameR520Input Pattern^+T>U

Rule SetsRule sets contain logic for:ParsingClassifyingProcessing data by pattern and bucketing dataThree required filesClassification TableDictionary FilePattern Action FileOptional Lookup Tables

Rule Sets FilesContains a series of patterns and programming commands to condition the dataContains standard abbreviations that identify and classify key words. Optional conversion and lookup tables for converting and returning standardized valuesDefine the output file fields to store the parsed and conditioned dataDescription file for the Rule SetTables for storing overrides entered into the Designer GUIClassification Table (.CLS)Pattern Action File (.PAT)Dictionary File (.DCT)Rule Set Description (.PRC)Lookup Tables (.CLS)Override Tables (.CLS)

Classification TableContains the words for classification, standardized versions of words, and data classData class (data tag) is assigned to each data tokenDefault classes are the same across all rule setsUser-defined classes are assigned in the classification tableUsers may modify, add or delete these classesUser-defined classes are a single letter

Default Classes

ClassDescription^A single numeric+A single unclassified alpha (word)?One or more consecutive unclassified alphas@Complex mixed token, e.g., , OConnell>Leading numeric, e.g., 6A= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;

TARGETCONDITION

Exercise 9-1: Survive the Best Customer RecordDefine the output fileDefine Survive stageChoose target fieldsDefine Survive rulesDeploy and runReview results

Module SummaryConsolidate or survive the best record by choosing the best record or best field from multiple recordsUse pre-defined techniques or build your ownMay use multiple rules

*************Four Common Methods:Character discrete Inv multiple single-domain fields independentlyType C View the character valuesType T View the field format or TemplateType X Ignore characters****These reports provide the quantitative understanding of data values prevalence that will permit correlation of the various spellings, misspellings, abbreviations or other representation of data valuesAlso note any anomalies (anything suspect: out of range or defaults values), and how often each anomaly occurs?Percent Populated per field: Note how often the field is populatedHow many formats templates exist for the data?The cardinality of the field: The number of distinct valuesThe frequency distribution: How often does each format occur?How often does data in the wrong domain occur?

*Investigates one or more single-domain fieldsEach field is treated independently for frequency count and pattern reportingReport namesjobp.FRQ - sorted by frequency in descending orderjobp.SRT - sorted alphabetical in ascending orderjob.PAT - reference file

*These reports provide the quantitative understanding of data values prevalence that will permit correlation of the various spellings, misspellings, abbreviations or other representation of data valuesAlso note any anomalies (anything suspect: out of range or defaults values), and how often each anomaly occurs?Percent Populated per field: Note how often the field is populatedHow many formats templates exist for the data?The cardinality of the field: The number of distinct valuesThe frequency distribution: How often does each format occur?How often does data in the wrong domain occur?

***These reports provide the quantitative understanding of data values prevalence that will permit correlation of the various spellings, misspellings, abbreviations or other representation of data valuesAlso note any anomalies (anything suspect: out of range or defaults values), and how often each anomaly occurs?Percent Populated per field: Note how often the field is populatedHow many formats templates exist for the data?The cardinality of the field: The number of distinct valuesThe frequency distribution: How often does each format occur?How often does data in the wrong domain occur?

***These reports provide the quantitative understanding of data values prevalence that will permit correlation of the various spellings, misspellings, abbreviations or other representation of data valuesAlso note any anomalies (anything suspect: out of range or defaults values), and how often each anomaly occurs?Percent Populated per field: Note how often the field is populatedHow many formats templates exist for the data?The cardinality of the field: The number of distinct valuesThe frequency distribution: How often does each format occur?How often does data in the wrong domain occur?*Parses free-form data into individual tokens Tokens are classified to create patterns Uses a set of rules for parsing and classifying the tokens Discover tokens (key words) to be added to the classification table such as name prefixes, business terminology, street types, new abbreviations for cities Create patterns of data tokens with the field context Identify spelling, misspellings and representations of data Identify parsing requirements for the conditioning process Patterns Reports: Distinct patterns within the field Pattern Reports List of all patterns sorted by frequency (p.frq) List of all patterns sorted alphabetical (p.srt) List of each token and its associated pattern (.pat)Word Frequency Reports: The frequency distribution of distinct values List of all alpha sorted by frequency (c.frq) List of all alpha sorted alphabetically (a.frq) May include numerics and mixed tokensWord Classification Reports: The frequency distribution of classified and unclassified words List of classified alpha (u.dlt) List of not-classified alpha (n.dlt) All alpha listed in the classification table are considered classified alpha

**There is a default seplist and striplist on the Word investigation: Advanced Options screenRemember investigation is about discovery, feel free to changes the seplist and strip to experiment and identify the best parsing parametersThe seplist and striplist only allow simple parsing (parsing by encountering the presence of a single character. More complex parsing can be done in the next phase, Conditioning.If you really arent sure how to parse the data then be very conservative, that is separate by a space and only strip out a space. Add in more characters after analysis of the resultsThe rule is Whenever in doubt dont strip out. If a character sometimes add context and sometimes does not then DONT strip out the character. Stripping the character loses context in all cases. Often we will choose the separate by the character but not strip it out:Examples: if we strip out the / we wont know if this started as or 12 or two independent digits 1 and 2.C/O: again if we strip out the / we may not realize that this was an abbrev of care of and interpret the token as company*We talk about tokens back in the early parts of investigation. This slide helps make it more clear how we create tokensUse the example of APT. is a period you can strip BUT $10.00 is not a period you can strip without changing the meaning of the data.*Start with an example of simple classification. IS the data Alpha, Numeric or Mixed then introduce more sophisticated classification, the classification table.This example types each token without information about its business meaning. We will add this in the next slide*Illustrates the process for Parsing and classifying data tokens to create patternsExample:--------------------------------------------------------------------------------------------Rebuild120Apple RD APT4BDataTypeHNSNSTUTUVPattern^?TM>Classify2^?TM>Classify1NAAAMParse120AppleRoadApart4B

Start with the bottom line and build. Refer to the appendix for the GEOCODE rules for this information.

***********************Notes: (decompose free-form fields into single component fields)(assign data to its appropriate metadata fields)

*Notice: Unit 20 & # 20, both bucketed as unit type and valueC/O Joseph C Reiff, moved to another field (cant see on screen)12 Western Ave, recognized as an address

*Notice the standardization of St & APTFirst we parse the dataClassify known words (classification table)Apply general (default tags) to unclassified tokensCreate new output fields (dictionary file)Process the patterns (Pattern action File) to move data into correct field and apply standard values and formats.

*Identify new fields based on underlying data Examples: set a gender flag Name type flag - identify individual address from an organization address. I=Individual, O=Organization Address type flag S=Street address, B= box address, R=rural route address, O=other type of addressTransformation rules are created both for matching and creating the load file (sometimes these rules are different)

*Three levels of rule setsCountry Identifier - Identify the country and append ISO country codeDomain Pre-processingDomain Specific

*(Note: US is assumed using ZQUSZQ country code delimiterPosition 1-3: ISO Country CodePosition 4-5: Indicator Flag (Y or N, where Y = country code verified)The format of the country code delimiter is: ZQZQFor example, the country code delimiter for the United States would be:ZQUSZQA full listing of ISO country codes is available in Appendix A of the Rules User GuideAllows easy processing of multi-national dataAssigns an ISO country code to each recordThe Country Identifier Rule Set requires the use of a country code delimiterThe country code delimiter indicates what country the user is expecting to find in the majority of the input fileWhen the Country Identifier Rule Set can not determine the country code, the default value will be taken from the country code delimiterNOTE: Input data can then be separated by country code for country-specific processing

*Categorize input data into one of the following domain-specific column sets:Name - individual and organization names, attention instructions, and secondary namesAddress - low level geography including street, rural, box, unit, and building addressesArea - high level geography including city name, postal code, and country codeOther - non-name and non-geography dataUp to six metadata delimited fields may be passed to a domain pre-processorThe metadata delimiters indicate what kind of information you are expecting to find within the fields of your input data If the pre-processor can not determine the domain of a token, it will be defaulted based on its metadata delimiterThe format of the metadata delimiter is: ZQZQThere are four accepted metadata delimiters:ZQNAMEZQ - Name delimiterZQADDRZQ - Address delimiterZQAREAZQ - Area delimiter ZQOTHRZQ - Other delimiter

*Evaluate domain-specific inputGenerate business intelligence fields:Create all subordinate domain elements needed for data storage and presentationApply consistent representations to dataIncorporate applicable standards such as postal standards for addresses Generate matching fields:Blocking keysPrimary match keys

**The first three files are required for each rule set. If the rule set name is USNAME, then the classification table is named USNAME.CLS, USNAME.PAT, USNAME.DCTRule sets can be copied, modified and deleted.Dictionary File: Defines two-character field abbreviations that are used in the output file for a particular rule set.. Each field is referenced in the .PAT file to determine where individual tokens will be bucketed.First Name Lookup table: Applies an enhanced first name, e.g. Barbara to Barb or Barbie, Kenneth to Ken, Kathleen to Kathy.Example: Classification table ST. is classified as a street type (T) with a standard value of (ST)

***********

(Pattern Action file) parses the data(classification table) Classifies known words (dictionary file) Defines new output fields (Pattern action File) Processes the patterns to move data into correct field and applies standard values and formats.

**********Note: The output file for a Standardize stage must be defined, no field definitions are required. (The field definitions are defined by the dictionary file!)

******Note: The output file for a Standardize stage must be defined, no field definitions are required. (The field definitions are defined by the dictionary file!)

***See USNAME.DCT file for all parsed names (PUT COPY OF USNAME.DCT, CLS KEY, PAT PARSING REQ. IN SG)A Name Type Flag is applied to help identify individual names vs. organization names which may require different match strategiesData is parsed into individual name fields

*Note: (PUT COPY OF USADDR.DCT & CLS KEY & PARSING REQ FROM PAT FILE IN SG)Address Type Flag: S=street address (street name is populated), B=Box address (no street name but a box type is populated), R=Rural Route (no street name, no box type, butr the rr type is populated).Different types of addresses may require different match strategies.Data is parsed into individual street address fields

**The output file for a Standardize stage must be defined, no field definitions are required. (The field definitions are defined by the dictionary file!)

*****Example: Ensure that the House Number field (HNUSADD) contains only numeric dataIs the House Number field always blank?Directions fields contain: N, S, E, W, or NW, NE, SW, SEYou may complete some quick visual inspections, the Investigation reports allow you to quantify the changes and improvements

**********Provides the user the ability to specify their own Standardize rulesUser overrides are GUI-drivenThe user does not need to know pattern-action language syntaxThe user does not need to edit the classification table or the pattern-action fileOverrides require the following information: Dictionary field name to move the token toOriginal value or standard value of tokenLeading space or no leading space for multiple tokens moved to the same dictionary field

*Example:Carolynne is not recognized as a first name. Since the name carolynne occurs 5 times in the data (review word inv report on names, word frequency count). We might want to add this name to the classification table so that it is recognized.

Use the Word Investigation Word Frequency reports to check the frequency that a word, abbreviation or misspelling occurs.Classification overrides take precedence over the classification tableClassification overrides are available in both domain pre-processor rule sets and domain-specific rule sets

*The word (alpha) Carolynne is not recognized as a First Name. Review the Name Word INV frequency report and note that Carolynne appears 5 times in teh data. This frequency influences the decision to add Carolynne as a Classification Override so that it is recognized as a first name.

The input pattern before the override is: +,+ unknown alpha comma unknown alphaAfter the override the pattern is +, F unknown word, comma first name The second pattern is recognized and processed by the pattern action file.*No partial string matching, only complete string matching*The example REIFF Funeral is a Special case as it needs to be handled different than the rest of the data with the Unhandled pattern of + +The remaining Unhandled Patterns of + + may handled the same way. The best type of override to *Again no partial matching, only complete pattern matchingPattern Overrides are the most general.Whenever possible use a pattern override as it is more general and will be applied to many records one override improves the data quality on many records vs. a text override which is very specific to a string of text*Why an Unhandled Pattern and not Input Pattern?Below is an example of a record that has a different input pattern then other records in this category, however it has the same unhandled pattern. All records with this unhandled pattern are to be handled (processed) the same way, it is more efficient to one unhandled pattern override rather than having to apply multiple input pattern overrides +,+ SANCHEZ-CIFUENTES , RYLMA +-+,+*Text overrides take precedence over pattern overrides because they are more specificInput overrides take precedence over all other patterns in the pattern-action file

Order of what to look for:Words to classify Input Pattern OverridesUnhandled Pattern OverridesInput Text OverridesUnhandled Text Overrides*Text overrides take precedence over pattern overrides because they are more specificInput overrides take precedence over all other patterns in the pattern-action file

Order of what I to look for:Words to classify Input Pattern OverridesUnhandled Pattern OverridesInput Text OverridesUnhandled Text Overrides**********Before we discuss the technology of matching, think about the human process that you would apply in making a decision about these record pairs.Do you fee comfortable about the first pair? Is there really enough information to suggest that these records should be linked? The two locations could be anywhere in the world and the first name initial doesnt offer much supporting information either.What about the second pair? Now we know that were dealing with the same geographical area and the same birthdate. Has this additional information given you greater confidence? Do you find yourself assigning more-or-less importance to some of the fields or some of the values? For instance, does the abbreviation of PLACE (PL) carry a little more weight in your mind than the abbreviation of STREET (ST) even though they are both just 2 characters? Does the 3 digit building number and the matching PLACE words give you sufficient confidence that these two versions of MAIN are likely the same even though there is a one-digit conflict in the Zip Code? Is the date-of-birth sufficient to say that these are the same person, or is there still some risk of them being twins?By the time you get to the 3rd pair your confidence should be very high. We now have phone number to further support the location data, and enough first name information to eliminate the risk of twins.These are the issues that automated matching must consider as well. Being accurate, consistent and justifiable are essential; being able to navigate the gray-areas of missing and conflicting values is what separates the simplistic from the industrial strength methods. Now lets look at the methods.*So what does INFORMATION CONTENT mean, and why should I care?Its the phrase that describes the scientific process of measuring the amount of emphasis, meaning, significance, usefulness, or decide-ability that a piece of data contributes to a process -- in this case the process of determining a match.Its actually a rigorous and mathematically defined concept based on INFORMATION THEORY. And QualityStage is the premier commercial implementation of that theory. QualityStage investigates your actual data, as a step in the matching process, and dynamically adjusts field and value-level scoring based on the characteristics of the data.You care because it automates, with far greater precision, the human intuitions that cause you to give more or less emphasis to certain values even within the same field.You care because it results in greater accuracy and because it gives your matching process a legitimacy and justification not possible through other techniques. And thats often essential to enterprise and mission critical projects whose success is measured by the confidence and trustworthiness of the resulting information.Now lets take a closer look at the Probabilistic process of measuring information content.*The scores (composites weights are relative to all the other scores). Plot the scores to see the distribution of the scores.This is the distribution of weights for matched and unmatched records. The more variables added to the match, the further apart these two humps will be. Its the point where the two groups intersect which can cause problems.In our previous example the score of 31.64 based on the distribution this is a fairly high score indicating a high confidence in the match.This is the distribution of weights for matched and unmatched records. The more variables added to the match, the further apart these two humps will be. Its the point where the two groups intersect which can cause problems.Including more fields is better as long as each field supports your matching goalsConsider how you would either use or omit fields depending on what your match goals are

*We measure the contribution with a weight. The more contribution the higher the weight, the less the lower the weight.Weight can also be defined as the discriminating power of a field*It is important to combine the theory with Business Knowledge to obtain the desired goals. Example: Statistic program have become so easy to use that anyone can build a regression and run it the achieve a result. But it takes an expert, a knowledgeable business person to understand the results and the relationship of the input to the result!Now lets look at that match again and this time we will apply weights.....

**Reliability: How correct are the data values. How often are the filled-in (non-missing) and when they are filled-in how often are they correct. Chance of Random Agreement: Measures the rareness or uniqueness of a value. The more frequent (or common) a value occurs in the data the less weight, confidence it contributes to the match.Example: If another Barbara walked into the room would you think AH HAH they must be the same person? ...Well maybe but Im not convinced. Now, if there were a Vladimir in the room and another Vladimir entered the room, instinctively I would have more confidence that they two Vladimirs are a match than I would the Barbaras. *The disagreement weight is proportional to the reliability score The m-prob value is entered by the user. It does not need to be an exact measurement as QualityStage will use the user-entered m-prob and improve the measurement based on a sample of the data.The more reliable the data in the field the more records are penalized for not agreeing, since errors are relatively rare.

Estimating the m-prob, if you really dont know then assume a 10% error rate, (m-prob =90% or .9)The m-prob is between .001 and .999. It can never be 1 (100%), there is always a chance of error. And it can never be 0 (completely reliable).

Data is reliable = errors are rareData is not reliable = errors are common

Example: of M-PROB: If the variable street type has a 12% error rate, then the m-probability is 0.88

*Rare values have less chance of accidental agreement and contribute more to a matchFrequency analysis determines the probability of chance agreement for any values (INTEGITY calculates)Example: If two records in a matched pair have a name of John Smith specified, you would be less sure that the record pair represented a true match than if Vladimir Horowitz were matched on both recordsRare events have more discriminating power than common eventsFrequencies should not be calculated for fields such as individual identification numbers since all values are rare

*Note: Use a sample size of 10,000 with a value frequency of 100 for a u-prob of .01 (1 in a hundred)*Within each group of blocked records, each record is compared to every other record according to the matching variablesExample: Lets say you washed all your socks, dried them, and then dumped then on the kitchen table. Its a beautiful Saturday morning and you must match up the socks before you can go out and play. Would you pick up one sock at a time and compare it to all the other socks? I would try to be efficient and sort or block the socks by something like color. Then I would only compare white socks to white socks, I wouldnt waste time comparing a white sock to a blue sock.*We could compare every record to every other record or we could block them by: Last Name.Enter to show block color-codingNotice the Jerosa record did not make the same block group as the Gerosa records. It did not match exactly on Last Name. This is one reason we create the Phonetic coding of some fields in the Conditioning phase. The phonetic coding fields are very useful for blocking as they introduce fuzziness to a rigid set of criteria (blocking).*Blocks with one record are considered residuals. There are not any other records in the group to compare to.Due to an error (potentially) in the last name Jerosa it did not make the same block group as the Gerosa records.

*If the accuracy is loose the scope is very largeIf the accuracy is too tight then the scope is too smallExample: If you are matching bank records to customers, it is NEVER OK to match the wrong record to a customer. The tolerance for error is very low. The accuracy must be high which causes a narrower scope of records.Marketing Campaign: In order to market a reasonable number of customers I might be willing to tolerate more error (less accuracy) to get a sufficiently wide scope, or marketing list.

*The goal of blocking is to group together like records that have a high probability of producing matches. The character discrete INV reports will help with these decisions, they tell you how often a field is populated.If you choose fields with reliable data then you are truly grouping together like records since the data values are reliable (usually correct).Choose fields that make business sense to meet you objective. If you are trying to identify unique customers then blocking by house number isnt the best choice.Gender usually doesnt have enough values to break records into groups of 100-200 (guideline). If all the data is from a few states the state may not be the best field.Again Inv reports tell use how often a field is populated, the distribution of the data.

**Example: 100*100 records (10,000 comparisons) is much faster than 200 (200*200 or 40,000 records)

*Examples: One-to-one >>> Customer records from the billing system should have one-to-one correspondence with customers from the marketing database. One-to-Many >>>>> Many Visa transactions will match to the same credit card number. Many addresses match to one postal code

*Over 24 ways to compare data values Char = Exact = Total agreement weightUncert = Fuzzy = Agreement weight is prorated based on how close to exact.If records are not close enough then the disagreement weight is assigned.These two are the most popular ways to compare data in fields.

**

Review Match extract for match resultsReview the Match Report with clients and Business AnalystsMatch Debug file, check for block overflow and review the Histogram

**Look at appendix c to look for the file layout of FI030000copy FI020000 and add all the fieldsPoint out that the extract is creating a custom extract (or report) of the information that will be used, going forward, into Survivorship.MOVE @SETMOVE MOVE @TYPEMOVE @PASSMOVE MOVE @WGTMOVE MOVEALL OF A

*Master: The reference records. All other records in the group compare to the reference record.

Duplicate: A Match to the Master record.

Residual: A single unmatched record

********Lets go back to the sock example...What if I found one pink sock (it might have started as a white sock but I washed it with a red shirt...hence the pink sock). Perhaps due to an error in its color it did not make the block with the white socks. Multiple passes help overcome the problem of records not making the correct block group.

Note: You may create up to seven match passes. Usually 2-3 are sufficient.

*********RJD Copied matches from Match slideRJD Changed timing********

Documents

49508437 Quality Stage Wipro Copy