Warehouse 92 Data Quality Whitepaper

Embed Size (px)

Citation preview

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    1/25

    Oracle9 i Warehouse Builder,Integrated Data Quality

    A n Oracle W hite Paper July 2003

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    2/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 2

    Oracle9 i Warehouse Builder,Integrated Data Quality

    Introduction........................................................................................................3 What Is Data Quality?.......................................................................................4 Data Quality is Critical......................................................................................5 Integrating Data Quality Into ETL.................................................................5 Data Quality in Oracle9 i Warehouse Builder ................................................6

    Name and Address Cleansing .....................................................................6 Sophisticated matching and merging..........................................................7 Overview of the Name and Address Cleansing Process.........................8

    Architecture ...............................................................................................9 Improving Name and Address Cleansing...........................................10

    Overview of the Match/ Merge Process ..................................................11 Match Rules.............................................................................................13 XREF Group ..........................................................................................21 Merge Rules.............................................................................................22

    Conclusion ........................................................................................................23

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    3/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 3

    Oracle9 i Warehouse Builder,Integrated Data Quality

    INTRODUCTION

    Today, more companies realize the need for Business Intelligence and establishingcustomer relationships to gain competitive advantage. Regardless of industry, thereis a universal foundation for all customer-focused initiatives customer data.Using customer data, businesses can gain a clear picture of customers, analyze theirbuying patterns, and predict future sales. For an accurate view of customers,companies are turning to solutions such as data warehouses or data marts to helpidentify and manage their customer relationships.

    Data quality is one of the biggest obstacles blocking the success of many datawarehousing projects. The Data Warehousing Institute estimates that data qualityproblems cost U.S. businesses more than $600 billion a year.

    Customer data usually reside in multiple disparate sources and needs to beconsolidated. Data must be filtered for unreasonable values and transformed tomatch other related data. During this process, issues of data quality become evenmore pronounced. Data of unknown quality is fundamentally untrustworthy andrepresents a significant risk to the successful completion of any data integration ormigration effort. Therefore, a data warehouse or data mart initiative cannot delivera tangible Return O n Investment (ROI) unless the data within it is reliable andaccurate.

    This paper will answer the questions: What is data quality? Why put an effort intodata quality? Why is this effort most efficient inside ETL (Extraction,Transformation and Loading) process? How will Oracle9 i Warehouse Buildermake this effort successful?

    You will discover how Oracle9 i Warehouse Builder combines ETL functionalitywith data cleansing capabilities such as parsing, correction, standardization andMatch/ Merge of data elements. With Oracle9 i Warehouse Builder, datawarehousing managers are ensured of the loading accurate data into a datawarehouse as well as effective management and maintenance of the customer datafoundation.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    4/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 4

    WHAT IS DATA QUALITY?

    Data quality is an all-encompassing term describing both the state of data that iscomplete, accurate and relevant as well as the set of processes to achieve such astate. The goal is to have data free of duplicates, misspellings, omissions andunnecessary variations and to have the data conform to the defined structure.Simply put, the data quality addresses the problem cynically but precisely summed

    in garbage in garbage out.The data quality processes are not utilized just during the population of datawarehouse, they are also employed before the warehouse is even designed andafter the warehouse is loaded. A significant part of data quality deals with customerdata names and addresses, due to both their key role in business processes andtheir highly dynamic nature. But data quality is certainly not limited to names andaddresses. Any data will benefit from being standardized and complete.

    Names and addresses usually contain dirty data, since they often includenicknames, abbreviations, misspellings, bad fielding etc. Often companies relyupon name and address information to match and merge data. A common unique

    identifier is typically not available across systems and because of the anomalies indata sources, it is crucial to cleanse and normalize it. Consequently, whatever datais available must be used to determine if different individuals, businesses or othertypes of records are actually the same. Names and addresses are ubiquitoustheytend to exist in almost every source and are often the only identifying data.

    Therefore, most matching applications rely heavily on names and addresses.Sophisticated name and address processing can solve these problems and have ahuge effect on matching.

    The following figure shows that in order to get a correct view of customer data inthe data warehouse, the source data must be cleansed, normalized, matched and

    merged.

    Figure1 - Complete and accurate view of customers

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    5/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 5

    DATA QUALITY IS CRITICAL

    There are many challenges in building the data warehouse and consequently manyareas of importance for the tools that assist with building the data warehouse. Is iteasy to build and maintain the warehouse (productivity and maintainability)? Canall required sources of data be brought into the warehouse (source integration)?Can data warehouse continue to add data within the allowed time periods and can

    it accommodate the growth (performance and scalability)? However, to the endusers of the data warehouse making the business decisions based on the data indata warehouse the most important question is Can I trust this data? It does notmatter if the warehouse was build in record short time, it practically maintainsitself, every imaginable source system is in, the loads are blazingly fast and are onlygetting faster. If the business users find the data in the data warehouse not to betrust-worthy, they will not use it to make their business decisions. All theimpressive achievements in those other areas will then be meaningless.

    INTEGRATING DATA QUALITY INTO ETL

    Having complete, accurate and relevant data free of all defects sounds great on

    paper but is never entirely achieved in reality. Instead, each organization defines,formally or informally, the acceptable level of data quality, a certain threshold thatcan be measured. Just as impressive as the figures that show the losses tobusinesses due to poor data quality, are the figures estimating how much is spenton trying to address the data quality issues. Therefore, it is not surprising that thetrue goal for businesses is not the absolute data quality in its academic sense butthe acceptable level of data quality achieved efficiently in terms of time, effort andmoney.

    Name and address cleansing is a specialized function traditionally handled by datacleansing vendors. The advent of data warehousing and especially the importanceof customer data have created a need for the name and address cleansingfunctionality to be incorporated into ETL tools. Today many ETL tools treat thename and address cleansing as a separate option that requires users knowledge of multiple tools.

    The key difference of data quality in Oracle9 i Warehouse Builder is the level of integration into the ETL process that makes data quality transformations aninseparable function of data warehouse development. Built-in, same interface fordesign, same interface for management. To understand just how important this is,consider the example of spell checker in a word processor. It would beunacceptable to anyone to close the document, start a spell checker application,process and correct the document, then re-open it in word processor. Even if the

    spell checker has a great user interface of its own, is fast and is otherwise pleasantto work with. Why not just invoke the spell checker from within a word processor?Better yet, why not correct the mistakes as the document is created? Of coursethere is no such problem in modern word processors. Spell-checkers have longbeen integrated and work seamlessly. Another interesting parallel to note is that its

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    6/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 6

    fairly unimportant to the user of a word processor that the actual spell checkingtechnology is provided by a third-party vendor.

    Oracle9 i Warehouse Builder fully integrates ETL, data warehousing, name/ addresscleansing and householding capabilities in a single tool. The following features arekey to rapidly building an effective, enterprise-wide customer data foundationsuitable for CRM and Business Intelligence:

    Complete Data Consolidation. Customer data is fragmented and can be found in manyinternal and external sources. This data needs to be consolidated to minimizedivergent data entry practices. Oracle9 i Warehouse Builder provides an extensibleframework and productive GUI for designing and deploying data warehouses anddata marts, facilitating:

    Design: Graphical support for dimensional and relational modeling;

    Extraction: Move data from relational, legacy, ERP and other datasources into the target data warehouse;

    Deployment: Generation and population of warehouse and marts

    (taking advantage of the Oracle database features) Management: Metadata-driven, integration with other Business

    Intelligence tools, Oracle Enterprise Manager and Oracle Workflow.

    A dvanced N ame and A ddress processing. Name and Address ensures accurate data forindividual names, firm names and locations. Cleansing includes address parsing,validation, correction, standardization and augmentation. Cleansed names andaddresses greatly improve the process of matching and merging customer data.

    Sophisticated matching and merging.Powerful fuzzy matching capabilities identifyunique customers, businesses, households or other entities for which commonidentifiers may not exist. Match rules can be created through an intuitive userinterface. Consolidation of match results is automated to create one view of thecustomer.

    L ifecycle Management . Complete data warehouse management by efficiently managingall aspects of the data warehouse lifecycle. Pre-built reports, impact analysis anddata lineage reports provide a graphical way to identify the impact of changes inyour source or target system. Lifecycle support and management ensuresinformation accuracy via synchronization of the data warehouse with the metadatasource in Oracle9 i Warehouse Builder.

    DATA QUALITY IN ORACLE9 I WAREHOUSE BUILDER

    Name and Address Cleansing

    In Oracle9 i Warehouse Builder, name and address cleansing is a built-intransformation. During cleansing, individual names or organization names and/ oraddresses are:

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    7/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 7

    Parsed into individual elements for improved correction and matching

    Standardized which involves modification of components to a standardversion acceptable to a postal service or suitable for record matching.Address components will be converted into a consistentrepresentation; for example, East can be converted to E., and Str orStreet can be converted to St. Nicknames and organizational names

    can be transiently standardized as well to improve matching.Standardization of components include:

    - Street and city misspellings/ improper abbreviations

    - Directional identifiers: N, S, W, NW

    - Street types: ST, AVE, RUA, PSO, and JLN

    - State/ province/ neighborhood

    - Business keywords (INC, Corp)

    - Standardized version of first and middle names as separate

    components Validated and Corrected by using a database of postal data for

    address components (such as street name, city and zip code). Onlyaddresses will be validated or corrected, names are standardized

    Augmented, by adding new data elements for mailing, better targetmarketing or other purposes. Examples are ZIP+4, probable gender,consumer versus business record identification, county codes, andgeocodes (including latitude and longitude)

    For example, for the original Source data of: Sue Smith| 500 OraclParkway| Redwood Shores CA

    The end result after parsing, standardization, correction/ validation andaugmentation would be include the following components:Sue| Susan| Smith| 500| Oracle| Pkwy| Redwood City| CA| 94065| 1675

    Sophisticated matching and merging

    In Oracle9 i Warehouse Builder, Match/ Merge is also a built-in transformation andintegrated as part of the ETL functionality.

    Powerful deterministic and probabilistic matching is provided to identify uniqueconsumers, businesses, households, or other entities for which common identifiersmay not exist. Exceptional matching effectiveness can be achieved because anycombination of attributes and relationships - name, address, email address, phonenumber, etc.- may be used to qualify a match. Examples of match criteria include:Similarity scoring to account for typos and other anomalies in the data, Soundex toidentify data that sounds alike, Abbreviation and Acronym matching, etc.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    8/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 8

    Match rules are specified and managed through intuitive wizards. Like any otheroperator in Oracle9 i Warehouse Builder, Match/ Merge can also be accessedthrough an extensive scripting language.

    Merging consolidates the matched data, including all transactions, service historyand other data is associated with a customer, to provide a complete unified view of the customer for the entire organization. Unlike mailing tools, which are designed

    to drop all but one record from a match set, match results are intelligentlyconsolidated into a unified view based on rules you specify. The user is able tospecify the rules for consolidating (merging) the match results to create the bestpossible final result. Oracle9 i Warehouse Builder supports a variety of merge rulestrategies, including custom rules, a variety of approaches for resolvinginconsistencies that often occur in matched data. Examples include: Rankingsources by order of preference, any non-null value, etc. The resolution of relationships in the merge process is automated to ensure the integrity of theintegrated result. The merge rule framework assembles the best possibleintegrated result. This provides significant timesavings over traditional hand-coding approaches and dramatically reduces maintenance complexity.

    Figure2 illustrates that effective Customer-Centric initiatives rely on Name andAddress cleansing as well as householding and de-duplication.

    Figure 2 - D ata quality for all customer-centric applications

    Oracle9 i Warehouse Builders unique data integration and consolidationarchitecture provides unmatched performance and scalability for creation andmaintenance of customer-focused databases.

    Overview of the Name and Address Cleansing ProcessOracle9 i Warehouse Builder features a Name and Address Mapping Operator thatworks with included Warehouse Builder Name & Address Server component tosupport parsing, standardization, postal matching, and geocoding of name andaddress data.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    9/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 9

    Architecture

    The Name & Address engine resides on the database server, as does the ETLengine for Oracle9 i Warehouse Builder. Warehouse Builder Name & AddressServer provides an open architecture allowing multiple third-party name/ addressproviders to plug-in to Oracle9 i Warehouse Builder.

    Name/ Address Server communication is socket based. Sockets were chosen for

    higher performance. Handling of parse requests and parse results over the socket isdone via messaging.

    Figure 3 A rchitecture

    At the design time Name and Address operator is used in a mapping to modelName and Address rules inside Oracle9 i Warehouse Builder (see screenshot inFigure 4). At runtime, the deployed Name and Address transformationstransparently access parsers and data provided by the third-party vendors. Theprocess of designing and running Name and Address transformation is always thesame, regardless of which third-party provider (or even multiple providers) areused.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    10/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 10

    Figure 4 N ame and A ddress Operator in Oracle9i W arehouse Builder

    Many countries have postal data libraries available from their postal serviceorganizations. The third-party Name and Address providers incorporate such datalibraries into their offerings. When postal data is available address verification andcorrection is most reliable, often to the street level. The list of supported countriesdiffers depending on the provider. The most up to date list of providers and thecountries they support is available on Oracle9 i Warehouse Builder product page atotn.oracle.com.

    Other third-party vendor specific capabilities are Geocoding (census and locationaldata such as latitude and longitude) and postal certification reports: USPS CASSfor United States, SERP for Canada, AMAS for Australia.

    Improving Name and Address Cleansing

    Discrete or semi-discrete input roles yield better parsing results by removingambiguity problems the parser may encounter. An input of Electric Avenue witha role of line1 is ambiguous it could be an electrical appliance business or a streetname. If the same input has a role of firm or primary address, the ambiguity isresolved. Input roles can be selected from a list of predefined role types that varydepending on the selected parsing option. An input role specifies the type of datathat name/ address input item represents.

    Based on the specified Input Roles, the name/ address parser will analyze and

    standardize the input data and return the requested Output Components. Thesecomponents are defined by adding output parameters, then assigning an OutputComponent type to each parameter using the associated predefined list. TheOutput Name, Output Address and Output Status are what the user maps totargets.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    11/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 11

    Several status flags are provided to help determine the quality of parsing and postalmatching. Generally, these flags should be utilized to determine the final quality of a record. High quality records may be mapped to a destination data target, whileother records should be mapped for manual quality checks and correction. Thesplitter operator in the Mapping Editor can be used to achieve this. The Booleanflags specify the degree to which an input record could be matched to postal

    directories. Is Good Group, Is Parsed, Is Found, etc. are examples of such flags.

    Figure 5 Input Roles and Output Components inOracle9i W arehouse Builder

    Figure 5 shows how an input name and address would be parsed into theabbreviated list of name and address components.

    Based on the user defined input roles and output components, Oracle9 i Warehouse Builder generates the necessary PL/ SQL code. This code is thenpassed to the name and address server for processing.

    Overview of the Match/Merge Process

    Oracle9 i Warehouse Builder features a Match/ Merge Mapping Operator thatsupports matching and consolidation of the name and address data includinghouseholding. Mapping wizards assist users through the Match/ Merge operator

    setup.The Match/ Merge Operator will accept records from an input source, determinethe records that are logically the same, and construct a new merged record fromthe matched records. The operator has one ingroup and two outgroupparameters.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    12/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 12

    INPUTINGROUP: Map the input records to this group.

    OUTPUTMERGE: Attributes in this group will be merged. The operator will matchand merge the input record set to provide the merged records. Anyrecords that have been matched will be merged into a single merge record.

    XREF: Cross Reference (XREF) provides a record of the merge process.Every record from the ingroup will construct a record in the XREF outgroup (1-1 cardinality with the ingroup).

    Figure 6 Match/ Merge Operator

    The Match/ Merge operator has an ordered record stream as input. From thisstream, it will construct the match bins based on the defined match bin keys. Fromeach match bin, it will construct matched sets based on defined match rules. Fromeach matched set, it will construct a merged record based on defined merge rules.The initial query will contain an order clause consisting of the match bin keys.

    Constructing Match Bins

    Match bins are subsets of data on which matching is performed. The match binsshould be constructed carefully to fulfill two conflicting needs. One is to make sureany records that should match reside in the bin, and second to keep the size of thebins as small as possible. A small possible match record set is desirable forefficiency.

    Constructing Match Record Sets

    Each match bin will generate one or more match record sets. The match rulessimply determine if two records match. The Match rules will be applied to all

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    13/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 13

    records in the possible match set to determine the match set. The algorithm istransitive, i.e. given records A, B, C with A = B and B = C, A will be considered tobe equal to C without making a comparison.

    Constructing Merge Records

    A single merge record is constructed from the match record set. Users may createspecific rules for defining merge attributes through the merge rules.

    Figure 7 Match/ Merge Rule D iagram

    Match BIN Attributes

    Bin keys limit the possible matches to manageable sets. The fuzzy matchalgorithm will compare each record in the bin to every other record in the set. Thistotal number of comparisons is therefore (N*(N-1))/ 2. It is necessary to select abin key or keys to limit the number of records in the bin.

    Match Rules

    Match rules determine if two records are logically the same and are used to createthe matched record set. A match rule may be active or passive. A passive matchrule is generated but not automatically invoked. All Active match rules are ORdtogether to determine if two records match. At least one active match rule must bedefined Match Rule types are as follows: All Records Match, No Records Match,Conditional, Weight, Person, Firm, Address, and Custom.

    Conditional

    Conditional Match rules specify the conditions under which records match.Conditional match rules allow you to combine multiple attribute comparisons into

    one composite rule. If more than one attribute is involved in the rule then all thecomparisons must be true for the record to be considered matched. That is, theattribute comparisons are AND'ed together.

    Comparison Algorithms: Each attribute in the rule is assigned a comparisonalgorithm, which specifies how the attribute values are compared. Multiple

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    14/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 14

    attributes may be compared in one rule with a separate comparison algorithmselected for each. The types of comparisons are:

    E x act - The attributes match if their values are exactly the same. Forexample, "Dog" and "dog!" would not match, because the second string is notcapitalized and contains an extra character. For data types other than String, this isthe only type of comparison allowed.

    Standardized E x act - The values of the attribute are standardized beforebeing compared for an exact match. With standardization, the comparison ignorescase, spaces, and non-alphanumeric characters. Using this algorithm, "Dog" and"dog!" would match.

    Similarity- For this type of comparison, a "similarity score" in the range 0-100 is entered. If the similarity of the two attributes is equal or greater to this value,then the attribute values are considered matched. The similarity algorithmcomputes the edit distance between two strings. A value of 100 indicates that thetwo values are identical; a value of zero indicates no similarity whatsoever. Forexample, if the string "tootle" is compared with the string "tootles", then the edit

    distance is 1. The length of the string "tootles" is 7. The similarity value is therefore(6/ 7)*100 or 85.

    Standardized Similarity- The values of the attribute are standardized beforeusing the Similarity algorithm to determine a match. With standardization, thecomparison ignores case, spaces, and non-alphanumeric characters.

    Soundex - This converts the data to a Soundex representation and thencompares the text strings. If the Soundex representations match, then the twoattribute values are considered matched.

    Partial N ame - The values of a string attribute are considered a match if thevalue of one entire attribute is contained within the other, starting with the firstword. For example, "Midtown Power" would match "Midtown Power and Light,"but would not match "Northern Midtown Power". The comparison ignores caseand non-alphanumeric characters.

    A bbreviation- The values of a string attribute are considered a match if onestring contains words that are abbreviations of corresponding words in the other.Before attempting to find an abbreviation, this algorithm performs a Std Exactcomparison on the entire string. The comparison ignores case and non-alphanumeric character.

    For each word, the match rule will look for abbreviations, as follows. If the largerof the words being compared contains all of the letters from the shorter word, andthe letters appear in the same order as the shorter word, then the words areconsidered a match. For example, "Intl. Business Products" would match"International Bus Prd".

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    15/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 15

    A cronym- The values of a string attribute are considered a match if onestring is an acronym for the other. Before attempting to identify an acronym, thisalgorithm performs a Std Exact comparison on the entire string. If no match isfound, then each word of one string is compared to the corresponding word in theother string. If the entire word does not match, each character of the word in onestring is compared to the first character of each remaining word in the other string.

    If the characters are the same, the names are considered a match. For example,"Chase Manhattan Bank NA" matches "CMB North America". The comparisonignores case and non-alphanumeric characters.

    Figure 8 Match Rules

    Weight

    Probabilistic matching is implemented in Oracle9 i Warehouse Builder with thematch rule of type Weight. A weighted match rule allows you to assign an integerweight to each attribute included in the rule. A threshold must also be specifiedequaled or exceeded for the data to be considered matched. For each attribute, the

    Match/ Merge operator multiplies the weight by the similarity score, and sums thescores. If the sum equals or exceeds the threshold, the two records beingcompared are considered a match. A weighted match rule is most useful whencomparing a large number of attributes, without having a single attribute that isdifferent causing a non-match, as can happen with conditional rules.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    16/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 16

    Weight rules implicitly invoke the similarity algorithm to compare two attributevalues. This algorithm returns an integer, percentage value in the range 0-100,which represents the degree to which two values are alike. A value of 100 indicatesthat the two values are identical; a value of zero indicates no similarity whatsoever.

    Example of a Weight Match Rule: Assume the following attribute values arecontained in three separate records and were read in the following order:

    Attribute 1 Attribute 2

    Record 1 CA QQ

    Record 2 CA QQ

    Record 3 CA QR

    The match rule is defined as follows: Attribute_1: Weight 50, Attribute_2: Weight50, required score for match: 80

    Record 1 is the initial record read.

    Comparing Record 1 to Record 2: Record 2 has a value for Attribute_1 of CA.That value has a similarity of 100 with the value in Record 1. The weight value forAttribute_1 is 50, so its weighted score is 50 (100% of 50). The total weightedscore so far is 50. The value for Attribute_2 is QQ. That value has a similarity of 100. The weight value for Attribute_2 is 50 so its weighted score is 50 (100% of 50). The total weighted score is 100 (50 + 50). This equals or exceeds the value of the required score for match and so Record 1 and Record 2 are matched.

    Comparing Record 1 to Record 3: Record 3 has a value for Attribute_1of CA,which has a similarity of 100 with the value in the Record 1. The weight value forAttribute_1 is 50, so its weighted score is 50 (100% of 50). The total weightedscore so far is 50. The value for Attribute_2 is QR and that has a similarity of 50with the value in the driver record. The weight value for Attribute_2 is 50, so itsweighted score is 25 (50% of 50). The total weighted score is 75 (50+25). This isless than the value of the Required Score for Match and so Record 1 and Record 3are not matched.

    Person

    Built-in Person rules provide an easy and convenient way for matching names of individuals. Person match rules are most effective when the data has first beencorrected using Warehouse Builder Name and Address. This rule requires users toset what data within the record make up a persons name. The data can come

    from multiple columns and each column specified must be assigned an input roleas to what it means. The Person roles for matching are: Prename, First_name_std,Middle_name_std, Middle_name_2_std, Middle_name_3_std, and Last_name.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    17/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 17

    To create a Person match rule, roles must be assigned to the various attributes. Forexample, you could assign roles as follows for SSN, FirstName, MI, andLastName:

    Attribute Person Role

    SSN

    FirstName First_name_std

    MI Middle_name_std

    LastName Last_name

    The attributes assigned to person roles are used in the match rule to compare therecords. How the attributes are compared depends on which role they have beenassigned to, and what other comparison options have been set. Refer to PersonRoles for a complete list of person roles and how each role is treated in a personmatch rule.

    Person Roles

    Prename: If this role is assigned, the prenames are compared only if thefollowing are true:

    The Last_name and, if present, the middle name(Middle_name_std, Middle_name_2_std, andMiddle_name_3_std roles) in both records match

    The "Mrs. Match" option is selected Either record has a missing First_name_std

    For example, the following would match:

    Prename First_name_std Last_name

    Mrs. William WebsterMrs. Webster

    First_name_std: The First names are compared if this role is assigned. Bydefault, the first names must match exactly, but other comparison optionscan be specified as well. First names match if both are blank. A blank firstname will not match a non-blank first name unless the Prename role hasbeen assigned and the "Mrs. Match" option is set. If a Last_name role hasnot been assigned, a role of First_name_std must be assigned.

    Middle_name_std, Middle_name_2_std, Middle_name_3_std: Middle names arecompared if any of the mentioned roles are assigned. By default, the

    middle names must match exactly, but other comparison options can bespecified. If more than one middle name role is assigned, attributesassigned to the different roles are cross-compared. For example, valuesfor Middle_name_std will be compared not only against otherMiddle_name_std values, but also against Middle_name_2_std if that roleis also assigned. Middle names match if either or both are blank. If any of

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    18/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 18

    the middle name roles are assigned, the First_name_std role must also beassigned.

    L ast_name: Last names are compared if this role is assigned. By default,the last names must match exactly, but you can specify other comparisonoptions. The last names match if both are blank, but not if only one isblank.

    Maturity_postname: If this role is assigned, the post name, such as "Jr.","III," etc., are compared. The post names match if the values are exactlythe same, or if either value is blank.

    Firm

    Built-in Firm rules provide an easy and convenient way for matching names of businesses. Firm match rules are most effective when the data has first beencorrected using Warehouse Builder Name and Address. Similar to the Person rule,this rule requires users to set what data within the record make up a firms name.The data can come from multiple columns and each column specified must beassigned an input role as to what it means.

    For example, if attributes TaxID, Company, and Contact are present, assign rolesare as follows:

    Attribute Firm Role

    TaxID

    Company Firm1

    Contact

    Note that not every attribute needs to be assigned to a firm role, and not every roleneeds to be assigned to an attribute. The attributes assigned to firm roles are usedin the match rule to compare the records. How the attributes are compareddepends on which role they have been assigned, and what other comparisonoptions have been set. Refer to Firm Roles for a complete list of firm roles andhow each role is treated in a firm match rule.

    Firm Roles

    Firm1: If this role is assigned, the business names represented by Firm1are compared. Firm1 names will not be compared against Firm2 namesunless if the Cross-match firm1 and firm2 box is checked. By default, thefirm names must match exactly; but other comparison options can also bespecified. Firm1 names do not match if either or both names are blank.

    Firm2: If this role is assigned, the values of the attribute assigned to Firm2will be compared. Firm2 names will not be compared against Firm1names unless if the Cross-match firm1 and firm2 box is checked. Bydefault, the firm names must match exactly; but other comparison options

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    19/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 19

    can also be specified. Firm2 names do not match if either or both namesare blank. If a Firm1 role is not assigned, a Firm2 roles must be assigned.

    Address

    Address Match rules provide an easy way to match records based on postal

    addresses. Address match rules are most effective when the data has first beencorrected using Warehouse Builder Name and Address. Address Rules work differently depending on whether the address being processed has been correctedor not.

    Generally, corrected addresses have already been identified in a postal matchingdatabase, and are therefore not only syntactically correct, but are legal and existingaddresses according to the Postal Service of the country containing the address.Corrected addresses can be processed more quickly, since the match rule can makecertain assumptions about their format.Uncorrected addresses may be syntactically correct, but have not been found in a

    postal matching database. Addresses may have not been found because they arenot in the database, or because there is no postal matching database installed forthe country containing the address. Address match rules determine whether anaddress has been corrected based on the Is_found role. If Is_found role is notassigned, then the match rule performs the comparisons for both the correctedand uncorrected addresses.A list of address roles is as follows: Primary_address, Unit_number, PO_Box,Dual_primary_address, Dual_unit_number, Dual_PO_Box, City, State,Postal_code, and Is_found. Note that not every attribute needs to be assigned toan address role, and not every role needs to be assigned to an attribute.

    To create an Address match rule, address roles to the various attributes should beassigned. For example, if the attributes Business, AddressLine1, City, StateAbbr,and ZIP exist, the assigned roles are as follows:

    Attribute Address Role

    Business AddressLine1 Primary_address

    City City

    StateAbbr State

    ZIP Postal_code

    The attributes assigned to address roles are used in the match rule to compare therecords. How the attributes are compared depends on which role they have been

    assigned, and what other comparison options have been set. Refer to AddressRoles for a complete list of address roles and how each role is treated in an addressmatch rule.

    Address Roles

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    20/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 20

    Primary_address: If this role is assigned, the primary addresses arecompared. Primary addresses can be, for example, street addresses ("100Main Street") or PO boxes ("PO Box 100"). By default, the primaryaddresses must match exactly, but a similarity option can also be specified.The Primary_address role must be assigned.

    Unit_number: If this role is assigned, unit numbers (such as suite numbers,floor numbers, or apartment numbers) are compared if the primary

    addresses match. The unit numbers match if both are blank, but not if one is blank, unless the Match on blank secondary address option is set. If the Allow differing secondary address is set, the unit numbers are ignored.

    PO_Box: If this role is assigned, the Post Office Boxes are compared. ThePO Box is just the number portion of the PO Box ("100"), and is a subsetof the primary address, when the primary address represents a PO Box("PO Box 100"). If the primary address represents a street address, thePO Box will be blank.

    Dual_primary_address: If this role is assigned, the Dual_primary_address iscompared against the other record's Dual_primary_address andPrimary_address to determine a match.

    Dual_unit_number: If this role is assigned, the Dual_unit_number addressis compared against the other record's Dual_unit_number andUnit_number. The unit numbers will also match if one or both are blank.To assign the Dual_unit_number role, the Dual_primary_address rolemust also be assigned.

    Dual_PO_Box: If this role is assigned, the Dual_PO_Box address iscompared against the other record's Dual_PO_Box and the PO_Box. Toassign the Dual_PO_Box role, the Dual_primary_address role must alsobe assigned.

    City: If this role is assigned, the cities are compared for uncorrectedaddresses. For corrected addresses, the cities are only compared if thepostal codes do not match. If both City and State roles match, then the

    address line roles, such as Primary_address, can be compared. By default,the cities must match exactly, but a last line similarity option may bespecified.. The cities match if both are blank, but not if only one is blank.If the City role is assigned, then the State role must also be assigned.

    State: If this role is assigned, the states are compared for uncorrectedaddresses. For corrected addresses, the states are only compared if thepostal codes do not match. If both State and City roles match, then theaddress line roles, such as Primary_address, can be compared. By default,the states must match exactly, but a last line similarity option may bespecified. The states match if both are blank, but not if only one is blank.If the State role is assigned, then the City role must also be assigned.

    Postal_code: If this role is assigned, the postal codes are compared for

    corrected addresses. For uncorrected addresses, the Postal_code role isnot used. To match, the postal codes must be exactly the same. The postalcodes are not considered a match if one or both are blank. If the postalcodes match, then the address line roles, such as Primary_address, can becompared. If the postal codes do not match, City and State roles are

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    21/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 21

    compared to determine whether the address line roles should becompared.

    Is_found: If this role is assigned, the Is_found_flag attributes are notcompared, but instead are used to determine whether an address has beenfound in a postal matching database, and therefore represents a legaladdress according to the postal service of the country containing theaddress. This determination is important because the type of comparison

    done during matching depends on whether the address has been found inthe postal database or not.

    Custom

    The users can write their own comparison algorithms desired for matching. Theycan also reference one of the other match rule types to logically and or orthem together to produce a unique rule type. The editor will allow editing aPL/ SQL function that returns a true or false. The users can validate the PL/ SQLfrom this editor. Custom rules can invoke other match rules. This allows multiplerules to be chained together in more complex ways than the default, which OR'sindividual rules together. By default, if there is more than one match rule defined

    and "active", then any rule that gives a true condition (matches) results in therecord being matched. Custom rules reference other match rules using the nameof that rule. This allows individual weight rules, conditional rules, etc. to beinvoked from custom rules. If a custom rule invokes another match rule, theinvoked match rule should be marked as "passive." Otherwise the invoked rule willbe executed twice, and will unnecessarily increase processing time or produce anundesirable match.

    Each match rule is represented as a PL/ SQL function with two input parameters.Parameters THIS_ and THAT_ refer to the two records being compared. Alldefined match rules are available in the Custom Match Rule Editor for simpledrag-and-drop. For example, if the are defined passive rules LocationMatch,HomePhoneMatch and WorkPhoneMatch, then this custom rule simplyaccomplishes the logical AND of those rules.

    BEGIN

    RETURN LocationMatch(THIS_, THAT_) and

    (HomePhoneMatch(THIS_, THAT_) and WorkPhoneMatch(THIS_,THAT_));

    END;

    XREF GroupXREF provides a record of the merge process. Every record from the ingroup willconstruct a record in the Xref outgroup (1-1 cardinality with the ingroup). TheXref record may be populated with both the original attribute values and themerged attributes.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    22/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 22

    Merge Rules

    Matching will produce a set of records that are logically the same. Merging is theprocess of creating one record from the set of matched records. A Merge rule isprovided for each attribute of a record and uses the matched records to generate amerged value for that attribute.

    Figure 9 M erge Rules

    Any

    This rule will pick any (first non-null) value of the match set for this attribute.

    Rank

    This rule will rank the records from the match set. The associated attribute fromthe highest ranked record will be used to populate the merge attribute value.

    Sequence

    User can specify a database sequence for this rule. The nextval of the sequence willbe used for the value.

    Min/ Max

    User can specify an attribute and a relation to choose the record to be used as asource for the merge attribute.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    23/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 23

    Copy

    User can choose a value from a different previously merged value.

    Any RECORD, Rank RECORD, Min/ Max RECORD, custom RE CORD The RECOD merge rules select a matched record to populate values for a groupof attributes.

    Custom User can create a PL/ SQL package function to select the merge value. Theoperator will provide the signature of this function. The user is responsible forthe implementation of the rule from BEGIN to END; The matchedrecords and merge record are parameters for this function.

    CONCLUSION

    A unified, enterprise-wide view of customer data is the foundation for successfulData Warehousing and Business Intelligence. As companies implement datawarehouses, data quality should not be overlooked.

    Oracle9 i Warehouse Builder provides the full range of data consolidation andcustomer data quality capabilities required to build and to maintain customer-focused databases in a single, integrated solution. It produces significantproductivity gains over traditional multi-vendor solutions involving general-purpose transformation tools, mailing utilities and extensive custom coding.Specifically in data quality area, Oracle9 i Warehouse Builder allowsaccomplishment of tasks such as:

    Name and Address parsing, standardization, verification and augmentation

    Matching and Merging for the purpose of de-duplication

    Matching and Merging for the purpose of record-linking (includinghouseholding)

    This leads to providing a higher ROI through delivery of:

    Improved relationship and reputation with customers by better targetingcommunications and campaigns

    Reduced costs of marketing campaigns due to postal discounts for cleanseddata

    Reliable geography specific marketing

    Customer-oriented services as necessary (address validation, correction,lookups, householding, etc.).

    Other initiatives requiring a cleansed and unified view of customer.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    24/25

    Oracle 9 i Warehouse Builder, Integrated Data Quality Page 24

    Oracle9 i Warehouse Builder addresses the challenges of data quality by offeringdata cleansing and integration in one tool, allowing businesses to more accuratelylocate, understand and target customers. Best of all, Oracle9 i Warehouse Builderoffers the flexibility and accuracy that data consolidation projects demand, yet at alow cost. By implementing Oracle9 i Warehouse Builder, you will build a solidfoundation for data quality in your data warehouse.

  • 8/14/2019 Warehouse 92 Data Quality Whitepaper

    25/25

    Oracle9 i Warehouse Builder, Integrated Data Quality

    July 2003

    Authors: Behnaz Westman, Nikolai Rochnik

    Oracle Corporation

    World Headquarters

    500 Oracle Parkway

    Redwood Shores, CA 94065U.S.A.

    Worldwide Inquiries:

    Phone: +1.650.506.7000

    Fax: +1.650.506.7200

    www.oracle.com

    Oracle is a registered trademark of Oracle Corp oration. Variousproduct and service names referenced herein may be trademarks

    of Oracle Corporation. All o ther product and service namesmentioned may be trademarks of their respective owners.

    Copyright 2003 Oracle Corporation

    All rights reserved.