21
Implementing Data Quality as a Corporate Service Introduction written by: Colin White, President, BI Research

Implementing Data Quality as a Corporate Servicedownload.101com.com/tdwi/ww20/Firstlogic_DQ_Corporate... · 2005. 10. 18. · 1 Executive Summary This paper has been designed for

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Implementing Data Quality as a Corporate ServiceIntroduction written by: Colin White, President, BI Research

  • 1

    Executive Summary

    This paper has been designed for a more technical audience such as information

    technology (IT) professionals or systems integrators who have a general understanding

    of the benefits of data quality in a corporate setting. Companies who have already

    implemented data quality solutions and want to improve them, or organizations who are

    considering implementing a data quality solution, will benefit from the real-world practical

    knowledge shared in this paper.

    Noted author, speaker, and Business Intelligence (BI) and Customer Relationship

    Management (CRM) expert Colin White, president of BI Research, provides an introduction

    to this subject with unique insights into the need for data quality in corporate computing

    systems, and how they fit into today’s world of the “real time” enterprise. Corporate

    decision-making depends on the information behind those decisions, according to White,

    and it’s critical that businesses consider how architectural design of corporate systems

    impact data quality efforts.

    This paper provides practical application of data quality in a service-oriented architecture

    (SOA), with examples of how organizations are taking advantage of solutions designed for

    the next generation of corporate computing. Specifically, it will help IT professionals and

    systems integrators to:

    � Understand the history of data quality solutions and corresponding architectures.

    � Realize how a data quality solution built with SOA can be beneficial to an enterprise.

    � Recognize the features that should be considered when looking for a data quality solution, especially those that are possible with SOA.

    While it is the intention of this paper to answer the most common questions about data

    quality in SOA environments, requirements obviously vary greatly from organization to

    organization. Specific questions about computing environments or particular needs are

    welcome. Simply contact Firstlogic at 888.215.6442 or email [email protected].

    Copyright © 2004 by Firstlogic, Inc. All rights reserved. No part of this publication may be stored

    in a retrieval system, transmitted or reproduced in any way, including but not limited to photocopy,

    photograph, magnetic or other record, without prior written agreement and permission of Firstlogic,

    except for such limited purposes as may be authorized by the Copyright Act of 1976. Printed in the USA.

  • 2

    Meeting Evolving Business Needs with a Data Quality Servicean introduction by Colin White, president of BI Research

    Information is power, and companies today cannot operate effectively or compete

    successfully unless they give their users timely access to accurate and consistent

    information. The three keys words here are timely, accurate, and consistent.

    Timely Information. The concept of time is changing in organizations. It used to be that

    companies would run their planning cycles annually, and executives and line-of-business

    (LOB) managers would optimize business processes to satisfy those plans at monthly

    intervals. In today’s highly competitive business world, these long decision-making cycles

    are no longer acceptable. Successful organizations now run their budgeting and forecasting

    cycle several times a year, and continuously manage and optimize their critical business

    processes to ensure that operational, tactical, and strategic business goals are being met.

    Accurate Information. Information is useless unless it is accurate. A popular computer

    expression is garbage in, garbage out, and this applies equally to business decisions and

    actions. Sound information makes for informed decisions, but bad information results in

    poor decisions. Accuracy is affected by time. In the old data processing and business worlds

    of batch processing and monthly business decision-making cycles, organizations had time to

    analyze and fix data quality problems. In today’s fast paced world of the Internet, companies

    no longer have the luxury of time. Internet customers applying for new credit cards or loans

    want fast answers, and will go to a competitor if they do not get them. Consumers ordering

    from Web storefronts do not want to be told the following day that the product is out of

    stock. Being able to react rapidly is a competitive advantage, but fast decisions based on

    inaccurate information can lead to bad loans and high-risk clients, which ultimately hurts

    the bottom line.

    Consistent Information. Even though information consistency and accuracy are related, they

    are not the same. Companies are becoming more and more automated, and this has led to

    information being dispersed across a multitude of applications and systems. Take customer

    data, for example. In front-office systems, like CRM, customer data may be spread across

    customer sales, marketing, and support systems. In the back-office, this customer data

    may exist in order entry, billing, and shipping systems. External information providers may

    supplement customer data with information about credit history, demographics, and so

    forth. There are also multiple customer touch points to deal with, from Internet storefronts, to

    physical retail stores, and customer support centers. Each of these systems and touch points

    may contain accurate customer data, but is it consistent? This data may reflect different

    moments in time, may be formatted differently, and may often reflect different business

    definitions. Customer name and address data is an example where major consistency

    problems across systems and between applications exist. These inconsistencies are a serious

    obstacle to obtaining the single view of the customer that many companies are looking for.

  • 3

    Managing Data QualityEver since the advent of the computer and data processing software, data quality has been

    an issue, and a major stumbling block to effective and accurate decision making. Life was

    much simpler in the past. Early business transaction processing systems consisted primarily

    of batch applications that processed and exchanged information using batch files. In this

    environment, data quality was reactive, in that accuracy and consistency was checked after

    the fact by validating batch input and interchange files. The only time pressure here was the

    elapsed time to process those files. Decision support applications during this time consisted

    mainly of regular weekly, monthly, and quarterly batch reporting jobs, and so again there

    was time to fix data quality problems.

    As the use of business transaction applications evolved, companies changed from a batch

    mode of operation, to an interactive and online one. This evolution saw the introduction of

    terminal-driven applications, client/server computing, and today’s Web-based systems. At

    the same time, companies began to use application integration middleware and distributed

    computing to transfer data between systems. The move toward the online enterprise forced

    organizations to be less reactive and more proactive in their approach to data quality. Online

    applications and application integration middleware now included data validation, lookup

    routines, and business rules to verify dynamically the accuracy of the data as it was created

    and moved between systems.

    Decision-making technologies also changed as organizations became online enterprises. The

    use of business intelligence (BI) and data warehousing (DW) saw dramatic growth as these

    technologies offered the ideal solution for supplying integrated, summarized, and historical

    transaction data for strategic planning and tactical business analysis.

    Although data quality management in business transaction systems has become more

    dynamic and proactive, this is not the case with most BI/DW systems. Data quality

    management in DW is still reactive in nature. Data warehouses are maintained primarily

    by running regular batch jobs. As batch files of extracted business transaction data are

    processed by extract, transform, and load (ETL) applications, rules-driven data quality

    routines are run to check data accuracy, and to ensure the consistency of the integrated data

    warehouse information.

    The use of BI/DW applications in organizations, however, is going through a major paradigm

    shift. Companies are starting to use BI/DW applications, not just for strategic and tactical

    reporting and analysis, but also for managing day-to-day and intra-day business operations.

    As a result, ETL processes are moving from a batch update cycle, to one of capturing a

    continuous stream of business transaction data for updating data warehouses with near-real-

    time information.

    BI performance management applications in turn are starting to use near-real-time data for

    creating operational business performance dashboards for executives and LOB managers to

    monitor and manage intra-day business performance. These applications enable executives

  • 4

    and LOB managers to compare actual business performance to business goals, and to take

    rapid action when performance metrics indicate that goals are not being met.

    This BI paradigm shift means that BI/DW applications are becoming essential to business

    success because they are responsible for driving and optimizing daily business operations.

    This trend will put even more pressure on BI/DW groups to guarantee the accuracy and

    consistency of data. It will also mean, as with business transaction processing, that data

    quality management must become more proactive and less reactive. This requires BI/DW

    applications to check the quality of information dynamically, in-flight, as it flows between

    systems and applications.

    The Need for Service-Oriented ArchitectureBusiness transaction, data warehousing, and business intelligence processes are becoming

    interconnected and closer to real-time in nature. The benefit to business users of this real-

    time architecture is that they have access to the information they need to monitor business

    operations and react rapidly to changing business needs and circumstances.

    Although a real-time IT system may be able to deliver timely information to business

    executives and managers, this is of limited value unless it can also handle real-time

    requests to modify business rules and processes that come as a result of faster decision

    making. The problem is that most rules and process interconnections are hard-wired into

    existing monolithic applications, and are difficult to change dynamically. The solution to this

    problem is service-oriented architecture (SOA).

    SOA is based on a network of loosely-coupled components that can be interconnected using

    common and open standards. Components may be applications, shared services, and so

    forth. The SOA concept is not new. In the past, software vendors have used technologies

    such as CORBA and Java Messaging Services to interconnect disparate application

    components in support of SOA. The issue with early attempts at supporting SOA was that

    many of the solutions were proprietary and complex to implement.

    The advent of Web services and XML-based protocols has made SOA more viable because

    these services and protocols are easier to implement, are more flexible, and are based

    on open standards. Another key benefit is that an existing application can be wrapped

    and presented as a Web service, which supports an orderly migration to a modern SOA

    environment. The use of Web services, however, is not a prerequisite for SOA.

    SOA is ideally suited for a real-time and dynamic enterprise because processes can be

    interconnected easily in a flexible architecture that can adapt to changing business needs.

    It allows service functions such as data validation, data transformation, etc, to exist as

    separate components that can be called by business process components as required. SOA

    also means that business rules used by service components no longer have to be embedded

    in business processes and applications, but can be maintained and shared independently

    from the business components that use them.

  • 5

    A Promising FutureIn summary, modern business transaction and business intelligence technologies can

    work cohesively together to enable organizations to work smarter and make more timely

    decisions. Always, accurate and consistent information is crucial to success. The advent of

    SOA enables data quality management software to act as a service that can be shared by

    multiple business processes. Separating data quality management rules from the processes

    that use them improves flexibility and makes it possible for the rules to be dynamically

    maintained to meet constantly changing business needs.

    Further Defining Service-Oriented Architecture

    While Colin White's introduction to this paper discussed some of the basics of SOA, it is prudent

    to define SOA as it relates to data quality solutions. An article from O’Reilly’s webservices.xml.

    com defines SOA as:

    An architectural style whose goal is to achieve loose coupling among interacting

    software agents. A service is a unit of work done by a service provider to achieve

    desired end results for a service consumer. Both provider and consumer are

    roles played by software agents on behalf of their owners. (He, 2003, “SOA Defined

    and Explained”)

    For those new to the idea of SOA, the above definition may sound a bit abstract. As this paper

    discusses SOA in terms of data quality, readers should keep the following points in mind:

    � SOA consists of a service, the service provider, and the service consumer.

    � The service is some bit of functionality or “work” to be performed.

    � The service provider is the mechanism by which the service is made available

    (usually via a server).

    � The service consumer is the piece of software that takes advantage of the

    functionality provided by the service (in other words, it consumes the service).

    Furthermore, when describing SOA, some immediately jump to the conclusion that Web

    services imply an Internet-based Application Services Provider (ASP) model. In fact,

    organizations are instead frequently building services within their intranet environment,

    behind their corporate firewall, to provide such information services.

    Resources and ReferencesMore information about the general concept of service-oriented

    architecture can be found on page 19.

  • 6

    Data Quality Solutions before SOA

    For many years, the philosophy and practice of the “online enterprise,” mentioned in the

    paper’s introduction, did not receive much attention, nor did the practice for data quality. As

    a Forrester Research, Inc. article points out, “Key business applications were not designed

    to interact with one another, and sharing of information across application boundaries

    frequently requires point-to-point coding,”(Gilpin & Vollmer, 2004). Data quality solutions, in

    turn, were built with this architecture in mind — a “silo” (or stovepipe) architecture.

    Figure 1: Data quality deployed in a “silo” architecture

    As shown in Figure 1, organizations have historically implemented different solutions from

    one or more vendors to solve data quality needs across various business-critical applications

    (ERP, CRM, Data Warehouse, etc.). A few examples include Web-friendly application

    programming interfaces (APIs) for integrating into browser-based applications and e-

    commerce, “tight-integration” APIs for desktop applications, and stand-alone back office

    applications for working with an enterprise’s data warehouse. With the data environment

    being highly dynamic, this siloed approach hinders organizations from meeting the accuracy

    or consistency requirements alluded to in the introduction.

    A Service without the ArchitectureData quality has always been a service for other business applications, never the application

    itself, though its architecture has not always matched its role. As enterprise-level business

    applications became more diverse, the need for consistent data quality across all of these

    applications grew. However, the architecture of many solutions did not readily support

  • company-wide IT initiatives. Therefore, the costs also increased for an IT staff to purchase,

    learn, implement, and maintain a consistent data quality solution across the enterprise. The

    drawbacks of the silo architecture in data quality solutions became evident. Here are just a

    few examples:

    � Slightly different implementations for each silo often meant inconsistencies for how data was handled. This obviously caused problems for IT departments trying to maintain consistency across an enterprise.

    � IT programmers were required to be “data quality experts,” because business rules were coupled with the API and more staff was needed to manage multiple implementations.

    � APIs were often very proprietary. This could mean that a programmer could not integrate in his or her preferred language. Or, the programmer for a Web-friendly API might have little or no knowledge carry-over to the tight-integration API. Therefore, each silo’s integration basically started from scratch.

    � As enterprise applications began sharing data, databases grew increasingly larger. Many data quality applications were not able to scale to meet this new demand,

    meaning larger processing times and more demand on hardware resources.

    Enter Service-Oriented Architecture

    Many IT professionals can relate to the problems noted above. However, a new generation of

    data quality solutions has begun to use the principles of SOA to alleviate some of the downfalls

    of silo types of implementations. Where the silo architecture has individual and potentially

    disparate data quality solutions for each business application, the service-oriented architecture

    treats data quality as the ubiquitous service it truly should be in an enterprise (see Figure 2).

    Figure 2: Data quality in a service-oriented architecture

    7

  • 8

    The technical details of how an organization can realize the advantages of SOA with data

    quality solutions are discussed in the coming sections. Below are some inherent advantages:

    � Less time to create and maintain data quality solutions.

    � More flexibility in terms of how and where data quality solutions are deployed.

    � Reduced learning curves for integrating data quality solutions.

    � Little or no need for IT staff to become data quality experts.

    � Faster implementations, improved data quality results, and reduced costs.

    What to Look for in a Data Quality Solution

    At a high level, this paper has discussed some of the disadvantages of the silo approach to

    data quality implementation, and some of the advantages that SOA claims to offer. But how

    does an organization realize these advantages? Picking any data quality solution that is built

    on Web services does not necessarily guarantee all of the potential advantages that a true

    SOA design can offer.

    There are very specific features to look for in a data quality solution, especially those focused

    on SOA. The following sections include an in-depth technical discussion (where appropriate)

    of what to look for in a data quality solution, and explain how a services approach directly

    impacts an IT professional or a systems integrator implementing data quality across the

    enterprise. This section will discuss:

    � Evaluating a data quality API

    � Defining business rules

    � Selecting a service provider (data quality server)

    � Other features important for a data quality solution

    Evaluating a Data Quality API

    Critical to the success of any integration project is evaluating and selecting a data quality

    API. The right API will speed implementation, meeting or exceeding integration timelines

    and reducing maintenance efforts. A poor API has the potential to lock up the most skilled

    IT engineering resources in a spiral of confusion and missed deadlines, with long-term and

    intensive management requirements. Total cost of ownership (TCO) must be evaluated

    with as much scrutiny as the cost of the technology itself. When selecting an API that will

    enhance TCO and productivity, one should look for the following characteristics:

    � Business rules decoupled from the API

    � API that follows industry standards

  • 9

    Business Rules Decoupled from the APIBusiness rules define exactly how the data quality processing should occur for a specific data

    set. For example, business rules define which fields to cleanse, which new fields to add to the

    data, how to standardize data, and a plethora of other options.

    For any programmer, it is enough to have to learn a new product’s API in order to integrate

    it. In the past, IT staff members tasked with integrating a data quality application typically

    had to become data quality experts as well. Not only were they required to work with internal

    customers to establish business rules, they also had to translate the rules into the new API.

    A data quality solution built on SOA can, however, eliminate this problem by allowing the

    business rules to be completely decoupled from the API. Instead of learning all the nuances and

    minutiae of data quality, IT staff members can leave the business rules to a business user (or

    appointed enterprise-wide or department-level data quality expert). This allows IT resources to

    concentrate on programming communication between the service consumer and provider.

    Consider the example of a call-center application where customer information is collected. For

    simplicity’s sake, assume the organization simply wants to cleanse address data within their

    proprietary call center application.

    Without targeting any specific products, Figure 3 offers a pseudo-code example of the

    programming necessary to standardize a domestic address.

    /* Set up the standardization parameters */set_option(OPT_ASSIGN_CITY_BY_INPUT_LLIDX,TRUE);set_option(OPT_PLACENAME, CONVERT_PLACENAME);set_option(OPT_STND_ADDR_LINE, TRUE);set_option(OPT_STND_LAST_LINE, TRUE);set_option(OPT_UNIT_DESIG, UNIT_DIRECTORY);set_option(OPT_CAPITALIZATION, UPPERCASE);set_option(OPT_DUAL_TYPE, DUAL_MAILING);set_option(OPT_APPEND_PMB, TRUE);

    /* EWS is required for CASS Certification */set_mode(MODE_ENABLE_EWS, TRUE);

    /* Set location of look-up directories and dicitionaries */set_file(DIR_ZIP4_1, C:\data_quality\data\zipfile.dir);set_file(DIR_REVZIP4, C:\data_quality\data\revzipfile.dir);set_file(DIR_CITY, C:\data_quality\data\cityfile.dir);set_file(DIR_ZCF, C:\data_quality\data\zcffile.dir);set_file(DIR_EWS, C:\data_quality\data\ewsfile.dir);set_file(DCT_CAP, C:\data_quality\data\capitalization.dct);set_file(DCT_FIRMLN, C:\data_quality\data\firms.dct);set_file(DCT_ADDRLN, C:\data_quality\data\addressline.dct);set_file(DCT_LASTLN, C:\data_quality\data\lastline.dct);

    /* Set input fields */set_line(IADDRESS_LINE, tmpbuf1);set_line(LASTLINE, tmpbuf2);set_line(ZIP4, (char *)” “);set_line(URB, tmpbuf3);...

    Figure 3: Pseudo-code excerpt: API coupled with business rules

  • The code in Figure 3 is just a very small excerpt of what could be necessary for an API that is

    coupled with business rules. It merely sets a few options, the locations of some necessary

    files, and the input fields. Completing such an example would require defining input field

    formats, determining locations for input and output of data, specifying processing options

    for reports, and numerous other pieces of functionality that would be necessary. Obviously,

    there is still much more code to be written.

    var hostname = “server1”;var portnumber = “20003”;var busrulelocation = “\\server2\dataquality\busrules”;runBatchProject(hostname, portnumber, busrulelocation, myproject3);

    Figure 4: Pseudo-code excerpt: API decoupled from business rules

    The example shown in Figure 4 depicts what a programmer might have to specify to run a

    project in an application built on SOA where the business rules are decoupled from the API.

    The programmer would simply specify information about the service provider and the set of

    business rules to use. Other optional methods could be used, but the example above may be

    all the code necessary to call a batch project when a data quality solution is built with SOA.

    These examples have been simplified to show only calls to the API (for example, no user-

    interface code is shown). Though simplified, these examples show the true advantage of

    an API that is decoupled from business rules. In the coupled example, if a change to even

    a simple business rule preference were made, it would require a corresponding change to

    an API call in the code resulting in recompilation, testing, and redeployment to production

    systems. However, a similar change to the decoupled example could require only a change

    within the business rules, not the code that calls the data quality solution.

    It is easy to see the time and energy this would save for initial creation and maintenance of

    data quality code in any enterprise application.

    API Follows Industry StandardsEarlier sections touched on the difficulties caused by proprietary APIs. There is always a

    learning curve involved with a new API, and the less it adheres to industry standards, the

    higher the learning curve will be. Additionally, many existing data quality APIs are at least

    somewhat limited in terms of integration language support, or platform support.

    What happens when a data quality solution has an API available only in C++, but Java is the

    preferred language of the IT department? Or, a Solaris solution is required, but the API is

    available only on Windows? The company is then forced to either pass on what could be an

    otherwise-good solution, or adjust standard business practices to fit in the solution.

    10

  • This is where Web services can provide significant value. As mentioned before, Web services

    alone are not synonymous with SOA, but can be a very important part of an enterprise-wide

    SOA. If a data quality tool uses a Web service interface, what does it mean?

    � Platform independence insures that the solution will fit any environment; the

    environment would not have to be fit to the solution.

    � Implementation independence enables use of whichever programming language the

    IT department is comfortable with. This can help keep the learning curve low.

    � Industry standards mean a head start if the IT professionals have integrated

    other Web services. Additionally, companies have the option of using third-party

    development tools available for the industry standards.

    � Web services is an ideal model for working in a heterogeneous environment (such as

    a mixture of Windows and UNIX systems).

    Defining Business Rules

    Probably as important as the API is the way that business rules are defined. Without an API,

    there is no way for an application to tie in data quality. Without business rules, there is no

    way to tell the application what to do once the data quality processes have been launched.

    One should look for the following in business rule definition:

    � Centralized business rules

    � Business rules with inheritance

    � Predefined business rules

    � How the rules are defined

    Centralized Business RulesIn a silo architecture, each data quality solution generally had its own way to define business

    rules. Some business rules may be defined directly in an API, whereas others may be defined

    in a proprietary configuration file. Spreading business rules across the enterprise leads to a

    number of problems such as:

    � Inconsistency: Siloed implementations, each with unique business rules, result in

    inconsistent data formats and content.

    � High maintenance costs: Even if an organization uses a single vendor’s solution in

    multiple implementations, what happens if a business rule is updated in one spot?

    There will likely be the need for an internal process or mechanism to pass that

    change throughout the enterprise.

    11

  • A data quality solution with a centralized set of business rules, accessed by the service

    provider, is a key component of a data quality SOA. With a centralized set of business rules,

    rules are defined in the same way, providing consistency across implementations.

    If multiple applications use the same business rules configuration, a centralized set of rules

    instantly eliminates much of the maintenance cost. A user can update a rule in one spot and it

    is updated throughout all of the enterprise’s applications.

    Business Rules with InheritanceConsistent data quality across enterprise applications requires consistency of business

    rules. Establishing corporate-wide data quality standards through business rules supports

    consistency. However, there are often project specific nuances that must be considered,

    and therefore subtle changes to business rules become a necessity. One would assume that

    development and maintenance time for the business rules has been immediately increased.

    This is not the case if the data quality solution supports the inheritance principle for business

    rule definition.

    What does inheritance mean for data quality business rules? One can think of it in terms of

    programming. Imagine that a programmer has a block of code that he wants to use in multiple

    places within an application. If following good programming practices, the programmer is

    not copying and pasting that code in multiple places. Instead, the programmer would define

    a reusable function and simply call that function where necessary. If updates to the code

    are needed, the programmer would update the function directly, which would automatically

    propagate the change wherever the function is used.

    The same functionality should be available in a data quality solution. When defining business

    rules, components should be reusable. For example, data quality projects should be able to

    inherit settings from lower-level components. That way, a component could be shared across

    many projects. Just like the function example, if the low-level data quality component were

    updated, that change would be inherited by all projects (see Figure 5).

    Figure 5: Projects A and B both inherit the same business rules for address cleansing

    12

  • 13

    This is a fairly typical (albeit simple) flow of data in a data quality process. In each project,

    the application is configured to cleanse address data. However, the data source and target

    are different in each project. If the data quality tool supports the inheritance concept for

    business rules, the address cleanse process is defined independently of the other pieces, for

    example. Then, each higher-level project inherits that object. Any change made to the address

    cleanse process is then picked up by each project, drastically reducing the cost of maintaining

    multiple projects.

    Similarly, the solution should also have the option to “override” business rules, if necessary.

    That way, the advantage of inheritance still exists, but there is also the flexibility to override a

    rule if it makes sense for a given project.

    The inheritance idea is not necessarily tied to SOA. However, combining inheritance with SOA

    truly enhances the power of this functionality. There can be a huge amount of maintenance

    time saved if all data quality projects across the enterprise share common business rules.

    Predefined Business RulesThe cost of learning any new software platform can be a bit burdensome on an IT professional

    or systems integrator. This can also be true of data quality tools. However, data quality

    solutions can include capabilities to help reduce this learning curve.

    For example, a data quality solution should include a wide array of predefined business rules.

    The company that creates the data quality solution should be an expert on that subject, and

    with predefined rules, vendors can pass on some of that expertise. It is certainly easier to

    modify a set of rules to fit with a given set of data than it is to start completely from scratch. A

    data quality solution should provide a wide variety of predefined rules for projects similar to

    those common for most enterprises.

    How the Rules are DefinedIt goes without saying that a data quality tool should include an intuitive interface for defining

    business rules. A good interface can lessen the learning curve and the time to create projects

    for a data quality expert, or business user.

    A user interface (UI) should be easy enough for a business user to feel comfortable working

    with. It should enable a business user to set up the basic framework of a data quality project.

    Ideally, the UI should provide a graphical view of the data quality process, allowing the user a

    visual representation of how the data will be cleansed.

  • Selecting a Data Quality Service (Data Quality Server)

    As discussed in the definition of SOA, one of the necessary components for any solution built

    on SOA is a service provider. Chances are that if a data quality tool is built with an SOA, its

    service provider will be some sort of data quality server. This is where the real work of data

    quality processing will take place.

    This piece may be one of the least visible components of a data quality solution — it is

    usually running in the background of a system, accepting and processing data quality

    requests. However, this component is certainly just as important than any other piece of a

    data quality solution. The data quality service should include these types of features:

    � Flexible server configuration

    � Support for standard data formats

    � Server scalability

    14

    Case Study: Avid Technologyby Colin White, president of BI Research

    An excellent example of how IT systems and data quality management have evolved from a batch architecture to a real-time one is Avid Technology, a provider of digital media creation, management, and distribution solutions.

    Avid uses an Onyx Software system to handle its customer center operations, and SAP Business Information Warehouse (SAP BW) to manage its business intelligence environment. Customer data for Internet and e-mail marketing is extracted from the Onyx CRM system and loaded in batch mode to SAP BW once per quarter. During the ETL processing, data quality management software from Firstlogic is used to perform a number of data quality routines ensuring the best customer information is entered into the CRM system. Data cleanup improves data accuracy and reduces marketing costs. On average, about 15 percent of the data contains duplicate information.

    At the beginning of 2003, Avid decided to use SAP CRM to expand its front-office initiatives and to include customer information coming from the Web sites of its three independent business units. Unlike the Onyx environment, no data quality validation routines were put into place to manage customer data. The company quickly found that poor data was finding its way into SAP BW and its associated BI applications.

    To solve this problem, Avid implemented Firstlogic IQ8 Integration Studio™ to check data coming from all customer touch points. This real-time and sharable service dynamically checks the data collected from Avid’s three Web environments before it is loaded into SAP CRM. The benefits of this approach are that all Web activity is subject to the same business rules, and the shared business rules can be maintained interactively and independently from application processing.

    Avid intends to extend the use of its service-oriented approach to data quality management to include dynamic processes that ensure that customer orders and shipments satisfy regulatory compliance such as the USA PATRIOT Act.

  • Flexible Server ConfigurationIn the software world, the term “flexible” is often an overused buzzword. But consider

    how important flexibility is for any software solution. It can mean the difference between

    a relatively easy or difficult setup and integration into an enterprise. Flexibility is just as

    important for a data quality server.

    A data quality solution should have a server that is flexible across many platforms. For

    example, a data quality server should be able to reside on either a UNIX or Windows server,

    yet still be able to communicate with other UNIX and Windows computers, regardless of

    platform. Again, the solution should fit into any environment, and not force the existing

    environment to adapt to the solution.

    The data quality solution should also be implementation independent. For example, if the

    data quality solution is integrated into both a Web-based application and a “thick client”

    desktop application, both of these applications should be able to use the same data quality

    server and business rules. Likewise, it should be possible to use the same server for both

    batch processing or transactional processing. This can simplify an installation environment,

    easing the burden on initial setup and maintenance. Also, if the data quality servers have

    any configurations of their own, this can help ensure consistency between servers.

    Conversely, the solution should also allow use of multiple data quality servers. For example,

    the user should have the ability to distribute the load by spreading work among multiple

    servers. There could also be one server for transactional processing, and one for batch

    processing, to ensure transactional requests are getting an appropriate response. In this

    scenario, transactional requests would be protected from any lag that could be caused while

    a batch process was running.

    As a side note, in an environment with multiple servers, the data quality solution should also

    allow for shared business rules across servers. Again, all of the advantages of centralized

    business rules mentioned before apply here.

    Support for Standard Data FormatsOne problem with many of the silo-generation solutions is that they offer support for a very

    limited number of data formats. Typically, these tools support ASCII flat-files, and occasionally

    one of the dBase formats. In addition, some solutions from this generation require an even

    more proprietary format requiring conversion of data into a vendor-specified layout.

    If a company’s data were in a relational database format, like SQL Server or Oracle, there

    would likely be a need for additional business processes to accommodate the data quality

    solution. For example, data would need to be converted to one of the accepted formats,

    processed, then reconverted and reloaded into the preferred data format. Certainly, this is

    inconvenient, time consuming, error-prone, and often causes more development work.

    15

  • Support for most data formats is one of the trademarks of a solution that is truly a

    service, because the solution can tie in seamlessly with data, just as it can tie in

    seamlessly with applications.

    Server ScalabilityIt goes without saying that faster is better. As databases continue to grow, scalability

    becomes increasingly important. Take the example of a data quality process running

    overnight so that its hardware resources are free during the workday for other tasks. Now,

    due to a growing dataset, the process takes too long to run overnight. The process could

    be moved to a weekly – instead of nightly – process and be run on the weekend, but the

    advantages of regular data quality processing are lost. The process could be moved to a

    computer with more processing power, but if the solution does not scale, that really solves

    nothing. These are just a couple examples that demonstrate the importance of scalability.

    Advertising that a solution is “scalable” is not necessarily enough, though. A data quality

    solution should scale in the following ways:

    � The solution should scale to support multiple projects and increasing numbers of

    concurrent transactional users.

    � Most data quality processes are made up of a number of sub-processes (e.g.,

    address cleansing, data cleansing, data appending, matching/consolidation, and

    so on). The solution should allow these individual sub-processes to be tuned. For

    example, the ability to adjust the number of threads supported by each sub-process

    means fine-tuning processing to truly get the best performance out of each data

    quality project.

    Other Features Important to a Data Quality Solution

    A few other features that are important for a data quality solution include:

    � Versatile options for metadata

    � Data processing in one step

    � Transactional and batch processing as a service

    Versatile Options for MetadataIT managers know the importance of hard facts to back up a report to the CFO, or to

    justify a request for expenditure. IT professionals depend on metadata to understand the

    data itself and make better decisions about processing it. Metadata is also a key tool for

    troubleshooting problems when unexpected results occur.

    For these reasons, a flexible metadata solution is very important. A data quality solution should

    allow the user to retrieve needed metadata, from any point in the process, as in Figure 6.

    16

  • Figure 6: The data quality solution should allow for metadata retrieval at any point in the process

    Most data quality solutions provide metadata only at the end of a process. But business

    drivers may dictate that metadata be captured for a specific step of the cleansing process.

    Flexible metadata capture allows companies to compare intermediate results with the final

    metadata at the end of the process. To ensure ultimate flexibility, the solution should allow

    metadata to be created in any chosen data destination and format.

    Data Processing in One StepAs discussed in earlier sections, a data quality process is really made up of many sub-

    processes such as address cleansing, data cleansing, matching, and so on. Many data quality

    solutions, however, do not treat these as sub-processes at all. In this scenario, multiple main

    processes are required, often through different products, to get the end result.

    A data quality solution should truly be a data quality platform. It should treat each piece of

    data quality as part of the bigger process. This allows users to configure their solution to be

    a simple process, such as address cleansing alone, or a complex, multi-function process like

    consumer householding. Regardless of the desired result, the solution must allow the user to

    configure the cleansing to be done in a single process.

    For example, a company needs to cleanse address data; cleanse name, firm, and e-mail data;

    and then locate matching records. In many data quality solutions, these would be three

    distinct projects, using a separate product for each step. This means that output must be

    generated for each process and input into the next. In addition to more work, this generally

    means more files to manage and more disk space used on the system.

    Figure 7: Multi-step data quality process

    17

  • If the solution truly treats data quality as one process (or project) there are fewer individual

    steps and no need for managing extra files, as shown in Figure 8.

    Figure 8: The same project in a single-step data quality process

    Transactional and Batch Processing as a ServiceTreating transactions as a service seems pretty obvious. In a world of thin-client

    applications, nobody wants to house a thick-client data quality application on each client

    computer. However, batch processing should also be treated as a service, though in a

    slightly different way.

    It is likely that newer data quality solutions will be built using a Web service or other similar

    mechanism as the communication method. This makes perfect sense in the transaction

    world. A proprietary application would send a set of data in a SOAP envelope to the Web

    server (and subsequently the data quality server). Then, the envelope would be returned to

    the application in reverse order with the cleansed data.

    This approach is not well suited for a batch process. An application should not send

    hundreds, thousands, or millions of batch records through the service, nor should it send

    one huge transaction with this sort of data. The traffic of either of these methods would likely

    gridlock a service in no time.

    However, batch processing should still be treated as a service in the following way. The

    application should be able to send a similar SOAP envelope that simply says, “start

    processing”— thereby launching the batch job at the server. The business rules for this

    project would already identify the data sources and targets, allowing the job to process the

    data directly. The service should allow for querying the process of that batch job and sending

    back a message when the process has completed. This type of architecture makes it possible

    to kick off a batch process and monitor progress from a remote location, for example.

    Data Quality Solutions Built with SOA

    This paper has discussed data quality solutions before SOA, built with SOA, and what IT

    professionals and systems integrators should look for in the new generation of data quality

    solutions. Until recently, data quality solutions were often ill suited for the modern online

    enterprise and BI/DW paradigm shift that BI expert Colin White discussed in the introduction

    to this paper. Now data quality solutions, designed with a service-oriented architecture, are

    an ideal fit for providing the timely, accurate, and consistent information that companies

    need to operate effectively and compete successfully.

    18

  • More Information about SOA

    The following online articles include more information about SOA in general (not necessarily

    relating to data quality).

    � “What is Service-Oriented Architecture” by Hao He

    http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html

    � “Understanding Service-Oriented Architecture” by David Sprott and Lawrence Wilkes

    http://msdn.microsoft.com/architecture/soa/default.aspx?pull=/library/en-us/

    dnmaj/html/aj1soa.asp

    � “The Benefits of a Service-Oriented Architecture” by Michael Stevens

    http://www.developer.com/tech/article.php/1041191

    � “Web Services and Service-Oriented Architecture”

    http://www.service-architecture.com/index.html

    References

    Gilpin, Mike and Vollmer, Ken. (2004, July 6). Integration in a Service-Oriented World.

    Forrester Research, Inc. 4.

    He, Hao. What is Service-Oriented Architecture? (2003, September 20).

    Retrieved July 19, 2004, from

    http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html

    19

  • About Firstlogic

    Firstlogic develops data quality software that helps businesses create a single view within

    their database. Its data profiling solution, IQ Insight®, measures, analyzes, and reports on

    data quality problems and business rule violations. Firstlogic’s industry-leading Information

    Quality Suite® cleanses and standardizes worldwide data, appends third-party information,

    and builds relationships through matching and consolidating records. Firstlogic's new

    data quality integration environment offers centralized data quality services, tuned to the

    specific needs of systems integrators and corporate IT engineers. IQ8 Integration Studio™ is

    a revolutionary environment for designing, building, deploying, and managing data quality

    solutions. Firstlogic’s data quality software seamlessly integrates into CRM, ERP, BI, and

    data warehousing applications. In addition to developing commercial solutions, Firstlogic

    partners with many systems integrators, consultants, and original equipment manufacturers

    to provide its unique technology to their end-user customers. Founded in 1984, Firstlogic

    today serves thousands of customers worldwide, including Fortune 1000 companies in

    the e-business, financial, insurance, healthcare, direct marketing, higher education, and

    telecommunications markets. For more information, call 608.782.5000, send an email to

    [email protected], or visit the company’s Web site at www.firstlogic.com.

    Firstlogic, IQ Insight, and Information Quality Suite are registered trademarks of Firstlogic, Inc. All other trademarks are held by their respective owner or manufacturer.

    © 2004 Firstlogic, Inc.

    20