Implementing Data Quality as a Corporate Servicedownload.101com.com/tdwi/ww20/Firstlogic_DQ_Corporate... · 2005. 10. 18. · 1 Executive Summary This paper has been designed for

Implementing Data Quality as a Corporate ServiceIntroduction written by: Colin White, President, BI Research

1

Executive Summary

This paper has been designed for a more technical audience such as information

technology (IT) professionals or systems integrators who have a general understanding

of the benefits of data quality in a corporate setting. Companies who have already

implemented data quality solutions and want to improve them, or organizations who are

considering implementing a data quality solution, will benefit from the real-world practical

knowledge shared in this paper.

Noted author, speaker, and Business Intelligence (BI) and Customer Relationship

Management (CRM) expert Colin White, president of BI Research, provides an introduction

to this subject with unique insights into the need for data quality in corporate computing

systems, and how they fit into today’s world of the “real time” enterprise. Corporate

decision-making depends on the information behind those decisions, according to White,

and it’s critical that businesses consider how architectural design of corporate systems

impact data quality efforts.

This paper provides practical application of data quality in a service-oriented architecture

(SOA), with examples of how organizations are taking advantage of solutions designed for

the next generation of corporate computing. Specifically, it will help IT professionals and

systems integrators to:

� Understand the history of data quality solutions and corresponding architectures.

� Realize how a data quality solution built with SOA can be beneficial to an enterprise.

� Recognize the features that should be considered when looking for a data quality solution, especially those that are possible with SOA.

While it is the intention of this paper to answer the most common questions about data

quality in SOA environments, requirements obviously vary greatly from organization to

organization. Specific questions about computing environments or particular needs are

welcome. Simply contact Firstlogic at 888.215.6442 or email [email protected].

Copyright © 2004 by Firstlogic, Inc. All rights reserved. No part of this publication may be stored

in a retrieval system, transmitted or reproduced in any way, including but not limited to photocopy,

photograph, magnetic or other record, without prior written agreement and permission of Firstlogic,

except for such limited purposes as may be authorized by the Copyright Act of 1976. Printed in the USA.

2

Meeting Evolving Business Needs with a Data Quality Servicean introduction by Colin White, president of BI Research

Information is power, and companies today cannot operate effectively or compete

successfully unless they give their users timely access to accurate and consistent

information. The three keys words here are timely, accurate, and consistent.

Timely Information. The concept of time is changing in organizations. It used to be that

companies would run their planning cycles annually, and executives and line-of-business

(LOB) managers would optimize business processes to satisfy those plans at monthly

intervals. In today’s highly competitive business world, these long decision-making cycles

are no longer acceptable. Successful organizations now run their budgeting and forecasting

cycle several times a year, and continuously manage and optimize their critical business

processes to ensure that operational, tactical, and strategic business goals are being met.

Accurate Information. Information is useless unless it is accurate. A popular computer

expression is garbage in, garbage out, and this applies equally to business decisions and

actions. Sound information makes for informed decisions, but bad information results in

poor decisions. Accuracy is affected by time. In the old data processing and business worlds

of batch processing and monthly business decision-making cycles, organizations had time to

analyze and fix data quality problems. In today’s fast paced world of the Internet, companies

no longer have the luxury of time. Internet customers applying for new credit cards or loans

want fast answers, and will go to a competitor if they do not get them. Consumers ordering

from Web storefronts do not want to be told the following day that the product is out of

stock. Being able to react rapidly is a competitive advantage, but fast decisions based on

inaccurate information can lead to bad loans and high-risk clients, which ultimately hurts

the bottom line.

Consistent Information. Even though information consistency and accuracy are related, they

are not the same. Companies are becoming more and more automated, and this has led to

information being dispersed across a multitude of applications and systems. Take customer

data, for example. In front-office systems, like CRM, customer data may be spread across

customer sales, marketing, and support systems. In the back-office, this customer data

may exist in order entry, billing, and shipping systems. External information providers may

supplement customer data with information about credit history, demographics, and so

forth. There are also multiple customer touch points to deal with, from Internet storefronts, to

physical retail stores, and customer support centers. Each of these systems and touch points

may contain accurate customer data, but is it consistent? This data may reflect different

moments in time, may be formatted differently, and may often reflect different business

definitions. Customer name and address data is an example where major consistency

problems across systems and between applications exist. These inconsistencies are a serious

obstacle to obtaining the single view of the customer that many companies are looking for.

3

Managing Data QualityEver since the advent of the computer and data processing software, data quality has been

an issue, and a major stumbling block to effective and accurate decision making. Life was

much simpler in the past. Early business transaction processing systems consisted primarily

of batch applications that processed and exchanged information using batch files. In this

environment, data quality was reactive, in that accuracy and consistency was checked after

the fact by validating batch input and interchange files. The only time pressure here was the

elapsed time to process those files. Decision support applications during this time consisted

mainly of regular weekly, monthly, and quarterly batch reporting jobs, and so again there

was time to fix data quality problems.

As the use of business transaction applications evolved, companies changed from a batch

mode of operation, to an interactive and online one. This evolution saw the introduction of

terminal-driven applications, client/server computing, and today’s Web-based systems. At

the same time, companies began to use application integration middleware and distributed

computing to transfer data between systems. The move toward the online enterprise forced

organizations to be less reactive and more proactive in their approach to data quality. Online

applications and application integration middleware now included data validation, lookup

routines, and business rules to verify dynamically the accuracy of the data as it was created

and moved between systems.

Decision-making technologies also changed as organizations became online enterprises. The

use of business intelligence (BI) and data warehousing (DW) saw dramatic growth as these

technologies offered the ideal solution for supplying integrated, summarized, and historical

transaction data for strategic planning and tactical business analysis.

Although data quality management in business transaction systems has become more

dynamic and proactive, this is not the case with most BI/DW systems. Data quality

management in DW is still reactive in nature. Data warehouses are maintained primarily

by running regular batch jobs. As batch files of extracted business transaction data are

processed by extract, transform, and load (ETL) applications, rules-driven data quality

routines are run to check data accuracy, and to ensure the consistency of the integrated data

warehouse information.

The use of BI/DW applications in organizations, however, is going through a major paradigm

shift. Companies are starting to use BI/DW applications, not just for strategic and tactical

reporting and analysis, but also for managing day-to-day and intra-day business operations.

As a result, ETL processes are moving from a batch update cycle, to one of capturing a

continuous stream of business transaction data for updating data warehouses with near-real-

time information.

BI performance management applications in turn are starting to use near-real-time data for

creating operational business performance dashboards for executives and LOB managers to

monitor and manage intra-day business performance. These applications enable executives

4

and LOB managers to compare actual business performance to business goals, and to take

rapid action when performance metrics indicate that goals are not being met.

This BI paradigm shift means that BI/DW applications are becoming essential to business

success because they are responsible for driving and optimizing daily business operations.

This trend will put even more pressure on BI/DW groups to guarantee the accuracy and

consistency of data. It will also mean, as with business transaction processing, that data

quality management must become more proactive and less reactive. This requires BI/DW

applications to check the quality of information dynamically, in-flight, as it flows between

systems and applications.

The Need for Service-Oriented ArchitectureBusiness transaction, data warehousing, and business intelligence processes are becoming

interconnected and closer to real-time in nature. The benefit to business users of this real-

time architecture is that they have access to the information they need to monitor business

operations and react rapidly to changing business needs and circumstances.

Although a real-time IT system may be able to deliver timely information to business

executives and managers, this is of limited value unless it can also handle real-time

requests to modify business rules and processes that come as a result of faster decision

making. The problem is that most rules and process interconnections are hard-wired into

existing monolithic applications, and are difficult to change dynamically. The solution to this

problem is service-oriented architecture (SOA).

SOA is based on a network of loosely-coupled components that can be interconnected using

common and open standards. Components may be applications, shared services, and so

forth. The SOA concept is not new. In the past, software vendors have used technologies

such as CORBA and Java Messaging Services to interconnect disparate application

components in support of SOA. The issue with early attempts at supporting SOA was that

many of the solutions were proprietary and complex to implement.

The advent of Web services and XML-based protocols has made SOA more viable because

these services and protocols are easier to implement, are more flexible, and are based

on open standards. Another key benefit is that an existing application can be wrapped

and presented as a Web service, which supports an orderly migration to a modern SOA

environment. The use of Web services, however, is not a prerequisite for SOA.

SOA is ideally suited for a real-time and dynamic enterprise because processes can be

interconnected easily in a flexible architecture that can adapt to changing business needs.

It allows service functions such as data validation, data transformation, etc, to exist as

separate components that can be called by business process components as required. SOA

also means that business rules used by service components no longer have to be embedded

in business processes and applications, but can be maintained and shared independently

from the business components that use them.

5

A Promising FutureIn summary, modern business transaction and business intelligence technologies can

work cohesively together to enable organizations to work smarter and make more timely

decisions. Always, accurate and consistent information is crucial to success. The advent of

SOA enables data quality management software to act as a service that can be shared by

multiple business processes. Separating data quality management rules from the processes

that use them improves flexibility and makes it possible for the rules to be dynamically

maintained to meet constantly changing business needs.

Further Defining Service-Oriented Architecture

While Colin White's introduction to this paper discussed some of the basics of SOA, it is prudent

to define SOA as it relates to data quality solutions. An article from O’Reilly’s webservices.xml.

com defines SOA as:

An architectural style whose goal is to achieve loose coupling among interacting

software agents. A service is a unit of work done by a service provider to achieve

desired end results for a service consumer. Both provider and consumer are

roles played by software agents on behalf of their owners. (He, 2003, “SOA Defined

and Explained”)

For those new to the idea of SOA, the above definition may sound a bit abstract. As this paper

discusses SOA in terms of data quality, readers should keep the following points in mind:

� SOA consists of a service, the service provider, and the service consumer.

� The service is some bit of functionality or “work” to be performed.

� The service provider is the mechanism by which the service is made available

(usually via a server).

� The service consumer is the piece of software that takes advantage of the

functionality provided by the service (in other words, it consumes the service).

Furthermore, when describing SOA, some immediately jump to the conclusion that Web

services imply an Internet-based Application Services Provider (ASP) model. In fact,

organizations are instead frequently building services within their intranet environment,

behind their corporate firewall, to provide such information services.

Resources and ReferencesMore information about the general concept of service-oriented

architecture can be found on page 19.

6

Data Quality Solutions before SOA

For many years, the philosophy and practice of the “online enterprise,” mentioned in the

paper’s introduction, did not receive much attention, nor did the practice for data quality. As

a Forrester Research, Inc. article points out, “Key business applications were not designed

to interact with one another, and sharing of information across application boundaries

frequently requires point-to-point coding,”(Gilpin & Vollmer, 2004). Data quality solutions, in

turn, were built with this architecture in mind — a “silo” (or stovepipe) architecture.

Figure 1: Data quality deployed in a “silo” architecture

As shown in Figure 1, organizations have historically implemented different solutions from

one or more vendors to solve data quality needs across various business-critical applications

(ERP, CRM, Data Warehouse, etc.). A few examples include Web-friendly application

programming interfaces (APIs) for integrating into browser-based applications and e-

commerce, “tight-integration” APIs for desktop applications, and stand-alone back office

applications for working with an enterprise’s data warehouse. With the data environment

being highly dynamic, this siloed approach hinders organizations from meeting the accuracy

or consistency requirements alluded to in the introduction.

A Service without the ArchitectureData quality has always been a service for other business applications, never the application

itself, though its architecture has not always matched its role. As enterprise-level business

applications became more diverse, the need for consistent data quality across all of these

applications grew. However, the architecture of many solutions did not readily support

company-wide IT initiatives. Therefore, the costs also increased for an IT staff to purchase,

learn, implement, and maintain a consistent data quality solution across the enterprise. The

drawbacks of the silo architecture in data quality solutions became evident. Here are just a

few examples:

� Slightly different implementations for each silo often meant inconsistencies for how data was handled. This obviously caused problems for IT departments trying to maintain consistency across an enterprise.

� IT programmers were required to be “data quality experts,” because business rules were coupled with the API and more staff was needed to manage multiple implementations.

� APIs were often very proprietary. This could mean that a programmer could not integrate in his or her preferred language. Or, the programmer for a Web-friendly API might have little or no knowledge carry-over to the tight-integration API. Therefore, each silo’s integration basically started from scratch.

� As enterprise applications began sharing data, databases grew increasingly larger. Many data quality applications were not able to scale to meet this new demand,

meaning larger processing times and more demand on hardware resources.

Enter Service-Oriented Architecture

Many IT professionals can relate to the problems noted above. However, a new generation of

data quality solutions has begun to use the principles of SOA to alleviate some of the downfalls

of silo types of implementations. Where the silo architecture has individual and potentially

disparate data quality solutions for each business application, the service-oriented architecture

treats data quality as the ubiquitous service it truly should be in an enterprise (see Figure 2).

Figure 2: Data quality in a service-oriented architecture

7

8

The technical details of how an organization can realize the advantages of SOA with data

quality solutions are discussed in the coming sections. Below are some inherent advantages:

� Less time to create and maintain data quality solutions.

� More flexibility in terms of how and where data quality solutions are deployed.

� Reduced learning curves for integrating data quality solutions.

� Little or no need for IT staff to become data quality experts.

� Faster implementations, improved data quality results, and reduced costs.

What to Look for in a Data Quality Solution

At a high level, this paper has discussed some of the disadvantages of the silo approach to

data quality implementation, and some of the advantages that SOA claims to offer. But how

does an organization realize these advantages? Picking any data quality solution that is built

on Web services does not necessarily guarantee all of the potential advantages that a true

SOA design can offer.

There are very specific features to look for in a data quality solution, especially those focused

on SOA. The following sections include an in-depth technical discussion (where appropriate)

of what to look for in a data quality solution, and explain how a services approach directly

impacts an IT professional or a systems integrator implementing data quality across the

enterprise. This section will discuss:

� Evaluating a data quality API

� Defining business rules

� Selecting a service provider (data quality server)

� Other features important for a data quality solution

Evaluating a Data Quality API

Critical to the success of any integration project is evaluating and selecting a data quality

API. The right API will speed implementation, meeting or exceeding integration timelines

and reducing maintenance efforts. A poor API has the potential to lock up the most skilled

IT engineering resources in a spiral of confusion and missed deadlines, with long-term and

intensive management requirements. Total cost of ownership (TCO) must be evaluated

with as much scrutiny as the cost of the technology itself. When selecting an API that will

enhance TCO and productivity, one should look for the following characteristics:

� Business rules decoupled from the API

� API that follows industry standards

9

Business Rules Decoupled from the APIBusiness rules define exactly how the data quality processing should occur for a specific data

set. For example, business rules define which fields to cleanse, which new fields to add to the

data, how to standardize data, and a plethora of other options.

For any programmer, it is enough to have to learn a new product’s API in order to integrate

it. In the past, IT staff members tasked with integrating a data quality application typically

had to become data quality experts as well. Not only were they required to work with internal

customers to establish business rules, they also had to translate the rules into the new API.

A data quality solution built on SOA can, however, eliminate this problem by allowing the

business rules to be completely decoupled from the API. Instead of learning all the nuances and

minutiae of data quality, IT staff members can leave the business rules to a business user (or

appointed enterprise-wide or department-level data quality expert). This allows IT resources to

concentrate on programming communication between the service consumer and provider.

Consider the example of a call-center application where customer information is collected. For

simplicity’s sake, assume the organization simply wants to cleanse address data within their

proprietary call center application.

Without targeting any specific products, Figure 3 offers a pseudo-code example of the

programming necessary to standardize a domestic address.

/* Set up the standardization parameters */set_option(OPT_ASSIGN_CITY_BY_INPUT_LLIDX,TRUE);set_option(OPT_PLACENAME, CONVERT_PLACENAME);set_option(OPT_STND_ADDR_LINE, TRUE);set_option(OPT_STND_LAST_LINE, TRUE);set_option(OPT_UNIT_DESIG, UNIT_DIRECTORY);set_option(OPT_CAPITALIZATION, UPPERCASE);set_option(OPT_DUAL_TYPE, DUAL_MAILING);set_option(OPT_APPEND_PMB, TRUE);

/* EWS is required for CASS Certification */set_mode(MODE_ENABLE_EWS, TRUE);

/* Set location of look-up directories and dicitionaries */set_file(DIR_ZIP4_1, C:\data_quality\data\zipfile.dir);set_file(DIR_REVZIP4, C:\data_quality\data\revzipfile.dir);set_file(DIR_CITY, C:\data_quality\data\cityfile.dir);set_file(DIR_ZCF, C:\data_quality\data\zcffile.dir);set_file(DIR_EWS, C:\data_quality\data\ewsfile.dir);set_file(DCT_CAP, C:\data_quality\data\capitalization.dct);set_file(DCT_FIRMLN, C:\data_quality\data\firms.dct);set_file(DCT_ADDRLN, C:\data_quality\data\addressline.dct);set_file(DCT_LASTLN, C:\data_quality\data\lastline.dct);

/* Set input fields */set_line(IADDRESS_LINE, tmpbuf1);set_line(LASTLINE, tmpbuf2);set_line(ZIP4, (char *)” “);set_line(URB, tmpbuf3);...

Figure 3: Pseudo-code excerpt: API coupled with business rules

The code in Figure 3 is just a very small excerpt of what could be necessary for an API that is

coupled with business rules. It merely sets a few options, the locations of some necessary

files, and the input fields. Completing such an example would require defining input field

formats, determining locations for input and output of data, specifying processing options

for reports, and numerous other pieces of functionality that would be necessary. Obviously,

there is still much more code to be written.

var hostname = “server1”;var portnumber = “20003”;var busrulelocation = “\\server2\dataquality\busrules”;runBatchProject(hostname, portnumber, busrulelocation, myproject3);

Figure 4: Pseudo-code excerpt: API decoupled from business rules

The example shown in Figure 4 depicts what a programmer might have to specify to run a

project in an application built on SOA where the business rules are decoupled from the API.

The programmer would simply specify information about the service provider and the set of

business rules to use. Other optional methods could be used, but the example above may be

all the code necessary to call a batch project when a data quality solution is built with SOA.

These examples have been simplified to show only calls to the API (for example, no user-

interface code is shown). Though simplified, these examples show the true advantage of

an API that is decoupled from business rules. In the coupled example, if a change to even

a simple business rule preference were made, it would require a corresponding change to

an API call in the code resulting in recompilation, testing, and redeployment to production

systems. However, a similar change to the decoupled example could require only a change

within the business rules, not the code that calls the data quality solution.

It is easy to see the time and energy this would save for initial creation and maintenance of

data quality code in any enterprise application.

API Follows Industry StandardsEarlier sections touched on the difficulties caused by proprietary APIs. There is always a

learning curve involved with a new API, and the less it adheres to industry standards, the

higher the learning curve will be. Additionally, many existing data quality APIs are at least

somewhat limited in terms of integration language support, or platform support.

What happens when a data quality solution has an API available only in C++, but Java is the

preferred language of the IT department? Or, a Solaris solution is required, but the API is

available only on Windows? The company is then forced to either pass on what could be an

otherwise-good solution, or adjust standard business practices to fit in the solution.

10

This is where Web services can provide significant value. As mentioned before, Web services

alone are not synonymous with SOA, but can be a very important part of an enterprise-wide

SOA. If a data quality tool uses a Web service interface, what does it mean?

� Platform independence insures that the solution will fit any environment; the

environment would not have to be fit to the solution.

� Implementation independence enables use of whichever programming language the

IT department is comfortable with. This can help keep the learning curve low.

� Industry standards mean a head start if the IT professionals have integrated

other Web services. Additionally, companies have the option of using third-party

development tools available for the industry standards.

� Web services is an ideal model for working in a heterogeneous environment (such as

a mixture of Windows and UNIX systems).

Defining Business Rules

Probably as important as the API is the way that business rules are defined. Without an API,

there is no way for an application to tie in data quality. Without business rules, there is no

way to tell the application what to do once the data quality processes have been launched.

One should look for the following in business rule definition:

� Centralized business rules

� Business rules with inheritance

� Predefined business rules

� How the rules are defined

Centralized Business RulesIn a silo architecture, each data quality solution generally had its own way to define business

rules. Some business rules may be defined directly in an API, whereas others may be defined

in a proprietary configuration file. Spreading business rules across the enterprise leads to a

number of problems such as:

� Inconsistency: Siloed implementations, each with unique business rules, result in

inconsistent data formats and content.

� High maintenance costs: Even if an organization uses a single vendor’s solution in

multiple implementations, what happens if a business rule is updated in one spot?

There will likely be the need for an internal process or mechanism to pass that

change throughout the enterprise.

11

A data quality solution with a centralized set of business rules, accessed by the service

provider, is a key component of a data quality SOA. With a centralized set of business rules,

rules are defined in the same way, providing consistency across implementations.

If multiple applications use the same business rules configuration, a centralized set of rules

instantly eliminates much of the maintenance cost. A user can update a rule in one spot and it

is updated throughout all of the enterprise’s applications.

Business Rules with InheritanceConsistent data quality across enterprise applications requires consistency of business

rules. Establishing corporate-wide data quality standards through business rules supports

consistency. However, there are often project specific nuances that must be considered,

and therefore subtle changes to business rules become a necessity. One would assume that

development and maintenance time for the business rules has been immediately increased.

This is not the case if the data quality solution supports the inheritance principle for business

rule definition.

What does inheritance mean for data quality business rules? One can think of it in terms of

programming. Imagine that a programmer has a block of code that he wants to use in multiple

places within an application. If following good programming practices, the programmer is

not copying and pasting that code in multiple places. Instead, the programmer would define

a reusable function and simply call that function where necessary. If updates to the code

are needed, the programmer would update the function directly, which would automatically

propagate the change wherever the function is used.

The same functionality should be available in a data quality solution. When defining business

rules, components should be reusable. For example, data quality projects should be able to

inherit settings from lower-level components. That way, a component could be shared across

many projects. Just like the function example, if the low-level data quality component were

updated, that change would be inherited by all projects (see Figure 5).

Figure 5: Projects A and B both inherit the same business rules for address cleansing

12

13

This is a fairly typical (albeit simple) flow of data in a data quality process. In each project,

the application is configured to cleanse address data. However, the data source and target

are different in each project. If the data quality tool supports the inheritance concept for

business rules, the address cleanse process is defined independently of the other pieces, for

example. Then, each higher-level project inherits that object. Any change made to the address

cleanse process is then picked up by each project, drastically reducing the cost of maintaining

multiple projects.

Similarly, the solution should also have the option to “override” business rules, if necessary.

That way, the advantage of inheritance still exists, but there is also the flexibility to override a

rule if it makes sense for a given project.

The inheritance idea is not necessarily tied to SOA. However, combining inheritance with SOA

truly enhances the power of this functionality. There can be a huge amount of maintenance

time saved if all data quality projects across the enterprise share common business rules.

Predefined Business RulesThe cost of learning any new software platform can be a bit burdensome on an IT professional

or systems integrator. This can also be true of data quality tools. However, data quality

solutions can include capabilities to help reduce this learning curve.

For example, a data quality solution should include a wide array of predefined business rules.

The company that creates the data quality solution should be an expert on that subject, and

with predefined rules, vendors can pass on some of that expertise. It is certainly easier to

modify a set of rules to fit with a given set of data than it is to start completely from scratch. A

data quality solution should provide a wide variety of predefined rules for projects similar to

those common for most enterprises.

How the Rules are DefinedIt goes without saying that a data quality tool should include an intuitive interface for defining

business rules. A good interface can lessen the learning curve and the time to create projects

for a data quality expert, or business user.

A user interface (UI) should be easy enough for a business user to feel comfortable working

with. It should enable a business user to set up the basic framework of a data quality project.

Ideally, the UI should provide a graphical view of the data quality process, allowing the user a

visual representation of how the data will be cleansed.

Selecting a Data Quality Service (Data Quality Server)

As discussed in the definition of SOA, one of the necessary components for any solution built

on SOA is a service provider. Chances are that if a data quality tool is built with an SOA, its

service provider will be some sort of data quality server. This is where the real work of data

quality processing will take place.

This piece may be one of the least visible components of a data quality solution — it is

usually running in the background of a system, accepting and processing data quality

requests. However, this component is certainly just as important than any other piece of a

data quality solution. The data quality service should include these types of features:

� Flexible server configuration

� Support for standard data formats

� Server scalability

14

Case Study: Avid Technologyby Colin White, president of BI Research

An excellent example of how IT systems and data quality management have evolved from a batch architecture to a real-time one is Avid Technology, a provider of digital media creation, management, and distribution solutions.

Avid uses an Onyx Software system to handle its customer center operations, and SAP Business Information Warehouse (SAP BW) to manage its business intelligence environment. Customer data for Internet and e-mail marketing is extracted from the Onyx CRM system and loaded in batch mode to SAP BW once per quarter. During the ETL processing, data quality management software from Firstlogic is used to perform a number of data quality routines ensuring the best customer information is entered into the CRM system. Data cleanup improves data accuracy and reduces marketing costs. On average, about 15 percent of the data contains duplicate information.

At the beginning of 2003, Avid decided to use SAP CRM to expand its front-office initiatives and to include customer information coming from the Web sites of its three independent business units. Unlike the Onyx environment, no data quality validation routines were put into place to manage customer data. The company quickly found that poor data was finding its way into SAP BW and its associated BI applications.

To solve this problem, Avid implemented Firstlogic IQ8 Integration Studio™ to check data coming from all customer touch points. This real-time and sharable service dynamically checks the data collected from Avid’s three Web environments before it is loaded into SAP CRM. The benefits of this approach are that all Web activity is subject to the same business rules, and the shared business rules can be maintained interactively and independently from application processing.

Avid intends to extend the use of its service-oriented approach to data quality management to include dynamic processes that ensure that customer orders and shipments satisfy regulatory compliance such as the USA PATRIOT Act.

Flexible Server ConfigurationIn the software world, the term “flexible” is often an overused buzzword. But consider

how important flexibility is for any software solution. It can mean the difference between

a relatively easy or difficult setup and integration into an enterprise. Flexibility is just as

important for a data quality server.

A data quality solution should have a server that is flexible across many platforms. For

example, a data quality server should be able to reside on either a UNIX or Windows server,

yet still be able to communicate with other UNIX and Windows computers, regardless of

platform. Again, the solution should fit into any environment, and not force the existing

environment to adapt to the solution.

The data quality solution should also be implementation independent. For example, if the

data quality solution is integrated into both a Web-based application and a “thick client”

desktop application, both of these applications should be able to use the same data quality

server and business rules. Likewise, it should be possible to use the same server for both

batch processing or transactional processing. This can simplify an installation environment,

easing the burden on initial setup and maintenance. Also, if the data quality servers have

any configurations of their own, this can help ensure consistency between servers.

Conversely, the solution should also allow use of multiple data quality servers. For example,

the user should have the ability to distribute the load by spreading work among multiple

servers. There could also be one server for transactional processing, and one for batch

processing, to ensure transactional requests are getting an appropriate response. In this

scenario, transactional requests would be protected from any lag that could be caused while

a batch process was running.

As a side note, in an environment with multiple servers, the data quality solution should also

allow for shared business rules across servers. Again, all of the advantages of centralized

business rules mentioned before apply here.

Support for Standard Data FormatsOne problem with many of the silo-generation solutions is that they offer support for a very

limited number of data formats. Typically, these tools support ASCII flat-files, and occasionally

one of the dBase formats. In addition, some solutions from this generation require an even

more proprietary format requiring conversion of data into a vendor-specified layout.

If a company’s data were in a relational database format, like SQL Server or Oracle, there

would likely be a need for additional business processes to accommodate the data quality

solution. For example, data would need to be converted to one of the accepted formats,

processed, then reconverted and reloaded into the preferred data format. Certainly, this is

inconvenient, time consuming, error-prone, and often causes more development work.

15

Support for most data formats is one of the trademarks of a solution that is truly a

service, because the solution can tie in seamlessly with data, just as it can tie in

seamlessly with applications.

Server ScalabilityIt goes without saying that faster is better. As databases continue to grow, scalability

becomes increasingly important. Take the example of a data quality process running

overnight so that its hardware resources are free during the workday for other tasks. Now,

due to a growing dataset, the process takes too long to run overnight. The process could

be moved to a weekly – instead of nightly – process and be run on the weekend, but the

advantages of regular data quality processing are lost. The process could be moved to a

computer with more processing power, but if the solution does not scale, that really solves

nothing. These are just a couple examples that demonstrate the importance of scalability.

Advertising that a solution is “scalable” is not necessarily enough, though. A data quality

solution should scale in the following ways:

� The solution should scale to support multiple projects and increasing numbers of

concurrent transactional users.

� Most data quality processes are made up of a number of sub-processes (e.g.,

address cleansing, data cleansing, data appending, matching/consolidation, and

so on). The solution should allow these individual sub-processes to be tuned. For

example, the ability to adjust the number of threads supported by each sub-process

means fine-tuning processing to truly get the best performance out of each data

quality project.

Other Features Important to a Data Quality Solution

A few other features that are important for a data quality solution include:

� Versatile options for metadata

� Data processing in one step

� Transactional and batch processing as a service

Versatile Options for MetadataIT managers know the importance of hard facts to back up a report to the CFO, or to

justify a request for expenditure. IT professionals depend on metadata to understand the

data itself and make better decisions about processing it. Metadata is also a key tool for

troubleshooting problems when unexpected results occur.

For these reasons, a flexible metadata solution is very important. A data quality solution should

allow the user to retrieve needed metadata, from any point in the process, as in Figure 6.

16

Figure 6: The data quality solution should allow for metadata retrieval at any point in the process

Most data quality solutions provide metadata only at the end of a process. But business

drivers may dictate that metadata be captured for a specific step of the cleansing process.

Flexible metadata capture allows companies to compare intermediate results with the final

metadata at the end of the process. To ensure ultimate flexibility, the solution should allow

metadata to be created in any chosen data destination and format.

Data Processing in One StepAs discussed in earlier sections, a data quality process is really made up of many sub-

processes such as address cleansing, data cleansing, matching, and so on. Many data quality

solutions, however, do not treat these as sub-processes at all. In this scenario, multiple main

processes are required, often through different products, to get the end result.

A data quality solution should truly be a data quality platform. It should treat each piece of

data quality as part of the bigger process. This allows users to configure their solution to be

a simple process, such as address cleansing alone, or a complex, multi-function process like

consumer householding. Regardless of the desired result, the solution must allow the user to

configure the cleansing to be done in a single process.

For example, a company needs to cleanse address data; cleanse name, firm, and e-mail data;

and then locate matching records. In many data quality solutions, these would be three

distinct projects, using a separate product for each step. This means that output must be

generated for each process and input into the next. In addition to more work, this generally

means more files to manage and more disk space used on the system.

Figure 7: Multi-step data quality process

17

If the solution truly treats data quality as one process (or project) there are fewer individual

steps and no need for managing extra files, as shown in Figure 8.

Figure 8: The same project in a single-step data quality process

Transactional and Batch Processing as a ServiceTreating transactions as a service seems pretty obvious. In a world of thin-client

applications, nobody wants to house a thick-client data quality application on each client

computer. However, batch processing should also be treated as a service, though in a

slightly different way.

It is likely that newer data quality solutions will be built using a Web service or other similar

mechanism as the communication method. This makes perfect sense in the transaction

world. A proprietary application would send a set of data in a SOAP envelope to the Web

server (and subsequently the data quality server). Then, the envelope would be returned to

the application in reverse order with the cleansed data.

This approach is not well suited for a batch process. An application should not send

hundreds, thousands, or millions of batch records through the service, nor should it send

one huge transaction with this sort of data. The traffic of either of these methods would likely

gridlock a service in no time.

However, batch processing should still be treated as a service in the following way. The

application should be able to send a similar SOAP envelope that simply says, “start

processing”— thereby launching the batch job at the server. The business rules for this

project would already identify the data sources and targets, allowing the job to process the

data directly. The service should allow for querying the process of that batch job and sending

back a message when the process has completed. This type of architecture makes it possible

to kick off a batch process and monitor progress from a remote location, for example.

Data Quality Solutions Built with SOA

This paper has discussed data quality solutions before SOA, built with SOA, and what IT

professionals and systems integrators should look for in the new generation of data quality

solutions. Until recently, data quality solutions were often ill suited for the modern online

enterprise and BI/DW paradigm shift that BI expert Colin White discussed in the introduction

to this paper. Now data quality solutions, designed with a service-oriented architecture, are

an ideal fit for providing the timely, accurate, and consistent information that companies

need to operate effectively and compete successfully.

18

More Information about SOA

The following online articles include more information about SOA in general (not necessarily

relating to data quality).

� “What is Service-Oriented Architecture” by Hao He

http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html

� “Understanding Service-Oriented Architecture” by David Sprott and Lawrence Wilkes

http://msdn.microsoft.com/architecture/soa/default.aspx?pull=/library/en-us/

dnmaj/html/aj1soa.asp

� “The Benefits of a Service-Oriented Architecture” by Michael Stevens

http://www.developer.com/tech/article.php/1041191

� “Web Services and Service-Oriented Architecture”

http://www.service-architecture.com/index.html

References

Gilpin, Mike and Vollmer, Ken. (2004, July 6). Integration in a Service-Oriented World.

Forrester Research, Inc. 4.

He, Hao. What is Service-Oriented Architecture? (2003, September 20).

Retrieved July 19, 2004, from

http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html

19

About Firstlogic

Firstlogic develops data quality software that helps businesses create a single view within

their database. Its data profiling solution, IQ Insight®, measures, analyzes, and reports on

data quality problems and business rule violations. Firstlogic’s industry-leading Information

Quality Suite® cleanses and standardizes worldwide data, appends third-party information,

and builds relationships through matching and consolidating records. Firstlogic's new

data quality integration environment offers centralized data quality services, tuned to the

specific needs of systems integrators and corporate IT engineers. IQ8 Integration Studio™ is

a revolutionary environment for designing, building, deploying, and managing data quality

solutions. Firstlogic’s data quality software seamlessly integrates into CRM, ERP, BI, and

data warehousing applications. In addition to developing commercial solutions, Firstlogic

partners with many systems integrators, consultants, and original equipment manufacturers

to provide its unique technology to their end-user customers. Founded in 1984, Firstlogic

today serves thousands of customers worldwide, including Fortune 1000 companies in

the e-business, financial, insurance, healthcare, direct marketing, higher education, and

telecommunications markets. For more information, call 608.782.5000, send an email to

[email protected], or visit the company’s Web site at www.firstlogic.com.

Firstlogic, IQ Insight, and Information Quality Suite are registered trademarks of Firstlogic, Inc. All other trademarks are held by their respective owner or manufacturer.

© 2004 Firstlogic, Inc.

20

Documents

Implementing Data Quality as a Corporate Servicedownload.101com.com/tdwi/ww20/Firstlogic_DQ_Corporate... · 2005. 10. 18. · 1 Executive Summary This paper has been designed for