Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Implementing Data Quality as a Corporate ServiceIntroduction written by: Colin White, President, BI Research
1
Executive Summary
This paper has been designed for a more technical audience such as information
technology (IT) professionals or systems integrators who have a general understanding
of the benefits of data quality in a corporate setting. Companies who have already
implemented data quality solutions and want to improve them, or organizations who are
considering implementing a data quality solution, will benefit from the real-world practical
knowledge shared in this paper.
Noted author, speaker, and Business Intelligence (BI) and Customer Relationship
Management (CRM) expert Colin White, president of BI Research, provides an introduction
to this subject with unique insights into the need for data quality in corporate computing
systems, and how they fit into today’s world of the “real time” enterprise. Corporate
decision-making depends on the information behind those decisions, according to White,
and it’s critical that businesses consider how architectural design of corporate systems
impact data quality efforts.
This paper provides practical application of data quality in a service-oriented architecture
(SOA), with examples of how organizations are taking advantage of solutions designed for
the next generation of corporate computing. Specifically, it will help IT professionals and
systems integrators to:
� Understand the history of data quality solutions and corresponding architectures.
� Realize how a data quality solution built with SOA can be beneficial to an enterprise.
� Recognize the features that should be considered when looking for a data quality solution, especially those that are possible with SOA.
While it is the intention of this paper to answer the most common questions about data
quality in SOA environments, requirements obviously vary greatly from organization to
organization. Specific questions about computing environments or particular needs are
welcome. Simply contact Firstlogic at 888.215.6442 or email [email protected].
Copyright © 2004 by Firstlogic, Inc. All rights reserved. No part of this publication may be stored
in a retrieval system, transmitted or reproduced in any way, including but not limited to photocopy,
photograph, magnetic or other record, without prior written agreement and permission of Firstlogic,
except for such limited purposes as may be authorized by the Copyright Act of 1976. Printed in the USA.
2
Meeting Evolving Business Needs with a Data Quality Servicean introduction by Colin White, president of BI Research
Information is power, and companies today cannot operate effectively or compete
successfully unless they give their users timely access to accurate and consistent
information. The three keys words here are timely, accurate, and consistent.
Timely Information. The concept of time is changing in organizations. It used to be that
companies would run their planning cycles annually, and executives and line-of-business
(LOB) managers would optimize business processes to satisfy those plans at monthly
intervals. In today’s highly competitive business world, these long decision-making cycles
are no longer acceptable. Successful organizations now run their budgeting and forecasting
cycle several times a year, and continuously manage and optimize their critical business
processes to ensure that operational, tactical, and strategic business goals are being met.
Accurate Information. Information is useless unless it is accurate. A popular computer
expression is garbage in, garbage out, and this applies equally to business decisions and
actions. Sound information makes for informed decisions, but bad information results in
poor decisions. Accuracy is affected by time. In the old data processing and business worlds
of batch processing and monthly business decision-making cycles, organizations had time to
analyze and fix data quality problems. In today’s fast paced world of the Internet, companies
no longer have the luxury of time. Internet customers applying for new credit cards or loans
want fast answers, and will go to a competitor if they do not get them. Consumers ordering
from Web storefronts do not want to be told the following day that the product is out of
stock. Being able to react rapidly is a competitive advantage, but fast decisions based on
inaccurate information can lead to bad loans and high-risk clients, which ultimately hurts
the bottom line.
Consistent Information. Even though information consistency and accuracy are related, they
are not the same. Companies are becoming more and more automated, and this has led to
information being dispersed across a multitude of applications and systems. Take customer
data, for example. In front-office systems, like CRM, customer data may be spread across
customer sales, marketing, and support systems. In the back-office, this customer data
may exist in order entry, billing, and shipping systems. External information providers may
supplement customer data with information about credit history, demographics, and so
forth. There are also multiple customer touch points to deal with, from Internet storefronts, to
physical retail stores, and customer support centers. Each of these systems and touch points
may contain accurate customer data, but is it consistent? This data may reflect different
moments in time, may be formatted differently, and may often reflect different business
definitions. Customer name and address data is an example where major consistency
problems across systems and between applications exist. These inconsistencies are a serious
obstacle to obtaining the single view of the customer that many companies are looking for.
3
Managing Data QualityEver since the advent of the computer and data processing software, data quality has been
an issue, and a major stumbling block to effective and accurate decision making. Life was
much simpler in the past. Early business transaction processing systems consisted primarily
of batch applications that processed and exchanged information using batch files. In this
environment, data quality was reactive, in that accuracy and consistency was checked after
the fact by validating batch input and interchange files. The only time pressure here was the
elapsed time to process those files. Decision support applications during this time consisted
mainly of regular weekly, monthly, and quarterly batch reporting jobs, and so again there
was time to fix data quality problems.
As the use of business transaction applications evolved, companies changed from a batch
mode of operation, to an interactive and online one. This evolution saw the introduction of
terminal-driven applications, client/server computing, and today’s Web-based systems. At
the same time, companies began to use application integration middleware and distributed
computing to transfer data between systems. The move toward the online enterprise forced
organizations to be less reactive and more proactive in their approach to data quality. Online
applications and application integration middleware now included data validation, lookup
routines, and business rules to verify dynamically the accuracy of the data as it was created
and moved between systems.
Decision-making technologies also changed as organizations became online enterprises. The
use of business intelligence (BI) and data warehousing (DW) saw dramatic growth as these
technologies offered the ideal solution for supplying integrated, summarized, and historical
transaction data for strategic planning and tactical business analysis.
Although data quality management in business transaction systems has become more
dynamic and proactive, this is not the case with most BI/DW systems. Data quality
management in DW is still reactive in nature. Data warehouses are maintained primarily
by running regular batch jobs. As batch files of extracted business transaction data are
processed by extract, transform, and load (ETL) applications, rules-driven data quality
routines are run to check data accuracy, and to ensure the consistency of the integrated data
warehouse information.
The use of BI/DW applications in organizations, however, is going through a major paradigm
shift. Companies are starting to use BI/DW applications, not just for strategic and tactical
reporting and analysis, but also for managing day-to-day and intra-day business operations.
As a result, ETL processes are moving from a batch update cycle, to one of capturing a
continuous stream of business transaction data for updating data warehouses with near-real-
time information.
BI performance management applications in turn are starting to use near-real-time data for
creating operational business performance dashboards for executives and LOB managers to
monitor and manage intra-day business performance. These applications enable executives
4
and LOB managers to compare actual business performance to business goals, and to take
rapid action when performance metrics indicate that goals are not being met.
This BI paradigm shift means that BI/DW applications are becoming essential to business
success because they are responsible for driving and optimizing daily business operations.
This trend will put even more pressure on BI/DW groups to guarantee the accuracy and
consistency of data. It will also mean, as with business transaction processing, that data
quality management must become more proactive and less reactive. This requires BI/DW
applications to check the quality of information dynamically, in-flight, as it flows between
systems and applications.
The Need for Service-Oriented ArchitectureBusiness transaction, data warehousing, and business intelligence processes are becoming
interconnected and closer to real-time in nature. The benefit to business users of this real-
time architecture is that they have access to the information they need to monitor business
operations and react rapidly to changing business needs and circumstances.
Although a real-time IT system may be able to deliver timely information to business
executives and managers, this is of limited value unless it can also handle real-time
requests to modify business rules and processes that come as a result of faster decision
making. The problem is that most rules and process interconnections are hard-wired into
existing monolithic applications, and are difficult to change dynamically. The solution to this
problem is service-oriented architecture (SOA).
SOA is based on a network of loosely-coupled components that can be interconnected using
common and open standards. Components may be applications, shared services, and so
forth. The SOA concept is not new. In the past, software vendors have used technologies
such as CORBA and Java Messaging Services to interconnect disparate application
components in support of SOA. The issue with early attempts at supporting SOA was that
many of the solutions were proprietary and complex to implement.
The advent of Web services and XML-based protocols has made SOA more viable because
these services and protocols are easier to implement, are more flexible, and are based
on open standards. Another key benefit is that an existing application can be wrapped
and presented as a Web service, which supports an orderly migration to a modern SOA
environment. The use of Web services, however, is not a prerequisite for SOA.
SOA is ideally suited for a real-time and dynamic enterprise because processes can be
interconnected easily in a flexible architecture that can adapt to changing business needs.
It allows service functions such as data validation, data transformation, etc, to exist as
separate components that can be called by business process components as required. SOA
also means that business rules used by service components no longer have to be embedded
in business processes and applications, but can be maintained and shared independently
from the business components that use them.
5
A Promising FutureIn summary, modern business transaction and business intelligence technologies can
work cohesively together to enable organizations to work smarter and make more timely
decisions. Always, accurate and consistent information is crucial to success. The advent of
SOA enables data quality management software to act as a service that can be shared by
multiple business processes. Separating data quality management rules from the processes
that use them improves flexibility and makes it possible for the rules to be dynamically
maintained to meet constantly changing business needs.
Further Defining Service-Oriented Architecture
While Colin White's introduction to this paper discussed some of the basics of SOA, it is prudent
to define SOA as it relates to data quality solutions. An article from O’Reilly’s webservices.xml.
com defines SOA as:
An architectural style whose goal is to achieve loose coupling among interacting
software agents. A service is a unit of work done by a service provider to achieve
desired end results for a service consumer. Both provider and consumer are
roles played by software agents on behalf of their owners. (He, 2003, “SOA Defined
and Explained”)
For those new to the idea of SOA, the above definition may sound a bit abstract. As this paper
discusses SOA in terms of data quality, readers should keep the following points in mind:
� SOA consists of a service, the service provider, and the service consumer.
� The service is some bit of functionality or “work” to be performed.
� The service provider is the mechanism by which the service is made available
(usually via a server).
� The service consumer is the piece of software that takes advantage of the
functionality provided by the service (in other words, it consumes the service).
Furthermore, when describing SOA, some immediately jump to the conclusion that Web
services imply an Internet-based Application Services Provider (ASP) model. In fact,
organizations are instead frequently building services within their intranet environment,
behind their corporate firewall, to provide such information services.
Resources and ReferencesMore information about the general concept of service-oriented
architecture can be found on page 19.
6
Data Quality Solutions before SOA
For many years, the philosophy and practice of the “online enterprise,” mentioned in the
paper’s introduction, did not receive much attention, nor did the practice for data quality. As
a Forrester Research, Inc. article points out, “Key business applications were not designed
to interact with one another, and sharing of information across application boundaries
frequently requires point-to-point coding,”(Gilpin & Vollmer, 2004). Data quality solutions, in
turn, were built with this architecture in mind — a “silo” (or stovepipe) architecture.
Figure 1: Data quality deployed in a “silo” architecture
As shown in Figure 1, organizations have historically implemented different solutions from
one or more vendors to solve data quality needs across various business-critical applications
(ERP, CRM, Data Warehouse, etc.). A few examples include Web-friendly application
programming interfaces (APIs) for integrating into browser-based applications and e-
commerce, “tight-integration” APIs for desktop applications, and stand-alone back office
applications for working with an enterprise’s data warehouse. With the data environment
being highly dynamic, this siloed approach hinders organizations from meeting the accuracy
or consistency requirements alluded to in the introduction.
A Service without the ArchitectureData quality has always been a service for other business applications, never the application
itself, though its architecture has not always matched its role. As enterprise-level business
applications became more diverse, the need for consistent data quality across all of these
applications grew. However, the architecture of many solutions did not readily support
company-wide IT initiatives. Therefore, the costs also increased for an IT staff to purchase,
learn, implement, and maintain a consistent data quality solution across the enterprise. The
drawbacks of the silo architecture in data quality solutions became evident. Here are just a
few examples:
� Slightly different implementations for each silo often meant inconsistencies for how data was handled. This obviously caused problems for IT departments trying to maintain consistency across an enterprise.
� IT programmers were required to be “data quality experts,” because business rules were coupled with the API and more staff was needed to manage multiple implementations.
� APIs were often very proprietary. This could mean that a programmer could not integrate in his or her preferred language. Or, the programmer for a Web-friendly API might have little or no knowledge carry-over to the tight-integration API. Therefore, each silo’s integration basically started from scratch.
� As enterprise applications began sharing data, databases grew increasingly larger. Many data quality applications were not able to scale to meet this new demand,
meaning larger processing times and more demand on hardware resources.
Enter Service-Oriented Architecture
Many IT professionals can relate to the problems noted above. However, a new generation of
data quality solutions has begun to use the principles of SOA to alleviate some of the downfalls
of silo types of implementations. Where the silo architecture has individual and potentially
disparate data quality solutions for each business application, the service-oriented architecture
treats data quality as the ubiquitous service it truly should be in an enterprise (see Figure 2).
Figure 2: Data quality in a service-oriented architecture
7
8
The technical details of how an organization can realize the advantages of SOA with data
quality solutions are discussed in the coming sections. Below are some inherent advantages:
� Less time to create and maintain data quality solutions.
� More flexibility in terms of how and where data quality solutions are deployed.
� Reduced learning curves for integrating data quality solutions.
� Little or no need for IT staff to become data quality experts.
� Faster implementations, improved data quality results, and reduced costs.
What to Look for in a Data Quality Solution
At a high level, this paper has discussed some of the disadvantages of the silo approach to
data quality implementation, and some of the advantages that SOA claims to offer. But how
does an organization realize these advantages? Picking any data quality solution that is built
on Web services does not necessarily guarantee all of the potential advantages that a true
SOA design can offer.
There are very specific features to look for in a data quality solution, especially those focused
on SOA. The following sections include an in-depth technical discussion (where appropriate)
of what to look for in a data quality solution, and explain how a services approach directly
impacts an IT professional or a systems integrator implementing data quality across the
enterprise. This section will discuss:
� Evaluating a data quality API
� Defining business rules
� Selecting a service provider (data quality server)
� Other features important for a data quality solution
Evaluating a Data Quality API
Critical to the success of any integration project is evaluating and selecting a data quality
API. The right API will speed implementation, meeting or exceeding integration timelines
and reducing maintenance efforts. A poor API has the potential to lock up the most skilled
IT engineering resources in a spiral of confusion and missed deadlines, with long-term and
intensive management requirements. Total cost of ownership (TCO) must be evaluated
with as much scrutiny as the cost of the technology itself. When selecting an API that will
enhance TCO and productivity, one should look for the following characteristics:
� Business rules decoupled from the API
� API that follows industry standards
9
Business Rules Decoupled from the APIBusiness rules define exactly how the data quality processing should occur for a specific data
set. For example, business rules define which fields to cleanse, which new fields to add to the
data, how to standardize data, and a plethora of other options.
For any programmer, it is enough to have to learn a new product’s API in order to integrate
it. In the past, IT staff members tasked with integrating a data quality application typically
had to become data quality experts as well. Not only were they required to work with internal
customers to establish business rules, they also had to translate the rules into the new API.
A data quality solution built on SOA can, however, eliminate this problem by allowing the
business rules to be completely decoupled from the API. Instead of learning all the nuances and
minutiae of data quality, IT staff members can leave the business rules to a business user (or
appointed enterprise-wide or department-level data quality expert). This allows IT resources to
concentrate on programming communication between the service consumer and provider.
Consider the example of a call-center application where customer information is collected. For
simplicity’s sake, assume the organization simply wants to cleanse address data within their
proprietary call center application.
Without targeting any specific products, Figure 3 offers a pseudo-code example of the
programming necessary to standardize a domestic address.
/* Set up the standardization parameters */set_option(OPT_ASSIGN_CITY_BY_INPUT_LLIDX,TRUE);set_option(OPT_PLACENAME, CONVERT_PLACENAME);set_option(OPT_STND_ADDR_LINE, TRUE);set_option(OPT_STND_LAST_LINE, TRUE);set_option(OPT_UNIT_DESIG, UNIT_DIRECTORY);set_option(OPT_CAPITALIZATION, UPPERCASE);set_option(OPT_DUAL_TYPE, DUAL_MAILING);set_option(OPT_APPEND_PMB, TRUE);
/* EWS is required for CASS Certification */set_mode(MODE_ENABLE_EWS, TRUE);
/* Set location of look-up directories and dicitionaries */set_file(DIR_ZIP4_1, C:\data_quality\data\zipfile.dir);set_file(DIR_REVZIP4, C:\data_quality\data\revzipfile.dir);set_file(DIR_CITY, C:\data_quality\data\cityfile.dir);set_file(DIR_ZCF, C:\data_quality\data\zcffile.dir);set_file(DIR_EWS, C:\data_quality\data\ewsfile.dir);set_file(DCT_CAP, C:\data_quality\data\capitalization.dct);set_file(DCT_FIRMLN, C:\data_quality\data\firms.dct);set_file(DCT_ADDRLN, C:\data_quality\data\addressline.dct);set_file(DCT_LASTLN, C:\data_quality\data\lastline.dct);
/* Set input fields */set_line(IADDRESS_LINE, tmpbuf1);set_line(LASTLINE, tmpbuf2);set_line(ZIP4, (char *)” “);set_line(URB, tmpbuf3);...
Figure 3: Pseudo-code excerpt: API coupled with business rules
The code in Figure 3 is just a very small excerpt of what could be necessary for an API that is
coupled with business rules. It merely sets a few options, the locations of some necessary
files, and the input fields. Completing such an example would require defining input field
formats, determining locations for input and output of data, specifying processing options
for reports, and numerous other pieces of functionality that would be necessary. Obviously,
there is still much more code to be written.
var hostname = “server1”;var portnumber = “20003”;var busrulelocation = “\\server2\dataquality\busrules”;runBatchProject(hostname, portnumber, busrulelocation, myproject3);
Figure 4: Pseudo-code excerpt: API decoupled from business rules
The example shown in Figure 4 depicts what a programmer might have to specify to run a
project in an application built on SOA where the business rules are decoupled from the API.
The programmer would simply specify information about the service provider and the set of
business rules to use. Other optional methods could be used, but the example above may be
all the code necessary to call a batch project when a data quality solution is built with SOA.
These examples have been simplified to show only calls to the API (for example, no user-
interface code is shown). Though simplified, these examples show the true advantage of
an API that is decoupled from business rules. In the coupled example, if a change to even
a simple business rule preference were made, it would require a corresponding change to
an API call in the code resulting in recompilation, testing, and redeployment to production
systems. However, a similar change to the decoupled example could require only a change
within the business rules, not the code that calls the data quality solution.
It is easy to see the time and energy this would save for initial creation and maintenance of
data quality code in any enterprise application.
API Follows Industry StandardsEarlier sections touched on the difficulties caused by proprietary APIs. There is always a
learning curve involved with a new API, and the less it adheres to industry standards, the
higher the learning curve will be. Additionally, many existing data quality APIs are at least
somewhat limited in terms of integration language support, or platform support.
What happens when a data quality solution has an API available only in C++, but Java is the
preferred language of the IT department? Or, a Solaris solution is required, but the API is
available only on Windows? The company is then forced to either pass on what could be an
otherwise-good solution, or adjust standard business practices to fit in the solution.
10
This is where Web services can provide significant value. As mentioned before, Web services
alone are not synonymous with SOA, but can be a very important part of an enterprise-wide
SOA. If a data quality tool uses a Web service interface, what does it mean?
� Platform independence insures that the solution will fit any environment; the
environment would not have to be fit to the solution.
� Implementation independence enables use of whichever programming language the
IT department is comfortable with. This can help keep the learning curve low.
� Industry standards mean a head start if the IT professionals have integrated
other Web services. Additionally, companies have the option of using third-party
development tools available for the industry standards.
� Web services is an ideal model for working in a heterogeneous environment (such as
a mixture of Windows and UNIX systems).
Defining Business Rules
Probably as important as the API is the way that business rules are defined. Without an API,
there is no way for an application to tie in data quality. Without business rules, there is no
way to tell the application what to do once the data quality processes have been launched.
One should look for the following in business rule definition:
� Centralized business rules
� Business rules with inheritance
� Predefined business rules
� How the rules are defined
Centralized Business RulesIn a silo architecture, each data quality solution generally had its own way to define business
rules. Some business rules may be defined directly in an API, whereas others may be defined
in a proprietary configuration file. Spreading business rules across the enterprise leads to a
number of problems such as:
� Inconsistency: Siloed implementations, each with unique business rules, result in
inconsistent data formats and content.
� High maintenance costs: Even if an organization uses a single vendor’s solution in
multiple implementations, what happens if a business rule is updated in one spot?
There will likely be the need for an internal process or mechanism to pass that
change throughout the enterprise.
11
A data quality solution with a centralized set of business rules, accessed by the service
provider, is a key component of a data quality SOA. With a centralized set of business rules,
rules are defined in the same way, providing consistency across implementations.
If multiple applications use the same business rules configuration, a centralized set of rules
instantly eliminates much of the maintenance cost. A user can update a rule in one spot and it
is updated throughout all of the enterprise’s applications.
Business Rules with InheritanceConsistent data quality across enterprise applications requires consistency of business
rules. Establishing corporate-wide data quality standards through business rules supports
consistency. However, there are often project specific nuances that must be considered,
and therefore subtle changes to business rules become a necessity. One would assume that
development and maintenance time for the business rules has been immediately increased.
This is not the case if the data quality solution supports the inheritance principle for business
rule definition.
What does inheritance mean for data quality business rules? One can think of it in terms of
programming. Imagine that a programmer has a block of code that he wants to use in multiple
places within an application. If following good programming practices, the programmer is
not copying and pasting that code in multiple places. Instead, the programmer would define
a reusable function and simply call that function where necessary. If updates to the code
are needed, the programmer would update the function directly, which would automatically
propagate the change wherever the function is used.
The same functionality should be available in a data quality solution. When defining business
rules, components should be reusable. For example, data quality projects should be able to
inherit settings from lower-level components. That way, a component could be shared across
many projects. Just like the function example, if the low-level data quality component were
updated, that change would be inherited by all projects (see Figure 5).
Figure 5: Projects A and B both inherit the same business rules for address cleansing
12
13
This is a fairly typical (albeit simple) flow of data in a data quality process. In each project,
the application is configured to cleanse address data. However, the data source and target
are different in each project. If the data quality tool supports the inheritance concept for
business rules, the address cleanse process is defined independently of the other pieces, for
example. Then, each higher-level project inherits that object. Any change made to the address
cleanse process is then picked up by each project, drastically reducing the cost of maintaining
multiple projects.
Similarly, the solution should also have the option to “override” business rules, if necessary.
That way, the advantage of inheritance still exists, but there is also the flexibility to override a
rule if it makes sense for a given project.
The inheritance idea is not necessarily tied to SOA. However, combining inheritance with SOA
truly enhances the power of this functionality. There can be a huge amount of maintenance
time saved if all data quality projects across the enterprise share common business rules.
Predefined Business RulesThe cost of learning any new software platform can be a bit burdensome on an IT professional
or systems integrator. This can also be true of data quality tools. However, data quality
solutions can include capabilities to help reduce this learning curve.
For example, a data quality solution should include a wide array of predefined business rules.
The company that creates the data quality solution should be an expert on that subject, and
with predefined rules, vendors can pass on some of that expertise. It is certainly easier to
modify a set of rules to fit with a given set of data than it is to start completely from scratch. A
data quality solution should provide a wide variety of predefined rules for projects similar to
those common for most enterprises.
How the Rules are DefinedIt goes without saying that a data quality tool should include an intuitive interface for defining
business rules. A good interface can lessen the learning curve and the time to create projects
for a data quality expert, or business user.
A user interface (UI) should be easy enough for a business user to feel comfortable working
with. It should enable a business user to set up the basic framework of a data quality project.
Ideally, the UI should provide a graphical view of the data quality process, allowing the user a
visual representation of how the data will be cleansed.
Selecting a Data Quality Service (Data Quality Server)
As discussed in the definition of SOA, one of the necessary components for any solution built
on SOA is a service provider. Chances are that if a data quality tool is built with an SOA, its
service provider will be some sort of data quality server. This is where the real work of data
quality processing will take place.
This piece may be one of the least visible components of a data quality solution — it is
usually running in the background of a system, accepting and processing data quality
requests. However, this component is certainly just as important than any other piece of a
data quality solution. The data quality service should include these types of features:
� Flexible server configuration
� Support for standard data formats
� Server scalability
14
Case Study: Avid Technologyby Colin White, president of BI Research
An excellent example of how IT systems and data quality management have evolved from a batch architecture to a real-time one is Avid Technology, a provider of digital media creation, management, and distribution solutions.
Avid uses an Onyx Software system to handle its customer center operations, and SAP Business Information Warehouse (SAP BW) to manage its business intelligence environment. Customer data for Internet and e-mail marketing is extracted from the Onyx CRM system and loaded in batch mode to SAP BW once per quarter. During the ETL processing, data quality management software from Firstlogic is used to perform a number of data quality routines ensuring the best customer information is entered into the CRM system. Data cleanup improves data accuracy and reduces marketing costs. On average, about 15 percent of the data contains duplicate information.
At the beginning of 2003, Avid decided to use SAP CRM to expand its front-office initiatives and to include customer information coming from the Web sites of its three independent business units. Unlike the Onyx environment, no data quality validation routines were put into place to manage customer data. The company quickly found that poor data was finding its way into SAP BW and its associated BI applications.
To solve this problem, Avid implemented Firstlogic IQ8 Integration Studio™ to check data coming from all customer touch points. This real-time and sharable service dynamically checks the data collected from Avid’s three Web environments before it is loaded into SAP CRM. The benefits of this approach are that all Web activity is subject to the same business rules, and the shared business rules can be maintained interactively and independently from application processing.
Avid intends to extend the use of its service-oriented approach to data quality management to include dynamic processes that ensure that customer orders and shipments satisfy regulatory compliance such as the USA PATRIOT Act.
Flexible Server ConfigurationIn the software world, the term “flexible” is often an overused buzzword. But consider
how important flexibility is for any software solution. It can mean the difference between
a relatively easy or difficult setup and integration into an enterprise. Flexibility is just as
important for a data quality server.
A data quality solution should have a server that is flexible across many platforms. For
example, a data quality server should be able to reside on either a UNIX or Windows server,
yet still be able to communicate with other UNIX and Windows computers, regardless of
platform. Again, the solution should fit into any environment, and not force the existing
environment to adapt to the solution.
The data quality solution should also be implementation independent. For example, if the
data quality solution is integrated into both a Web-based application and a “thick client”
desktop application, both of these applications should be able to use the same data quality
server and business rules. Likewise, it should be possible to use the same server for both
batch processing or transactional processing. This can simplify an installation environment,
easing the burden on initial setup and maintenance. Also, if the data quality servers have
any configurations of their own, this can help ensure consistency between servers.
Conversely, the solution should also allow use of multiple data quality servers. For example,
the user should have the ability to distribute the load by spreading work among multiple
servers. There could also be one server for transactional processing, and one for batch
processing, to ensure transactional requests are getting an appropriate response. In this
scenario, transactional requests would be protected from any lag that could be caused while
a batch process was running.
As a side note, in an environment with multiple servers, the data quality solution should also
allow for shared business rules across servers. Again, all of the advantages of centralized
business rules mentioned before apply here.
Support for Standard Data FormatsOne problem with many of the silo-generation solutions is that they offer support for a very
limited number of data formats. Typically, these tools support ASCII flat-files, and occasionally
one of the dBase formats. In addition, some solutions from this generation require an even
more proprietary format requiring conversion of data into a vendor-specified layout.
If a company’s data were in a relational database format, like SQL Server or Oracle, there
would likely be a need for additional business processes to accommodate the data quality
solution. For example, data would need to be converted to one of the accepted formats,
processed, then reconverted and reloaded into the preferred data format. Certainly, this is
inconvenient, time consuming, error-prone, and often causes more development work.
15
Support for most data formats is one of the trademarks of a solution that is truly a
service, because the solution can tie in seamlessly with data, just as it can tie in
seamlessly with applications.
Server ScalabilityIt goes without saying that faster is better. As databases continue to grow, scalability
becomes increasingly important. Take the example of a data quality process running
overnight so that its hardware resources are free during the workday for other tasks. Now,
due to a growing dataset, the process takes too long to run overnight. The process could
be moved to a weekly – instead of nightly – process and be run on the weekend, but the
advantages of regular data quality processing are lost. The process could be moved to a
computer with more processing power, but if the solution does not scale, that really solves
nothing. These are just a couple examples that demonstrate the importance of scalability.
Advertising that a solution is “scalable” is not necessarily enough, though. A data quality
solution should scale in the following ways:
� The solution should scale to support multiple projects and increasing numbers of
concurrent transactional users.
� Most data quality processes are made up of a number of sub-processes (e.g.,
address cleansing, data cleansing, data appending, matching/consolidation, and
so on). The solution should allow these individual sub-processes to be tuned. For
example, the ability to adjust the number of threads supported by each sub-process
means fine-tuning processing to truly get the best performance out of each data
quality project.
Other Features Important to a Data Quality Solution
A few other features that are important for a data quality solution include:
� Versatile options for metadata
� Data processing in one step
� Transactional and batch processing as a service
Versatile Options for MetadataIT managers know the importance of hard facts to back up a report to the CFO, or to
justify a request for expenditure. IT professionals depend on metadata to understand the
data itself and make better decisions about processing it. Metadata is also a key tool for
troubleshooting problems when unexpected results occur.
For these reasons, a flexible metadata solution is very important. A data quality solution should
allow the user to retrieve needed metadata, from any point in the process, as in Figure 6.
16
Figure 6: The data quality solution should allow for metadata retrieval at any point in the process
Most data quality solutions provide metadata only at the end of a process. But business
drivers may dictate that metadata be captured for a specific step of the cleansing process.
Flexible metadata capture allows companies to compare intermediate results with the final
metadata at the end of the process. To ensure ultimate flexibility, the solution should allow
metadata to be created in any chosen data destination and format.
Data Processing in One StepAs discussed in earlier sections, a data quality process is really made up of many sub-
processes such as address cleansing, data cleansing, matching, and so on. Many data quality
solutions, however, do not treat these as sub-processes at all. In this scenario, multiple main
processes are required, often through different products, to get the end result.
A data quality solution should truly be a data quality platform. It should treat each piece of
data quality as part of the bigger process. This allows users to configure their solution to be
a simple process, such as address cleansing alone, or a complex, multi-function process like
consumer householding. Regardless of the desired result, the solution must allow the user to
configure the cleansing to be done in a single process.
For example, a company needs to cleanse address data; cleanse name, firm, and e-mail data;
and then locate matching records. In many data quality solutions, these would be three
distinct projects, using a separate product for each step. This means that output must be
generated for each process and input into the next. In addition to more work, this generally
means more files to manage and more disk space used on the system.
Figure 7: Multi-step data quality process
17
If the solution truly treats data quality as one process (or project) there are fewer individual
steps and no need for managing extra files, as shown in Figure 8.
Figure 8: The same project in a single-step data quality process
Transactional and Batch Processing as a ServiceTreating transactions as a service seems pretty obvious. In a world of thin-client
applications, nobody wants to house a thick-client data quality application on each client
computer. However, batch processing should also be treated as a service, though in a
slightly different way.
It is likely that newer data quality solutions will be built using a Web service or other similar
mechanism as the communication method. This makes perfect sense in the transaction
world. A proprietary application would send a set of data in a SOAP envelope to the Web
server (and subsequently the data quality server). Then, the envelope would be returned to
the application in reverse order with the cleansed data.
This approach is not well suited for a batch process. An application should not send
hundreds, thousands, or millions of batch records through the service, nor should it send
one huge transaction with this sort of data. The traffic of either of these methods would likely
gridlock a service in no time.
However, batch processing should still be treated as a service in the following way. The
application should be able to send a similar SOAP envelope that simply says, “start
processing”— thereby launching the batch job at the server. The business rules for this
project would already identify the data sources and targets, allowing the job to process the
data directly. The service should allow for querying the process of that batch job and sending
back a message when the process has completed. This type of architecture makes it possible
to kick off a batch process and monitor progress from a remote location, for example.
Data Quality Solutions Built with SOA
This paper has discussed data quality solutions before SOA, built with SOA, and what IT
professionals and systems integrators should look for in the new generation of data quality
solutions. Until recently, data quality solutions were often ill suited for the modern online
enterprise and BI/DW paradigm shift that BI expert Colin White discussed in the introduction
to this paper. Now data quality solutions, designed with a service-oriented architecture, are
an ideal fit for providing the timely, accurate, and consistent information that companies
need to operate effectively and compete successfully.
18
More Information about SOA
The following online articles include more information about SOA in general (not necessarily
relating to data quality).
� “What is Service-Oriented Architecture” by Hao He
http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html
� “Understanding Service-Oriented Architecture” by David Sprott and Lawrence Wilkes
http://msdn.microsoft.com/architecture/soa/default.aspx?pull=/library/en-us/
dnmaj/html/aj1soa.asp
� “The Benefits of a Service-Oriented Architecture” by Michael Stevens
http://www.developer.com/tech/article.php/1041191
� “Web Services and Service-Oriented Architecture”
http://www.service-architecture.com/index.html
References
Gilpin, Mike and Vollmer, Ken. (2004, July 6). Integration in a Service-Oriented World.
Forrester Research, Inc. 4.
He, Hao. What is Service-Oriented Architecture? (2003, September 20).
Retrieved July 19, 2004, from
http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html
19
About Firstlogic
Firstlogic develops data quality software that helps businesses create a single view within
their database. Its data profiling solution, IQ Insight®, measures, analyzes, and reports on
data quality problems and business rule violations. Firstlogic’s industry-leading Information
Quality Suite® cleanses and standardizes worldwide data, appends third-party information,
and builds relationships through matching and consolidating records. Firstlogic's new
data quality integration environment offers centralized data quality services, tuned to the
specific needs of systems integrators and corporate IT engineers. IQ8 Integration Studio™ is
a revolutionary environment for designing, building, deploying, and managing data quality
solutions. Firstlogic’s data quality software seamlessly integrates into CRM, ERP, BI, and
data warehousing applications. In addition to developing commercial solutions, Firstlogic
partners with many systems integrators, consultants, and original equipment manufacturers
to provide its unique technology to their end-user customers. Founded in 1984, Firstlogic
today serves thousands of customers worldwide, including Fortune 1000 companies in
the e-business, financial, insurance, healthcare, direct marketing, higher education, and
telecommunications markets. For more information, call 608.782.5000, send an email to
[email protected], or visit the company’s Web site at www.firstlogic.com.
Firstlogic, IQ Insight, and Information Quality Suite are registered trademarks of Firstlogic, Inc. All other trademarks are held by their respective owner or manufacturer.
© 2004 Firstlogic, Inc.
20