Case Studies of Open Source Data Quality Management

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuCase Studies of

Open SourceData Quality Management

©Dezhi LiuJune 25th 2010

Diploma Thesis

Supervisor:Prof. Dr.-Ing. Holger Hinrichs

Department of Electrical Engineering and Computer ScienceLuebeck University of Applied Sciences

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Department: Electrical Engineering and Computer Science

University course: Information Technology

Subject: Case Studies of Open Source Data Quality Management

Abstract

As organizations increasingly depend on their information systems, the field of data quality man-agement (DQM) is getting more and more attention. Data of high quality will not only reduceunnecessary cost, but also provide creditable information for decision making. This thesis fo-cuses on the area of Open Source data quality management. The goal is to give an overview ofthis area with case studies based on major technologies.

Both theoretical background regarding DQM and practical evaluation of the sample tools areincluded. The fundamentals in terms of data quality, DQM and Open Source software are in-troduced. Three major technologies which are widely adopted by Open Source tools, i.e. dataprofiling, data cleansing and record linkage, are discussed in detail. The case studies are basedon the evaluation criteria focusing on two aspects: Open Source qualification and data qualityfunctionality. According to the criteria, qualifications of each tool are evaluated by the follow-ing approaches: analysis based on available documents, general-purposed test and sample taskevaluation. Finally, the thesis concludes with summarized results of the case studies as well assuggestions and a brief outlook for further research in this area.

Author: Dezhi Liu

Attending Professor: Prof. Dr.-Ing. Holger Hinrichs

WS / SS: SS 2010

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuDeclaration

I declare that this thesis was composed by myself, that the work contained herein is my ownexcept where explicitly stated otherwise in the text, and that this work has not been submitted forany other degree or professional qualification.

Only the sources that were cited in the document have been used in this thesis. Parts that aredirect quotes or paraphrases are identified as such.

I agree that my work is published, in particular that the work is presented to third parties forinspection or copies of the work are made to pass on to third parties.

Lübeck,

Signature

iii

December

15, 2

011

COPYRIGHT ©Dezh

i Liu

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Foundation 42.1 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Introduction to the Concept . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Data Quality Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Data Quality Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Recognize the Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 General Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Major Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Data Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Record Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Open Source Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 What is Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Open Source Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 Why Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Methodology 203.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.2 Data Quality Functionality . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Tool Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Technical Environment and Data Sources . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Technical Environment and Database Systems . . . . . . . . . . . . . . . 233.3.2 Data Source for Sample Task Evaluation . . . . . . . . . . . . . . . . . 24

3.4 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.1 Data Profiling Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.2 Data Cleansing Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.3 Record Linkage Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iv

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

4 Evaluation and Results 294.1 Talend Open Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.2 Working Principles and General Functionality . . . . . . . . . . . . . . . 294.1.3 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 DataCleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.2 Working Principles and General Functionality . . . . . . . . . . . . . . . 354.2.3 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Talend Open Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.2 Working Principles and General Functionality . . . . . . . . . . . . . . . 384.3.3 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 SQL Power DQguru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 414.4.2 Working Principles and General Functionality . . . . . . . . . . . . . . . 414.4.3 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Fril . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.5.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 424.5.2 Working Principles and General Functionality . . . . . . . . . . . . . . . 434.5.3 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6 Febrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.6.1 Open Source Qualification . . . . . . . . . . . . . . . . . . . . . . . . . 464.6.2 Working Principles and General Functionality . . . . . . . . . . . . . . . 464.6.3 Sample Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Summary 505.1 Comparison and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.1 Case Study Result on Data Profiling Tools . . . . . . . . . . . . . . . . . 505.1.2 Case Study Result on Data Cleansing Tools . . . . . . . . . . . . . . . . 515.1.3 Case Study Result on Record Linkage Tools . . . . . . . . . . . . . . . . 52

5.2 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Bibliography 54

A Candidate List and Selection Results 58

v

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuChapter 1

Introduction

1.1 MotivationNowadays electronic data can be acquired, processed and stored much faster in a more flexibleway. They are more and more widely used in the information and communication technology(ICT) society. Apart from all the benefits of this development, organizations start to suffer fromthe disadvantages brought by data with poor quality. As shown in the report from the DataWarehousing Institute in 2002, data quality problems cost businesses in the US more than 600billion dollars a year [Ins02]. The cost could be even higher because people are now trying tosqueeze more information from their data for decision making. If data itself is of poor quality,so will be the information extracted.

For this reason, data quality (DQ) and data quality management (DQM) are getting more andmore attention. As organizations start to recognize the importance of the quality of their data,technologies that help to assure and improve data quality have had a great development. Newtechnologies, such as data warehousing and business intelligence (BI) have been developed asthe new strategic initiatives. The fields are dominated by huge players, e.g. the Oracle Corp.and SAP. Such vendors provide comprehensive range of products as well as services that helporganizations improve their data quality. In the mean time, several Open Source projects havealso emerged in this area. Although they are still not quite comparable with the commercialones, they have a great potential to become more powerful and competitive because of theirown characteristics. The Open Source tools enable organizations, especially small and mediumenterprises (SME), to tailor every aspect of the data quality experience from the ground up.

1.2 ObjectiveThe goal of this thesis is to give a brief overview of the Open Source DQM area covering not onlythe related theories, but also some case studies of the current practices in this area. As the startpoint, an introduction of data quality is needed. Data quality management should be discussedin detail concentrating on the available technologies. Due to the time limitation of this thesis, aswell as the lack of common agreement on the major technologies, it is not possible to cover thewhole range. Therefore, only those, which are widely adopted by the current Open Source (OS)

1

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

tools, are picked out and introduced intensively. Evaluation criteria should be clearly defined.Case studies of Open Source tools on these technologies should be performed according to thepredefined criteria. Sample tasks are to be designed in order to provide real practice experiencesof each tool.

In this thesis, three major technologies are focused on — data profiling, data cleansing andrecord linkage. The selected tools and a short description of each one are shown in Table 1.2.1.Sample tasks should correspond to the categorization.

OS Software Description

DataCleaner Open Source data profiling tool

Talend Open Profiler Open Source data profiling tool

Talend Open Studio Open Source data integration tool

SQL Power DQguru Open Source data cleansing & MDM tool

Febrl Open Source record linkage tool

Fril Open Source record linkage tool

Table 1.2.1: Overview of selected tools

The evaluation of sample tasks is implemented on the data source foodmart in two databases— MySQL and Apache Derby. Both the important steps of implementation and the results ofeach evaluated tool should be presented in detail.

Ultimately, results of the case study should be presented. The conclusion should contain adiscription of the current situation in this area based on the result. Found problems should bediscussed and suggestions and outlook for further research in this area should be given.

2

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

1.3 Structure of the ThesisFigure 1.3.1 gives a graphical representation of the structure of this thesis.

First, in chapter 2, fundamentals in the area of data quality and DQM as well as Open Sourcesoftware are introduced, which are served as the basis of this thesis.

Then, chapter 3 contains the criteria for evaluation, tool selection procedure for case studiesand related information about the evaluation procedure, e.g. data sources used and the technicalenvironment.

Case studies are performed in chapter 4, which includes the detailed description about theprocess of evaluation of each tool and the results obtained by sample tasks execution.

At last, chapter 5 concludes this thesis with a comparison of the tools in each category ac-cording to the specified criteria and speculates on open problems and future work in this area.

Figure 1.3.1: Thesis structure

3

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuChapter 2

Foundation

2.1 Data Quality

2.1.1 Introduction to the ConceptBefore data quality can be measured and improved, it should be defined in a meaningful andmeasurable way. There are different areas where data quality can be addressed from a researchperspective, ranging from statistics, management to computer science. The concept of data qual-ity was first introduced by statisticians in the late 1960’s. Since the beginning of the 1990’s,computer scientists started to consider the importance of the quality of electronic data stored indatabases, data warehouses and legacy systems [BS06].

The broad definition of data quality has been given as “fitness for use by information con-sumers” [Jur88]. It is often concerned with excellence, value, conformance to specifications ormeeting consumer expectations [KSW02]. Information quality, as well as data quality, has beendefined by IAIDQ1 as the fitness for use of information [EE10][PS05]. The term data qualityis also used synonymously with information quality, whose meaning has a broader range. Thisthesis is restricted to the usage of the term data quality angled from a more product-based per-spective, and focuses on the data stored and processed in information systems. Before we startto look at different dimensions of this multifaceted concept, the importance of data quality isdiscussed first.

Why is data quality important?

Today, data quality is causing more and more attention in corporations because it plays an in-creasingly important role in the business world. If not treated well, it may lead to serious conse-quences.

On one hand, the lack of data quality itself causes overhead. Companies are recognizingthat the data in their information systems are generally of poor quality. In fact, the problem isalready leading to terrible costs in some companies. The overhead mainly includes the cost ofunnecessary printing, postage, and staffing costs. Those costs appear very slow, but it is a steady

1The International Association for Information and Data Quality, see http://iaidq.org/

4

http://iaidq.org/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

erosion of an organization’s credibility among customers and suppliers. As already mentionedin chapter 1, the cost for U.S. businesses in 2002 is more than 600 billion dollars.

On the other hand, data quality plays an important role in the knowledge discovery process. Inthe past three decades, computer hardware technology has made great progress, leading to largesupplies of more powerful devices with lower prices, e.g. computers, data collection equipment,and storage media. The popularity of those devices in both PC and business market provides agreat boost to the information industry, leading the huge development of heterogeneous databasesystems and Internet-based information systems such as the World Wide Web (WWW) [HK01].To effectively and efficiently analyze the fast-growing, tremendous amount of data in those in-formation systems becomes a challenging task. At the same time, companies are eager to extractvaluable information from those data. The awkward situation has been described as data rich butinformation poor [HK01]. It’s hard to imagine if such tremendous amount of data are with poorquality, all the cost and effort spent on analyzing them will turn out to be a waste. Therefore, itis even more important to keep the data in a high quality level in larger information systems.

It has been estimated by English (2009) that the poor information quality can cost organi-zations from 20 to 35 percent of operating revenue wasted in recovery from process failure andinformation scrap and rework. 122 organizations lost a total of $1.2 trillion in process failure andwaste caused by poor quality Information [Eng09].

2.1.2 Data Quality DimensionsUntil recently, most people still equate the terms “data quality” and “data accuracy”. Indeed,it is obvious that inaccurate data is of low quality. But data quality is more than just accuracy.In reality, there are many other significant dimensions such as completeness, consistency, andcurrency. Although several researches have been carried out focusing on analysing the problemsand defining the dimensions, a precise and commonly accepted definition is still not available.There are generally three approaches to define data quality dimensions: theoretical, empiricaland intuitive (see [BS06], chapter 2). As typical examples, the theoretical approach that Wand& Wang proposed in 1996 considers mapping the real world (RW) to an information system(IS) [WW96]; the empirical research approach that Wang & Strong proposed in 1996 relied onthe information consumer feedback to derive quality criteria and then classify them into cate-gories. All together 15 different dimensions were selected by the authors fitting in four generalcategories [WS96]. There is also no general agreement on the names of the dimensions, andin different proposals, the meanings and criteria of dimensions with the same name may differ.But no matter which approach was adopted, as a result it tends to be agreeable at some level.The common dimensions that general proposals include are: Accuracy, Reliability, Timeliness,Completeness and Consistency.

Data quality dimensions defined in this section are theoretically grounded, which can alsoserve as a basis for further research and evaluation of the data quality tools. A integrated set ofdimensions of data quality is introduced based on three related researches: the classification pro-posed by Larry P. English (1999) (see [Eng99] chapter 2), the data quality dimensions defined byBatini & Scannapieco (2002) [BS06] and the Semiotic Information Quality Framework proposedby Price & Shanks (2005) [PS05].

5

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Inherent Data QualityDimensions in this category are based on the stored data’s conformance to the real-world itrepresents. It is similar to the concept of the term intrinsic data quality proposed by Wand &Wang (1996). A closer and simpler word for this category that most people may use is accuracy.“Inaccuracy implies that the information system represents a real world state different from theone that should have been represented.” (cited from [WW96]) But mapping the state from thereal world is not that simple. Dimensions in this category are the most visible and the mostimportant among all the dimensions. It is the foundation measure of the quality of data. Ifthe data is erroneous, to explore the other dimensions of data quality won’t make too muchsense. There are different kinds of errors and they should be distinguished from each other. Forexample, consider a database that contains names, addresses, phone numbers, date of birth, etc(see Table 2.1.1). Some records can be wrong, some can be missing and some can be obsolete.The typical dimensions in this category are:

Syntactic Accuracy

Syntactic accuracy is defined as the state that a value conforms to its metadata. In other words,data obeys the constraints described by the specified integrity rules, also called database schema.The level of data quality in this dimension largely depends on the degree to which it is governedby schema and integrity constraints controlling permissible data values. There are generally twotypes of constraints: domain constraints and integrity constraints. Domain constraints can bedefined as a value should be included in a certain domain definition. For example, the first nameof the third customer in Table 2.1.1 is Jck, it has a great potential to be an error whose correctvalue is “Jack”, because Jck is not an admissible value in the common first name domain. Thebirth_date value of the fourth customer is invalid, as 33 is not a valid value for month. “Ageshould be between 0 to 130” is another simple example. This kind of error can be recognizedby comparison functions. Integrity constraints can be considered as a more complex situation ofdomain constraint, as the domain is related to other attributes. For example, the marital statuscan only equal to “married” when the age of the entity is at least 18. The age should be alsoconforming to the date of birth. These situations are often referred to as consistency problems inother literature [BS06].

id first_name last_name birth_date address phone_number1 Jean Derry 1987-10-04 7640 First Ave. 583-555-44742 Michael Spence 1971-04-06 5423 Camby Rd. (0)616 5569663 Jck Baker 1975-03-27 7230 Berrellesa Street 904-555-87884 Babara Marshall 1923-33-10 6202 Seeno St.

Table 2.1.1: Sample customer table

Semantic Accuracy

Semantic accuracy represents the correctness of the value as compared to the correspondingvalue of the entity in the real world. It can be recognized and measured only when the true

6

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

value is already known, or the true value should be deduced from the available information. Forexample, the real birthday of the first customer in Table 2.1.1 is 1988-12-22. The stored value issyntactically correct, but the real birthday of that person is 1988-12-12. The value is not correctas it does not represent the state of the real world. The measurement of semantic accuracy ismore complex. One technique is to compare the data in different data sources. When data has apotential to be representing the same entity in the real world, it may provide reference for correctinformation.

Non-redundancy

The problems with duplication are included in this dimension. Duplication in a data sourcecan be described as more than one record or tuple in one data source represent the same real-world entity. In the customer’s example, each customer should be represented once exactly inthe database. When there is another tuple with the same name, date of birth, gender, it has agreat potential to be a duplicate. Duplicates may also have minor differences because the realworld state has changed or data has been typed in differently. Redundant data are apt to begenerated when integration process is performed. When duplication occurs, further costs such asadditional mailing cost or data storage cost are added directly. Techniques such as record linkageare designed to detect such potentials. The approach used to eliminate duplications usually isreferred to as de-duplication.

Completeness

Completeness has been defined as “the ability of an information system to represent every mean-ingful state of the represented real world system” by Wand & Wang (1996) [WW96]. Data canbe incomplete in different ways. In the case of relational data, there are normally three kinds ofcompleteness: attribute completeness, tuple completeness and value completeness [BS06]. At-tribute completeness is usually determined by the schema of the data source. If an attribute ismissing but is necessary for applications, the data is incomplete regarding its set of attributes.This is usually a design problem. Tuple completeness is reflected by the number of missingrecords. Value completeness is commonly related to the presence of NULL. A value can bemissing either because it exists but is unknown, or because it does not exist at all.

Pragmatic Data QualityData are of high level of inherent quality may not be valuable if they are not used. Therefore,data quality should not only represent the inherent quality of that data, but also the quality thatrelated to its application and usage, i.e. the degree of usefulness. As being defined by English(1999) [Eng99], those qualities are defined as the pragmatic data quality. Dimensions in thiscategory are all related to data usage.

Generally saying, concerning its usage, data should be accessible, secure, up-to-date, flexi-bly and suitably presented, relevant and valuable. Data accessibility and security often dependon the quality of the information system and the service it provides. Timeliness refers to the

7

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

delay between the data stored in the information systems and the real world state that data repre-sents [WW96], which largely depends on data update process. Format flexibility and suitabilityrepresents the state of the data format can be easily aggregated or converted into other formataiming at different applications. Relevance and the value of data depend on the real word entity.The precision and unit measurement should also be taken into account. For example, for the pur-pose of recognizing different records in two data sources that representing the same real worldentity (record linkage), the fields that are to be compared should be converted into the same for-mat, so that they are comparable by machines. This may not be a data quality problem unless thedata is used for record linkage. It requires data of high format flexibility and suitability.

2.2 Data Quality ManagementIn the previous section, the concept of data quality has been introduced and different data qualityissues have been categorized into several dimensions. The question becomes how we can mea-sure, assure and improve data quality. The whole process generally is referred to as data qualitymanagement (DQM). In this section, a short introduction of DQM and general approaches isgiven. In order to deal with problems concerning data quality, the first thing is to recognize whatproblems are there in data sources.

2.2.1 Recognize the ProblemsRahm & Do (2000) have had an analysis and research on the data quality problems [RD00]. Thedata quality problems can roughly be categorized into two groups. They are either single-sourceproblems or multi-source problems. For each group, there are problems that at the schema levelor at the instance level (see Table 2.2.1).

Single-Source Problems Multi-Source Problems

Schema LevelPoor schema design, lack of

integrity constrainsHeterogeneous data models and

schema designs

Instance Level Data entry errorsOverlapping, contradicting and

inconsistent data

Table 2.2.1: Four classification of data quality problems [RD00]

Single-source problems refer to data quality problems that are present in a single data source.As stated by Rahm & Do (2000) [RD00], data quality in single source largely depends on thequality of its data model, such as integrity constraints or schema design. Invalid data that can-not be avoid by schema or integrity constraints, e.g., due to misspellings during data entry orduplicates are referred to as at the instance level. Multi-source problems exist especially whenmultiple data sources need to be integrated, such as during the data warehousing processes. Thereason is that redundant or inconsistent data due to data representation difference and duplicationproblems are apt to increase significantly. Typical multi-source problems that are at the schemalevel are naming and structural conflicts.

8

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

2.2.2 General ApproachesThere are various approaches that can solve the data quality problems stated above, such asdata profiling, data cleansing, record linkage, monitoring, etc. However, in order to obtain andmaintain a high level of data quality, data quality management should be more than implement-ing several separate applications occasionally. From the research perspective, the systematicDQM process consists of different phases, which incorporate together as a cycle and should beimplemented constantly. One proposal is the TDQM (Total Data Quality Management) Cycle(see Figure 2.2.1) which has been defined by Wang (1998) [Wan98]. It is based on the PDCA

Figure 2.2.1: Cycle of TDQM methodology

(plan-do-check-act) four-step process known as Deming cycle which is typically used in businessprocess improvement. It the TDQM cycle, data is considered as an information product (IP). Ashort description of each component of the cycle is given as follows.

Define The important DQ dimensions and the corresponding DQ requirements should be defined.

Measure This phase involves the measurement of DQ quality according to the dimensions and re-quirements defined in the first phase.

Analysis Root causes for the DQ quality problems should be identified and the impacts of poorquality information should be calculated.

Improve According to the result of the previous phases, DQ quality improvement techniques shouldbe applied against the problems that influence data quality.

Additionally, two kinds of components should be embodied in a complete DQM program,namely proactive and reactive components.

Proactive components

Proactive components are those approaches aiming at preventing low quality data actually get-ting into the data source, i.e. eliminating or minimizing the errors at the entry process. Thereare several approaches. In fact, it entails the all administrative activities, such as the high-qualitydesign of the information system including infrastructure and applications, establishment anddeployment of roles, responsibilities, policies, processing and maintenance of data, training staffand so on. Furthermore, techniques such as data entry system (DES) are very effective at re-ducing input errors. Typical tasks of a DES are data entry checks ensuring each field for validvalues, duplicate checks ensuring non-redundancy.

9

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Reactive components

While performing proactive techniques is effective at preventing a lot of erroneous and inaccuratedata from getting into information systems, it is still not enough to assure high data quality. Todeal with the data quality problems that already exist in data sources, reactive components shouldbe employed. The component generally contains detecting and removing errors and inconsisten-cies from data, solving data quality problems that have been discussed previously. Therefore,technologies has emerged to detect and solve data quality issues such as data profiling, data au-diting, data cleansing, data parsing, standardization and record linkage. The names may differ indifferent context or depend on the preference of each individual. Those techniques are realizedby softwares which are often referred to as data quality tools. There are already several opensource solutions available in the market. The technologies for DQM that are widely adopted bythose tools are one of the main topics of this thesis. Detailed description for those techniques isdiscussed in the next section.

2.3 Major TechnologiesAccording to our analysis, most widely used data quality technologies can be categorized intothree groups: data profiling, data cleansing and record linkage. Other technologies may still notwidely adopted by Open Source data quality tools or do not have a common agreement on themyet. Some are related to data quality but the relevant enough. For all the stated reasons, thethree major technologies are considered by the author as the most popular ones among the OpenSource solutions.

Before we look at the approaches and technologies, data types are redefined for the con-venience of introducing the situations. Data types here should be distinguished with those indatabase systems, the latter focus on the physical property of the data. The four types are namelyNominal, Numerical, Time-related and Others.

Nominal Nominal data refer to texture data that have an integrity domain (personalnames, city names) or have a certain pattern (postal code, ISBN, email).Many data profiling technologies trying to improve data quality in the syn-tactic accuracy dimension focus on this type of data. Note that due to thedevelopment of names, variations or new names have been introduced, butwe still consider it as Nominal data.

Numerical Numerical data here has a restricted meaning: only numbers that have math-ematical meaning, in other words, the average or sum of the numbers aremeaningful.

Time-related Any date or time related data types are considered as time related data.Databases such as MySQL and Derby have the support of date format, onlyvalid values are allowed in the database. This can prevent syntax inaccuratedate data entering information systems.

10

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Others Although there are more interesting categories concerning data type, we referto all other data types that don’t belong to the three data types as others. Anexample would be unstructured-text data such as commentary data or addressdata.

Data type sometimes depends on how the data is treated or profiled, e.g. standardizing ad-dress data is also applicable when the address data is treated as the Nominal type. When referringto this categorization of data types, the first letter is capitalized throughout the thesis.

2.3.1 Data ProfilingData profiling has been described as the most important technology and recognized as the startfor improving data quality, especially in terms of data accuracy [Ols03]. Data profiling is theprocess of examining the data available in existing data sources by collecting statistics and in-formation in order to gain knowledge about the content, structure and quality of that data. Withthe information obtained by data profiling, data quality problems would be able to be detectedor recognized. Major data profiling approaches contain metadata analysis, structure analysis,column analysis and data rule analysis.

Metadata Analysis

Metadata is generally defined as data about data [Cat97]. Metadata is also data, it can bestored a repository or in other forms such as XML documents, report definitions, descriptionsfor database, log files, connections and configuration files [WH06]. In the data quality context,metadata should define the qualification and standard for accurate data. According to Olson(2003) [Ols03], such metadata should contain schema definition, business objects, domains, ta-ble definitions, rules, etc. This type of analysis is usually application-oriented, as it depends ondifferent business cases. If the metadata is accurate and complete, the data profiling task wouldbecome to examine data against the metadata. One of the data profiling tasks is to refine themetadata.

Structure Analysis

Compared with metadata analysis, this type of analysis focuses on the structure of a database,which often referring to the information about tables as well as columns in the sense of howthey are related to each other. This information is often useful when an application involvesthe usage of more than one tables or columns. As an example of a typical business model, theinformation of customers, products and sales are often stored in three separate tables. In orderto provide a complete purchase fact of a customer, it is required that the three tables cooperatetogether (see Figure 2.3.1). This is achieved by primary keys and foreign keys in the databasesystems. Structure analysis is the analysis that intends to find out and analyze such dependenciesin a database.

11

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 2.3.1: Example of business database

Column Analysis

This type of analysis focuses on the values stored in a column and the property independent ofall other columns. Interesting directions are discussed as follows.

• Storage PropertyStorage property of a column includes its physical data type (CHAR, FLOAT, DATE...)and its length.

• Pattern DistributionSometimes column, especially of type String, have a specific character pattern. For exam-ple, the gender of a person may be represented by “M” indicating male and “F” indicatingfemale, or by whole words, or simply by “1” and “2”. The system should be consistentin representing the value. Another example is Media Access Control (MAC) address2, ac-cording to IEEE 8023, the standard format for printing MAC-48 addresses should appearlike 00-21-6b-67-ce-54 or 00:21:6b:67:ce:54, i.e. six groups of two hexadecimal digits,separated by hyphens (-) or colons (:). Any other formats of presenting the MAC addresscan be recognized as inaccurate or non-standard presentation. In this case, the profilinganalysis should be able to detect all such inaccurate formats. This is done by validatingeach value by a specific rule, e.g. regular expression (Regex), which provides a flexibleway for matching Nominal data or Time-related data. Another way of analysis is to findout all patterns for presenting the same data, in order to design a solution for making theformat consistent.

Regex is short for Regular Expression, provides a concise and flexible means for matchingstrings of text concerning content or structure. It is possible to use regex to defineparticular characters, words, patterns of characters and numbers or order of them inthe text content. Some popular syntax standards are PCRE (Perl Compatible RegularExpressions) and POSIX Basic Regular Expressions. As an example, a regex thatmatches all valid email address format may appear as follows4.

2MAC address is a unique identifier assigned to most network adapters or network interface cards (NICs) by themanufacturer for identification

3A family of IEEE standards for local area networks (LAN) and metropolitan area networks (MAN).4See http://www.regular-expressions.info/ for more information about Regex

12

http://www.regular-expressions.info/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Regex

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}

• Range and Value DistributionThe two aspects are often considered together with Numeric or Time-related data. Somevalues already have a specific range, and the task is to validate all values against thatrange. The task could also be, how values are distributed in some certain ranges. Moresophisticated analysis in this direction which intends to derive business information isoften referred to as data analysis techniques such as data mining, distinguished from dataprofiling techniques.

• Domain ConstraintWhile range is related to Numeric or Time-related data, domain constraint here refers tothat a group of String values have a certain domain, which are considered as Nominal data.For example, there are 50 valid values for the 50 state names of the USA, any values thatdon’t match to those 50 valid ones are considered as irregular or incorrect data.

• Null ConditionsIn databases, a NULL value may indicate that the value is not applicable or applicablebut did not provided, or that it is just not clear whether it is applicable or not [Ols03].Sometimes other string values like “none”, “\”, “?” may indicate the same meaning asNULL. Such NULL indicators are to be detected and unified.

• Duplication and UniquenessIn the context of column analysis, duplications refer to identical values, i.e. the values areexactly the same. The analysis should also find out some simple statistics like the numberof duplicates, distinct values and unique values.

• Data Quality Rule AnalysisData quality rule analysis intends to find out violating combinations of values across mul-tiple columns according to data rule. The columns can come from single or multiple tableor even from different data sources. When a cross-table validation is applied, certain joincondition should be defined beforehand. A data rule may appear simple or complex whichmay contain more sub-rules and include the participation of several data sources. Belowshows an data rule example.

Rule Clause

IF YEAR_OF_BIRTH = ’1978’ THEN AGE = ’33’

2.3.2 Data CleansingAfter data profiling, target data have been explored and problems or noisy data should have beenidentified. The question becomes how to deal with the found anomalies. The action is oftenreferred to as data cleansing. Other terms like data cleaning, data scrubbing, or reconciliationmay have the same meaning or have slight differences in various contexts. There is no clear

13

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

or settled definition for data cleansing [MF03][MM00]. Here we use this term throughout thethesis with a narrowed meaning that referring to the act or process of correcting or eliminatingerroneous or inaccurate data from the data sources. Typical approaches for data cleansing aredata transformation, integrity constraint enforcement [MF03].

Data Transformation

Generally saying, data transformation is the process that transforms the data from their givenformat into another format which is expected by the application. Sometimes the term data stan-dardization is also used for transforming data into a standardized format. In the data warehous-ing field, data cleansing is applied before the integration of data sets in order to guarantee theconsistency of data presenting [HK01]. Therefore some ETL Tools may provide data qualitycomponent which deal with data quality issues specifically.

Integrity Constraint Enforcement

Integrity constraint enforcement is the process of eliminating the integrity constraint violationsor the process of providing useful information that can be used for this action [MF03]. A typicalapproach is to purge outdated or unrecoverable records. Sometimes it is not easy for the machineto tell if the value should be deleted or not. Those values are marked so that further action canbe applied.

2.3.3 Record LinkageRecord linkage refers to the task of finding syntactically distinct data entries that refer to the sameentity in the real world. There are many other names that may also refer to record linkage, e.g.entity reconciliation, the merge / purge problem and so forth. The name may differ in differentresearches or user communities [Chr08a][GBVR03]. The concept is first introduced by HalbertL. Dunn in 1946. The idea was to assemble each person’s records of the principal events in lifeinto a volume [Dun46]. While individuals are the initial entities of record linkage, entities ofinterest now include companies, geographic region, families and households. Record linkageis a useful tool when performing data mining tasks, especially when the data are originatedfrom different sources. This technology is used in a extensive range, e.g. in customer systemsfor marketing, customer relationship management (CRM), data warehousing, as well as in healthand clinical area and government administrations [GBVR03]. In many research projects it is alsonecessary to collect and examine information about an entity from more than one data source.Those sources often do not use a unique identifier or key.

Background

If two data sets of records (A and B) are to be linked, each record from A is to be compared witheach one from B, generating |A| × |B| record pairs, with |X| denoting the number of records inthe data set X. Based on a probabilistic approach proposed by Fellegi and Sunter (1969) [FS69],there are three possible decisions for each record pair, namely M (match), U (unmatch) and

14

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

P (possible match), which determined by a comparison vector. The vector contains the codedmatching rule and it represents the matching level between the record pair.

Process Model

The process model of record linkage which is widely adopted by record linkage tools containsthe common components showing in a flow diagram (see Figure 2.3.2), which is based on theTAILOR system [MEE02].

Figure 2.3.2: A sample flow diagram of the record linkage process model

Blocking / Searching

It is often expensive or sometimes even unfeasible to compare all pairs when the data sets con-tains a large number of records. Because of this reason, indexing or filtering techniques known asblocking / searching methods [Chr08a] [JLX+08], are employed to perform a preselection. Thesemethods are often implemented as the first step of a record linkage process (regardless of stan-dardization, which is considered separately in this thesis). Record linkage tools often providemore than one algorithm to accomplish this task.

One typical example approach is the sorted neighborhood method [HS95]. The idea behindis to sort the records over the relevant attributes (keys) bringing matching records close togetherand then compare the records restricted to a small neighborhood within the sorted list, like scan-ning the sorted records from sources using a fixed size of window. Another algorithm is calledBlocking as Preselection, which is based on quickly computed rejection rules [NM01].

Matching and Comparison

After the filtering process, the remaining much fewer record pairs are to be compared usingcomparison functions. Those functions determine the way to compare two fields (or attributes)of each record pair. This is the core and major research area of record linkage technology. Alarge number of comparison functions are now available for different types of data, e.g. strings,numbers, dates etc. Researchers are searching for more targeting methods such as comparisonfunctions especially aiming at names or addresses, as they have their own characteristic concern-ing comparison [Chr06]. Functions for numerical values and dates are often based on mathemati-cal operations. All the comparison functions return a raw similarity value, for example a numberbetween 0 and 1, indicating a total non-match and an exact match respectively, which also is

15

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

referred to as matching weight [GBVR03]. A good command of the comparison functions willhelp to improve the quality effectively and obtain a high performance of record linkage process.A brief introduction for some commonly used string comparison methods5 is given as follows(more information please refer to [CRF03] [GBVR03] [MEE02]).

Exactmatch

This straightforward function returns 1 if the two strings are exactly the sameand returns 0 if they are not.

Levensteindistance

The Levenstein distance between two strings is defined as the minimum num-ber of edit operations (insertion, deletion, or substitution of a single character)needed to convert one string into the other.

Jaro metric This string comparator is introduced by Jaro (1989) [Jar89], which is based onthe number and order of the common characters between two strings. Accord-ing to the record-linkage literature, good results have been obtained using thismethod or its variants [CRF03].

Winklermetric

As a variant of Jaro metric, modifications have been made to the original Jarometric. It introduced the length of the longest common prefix of the two stringsto the function [PWC97]. The Jaro and Jaro-Winkler metrics normally intendto compare short strings, e.g. names of persons.

Q-gramsdistance

Input strings are first divided into q-grams, e.g. a string Smith is divided into{ Smi, mit, ith, th, h } with q=3. The comparison result is obtained by com-paring the divided substrings. This distance minimizes errors due to switchingsubstrings in a long string. It can help to detect approximate string joins effi-ciently [GBVR03].

Classification and Decision Models

The task of this step is to make a decision for the compared record pairs. As discussed in theprobabilistic framework of Fellegi and Sunter (1969) [FS69], for each pair of record there arethree possible results, namely a match, a non-match and a possible match. This is done byforming all the calculated matching weights to a comparison vector. A comparison vector isusually defined by assigning weight for each comparison function. These vectors are then usedto determine and classify record pairs into the three groups.

Merging Techniques

This technique is not yet widely adopted by Open Source record linkage tools as it involvesa much more sophisticated decision-making method and often is considered to be application-oriented. If a record pair is identified as a match or a possible match, in order to combine the tworecords together, choices have to be made if some of the attributes don’t share the same value.The goal is to select the value which is more likely to be correct when possible. It remains to bea potential research field of the record linkage technology.

5The terminology of methods may differ in different contexts.

16

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

2.4 Open Source Software

2.4.1 What is Open Source?Open Source, as the name may indicate, describes the state that the source code is freely andpublicly available. Open Source software (OSS) generally refers to computer software with itssource code available that permits users to study, change, and improve the software under certainconditions. The concept of Open Source software is often considered to be easily confused withfreeware, where free here means there is no payment involved. However, they are two distinctconcepts. Freeware, which is regarded as free of charge, are usually closed source [Hin07]. Ac-cording to the Open Source Definition (OSD) proposed by the Open Source Initiative (OSI), thereare 10 criteria that an Open Source software must comply (see [Ini10]). The basic idea behind issimple, on one hand, from a programmer’s perspective, as stated by OSI: “When programmerscan read, redistribute, and modify the source code for a piece of software, the software evolves.People improve it, people adapt it, people fix bugs.” Therefore, better software may be producedcompared with the traditional closed model. On the other hand, from a user’s perspective, thismode eliminates or largely reduces the cost for using software. Meanwhile, Open Source hasalso became a controversial topic. Apart from all the benefits, as the term Open Source is be-coming more and more popular, it is also used by some vendors as a marketing strategy. Thereare also arguments that suggest Open Source may have a bad influence to the software market.

Open Source software projects are built and maintained by a network of volunteer program-mers. Open Source softwares has already become a very important part of the software industry.The Apache HTTP Server, the GNU / Linux operating system and the internet browser MozillaFirefox are some of the most popular and successful products in the Open Source world.

Open Source Licenses

An Open Source license is a license that makes the source code and the Open Source productavailable, especially for the purpose concerning redistribution or bringing Open Source productsto the commercial world. Some popular Open Source license examples are: GNU General PublicLicense (GPL), Lesser General Public License (LGPL), Apache License 2.0, Mozilla PublicLicense 1.1 (MPL), Eclipse Public License [Ros05]. There are several restrictions that an OpenSource license may include [Lau04] [Rus10]. One is regarding the term “copyleft”. A copyleftlicense requires all modified or extended versions of the software to be Open Source as well.The strictness of copyleft may differ, for instance, “copyleft” in the GNU GPL is stronger thanin the LGPL, the latter permits linking with non-free modules [FSF10]. Another aspect is thecompatibility with other Open Source licenses, e.g., whether the Open Source software is allowedto be linked with code using or to be released under different licenses. For example, ApacheLicense 2.0 is compatible with GPL 3.0, while Apache License 1.0 isn’t. OSI is a popularorganization which reviews and approves Open Source licenses based on their OSD. There arealso other approvals by different organizations based on their own criteria, such as Free SoftwareFoundation (FSF) approval, Fedora Project approval. The restrictions of the same license may inits different version.

17

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

2.4.2 Open Source ModelsCommercial Open Source Software

The use of dual-licensing has been adopted by many Open Source software venders. This modelenables software providers offer the software under an Open Source license and under separateproprietary license terms as well. Under this model, the software product often releases twoeditions. The community edition typically coming with fewer features meets the user’s basicneeds and is made available to the community, which means it is completely free. The enterpriseedition usually includes enhanced features as well as support, training services, documentation,consulting and more. Open Source projects using this model usually also have a dedicated soft-ware development team.

This hybrid version of licensing makes it possible for the customers to try out the softwareunder an Open Source license before they decide whether it is worth buying [Let08]. Comparedto traditional closed source software, this model provides more transparency of the software andat some level ensures the quality of the software. Tools which are usually supposed to be used forbusiness purposes are often under this kind of business model, because it requires higher qualityand stability of the software, technical support and services.

Free and Open Source Software (FOSS)

Free and Open Source Software is often regarded as completely free for usage, i.e. no enterprisesolution is available, no additional functionalities or other kind of services are reserved. Thesoftware may be licensed under one or a combination of Open Source licenses [Ros05]. MultipleOpen Source licensing provides different options for developers regarding redistribution.

2.4.3 Why Open Source?As mentioned before, being Open Source may have the potential to produce better software. Infact, it offers much more for users. Some of the benefits are listed below.

Openness and Transparency

Take record linkage tools as an example, as stated by Christen (2008), most of the commercialdata integration and linkage systems are software in “black boxes” from the user’s perspec-tive [Chr08a]. The details of the software or the technology are “hidden”, i.e. users only knowhow to use the software, and don’t know why it works, what technologies included are. Fur-thermore, many of the commercial systems are specialized in a certain domain, targeting certaintasks, such as business data integration or customer data de-duplication. This is the situation formost closed source software. Some new technologies may have been widely used in such soft-ware, but users are difficult to learn about them. Open Source makes it possible for the users tolook inside the software. This may be very useful for training purposes and for skilled softwaredevelopers.

18

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Flexibility

The flexibility in the Open Source context is reflected by the redistribution ability, which meansthe users may modify Open Source software to their own needs or integrate them to their ownsystems. Of course, the activity of redistribution should obey the terms of the license that thesoftware is published under. Some Open Source software offers well documented code and API(application programmers interface), which make it easier for understanding the software andmaking changes to it.

Cost

Using Open Source software will greatly reduce the cost in the sense of the software product.FOSSs are completely free. Although Open Source software vendors are trying to sell theirservices, compared to the traditional closed model, the cost is much lower. Commercial softwaresuites for business application are sometimes very expensive. This situation is more severe whenit comes to small and medium-sized companies. If the functions can also be realized by OpenSource software, e.g. in the field of Business Intelligence (BI), much more companies will benefitfrom the low-cost BI solutions [Hin07].

Although in the meantime, Open Source software also brings problems, e.g. less long-termsupport and quality assurance. At this point commercial Open Source software model may com-pensate for those shortcomings.

Market Overview: Open Source DQ Tools

Open Source DQ tools are often related to Open Source business intelligence (BI) and data inte-gration, which have already achieved significant progress in the last several years. The develop-ment of Open Source data integration seems to have an impact on new entrants in the data qualitymarket. Although DQ tools have been described as the target of the next Open Source data man-agement space, the capabilities are still far behind established commercial vendors according toT. Friedman’s report in this area [BF09].

Current market of Open Source tools that are related to data quality is heterogeneous. Thereare vendors under the commercial Open Source model like Talend and SQL Power, which of-fers not only data quality solutions but also integration or reporting solutions. There are alsosome smaller pure Open Source projects such as DataCleaner and Fril. According to the majortechnologies of DQM discussed in section 2.3, we decided to concentrate on four groups of dataquality tools: Data Profiling Tools, ETL (extract, transform, load) Tools, Data Cleansing Tools,Record Linkage Tools. There are other available Open Source tools which are also related tothe data quality area, such as Data Mining Tools, BRMS (Business Rule Management System),Auditing Tools and so forth. For the reason stated in section 2.3, no details are provided for them.

19

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuChapter 3

Methodology

The case studies of Open Source data quality software are based on the three major technolo-gies introduced in Section 2.3 on page 10. The the evaluation methodology is discussed in thischapter.

Software evaluation is often considered as an extensive and complex task since there aremiscellaneous perspectives and quality factors within this field. A comprehensive evaluationprocedure should not only include a systematic testing in a scientific way, but also reflect users’opinions of the software by e.g. surveys. However, the evaluation criteria of this thesis concen-trate on two aspects, namely the Open Source qualification and data quality functionality. Theevaluation of other aspects such as flexibility, reliability and stability requires professional eval-uation procedure and therefore is less focused. Nevertheless, the qualification of such aspectscould still be indicated indirectly in the evaluation result. Due to the inherent variety of the se-lected tools, this methodology also contains an introduction to their own characteristic and theworking principle.

Note that the tools using commercial Open Source model, which means a commercial versionof the software may also be available (see section 2.4.2 on page 18), all the evaluation is restrictedon the community version.

3.1 Evaluation Criteria

3.1.1 Open Source QualificationFor the evaluation of Open Source qualification, we follow the criteria proposed by van den Berg(2005) [vdB05], which is set by collecting and synthesizing the available Open Source softwareevaluation models. Modifications have been made for the suitability of the condition of thisthesis. The approaches include visiting the project’s website and observing the activities as wellas reading related documentation or articles. The evaluation of the Open Source qualificationfocuses on the following aspects, license, release activity, community, documentation and usersupport.

20

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

License As mentioned in section 2.4.1 on page 17, Open Source license plays animportant role in the Open Source culture. It provides the flexibility con-cerning modification and redistribution as well as restrictions of them.

ReleaseActivity

The update frequency of software can reflect its liveliness at some level.A new release may improve the software by fixing bugs and adding newfeature. All the release activities together form the road of progress of thesoftware. This is assessed by checking the release activities and the changelog that are available for the software. The release date of the latest versionreferred in this thesis is up to June 15, 2010.

Community Another thing special for Open Source software is its community, whichconsists of the people who participate in the project. It is the communitythat does most of the testing and maintains the Open Source project. Thecommunity’s activity is often expressed through discussions in forums. Sovisiting the project’s forum is one way to evaluate the community. Commu-nity forum is also the place where users get help, report bugs or other kindsof support.

Documentationand UserSupport

There are usually two types of documentation. The first is called user guideor manual, which tells users how to install and use the software. Additionalmaterials like webinar (Web-based seminar) or on-line tutorial also belongto this type. The other type of documentation concerning the developmentof the software. The commonly seen form is the API (application program-mers interface) , which, similar to the Java API, may include specificationsof routines, data structures, object classes, and protocols.

3.1.2 Data Quality FunctionalityThe evaluation of the functionality criteria is achieved by two approaches — working principlesand general functionality and sample task evaluation. This criterion are derived into severalsubcriteria. The result of evaluation in the functionality criteria contains two parts, the tool’sown qualification revealed by the given description and the comparison of each tool according tothe subcriteria.

Approaches

• Working Principles and General FunctionalityThis part of evaluation relies on the documentation as well as a general-purposed test ofeach evaluated tool. The main purpose is to give a description for each tool regarding itscore features and major functionality as well as some working principles. It is difficultto compare the characteristics of each tool. This part therefore intends to give readers anoverview and a general idea of the way of working and capabilities of the evaluated tool.

• Sample Task EvaluationThis evaluation focuses on the real usage of each tool. Each case study contains one or

21

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

several sample tasks. The configuration important steps of configuration are presented byscreen shots and descriptions. The same tasks are executed by the tools of each technology.Therefore the performance of each tool can be can be compared on the same basis. Detailedinformation about the sample task evaluation are given in section 3.4 on page 25.

Subcriteria

Ease-of-use The ease-of-use criterion may be reflected in many ways, for example, howeasy it is to build a connection to the target data set or to configure the tool forthe tasks. This subcriterion can also reflect the user-friendliness or usabilitybased on the software design. Note that the evaluation of this criterion ismainly based on the author’s experience with the evaluated tool.

Flexibility We evaluate the flexibility by examining three aspects. First is the operationflexibility in terms of the requirement for running environment of the evalu-ated tools. Second is the input and output flexibility — the restrictions of theinput file format and the output capability. They may have a large influenceon the scope of application and further usage of the tool. The last one is re-garding the user-tunable ability. A tool that provides more configure optionsfor different tasks is considered more flexible.

Functionality The functionality qualification of each tool is based on the author’s opinionbased on the experience. The range of functionality provided by each tooland the powerfulness will be taking into consideration.

Performance The evaluation of this subcriterion is based on the sample task evaluationresult obtained by each sample tool.

3.2 Tool SelectionSelection Criteria

As mentioned previously, there are a number of Open Source tools available related to the dataquality subject. The selection process is mainly based on the help of the Internet — web search-ing supplemented by information provided in related articles. On one hand related researches inthis field are not much and usually not available without cost. On the other hand one of the mostimportant qualifications of Open Source software is the openness and the transparency to users.The most popular approach is per Internet. We collect all the potential candidates and examinethem one by one, then eliminate the inappropriate ones. The criteria for selection are specifiedbelow.

• Open Source Qualification

1. The software and its source code should be available. This criterion is set becausewe found that even some tools claim to be Open Source, their codes or their productsturn out to be not publicly available. Those tools do not match the basic requirementfor Open Source software and therefore are eliminated.

22

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

2. The latest release of the software should not be too long ago. Specifically, the candi-date, whose latest release date is more than 2 years ago, should be eliminated becausethat it is considered to be no longer under development or not active any more.

• FunctionalityThe software must have functionality that relevant to at least one of the three technologiesdiscussed in section 2.3 on page 10. Note that this is regardless of whether the software’soriginal purpose or main functionality is targeted on data quality or not. For instance,Talend Open Studio is an ETL tools, but it is also acceptable because it has data cleansingfunctionality.

Selection Result

The full candidates list and detailed information about elimination can be found in Appendix A.As a result, 6 out of 10 tools are selected after the selection procedure. DataCleaner and TalendOpen Profiler (TOP) are Open Source data quality tools that focus on data profiling. SQL PowerDQguru, formerly called the SQL Power MatchMaker, is a data cleansing & MDM tool. TalendOpen Studio (TOS) is an Open Source ETL (extraction, transformation and loading) productwhich also provides data quality components. In this thesis, it is treated as a data cleansing tool.TOP and TOS are both from Talend, which is an Open Source software vendor providing datasolutions. They are two tools with different functionality. Fril and Febrl are two Open Sourcerecord linkage tools. The selection result and the version of each tool installed for evaluation isshown in Table 3.2.1.

Software Version Home PageDataCleaner 1.5.4 http://datacleaner.eobjects.org/

Talend Open Profiler 4.0.1 http://www.talend.com/

Talend Open Studio 0.9.7 http://www.talend.com/

SQL DQguru 4.0.1 http://www.sqlpower.ca/page/dqguru

Fril 2.1.4 http://fril.sourceforge.net/

Febrl 0.4.1 http://sourceforge.net/projects/febrl/

Table 3.2.1: Selection result

3.3 Technical Environment and Data Sources

3.3.1 Technical Environment and Database Systems• Operating system: Windows XP sp3 professional.• Java(TM) SE Runtime Environment: build 1.6.0_20-b02• Python version installed: 2.5.4• Version of Database systems: MySQL 5.1, Apache Derby release 10.5.3.0

23

http://datacleaner.eobjects.org/

http://www.talend.com/


http://www.sqlpower.ca/page/dqguru

http://fril.sourceforge.net/

http://sourceforge.net/projects/febrl/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Short description of the two database systems

Both MySQL and Apache Derby are Open Source relational database management systems(RDBMS). Although Apache Derby is not as popular as MySQL, it is commonly considered as apopular embedded Java database and is highlighted for its small footprint and free-of-installationfeature. The community version of MySQL is licensed under GPL 2.0, while Apache Derbyunder Apache License 2.0. We use the client/server mode of the Derby database. JDBC (JavaDatabase Connectivity) is used for all connections for both database systems, although ODBCmight also be supported under some circumstances. The two JDBC drivers used are respectivelycom.mysql.jdbc.Driver and org.apache.derby.jdbc.ClientDriver. The ability to connect to the twodatabase systems and run tasks against them might reflect the compatibility and flexibility of thesoftware at some level, although the difference of the result might be minor. Unless specified,the same performance of the sample task evaluation (see section 3.4) is indicated.

3.3.2 Data Source for Sample Task EvaluationTwo data sources are used for the evaluation. All profiling and cleansing tasks are performedagainst both the MySQL and the Apache Derby version of the foodmart database. Record linkagetasks are performed against CSV files of sample customer tables. A short description is givenbelow.

Foodmart database

This database is available on the Internet with both MySQL and Apache Derby database versions.The data set is virtually created and has hence almost no real meaning, and may contain veryfew “valuable data” with necessary data quality problems. Instead, it is used as a platform.Modifications have to be made for testing purpose — erroneous data or anomalies should beadded into the database.

Customer table CSV File

cust_id fname lname str address1 city postcode birthdate ...31810 Jamse Sorella 27 Alberrga Str. Alphington 5114 19760524 ...31811 James Sorella 27 Alberga Street Alphingten 5114 19260524 ...33391 Thomas Miles 5 Mainoru Place Red Hill 3250 19360107 ...33390 Tom Miles 5 Manoru 8750 19860707 ...33391 Thomas Mils Mainoru Place Red Hlll 3250 19860107 ...

Table 3.3.1: Sample duplicate data

The customer data used for evaluation are modified upon the sample data files provided bythe record linkage tool Febrl, which is available along with the software. All data are raw fakedata and they are generated from 500 original records, which stand for 500 different real worldentities. Each record has 2 ~ 4 duplicates, which may differ from the original record in various

24

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

ways — similar values in the same column, missing information, etc. Table 3.3.1 shows a portionof the data set as an example.

3.4 Sample Task EvaluationSeveral functions of data quality tools are picked out to build several sample tasks. As mentionedpreviously, the purpose is to compare the configuration procedure and the results with the samebasis. The approach is to provide screen shots for the important steps and the result as well asthe descriptions of them.

As mentioned in section 3.3.2, the testing data set has almost no real meaning. So ratherthan focus on the improvement of the real data quality, we focus on the realization of such dataquality functions. Therefore, the tasks are usually separated from each other instead of beinga complete systematic DQM process. The tasks should be designed simple to implement andfunctionality-oriented. For example, when we implement record linkage tasks, we do not focuson data transformation and standardization which are discussed in detail elsewhere, although thetools have such functionalities and data standardization is usually implemented before recordlinkage tasks. Instead, the two data sets that to be linked is already transformed to a consistentformat. The performance, result and problems of implementation will be presented after sampletask evaluation.

3.4.1 Data Profiling TasksTask 1: simple column statistics

• Key words: value frequency analysis, simple statistics• Related DQ dimension: non-redundancy, completeness• Potential target data types: Nominal, Numerical, Time-related, Others• Database information about the testing related data:

Table Column Type Size Nullablecustomer address1 VARCHAR 30 true

Description: This task is a simple profiling operation. The goal is to provide generalstatistical information about the target data set based on its number of records and values. Dataprofiling tools should be able to find out statistics about the data set including, e.g. the numberof rows, the number of null values, the number of distinct and unique values, the number ofduplicates, and the number of blank fields.

Task 2: domain integrity

• Key words: domain integrity, attribute domain constraint• Related DQ dimension: syntactic accuracy• Potential target data types: Nominal

25

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

• Database information about the testing related data:

Table Column Type Size Nullablecustomer state_province VARCHAR 30 true

Description: The purpose of this task is to evaluate the capability in validating domainintegrity of each data profiling tool. This aspect is also regarded as attribute domain constraint.A set of data, e.g. provinces of a country, colors or even people’s names, may have a certainvalue domain. Any value outside this domain is considered as irregular or incorrect. In foodmartdatabase, customers are from 3 countries: the USA, Canada and Mexico. Values in columnstate_province should be included in 3 domains with altogether 106 valid values:

1. The United States Postal Service state abbreviations: 62 distinct values[Ser10].

2. Canadian Provinces and Territories Names and Abbreviations: 13 distinct values [Pos10].

3. Political divisions of Mexico including 31 distinct values.

Task 3: pattern analysis

• Key words: pattern conformance, consistency of presenting data• Related DQ dimension: Syntactic Accuracy, Pragmatic Data Quality• Potential target data types: Nominal, Time-related• Database information about the testing related data:

Table Column Type Size Nullablecustomer phone1 VARCHAR 30 false

Figure 3.4.1: Sample erroneous data

Description: Some data has different presenting pattern or format. For the purpose ofconsistency in presentation of data, they are required to be unified. A portion of sample erroneousdata is shown in Figure 3.4.1. The purpose of this task is to identify all patterns that exist in thedata set, for the purpose of carrying out a solution to unify the presentation of phone numbers.The follow-up data standardization task is designed as Task 5.

26

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Task 4: DQ rule analysis

• Key words: data quality rule analysis• Related DQ dimension: Semantic Accuracy, Pragmatic Data Quality• Potential target data types: Nominal, Numerical, Time-related, Others• Database information about the testing related data:

Table Column Type Size Nullablecustomer gender DATE 10 falsecustomer salutation VARCHAR 5 truecustomer state_province VARCHAR 30 true

store store_state VARCHAR 30 true

Description: Data should obey the rules of the real world, DQ rules are used for validatingsuch data. DQ rules can also be used to find out interesting records. In this sample task, 2 DQrules are designed. DQ tools should be able to find out statistics and the corresponding dataaccording to those rules.

1. If the gender suggests that the customer is Male, then the Salutation should be “Mr.”. Theviolating records are actually more interesting, so the corresponding rule clause is shownas follows.

SQL Rule Clause

GENDER = ’M’ AND SALUTATION !=’MR.’

Figure 3.4.2: Relation of tables

2. The second rule is based on an assumption that for each sales record, if the purchase lo-cation (province/state of the store) of a customer is not consistent with his/her registrylocation (the province/state of the customer), the record has a potential to be inaccurate.For example, if the customer has moved to another state/province, this information mightbe reflected by his/her purchase location. Because the sales table doesn’t contain any loca-tion information, this rule analysis requires the comparison of records from three differenttables (the relation is illustrated in Figure 3.4.2). The corresponding rule clause is shown

27

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

as follows.

SQL Rule Clause

CUSTOMER.STATE_PROVINCE != STORE.STORE_STATE

3.4.2 Data Cleansing TaskTask 5: transformation and cleansing

• Key words: pattern conformance, consistency of data presentation, data transformation• Related DQ dimension: Semantic Accuracy, Pragmatic Data Quality• Potential target data types: Nominal, Time-related• Database information about the testing related data:

Table Column Type Size Nullablecustomer phone1 VARCHAR 30 false

Description: This task is aiming at solving the pattern inconsistency that has been dis-cussed in Task 3. The goal is to standardize all the 10-digit phone numbers which appear indifferent formats into the one unified format, namely “999-999-9999”. Data that contain char-acters other than digits, “-”, “ ”, “(”, “)” or are not able to be transformed will be considered asdata with problems. For such data, an asterisk (*) is added at the beginning of the original valuefor further cleansing activities.

3.4.3 Record Linkage TaskTask 6: record linkage

• Key words: de-duplication, record linkage• Testing related data set: customer table CSV file (see section 3.3.2 on page 24)

Description: This task is aiming at linking the records that has the same real world entity.The record linkage tools should be able to run the record linkage process with different param-eters, regarding indexing method, comparison function and so on, for the detection of linkagesbetween two data sets. Records that represent the same real world entity should be linked. Thetools should also not link two records with different real world entities. Additionally, they shouldbe able to provide the reference for their decision. The result should be reusable. It usually re-quires a large number of trials and analyses in order to obtain an optimum performance of recordlinkage tasks.

28

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuChapter 4

Evaluation and Results

4.1 Talend Open Profiler

4.1.1 Open Source QualificationTalend Open Profiler (TOP) is licensed under GPL 2.0. The project was launched in June 2008.From the first version 1.0.0RC1, TOP has already over 30 releases until the latest version 4.0.1released on April 28, 2010. The source code is available on-line1 on Subversion. Subversion isan open-source revision control system, which enables developers to maintain the source code aswell as other files, e.g. web pages, and documentation [Wik10]. TOP shows a comparably highactivity level which is reflected by not only its release frequency, but also its forum discussionand its way of bug finding and fixing. The project is held by the software vendor Talend, and ithas a dedicated software development team which is responsible of and guarantees the develop-ment and maintenance of the project. Documentations including the software introduction andtutorial are available. The tutorial provides step-by-step configure instructions and lead user togo through every feature of the product.

4.1.2 Working Principles and General FunctionalityTalend Open Profiler is based on Eclipse RCP2, which is a Java-based software framework. Itruns on any operating system that supports the Java virtual machine. Some component is alsoprovided based on the programming language Perl. TOP provides dedicated connections tovarious database systems, e.g. MySQL, Oracle, PostgreSQL, etc. It also support generic ODBCand JDBC if the dedicated connection is not found [Tal09] As dedicated connection to AppacheDerby database is still not supported by the current version of TOP, the generic JDBC is usedfor evaluation purpose. Data profiling functionalities are mainly supported by the use of thefollowing elements.

1See http://www.talend.com/resources/source-code.php2Eclipse Rich Client Platform, see http://www.eclipse.org/rcp/

29

http://www.talend.com/resources/source-code.php

http://www.eclipse.org/rcp/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Patterns

Patterns are used for validating strings. It enables to define the content or structure of Stringvalues by specifying the regular expression (Regex, see section 2.3.1 on page 12). A numberof patterns are available by default. Users can also visit the community website to find moreavailable patterns that generated and shared by other users or developers.

Indicators

Indicators in the TOP context have the meaning of the operation options targeting at differentprofiling purposes. Simply put, choosing the indicators means choosing the way of your profilingprocess. System indicators are predefined and ready to use, but users are able to define their ownindicators as well as to change the system indicators. The main part of an indicator is its SQLtemplate3, an example is given as follows.

SQL Statement

SELECT COUNT(*) FROM (SELECT <%=__COLUMN_NAMES__%>,COUNT(*) mycount FROM <%=__TABLE_NAME__%> m<%=__WHERE_CLAUSE__%> GROUP BY <%=__COLUMN_NAMES__%>HAVING mycount > 1) AS myquery

SQL Rules

SQL rules enable the Data Rule Analysis on TOP. Please refer to section 2.3.1 on page 12 fordetailed information about Data Rule Analysis.

4.1.3 Sample Task EvaluationTask 1: simple column statistics

1. PreparationIndicator Simple Statistics is used directly in this task.

2. ResultThe corresponding result is shown in a table (see Figure 4.1.1) and a histogram (not shown)is also generated to give users a more intuitive view .Note: Duplicate here refers to identical values. The number of Distinct Count, UniqueCount and Duplicate Count indicates the count of values, rather than the count of all qual-ified records.

3. Remarks

(a) User-tunable options is provided for more complex analysis, which means user canmake changes to the existing Indicator including, e.g. case insensitive options, thresh-olds definitions etc.

3An SQL statement with parameters, such as <%=__COLUMN_NAMES__%> in the example.

30

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.1.1: Result of simple column statistics

(b) Data values or rows of any certain result can be viewed and the SQL statement isgenerated correspondingly. This feature provides a more flexible way for reuse of theprofiling result. The sample result SQL statement for Duplicate Count is showingbelow.

SQL Statement

SELECT * FROM ‘foodmart‘.‘customer‘ WHERE ‘address1‘ IN (SELECT‘address1‘ FROM (SELECT ‘address1‘, COUNT(*) mycount FROM‘foodmart‘.‘customer‘ m GROUP BY ‘address1‘ HAVING mycount > 1) ASmyquery)

(c) TOP is able to connect to Derby database using Generic JDBC connection, withthe driver org.apache.derby.jdbc.ClientDriver. Profiling tasks can run against thedatabase with no problem occured. However, the database explorer component seemto have compatibility problems. The SQL statement it generated is correct but run-ning the statement pops out an error. The problem has not been solved and ourassumption is because of the Generic JDBC connection.


1. PreparationWe use the Pattern Matching feature for this task. Talend does not support any import do-main functionality compared with the Dictionary feature in DataCleaner (see section 4.2.2on page 35). One way to accomplish the task is to define a new Pattern using Regex. Thereare 106 valid values, so it is more complex to use Regex for this task. A shortened versionof the Regex used for this task is shown below.

Regex (shortened)

’^(AK |AL |AR ... |NL |NT |NS |NU |ON |PE |QC |SK |YT )$’

Description (based on the Basic Regular Expressions (BRE) syntax)4

4See http://en.wikipedia.org/wiki/Regular_expression

31

http://en.wikipedia.org/wiki/Regular_expression

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Metacharacter Description^ Matches the starting position of the string.$ Matches the ending position of the string.

|Known as the choice operator matches either the expression beforeor the expression after the operator, e.g. all the valid values (i.e.“AK”, “AL”, “AR”, ..., “YT”) are separated by a “ | ”.

2. ResultAs shown in Figure 4.1.2, values that do not match those in the integrity domain can bedetected. It is convenient to retrieve either the valid or invalid rows or values from thedatabase.

Figure 4.1.2: Result of domain integrity validation

3. RemarksIt would be more convenient if a feature like Dictionary is supported. An alternative ofway of implementing this task would be user-defined Java Indicator.


1. PreparationIndicator Pattern Frequency Statistics is used.

2. ResultThe result is shown in Figure 4.1.3. For reference, a portion of sample erroneous data inoriginal data set is shown in Figure 3.4.1 on page 26.

3. RemarksAs shown in the result, all digits are represented by “9”, all uppercase letters are repre-sented by “A”, and all lowercase letters are represented by “a”, other characters are repre-sented by themselves.

32

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.1.3: Result of pattern frequency statistics


Please refer to section 3.4.1 on page 27 for detailed information.1. Preparation

(a) Prepare the first DQ rule. This is done by completing the the Where Clause.

DQ rule configuration: The first DQ rule is done by completing the Where Clause. Thesecond rule, named as “region consistency”, needs a cross-table validation. The Join Con-dition should be set correspondingly as it indicates how the tables are connected with eachother (see Figure 4.1.4).

Figure 4.1.4: DQ rule configuration

2. ResultThe result of the both DQ rule analysis is shown in Figure 4.1.5.

33

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.1.5: Result of DQ rule analysis

In order to show the violating records, the following SQL statement can be used (generatedautomatically by TOP ).

SQL Statement

SELECT sales_fact_dec_1998.* FROM ‘foodmart‘.‘sales_fact_dec_1998‘sales_fact_dec_1998 JOIN ‘foodmart‘.‘customer‘ customer ON(sales_fact_dec_1998.‘customer_id‘ = customer.‘customer_id‘) JOIN‘foodmart‘.‘store‘ store ON (sales_fact_dec_1998.‘store_id‘ = store.‘store_id‘)WHERE (customer.state_province != store.store_state)

3. RemarksThe analysis could also be achieved by simply using SQL statements. But the workload ismuch higher especially for complicated DQ rules. It is more intuitive and time-saving tohave such DQ rule features in data quality tools.

4.2 DataCleaner

4.2.1 Open Source QualificationThe development of DataCleaner is handled by the DataCleaner community supported by eob-jects.org Open Source community.. The first version 1.0 was released on April 12, 2008. UntilJune 15, 2010 DataCleaner has 9 releases in total. The latest one is Version 1.5.4 released onMay 15, 2010. DataCleaner is distributed under LGPL license. Documentation and Online Wikicontains both software introduction and tutorial are available for DataCleaner, some parts of thedocumentation, e.g. API, are still under development5. Users can access its source code throughthe subversion repository @ eobjects.org6.

5see http://datacleaner.eobjects.org/docs6See http://eobjects.org/svn/DataCleaner/trunk/

34

http://datacleaner.eobjects.org/docs

http://eobjects.org/svn/DataCleaner/trunk/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

4.2.2 Working Principles and General FunctionalityDataCleaner requires a Java Runtime Environment (JRE) version 5.0 or higher. It runs in bothdesktop application mode and command-line interface mode on Microsoft Windows OperatingSystems, MacOS X and Linux. It supports only two database types by default, but it is able tosupport any database with a JDBC driver by simple driver installation. The connection to bothtesting databases has been successfully managed.

The software supports a number of different optimization techniques that can be combined inorder to make the executions quicker and more efficient [Sør09]. For example, the Max connec-tions option enables running independent queries simultaneously providing a better utilization ofCPU.

DataCleaner has three operation modes — Profile, Validate and Compare.

Profile

In the Profile mode, Profile is the name of the object that can be added to form a profiling task.

Validate

Validate mode provides several validate functions called validation rule, specifically, JavaScriptevaluation, Dictionary lookup, Value range evaluation, Regex validation and Not-null check.

Dictionary A dictionary contains all the valid data values for profiling. The use of dictionaryallows user to define domain constraint for a data set.

Regexes Regexes, short for Regular Expressions, provide a concise and flexible means formatching strings of text concerning content or structure (see section 2.3.1 onpage 12).

Compare

In the Compare mode, columns, tables and schemas of different data set can be compared. It isone approach for consistency checking.

4.2.3 Sample Task EvaluationTask 1: simple column statistics

1. PreparationProfile Standard measures is used.

2. ResultThe corresponding result is shown in Figure 4.2.1. The number of rows, number of NULLvalues and empty values are shown in the table. Highest and lowest values are meaninglessin this case.

35

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.2.1: Result of simple column statistics

3. RemarksDataCleaner does not support functions concerning the count of duplicates, distinct valuesand unique values. It is also not flexible to modify the Profile to unselect the unnecessarycomponents such as the highest or lowest value component in this case. For comparison,see section 4.1.3 on page 30.


1. Preparation

(a) Dictionary creation: a text-file containing all the valid values.

(b) Validation rule Dictionary look up is used.

2. ResultAs shown in Figure 4.2.2, all values that do not match those in the integrity domain aredetected. Compared to the result obtained by Talend, DataCleaner provides a clearer viewas it all the invalid values are listed out.

3. RemarksDictionary look up is quite handy in realizing this task. Instead of using a text-file, theintegrity domain can also be defined using a column in an databases. DataCleaner alsoprovides other approaches for this task which are similar to Talend Open Profiler, e.g.Javascript or Regex.


1. PreparationProfile Pattern Finder is used in this task.

2. Result

36

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.2.2: Result of domain integrity analysis

The result of this task is shown in Figure 4.2.3. For reference, a portion of the originalvalues can be viewed in Figure 3.4.1 on page 26.

Figure 4.2.3: Result of pattern analysis

3. RemarksAs shown in the result of this implementation, all digits are represented by “9”, all lettersregardless of uppercase or lowercase are represented by “a”. An assumption is that if a fieldcontaining both digits and letters, each character will be represented by “?”. No referencehas been found in the documentation to explain the situation.


No DQ rule analysis functionality is supported by the current version of DataCleaner.

37

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

4.3 Talend Open Studio

4.3.1 Open Source QualificationAs mentioned previously, Talend Open Studio (TOS) and Talend Open Profiler (TOP) are bothprovided by Talend. The project was launched in October 2006 earlier than TOP. TOS hasalready over 60 releases with the latest one 4.0.1 released on April 28, 2010. The Open Sourcequalification of TOS is similar to TOP (see section 4.1.1 on page 29).

4.3.2 Working Principles and General FunctionalityThe same as Talend Open Profiler, TOS is based on Eclipse RCP. It runs on any operating systemthat supports the Java virtual machine. It is announced compatible to most existing databasesystems. As mentioned before, TOS is basically an open-source ETL tool. Here skipping all thedata integration functionalities which may be considered as the spirit of the product, we focus onits data quality components. With the components presented below, TOS is able to provide verybasic record linkage functionality compared with that provided by record linkage tools that willbe presented later. The introduction is based on the Reference Guide [Tal10].

tMap

tMap is an advanced component integrated as a plugin of Talend Open Studio. tMap transformsand routes data from single or multiple sources to single or multiple destinations. The mappingprovides many predefined ready-to-use functions for transformation of different data types. Usersalso can define their own transforming functions using Java code.

tFuzzyMatch

tFuzzyMatch Compares a column from the main flow with a reference column from the lookupflow and outputs the main flow data displaying the distance. It helps ensuring the data quality ofsource data against a reference data source.

tUniqRow

tUniqRow Compares entries and removes the first encountered duplicate from the input flow.

Output ability

Users are able to choose from various output options such as to update the original data sourcewith the obtained result or to save the result into another data source that TOS supports. To exporta project for reuse, user can choose either to save the project as a Talend object or to export theJava code for the project. User is also able to generate a html document containing the detailsand description of the project.

38

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

4.3.3 Sample Task EvaluationTask 5: transformation and cleansing

Task description can be found in section 3.4.2 on page 28.1. Preparation

(a) Generate Java code to create a user-defined transformation function retainNum. Thecode of the class Retain_Char containing the function is shown in Java Code 4.3.1.

(b) Configure the workflow and tMap accordingly by adding the transformation functions(see Figure 4.3.1). The transformation result is saved into another data source calledphonename.

Figure 4.3.1: Workflow preparation

Java Code 4.3.11 p u b l i c c l a s s R e t a i n _ C h a r {2 / * re ta inNum : t h i s f u n c t i o n i s used t o s t a n d a r d i z e phone3 * number f o r m a t s4 * { t a l e n d T y p e s } S t r i n g5 * { Category } User D e f i n e d6 * /7 p u b l i c s t a t i c S t r i n g re ta inNum ( S t r i n g o r i g i n ) {8 i f ( o r i g i n . l e n g t h ( ) < 1) {9 re turn " " ;

10 }11 S t r i n g newS = new S t r i n g ( ) ;12 / / g e t t i n g r i d o f a l l u n n e c e s s a r y symbol or c h a r a c t e r s13 newS = o r i g i n . r e p l a c e ( "−" , " " ) ;14 newS = newS . r e p l a c e ( " " , " " ) ;15 newS = newS . r e p l a c e ( " ( " , " " ) ;16 newS = newS . r e p l a c e ( " ) " , " " ) ;17 / / i f t h e s t r i n g s t a r t s w i t h " 0 " , d e l e t e t h e "0"18 i f ( newS . s t a r t s W i t h ( " 0 " ) ) {19 newS = newS . s u b s t r i n g ( 1 ) ;20 }21 / / i f t h e phone number has 10 d i g i t s , o u t p u t t h e22 / / s t a n d a r d i z e d f o r m a t "123−456−7890"23 i f ( newS . l e n g t h ( ) ==10 && newS . matches ( " [0−9]{10} " ) ) {24 newS = newS . s u b s t r i n g ( 0 , 3 ) . c o n c a t ( "−" )

39

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

25 . c o n c a t ( newS . s u b s t r i n g ( 3 , 6 ) ) . c o n c a t ( "−" )26 . c o n c a t ( newS . s u b s t r i n g ( 6 , 1 0 ) ) ;27 re turn newS ;28 }29 / / i f not , o u t p u t t h e o r i g i n a l phone number and mark30 / / t h e phone number by add ing a "*" a t t h e b e g i n n i n g .31 e l s e re turn " * " . c o n c a t ( o r i g i n ) ;32 }33 }

2. ResultIn order to provide an overview for the result, both the original and obtained phone columnsare profiled again (see Task 3 in section 3.4.1 on page 26). The two results of patternanalysis are shown in Figure 4.3.2. The upper table contains the result of the original dataset, the downer the phone column after transformation. As shown in the result, the validvalues are all transformed into the standard pattern leaving the invalid ones with an asterisk(*) added at the beginning.

Figure 4.3.2: Result of cleansing: comparison of the profiling result

A small piece of the original data set and the corresponding output is illustrated in Fig-ure 4.3.3.

3. RemarksAlthough TOS already provides several predefined functions for String operations, datacleansing tasks are mostly application oriented and differ from time to time. User definedfunction for transformation enables users to utilize any tranformation function as long asit can be realized by Java.

40

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Figure 4.3.3: Comparison of the original data and the corresponding transformation result

4.4 SQL Power DQguru

4.4.1 Open Source QualificationDQguru is licensed under GPL 3.0. It is the data cleansing product of the SQL POWER SOFT-WARE Group. The initial Open Source release is 0.9.0 released in November 2007. The currentversion is 0.9.7 released on March 2010. It provides several short demo videos which is free ofcharge and available on the website to help users learn some features about the software. How-ever, its user guide is not free, it costs $49.00. Non-members may check out a read-only workingcopy of its source code anonymously on its Google Code Project Page7.

4.4.2 Working Principles and General FunctionalityDQguru is Java-based and the software is available in three versions, for Windows, Mac OS, andUnix/Generic. PostgreSQL, MySQL, SQL Server, Oracle Version, HSQLDB are currently sup-ported. The connection ability to other databases, such as DB2/UDB, Apache Derby, Infobright,and Eigenbase has not been certified. Because the lack of documentation for this tool, more de-tailed description about its functionality is not able to be provided. According to our experience,the output ability of this tool is rather limited. DQguru only supports updating the original dataset with the transformed result. An additional option is to add a CSV as an output file.

7See http://code.google.com/p/power-matchmaker/

41

http://code.google.com/p/power-matchmaker/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

4.4.3 Sample Task EvaluationTask 5: transformation and cleansing

1. PreparationWorkflow preparation configuration (see Figure 4.4.1)

Figure 4.4.1: Workflow preparation

2. ResultThe result is similar to that obtained by Talend Open Studio (see Section 4.3.3 on page 39).Only the transformation of the valid phone numbers into the unified presenting pattern hasbeen successfully managed using the workflow shown in Figure 4.4.1. Due to the lackof documentation and limited number of transformation functions, the attempt of furtheroperations is not successful.

3. RemarksThe configuration process is not as complex as that of TOS. The interface and configurationof DQguru are very intuitive and user-friendly. The lack of user-tunable functionalityprovides a huge difficulty for complex data cleansing tasks.

4.5 Fril

4.5.1 Open Source QualificationThe project of Fril started in 2008, with version 1.0 released in October. Until now Fril has 14releases in total. The latest release is Version 2.1.4 released on December 22, 2009. The productis licensed under MPL 1.1/GPL 2.0/LGPL 2.1, which is quite flexible concerning redistribution.Documentation including the software introduction and tutorial is available. The tutorial providesstep-by-step instructions for linkage configuration and contains a description for the features

42

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

available in FRIL, short introduction and usage of some comparison functions are also included.The source distribution is available for download8 which contains Java source code of FRILframework along with all the libraries required for byte-code generation.

4.5.2 Working Principles and General FunctionalityFril is Java based and can be run on any operating system that supports the Java virtual machine.By default it supports 4 kinds of data sources: CSV file, Excel file, Text file, and Database witha JDBC driver. The default supported databases are MS Access database, MS SQL, MySQL,PostgreSQL and Oracle. By simple modification of the misc.properties file, it may be able tosupport any database with an JDBC driver. We have managed use Fril to access Apache Derbydatabase. The output is limited to CSV file.

Join condition

Join condition in Fril determines the way of comparison. Assigning a join condition includeschoosing the distance metric, a.k.a comparison function, and setting up all the necessary param-eters. (see section 2.3.3 on page 15)

Join method

Join method in Fril has the same meaning as blocking/searching method (see section 2.3.3). Fouroptions are available: nested loop join, sorted neighborhood method, blocking and search methodand SVM join [CL01]. For details on the join method please refer to [JLX+08].

Results saving configuration

This configuration enables a enhancement method for the record linkage result (see Figure 4.5.3).

4.5.3 Sample Task EvaluationTask 6: record linkage

1. Preparation

(a) Set the join condition for each attribute. Assign the match weight and empty valuescore for each attribute. Acceptance level is the threshold for duplicates identification.The setting of this task is shown in Figure 4.5.1

8See http://fril.sourceforge.net/download.html

43

http://fril.sourceforge.net/download.html

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.5.1: Join condition configuration

Figure 4.5.2: Joined method configuration

(b) Joined method configuration. (see Figure 4.5.2)

(c) Results saving configuration. (see Figure 4.5.3)

44

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Figure 4.5.3: Results saving configuration

Figure 4.5.4: Sample Linkage results

2. Result and outputAs a result, 493 linkages are identified by this implementation. Figure 4.5.4 shows thelinkage result viewed by the internal viewer of Fril, which has the ability to add featureslike highlighting differences, filtering etc. Confidence for each linked pair is also providedin the linkage result.

45

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Additional: Merging technique for de-duplication

The reason for presenting this additional task is that the de-duplication mode of Fril is not exactlythe same as the record linkage mode. De-duplicated result is generated by employing simplemerging technique. As far as we explore, the result has been confirmed by one developer of Fril,Fril only support the simple algorithm for combine two linked records:

1. If one value of the fields is missing, the existing one is adopted.

2. If both fields are not empty, and have different values, the first one appears is adopted. Thesequence of records depends on the indexing and searching method employed.

Although this is not a real intelligent algorithm, it could be the start point to provide the mergingfeature to record linkage tools. An example is shown below.

If two records are:fname lname date_of_birth phoneAaom Breed 19945718Aaron Breed 19940718 03 01452468

By employing Fril’s algorithm, the result would be:

Aaom Breed 19945718 03 01452468 .

As Aaron is a valid value for names and 19945718 is not a valid birthdate, a better answer couldbe generated if a more advanced algorithm is employed, i.e.

Aaron Breed 19940718 03 01452468 .

This could be done by adding integrity constraint functionality to the merging technique (seesection 2.3.3).

4.6 Febrl

4.6.1 Open Source QualificationThe Febrl project is developed and maintained by The ANU (the Australian National University)Data Mining Group. The first version of Febrl (version 0.1) was released in September 2002.Until now Febrl has 6 releases in total, and the latest one is 0.4.1 released in December 2008. Theproduct is licensed under the ANU Open Source License, which is a FOSS license. The manualdescribing the software is available as well as several research publications. It is a prototypesoftware with no precompiled binary available. The source code is along with the software itself.

4.6.2 Working Principles and General FunctionalityFebrl is based on the programming language Python9 and associated extension libraries. It isan cross-platform software, but due to differences among operating systems, small changes

9See http://www.python.org/

46

http://www.python.org/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

is required for running the software correctly. The installation requires Python 2.5, with ad-ditional modules and libraries needed for the graphical user interface (GUI) and the supportvector machine (SVM) classification method [CL01]. Currently, only several text-based fileformats are supported, including the CSV format, i.e. it provide no database connectivity.The output is also limited to CSV file. All information provided in this section is based on[Chr08b][Chr08a, Chr06], for further information refer to the reference.

Index

Febrl provides several indexing methods (see section 2.3.3 on page 15): FullIndex, SortingIndex,QGramIndex, CanopyIndex, StringMapIndex, SuffixArrayIndex (more information is providedin [Chr08b], chapter 8). And detailed configuration are supported, e.g. window size, encodingfunction for the selected field, etc.

Comparison methods

Febrl provides a lot of comparison functions (for more information see [Chr08b], chapter 9).Figure 4.6.2 on the following page presents a portion of them.

Classification methods

An introduction of classification methods is provided in section 2.3.3 on page 16. Febrl offerssix different classification techniques presented in Figure 4.6.3 on page 49 (for more informationsee [Chr08b], chapter 10).

Output

Four output files are available of the execution of each task.1. Weight vector file produced in the record pair comparison step. It contains the similarity

value of every compared pair of attributes obtained by the comparison function accordingto the configuration.

2. Histogram file is a simple text-based histogram as the histogram shown in Figure 4.6.4.

3. Match status file contains the summed matching weight for each linked pair. A match IDis the identifier for each linked record pair.

4. The match file copies all the input data set and adds the generated match ID(s) at the endof each record that has been linked.

47

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

4.6.3 Sample Task EvaluationTask 6: record linkage

1. Preparation

(a) Index method configuration. (see Figure 4.6.1)

Figure 4.6.1: Index method configuration

(b) Comparison method configuration and parameter assignment. (see Figure 4.6.2)

Figure 4.6.2: Comparison method configuration

(c) Classification configuration. (see Figure 4.6.3)

48

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuFigure 4.6.3: Classification configuration

2. Result and outputThe result is shown in the histogram that automatically generated by Febrl (see Fig-ure 4.6.4)

Figure 4.6.4: The histogram showing the record linkage result

49

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuChapter 5

Summary

5.1 Comparison and ResultWe made a brief comparison of the selected tools in each category according to the evaluationcriteria. The results are shown in briefly in three tables. Comments and conclusions are alsopresented concentrating the two aspects, namely the Open Source qualification and data qualityfunctionality according to the evaluation criteria in section 3.1 on page 20.

Important Note: All the results obtained are the authors own opinion and based on the experi-ence of the evaluation process in this thesis. The release date of the latest version referred in thisthesis is up to June 15, 2010.

Description of Notation

N denotes that the tool is superior than the other one according to theindicating criteria.

H denotes that the tool is inferior to the other one according to the indicatingcriteria.

� denotes that two evaluated tools do not show much differences accordingto the indicating criteria.

X denotes that the tool does not have such functionality.

5.1.1 Case Study Result on Data Profiling Tools

TOP DataCleanerLicense GPL 2.0 LGPL

Release Activity N HVersion Installed 4.0.1 1.5.4

Release Date 28 Apr 2010 15 May 2010User support � �Ease-of-use H NFlexibility N H

50

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

TOP DataCleanerFunctionality N H

Task 1 N HTask 2 H NTask 3 � �Task 4 N X

Compared with DataCleaner, Talend Open Profiler (TOP) is more popular regarding its fo-rum activities. Its documentation shows better qualification and the Webinar are very helpful forstarting users to get to know the software. TOP has a more complex structure providing morefunctions and extension capabilities including a large number of options for the users to defineand configure. Although the structure may put difficulty in using the tool and may require ad-ditional learning, it provides more flexibility and efficiency. All sample tasks are accomplished,although the ability to define integrity domain is limited. The ability to not only show the re-sult in the database but also provide SQL statements for them is very useful. This feature is notsupported by the current version of DataCleaner.

DataCleaner is a simple-to-use data profiling tool which provides a simpler working algo-rithm. The clear and user-friendly GUI makes it easy for understanding the tool without puttingtoo much effort on the user-guide. DataCleaner provides limited configuration possibility and itdoes not support data rule analysis according to our sample task evaluation. Regarding the outputtype capability, DataCleaner is able to export a HTML report after each profiling task, althoughit only includes the basic information of the result. TOP on the other hand doesn’t support reportoutput in its community version. Another advantage of DataCleaner is its small footprint, thefull installation costs only 14.9MB compared with TOP 233MB.

5.1.2 Case Study Result on Data Cleansing Tools

TOS DQguruLicense GPL 2.0 GPL 3.0

Release Activity N HVersion Installed 4.0.1 0.9.7

Release Date 28 Apr 2010 Mar 2010User support N HEase-of-use H NFlexibility N H

Functionality N HTask 5 N H

Talend Open Studio (TOS) is an ETL tool with comprehensive functions and extension ca-pabilities, providing a large number of options with user-tunable configuration parameters. It

51

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

provides much more than data cleansing. A comparable complex structure requires the neces-sary knowledge for operating the software, but it provides more flexibility and efficiency at thesame time. User defined function using Java code provides a big privilege users who are familiarwith Java programming. It enables the realization of functions that are not supported directlyby the tool itself. Abundant options regarding output and export provide more advantages offlexibility.

DQguru, on the other hand, provides a large amount of predefined operations for data trans-formation. It only supports very limited configuration possibilities which makes it inflexibleregarding its functionality and the output capability as well. The lack of documentation andsupport put a great restriction to us for further exploration of this tool. Data cleansing tasks areoften application oriented, i.e. it depends on different business cases. Predefined transformationfunctions sometimes are not sufficient for such tasks. This disadvantage also is reflected by theperformance of the sample task evaluation.

5.1.3 Case Study Result on Record Linkage Tools

Fril FebrlLicense MPL 1.1/GPL 2.0/LGPL 2.1 ANU Open Source License

Release Activity � �Version Installed 2.1.4 0.4.1

Release Date 22 Dec 2009 Dec 2008User support � �Ease-of-use N HFlexibility H N

Functionality � �Task 6 � �

Febrl is a prototype tool as described by its developer and therefore it is more research pur-posed. It shows more complexity concerning both its user interface and structure. More config-uration parameters, more comparison functions and the complex structure increase the difficultyof using the software. The configuration requires users to have a certain level of knowledgein the record linkage area. The high level of documentation reduces the difficulty a little bit.Febrl also doesn’t support any database connectivity until now which puts a great disadvantageregarding the range of usage, as most real applications are based on databases. Fril, on the otherhand, supports database connectivity using JDBC and provides a more user-friendly interface.The interface is more intuitive and easy to understand. They don’t show much difference regard-ing record linkage task performance, although more comparison functions in Febrl may have apotential for generating better results.

52

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

5.2 Summary and OutlookThis research project has attempted to answer some questions regarding Open Source data qualitysoftware. Open Source data quality is relatively a new area and this thesis can serve as a basisfor further research in this area. For the summary and outlook of this thesis, we focus on answerthe following questions.

What are the major technologies in the Open Source data quality management area andhow do they work?

Three major technologies are discussed in this thesis: data profiling, data cleansing and recordlinkage. Other technologies may also be related with data quality management but it is difficultto cover the whole scope due to the limitation of time of this thesis and the lack of agreementon some of the technologies. Data profiling is usually considered as the start point of a dataquality management process, i.e. the first thing is to identify the existing data quality problems.The other two technologies are data cleansing and record linkage, which focus on the solutionsfor solving the found data quality issues. The fundamentals of each of the three technologiesare introduced. The case studies have been performed on each technology showing the currentpractices in this area based on the theoretical basis.

What Open Source solutions are provided in this area?

After given a general overview of the Open Source data quality market, six tools are selectedfor evaluation. The evaluation revolves around the three major technologies. The criteria covertwo aspects — Open Source qualification and data quality functionality. Additional descrip-tions about each tool’s working principles and sample tasks implementation steps are presented.Readers can have a quick look to each tool as well as get an impression of how it works and thequalification regarding each technology.

What do the results of the case studies suggest?

The case studies on data profiling, data cleansing and record linkage tools show that the selectedOpen Source tools are generally able to realize most of the functionalities of each technology. Butdifferences between these tools have also been revealed regarding both Open Source qualificationand their functionality. The ease-of-use and the performance of the sample tasks also differgreatly sometimes. One of the most aspects that lead to this difference is the Open Sourcemodel.

Pure Open Source vs. Commercial Open Source

Those tools that do not have a commercial version or are not supported by any software vendorare considered as pure Open Source software. They often have small footprints, simple structureand are developed by a small group of people. They may not provide as much user supportcompared with commercial Open Source software. Their disadvantages are also reflected by thescale of community and the lack of popularity. However, the functionality they provide is often

53

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

handy and easy to use. The small footprint often comes also with small cost of disk space andless operating consumption. The simpler structure makes it easier for interested developers toget involved in the project if good documentation is provided. It reduces the time of learning.This also provides advantages considering redistribution and integration ability.

One existing problem is that due to the lack of contribution strength of the pure Open Sourcesoftware, the documentation is sometimes with low quality or incomplete. Some Open Sourceprojects stopped at the stage of simply bringing the product free-of-charge to the market, whichallows users to use the software as a freeware. Lack of instruction makes it hard for users tomake full use of the software. The commercial Open Source software usually offers a completedocumentation as well as good user support. Therefore, they usually show better qualificationduring the evaluation.

What can be expected for further research or development in this area?

As this thesis only focused on three major technologies, one task for further research is to find outother technologies and more applications in this area. Adding more functionalities or optimizingthe current ones could be also an interesting task. As the example of Fril, merging techniquebased on general purposed algorithm can be added for its de-duplication process by employingintegrity domain or other methods.

The current Open Source solutions in this area still aim at a general scope, there are limitedproducts available and they are still not as competent as the commercial software. The function-ality and usability are to be improved which still requires a lot of effort. Some of the evaluatedtools still exist in a raw stage and are considered as research-oriented. In order to be used forsolving real business problems, more precise and specific details should be added.

54

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuBibliography

[BF09] A. Bitterer and T. Friedman. Who’s Who in Open-Source Data Quality. 2009.

[BS06] C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Tech-niques. Springer, 2006.

[Cat97] W. Cathro. Metadata: An Overview. 1997.

[Chr06] P. Christen. A Comparison of Personal Name Matching: Techniques and PracticalIssues. In ICDM Workshops, pages 290–294. IEEE Computer Society, 2006.

[Chr08a] P. Christen. Febrl - A Freely Available Record Linkage System with a GraphicalUser Interface. In Second Australasian Workshop on Health Data and KnowledgeManagement, volume 80 of CRPIT, pages 17–25, Wollongong, NSW, Australia,2008. ACS.

[Chr08b] P. Christen. Febrl - Freely Extensible Biomedical. The Australian National Univer-sity, 0.4.1 edition, December 2008.

[CL01] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. Online,2001.

[CRF03] W.W. Cohen, P.D. Ravikumar, and S.E. Fienberg. A comparison of string dis-tance metrics for name-matching tasks. In Subbarao Kambhampati and Craig A.Knoblock, editors, IIWeb, pages 73–78, 2003.

[Dun46] H.L. Dunn. Record Linkage. American Journal of Public Health, 36(12):1412,1946.

[EE10] M. Eppler and L. English. Iq/dq glossary — international association for informationand data quality (iaidq), 2010. URL http://iaidq.org/main/glossary.shtml. [Online; accessed 06-April-2010].

[Eng99] L.P. English. Improving Data Warehouse and Business Information Quality. JohnWiley & Sons, April 1999.

[Eng09] L.P. English. Information Quality Applied. Wiley, 2009.

[FS69] I. Fellegi and A. Sunter. A theory for record linkage. J. of the American Stat. Soc.,1969.

[FSF10] Inc. Free Software Foundation. Various licenses and comments about them, 2010.URL http://www.gnu.org/licenses/license-list.html. [Online;accessed 15-June-2010].

55

http://iaidq.org/main/glossary.shtml

http://iaidq.org/main/glossary.shtml

http://www.gnu.org/licenses/license-list.html

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

[GBVR03] L. Gu, R.A. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practiceand future directions. Technical Report 03/83, CSIRO Math. and Inf. Sci., GPO Box664, Canberra 2601, Australia, April 2003.

[Hin07] H. Hinrichs. A Survey on Open Source Software for Data Quality Management. InProc. of 5. German Information Quality Management Conference (GIQMC), BadSoden, 2007.

[HK01] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann,San Francisco, CA, second edition edition, 2001.

[HS95] M. Hernandez and S. Stolfo. The Merge/Purge problem for large databases. InACM-SIGMOD Conference, 1995.

[Ini10] The Open Source Initiative. The open source definition, 2010. URL http://www.opensource.org/docs/osd. [Online; accessed 21-May-2010].

[Ins02] Data Warehousing Institute. Data quality and the bottom line: Achieving businesssuccess through a commitment to high quality data. 2002.

[Jar89] M.A. Jaro. Advances in record-linkage methodology as applied to matching the1985 census of Tampa, Florida. Journal of the American Statistical Association,84(406):414–420, 1989.

[JLX+08] P. Jurczyk, J.J. Lu, L. Xiong, J.D. Cragan, and A. Correa. FRIL: A tool for compar-ative record linkage. In AMIA Annual Symposium Proceedings, volume 2008, page440. American Medical Informatics Association, 2008.

[Jur88] J. Juran. Juran’s Quality Control Handbook. pub-mcgraw-hill, pub-mcgraw-hill:adr,4th edition, 1988.

[KSW02] B.K. Kahn, D.M. Strong, and R.Y. Wang. Information quality benchmarks: productand service performance. Commun. ACM, 45(4):184–192, 2002.

[Lau04] A.M. St Laurent. Understanding Open Source and Free Software Licensing.O’Reilly Media, Inc., 2004.

[Let08] F. Letellier. Open Source Software: the Role of Nonprofits in Federating Businessand Innovation Ecosystems. 2008.

[MEE02] V.S. Verykios M.G. Elfeky and A.K. Elmagarmid. TAILOR: A record linkage tool-box. March 2002.

[MF03] H. Müller and J.C. Freytag. Problems, methods, and challenges in comprehensivedata cleansing. Humboldt Universitt Berlin, Tech. Rep, 2003.

[MM00] J.I. Maletic and A. Marcus. Data cleansing: Beyond integrity analysis. 2000.

[NM01] M. Neiling and R.M. Müller. The good into the Pot, the bad into the Crop. Pre-selection of Record Pairs for Database Fusion. Institute for Information Systems,2001.

[Ols03] J.E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003.

[Pos10] Canada Post. Canadian provinces and territories names and abbrevia-tions, 2010. URL http://www.canadapost.ca/tools/pg/manual/

56

http://www.opensource.org/docs/osd

http://www.opensource.org/docs/osd

http://www.canadapost.ca/tools/pg/manual/PGaddress-e.asp



Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

PGaddress-e.asp. [Online; accessed 09-May-2010].

[PS05] R.J. Price and G.G. Shanks. Empirical refinement of a semiotic information qualityframework. In HICSS. IEEE Computer Society, 2005.

[PWC97] E.H. Porter, W.E. Winkler, and Bureau Of The Census. Approximate string compar-ison and its effect on an advanced record linkage system. August 1997.

[RD00] E. Rahm and H.H. Do. Data cleaning: Problems and current approaches. IEEE DataEng. Bull, 23(4):3–13, 2000.

[Ros05] L. Rosen. Open Source Licensing: Software Freedom and Intellectual Property Law.Prentice Hall, 2005.

[Rus10] Z. Rusin. Open source licenses, 2010. URL http://developer.kde.org/documentation/licensing/licenses_summary.html. [Online; ac-cessed 21-May-2010].

[Ser10] United States Postal Service. Official usps abbreviations, 2010. URL http://www.usps.com/ncsc/lookups/usps_abbreviations.html. [Online;accessed 09-May-2010].

[Sør09] K. Sørensen. DataCleaner Reference Documentation. DataCleaner, 1.5.3 edition,October 2009.

[Tal09] Talend. Talend Open Profiler User Guide, 3.2a edition, October 2009.

[Tal10] Talend. Talend Open Studio Components Reference Guide, 4.0a edition, April 2010.

[vdB05] K. van den Berg. Finding open options: An open source software evaluation modelwith a case study on course management systems. Master thesis, Tilburg University,Augest 2005.

[Wan98] R.Y. Wang. A product perspective on total data quality management. Commun.ACM, 41(2):58–65, 1998.

[WH06] R. Wolter and K. Haselden. The what, why, and how of master data management,November 2006. URL http://msdn.microsoft.com/en-us/library/bb190163.aspx. [Online; accessed 01-June-2010].

[Wik10] Wikipedia. Apache subversion — Wikipedia, the free encyclopedia, 2010. URLhttp://en.wikipedia.org/wiki/Apache_Subversion. [Online; ac-cessed 05-June-2010].

[WS96] R.Y. Wang and D.M. Strong. Beyond accuracy: What data quality means to dataconsumers. Journal on Management Information Systems, 12(4):5–34, 1996.

[WW96] Y. Wand and R.Y. Wang. Anchoring data quality dimensions in ontological founda-tions. Communications of the ACM, 39(11):86–95, November 1996.

57




http://developer.kde.org/documentation/licensing/licenses_summary.html

http://developer.kde.org/documentation/licensing/licenses_summary.html

http://www.usps.com/ncsc/lookups/usps_abbreviations.html

http://www.usps.com/ncsc/lookups/usps_abbreviations.html

http://msdn.microsoft.com/en-us/library/bb190163.aspx

http://msdn.microsoft.com/en-us/library/bb190163.aspx

http://en.wikipedia.org/wiki/Apache_Subversion

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

LiuAppendix A

Candidate List and Selection Results

Candidate Website EliminatedTalend Open Profiler http://www.talend.com/

DataCleaner http://datacleaner.eobjects.org/

OpenDQ http://www.infosolvetech.com/opendq.html XAggregate Profiler http://sourceforge.net/projects/dataquality X

Talend Open Studio http://www.talend.com/

SQL DQguru http://www.sqlpower.ca/page/dqguru

Potter’s Wheel http://control.cs.berkeley.edu/abc/ XFril http://fril.sourceforge.net/

Febrl http://sourceforge.net/projects/febrl/

RapidMiner http://www.rapidminer.com X

Table A.0.1: Candidate list

Table A.0.1 lists all the candidates investigated in the selection procedure for the case study. Thereason for elimination is presented below.

OpenDQ

Although being tagged with open source data quality, the company delivers the OpenDQ productas a total solution on a open source platform1, which means that the software as well as the sourcecode are not available for unpaid customers as far as we can investigate.

Aggregate Profiler

Although the product is available on the SourceForge2, it turns out to be an evaluation version.The information about the community and maintenance of the software are very limited.

1See http://www.dataqualitysolution.com/2See http://sourceforge.net/projects/dataquality/

58


http://datacleaner.eobjects.org/

http://www.infosolvetech.com/opendq.html

http://sourceforge.net/projects/dataquality


http://www.sqlpower.ca/page/dqguru

http://control.cs.berkeley.edu/abc/

http://fril.sourceforge.net/

http://sourceforge.net/projects/febrl/

http://www.rapidminer.com

http://www.dataqualitysolution.com/

http://sourceforge.net/projects/dataquality/

Decem

ber 15

, 201

1

COPYRIGHT

©Dezhi

Liu

Potter’s Wheel

Potter’s Wheel A-B-C is an open source data analysis and cleansing tool, but the latest versionavailable is the version 1.3 which released on Oct 10, 2000. We consider the software is nolonger under development.

RapidMiner

RapidMiner is an environment for machine learning and data mining experiments, its focus isdata mining which is also a technology regarding data quality management, but due to the lackof time, it is not covered in this thesis.

59

Documents

Case Studies of Open Source Data Quality Management