Record Linkage: Current Practice and Future Directionsdc-pubs.dbs.uni-leipzig.de/files/Gu2003Recordlinkage...Record Linkage: Current Practice and Future Directions Lifang Gu, Rohan

Record Linkage: Current Practice and FutureDirections

Lifang Gu, Rohan Baxter, Deanne Vickers, and Chris Rainsford

CSIRO Mathematical and Information SciencesGPO Box 664, Canberra, ACT 2601, Australia

CMIS Technical Report No. 03/[email protected]

http://datamining.csiro.au

Abstract. Record linkage is the task of quickly and accurately identi-fying records corresponding to the same entity from one or more datasources. Record linkage is also known as data cleaning, entity reconcilia-tion or identification and the merge/purge problem. This paper presentsthe “standard” probabilistic record linkage model and the associatedalgorithm. Recent work in information retrieval, federated database sys-tems and data mining have proposed alternatives to key components ofthe standard algorithm. The impact of these alternatives on the stan-dard approach are assessed. The key question is whether and how thesenew alternatives are better in terms of time, accuracy and degree ofautomation for a particular record linkage application.

Keywords: record linkage, data cleaning, entity identification, entityreconciliation, object isomerism, merge/purge, list washing.

1 Introduction

Record linkage is the task of quickly and accurately identifying recordscorresponding to the same entity from one or more data sources. Entitiesof interest include individuals, companies, geographic region, families, orhouseholds.

Record linkage has applications in customer systems for marketing,customer relationship management, fraud detection, data warehousing,law enforcement and government administration. These applications canbe classed as ‘administrative’, because the record linkage is used to makedecisions and take actions regarding an individual entity. Our own par-ticular interest is record linkage for ‘research purposes’ where it is used tolink information regarding entities to answer ethically-approved epidemi-ological, health system or social science research and policy questions.

In many research projects it is necessary to collate information aboutan entity from more than one data source. If a unique identifier or key

of the entity of interest is available in the record fields of all of the datasources to be linked, deterministic record linkage can be used. Determin-istic record linkage assumes error-free identifying fields and links recordsthat exactly match on these identifying fields. For large multiple datasources, shared error-free identifying fields are uncommon. Sources ofvariation in identifying fields include changes over time, robustness inminor typographical errors and lack of completeness in availability overspace and time. When no error-free unique identifier is shared by all ofthe data sources, a probabilistic record linkage technique can be used topotentially join or merge the data sources.

Christen et al [20] note the process of linking records has differentnames in different research and user communities. While epidemiologistsand statisticians speak of record linkage, the same process is called en-tity heterogeneity [32], entity identification [68], object isomerism [19],instance identification [106], merge/purge [47], entity reconciliation [33],list washing and data cleaning [20] by computer scientists and others. Theterm record linkage is used throughout in this paper.

This survey paper presents the current standard practice in recordlinkage methodology. We also consider possible improvements due to re-cent innovations in computer science, statistics and operations research.Another purpose of this paper is to identify likely future directions in im-proving the performance of current practice. Comprehensive recent sur-veys touching on the long history of record linkage techniques can befound elsewhere [113, 1, 42]. The current standard practice is based onprobabilistic approaches [36, 52, 42]. Some recent contributions from thedatabase and data mining research that may eventually supersede thecurrent practice include rule-based record comparison methods, greaterautomation through active learning, and improved scalability through in-novative data structures and search algorithms [47, 104, 34]. There is min-imal cross-referencing between the statistical and database approaches.1

A key contribution of this paper is an initial attempt at an integratedview of recent developments in the separate approaches. We are unawareof any previous attempts at this integration.

The paper is structured as follows. The definition of the record linkageproblem, the formal probabilistic model and an outline of the standardpractice algorithm are provided in Section 2. Section 3 presents recent pro-posals for alternatives to various components of the standard algorithm.Section 4 then puts the core record linkage process in a privacy and legal

1 Exceptions being the wide referencing of the Fellegi and Sunter paper [36] and, moreoccasionally, Howard and Newcombe’s 1959 Science [82] and 1962 CACM paper [81].

context. Issues such as record linkage over federated data sources andconfidentiality protocols are briefly reviewed. Section 5 concludes with asummary of our views on the current methods and the most worthwhileresearch directions.

In Appendix A, government, academic and commercial record linkagesystems and their known features are described. This records our generalawareness of existing systems and will allow us to check on differentia-tors for any new proposed directions. Appendix B gives a list of researchprojects with a record linkage component. These studies are useful forproviding greater context for understanding the detailed requirements forrecord linkage software for health and social science research projects.

2 The Record Linkage Problem, Model and StandardAlgorithm

We first define the record linkage problem in Section 2.1. Section 2.2then provides the formal probabilistic model of the record linkage. Thisprobabilistic model has provided an influential framework for the practicalsystems forming the standard practice today. These systems largely followthe algorithm given in Section 2.3.

2.1 Record Linkage Problem Definition

Records in data sources are assumed to represent observations of entitiestaken from a particular population. The records are assumed to containsome attributes (fields or variables) identifying an individual entity. Ex-amples of identifying attributes are name, address, age and gender.

Suppose source A has na records and source B has nb records. Each ofthe nb records in source B is a potential match for each of the na recordsin source A. So there are na × nb record pairs whose match/non-matchstatus is to be determined [43].

Two disjoint sets M and U can be defined from the cross-product ofA with B, the set A × B. A record pair is a member of set M if thatpair represents a true match. Otherwise, it is a member of U . The recordlinkage process attempts to classify each record pair as belonging to eitherM or U .

Many matching problems are more constrained than this statementof the problem. For instance, if each record in data source B refers toa distinct entity, a record in data source A cannot be matched to tworecords at the same time in data source B. Cohen calls this the constrained

matching problem [26]. It is more generally referred to as 1-1 linkage incomparison to the alternative 1-many linkage. 1-1 linkage, since it hasmore constraints, is a harder optimisation problem [26].

2.2 Probabilistic Model of the Record Linkage Problem

Following Gomatam et al. [43], record pairs are labelled as:

– match, A1.– possible match, A2.– non-match, A3.

For record a from source A and record b from source B, available informa-tion on the records is denoted α(a) and β(b) respectively. A comparisonor agreement vector, γ, for a record pair (α(a), β(b)) represents the levelof agreement between the records a and b.

When record pairs are compared on k identifying fields the γ vectorhas k components. γ = (γ1(α(a), β(b)), ..., γk(α(a), β(b))) is a function onthe set of all na × nb record pairs.

For an observed agreement vector γ in Γ , the space of all possiblecomparison vectors, m(γ) is defined to be a conditional probability ofobserving γ given that the record pair is a true match. That is:

m(γ) = P (γ|(a, b) ∈ M) (1)

Similarly,u(γ) = P (γ|(a, b) ∈ U) (2)

denotes the conditional probability of observing γ given that the recordpair is a true non-match.

There are two kinds of possible misclassification errors: false matches(Type I error) and false non-matches (Type II error). The probability ofa false match is:

P (A1|U) =∑γ∈Γ

u(γ)P (A1|γ) (3)

and the probability of a false non-match is:

P (A3|M) =∑γ∈Γ

m(γ)P (A3|γ) (4)

For fixed values of the false match rate (µ) and false non-match rate(λ), Fellegi and Sunter [36] define the optimal linkage rule on Γ at lev-els µ and λ, denoted by L(µ, λ, Γ ) as the rule for which P (A1|U) =

µ, P (A3|M) = λ, and P (A2|L) ≤ P (A2|L′) for all other rules L′. Optimalis defined here as the rule that minimises the probability of classifyinga pair as belonging to A2, the subset of record pairs requiring manualreview. Other criteria for optimality are considered in Section 3.7.

Let m(γ)u(γ) be ordered to be monotonically decreasing (with ties broken

arbitrarily) and the associated γ be indexed 1, 2, ..., NΓ . If µ =∑n

i=1 u(γi)and λ =

∑NΓi=n′ m(γi), n < n′, then the optimal rule is a function of the

likelihood ratio m(γ)u(γ) and is given by the following equations:

(a, b) ∈ A1 if Tµ ≤ m(γ)u(γ) (5)

∈ A2 if Tλ < m(γ)u(γ) < Tµ (6)

∈ A3 if m(γ)u(γ) ≤ Tλ (7)

where Tµ = m(γn)u(γn) and Tλ = m(γn′ )

u(γn′ ). Figure 1 illustrates these three regions

in terms of the degree of agreement of the record pair.

Fig. 1. The three regions of the probability model.

Under the assumption of conditional independence of the componentsof the γ vector, the decision rule above can be written as a function oflog(m(γ)

u(γ) ) =∑k

j=1 wj , where the weight wj = log(m(γj)u(γj)

.Intuitively, there will be many more non-matching record pairs than

matching record pairs. Figure 2 is a typical histogram of record pairweights illustrating this. The non-match mode is much larger than thematching mode. The degree of separation between the modes is an indi-cation of the level of difficulty of the linkage task and amount of Type I

and II errors that may result. This is assuming there is ground truth ofmatches available.

Fig. 2. Histogram of comparison weights showing a mode for matching weights and amode for non-matching weights, and a degree of overlap where matching errors willoccur.

Alternative Statistical Models This standard model represents thedata probabilities directly. An alternative is to model errors in attributesexplicitly [28]. Different Bayesian models of record linkage [38, 64] havealso been proposed. The practicality and accuracy of these methods arenot clear. Systems implementing them have not been widely used and arenot widely available.

2.3 Standard Algorithm for Record Linkage

The framework of the previous section is the basis for the standard algo-rithm for record linkage. The operationalisation of the framework requiresa method for estimating the weights, wj , or more generally, the likelihoodratio m(γ)

u(γ) .Jaro [51, 52] uses the expectation-maximisation (EM) algorithm [31]

to estimate m(γ), γ ∈ Γ . The complete data vector is given by (γ, g),where g indicates the actual status (match or non-match) of the recordpair. Jaro restricts the γj to be 0/1 values and assumes conditional inde-pendence of the γj . The term, g, takes a value corresponding to a matchwith probability p and non-match with probability 1−p. The likelihood is

written in terms of g,m(γ) and u(γ). The EM algorithm uses maximumlikelihood estimates of m(γ), u(γ) and p to estimate unobserved g.

The EM algorithm needs initial estimates of m(γ), u(γ) and p andthen iterates. Jaro uses frequencies for estimates of m(γ), u(γ) and pwithin blocks. This results in biased u(γ) estimates. So only the m(γ)values from this solution are used. The u(γj), j = 1, 2, ..., k, probabilitiesare estimated as the probabilities of chance agreement of the j attributeby carrying out a frequency analysis.

The standard record linkage algorithm [51, 52] requires the followinginputs:

– Initial values for m(γ) and u(γ) probabilities.– Blocking attribute(s) for each blocking pass. Different blocking at-

tributes are used in each pass. Blocking passes are described in moredetail below.

– The thresholds for the weights to determine the three decision regions,A1, A2 and A3.

– The type of comparison functions.

The algorithm then goes through the following steps:

– Perform a Blocking Pass: Select possible pairs for linking from thedata sources using one or more blocking attributes. Note that thereis an implicit assumption that comparisons not made due to blockingare non-match records. For example, if both data sources are sortedby postcode, the comparison pairs would consist of only records wherepostcodes agree. Methods for reducing errors due to blocking or forminimising blocking are described in Section 3.3.

– Estimate the matching probability, m(γ), and non-matching proba-bility, u(γ), for the attributes used for matching within the stratumdefined by the blocking attribute(s).

– Use the probabilities to calculate the matching weight for each at-tribute.

– Calculate the composite score from the weights of all the matchingattributes.

– Determine whether a record pair is a match, non-match or possiblematch using the threshold levels on the composite score.

– Perform another blocking pass and the above steps on the remainingnon-match record pairs until all blocking passes are done.

3 Record Linkage System and Components

This section describes a record linkage system design largely followingthe TAILOR system [34]. The basic requirements for each key systemcomponent are then described. New approaches to implementing thesecomponents are considered and compared to the standard algorithm pre-sented in Section 2.3.

3.1 Record Linkage System Design

Fig. 3. Layered Design for a Record Linkage System.

Figure 3 shows a layered design of a record linkage system. The In-formation Flow diagram of this record linkage system, following TAI-LOR [34], is shown in Figure 4. We now briefly discuss the componentsof this record linkage system.

The Standardisation component is arguably optional, depending onthe quality of the data sources to be linked. Some successful record link-age systems do not separate standardisation [97], but rather incorporateit into the Comparison component. Approaches to standardisation aredescribed in Section 3.2.

Blocking is used to reduce the number of comparisons of record pairsand blocking strategies are described in Section 3.3.

The Comparison component performs comparison of record pairs.Comparison methods are described in Section 3.5.

Fig. 4. Information Flow for a Record Linkage System.

The Decision component makes the decision of whether a record pairis a match, non-match or possible match. Alternative decision models aredescribed in Section 3.6.

The Measurement component allows evaluation of the record linkagesystem along different performance criteria and these are described inSection 3.7.

The Graphical User Interface component should provide the followingfunctionalities:

– Entry of the standard algorithm parameters.– Assistance for identification and specification of common identifying

attributes for use in the record linkage as discussed in Section 3.4.– Support for the presentation and interpretation of performance mea-

surements.– Allowing convenient clerical review of possible matches, where the

record linkage process is not fully automatic.

Alternative Record Linkage System Designs Elfekey et al. [34]present an alternative record linkage system model called the InductionRecord Linkage Model. This model shows how training data can be incor-porated in the record linkage system, where it is available.

Other alternative record linkage systems that have been proposed in-clude AJAX [40], WHIRL [24], Intelliclean [66], Merge/Purge [48], andSchemaSQL [63]. Some of the designs and architectures for the many

available academic, government and commercial systems are found in Ap-pendix A.

3.2 Standardisation Methods

Standardisation is also called data cleaning or attribute-level reconcili-ation. Without standardisation, many true matches could be wronglydesignated as non-matches because the common identifying attributes donot have sufficient similarity [108].

The basic ideas of standardisation are:

– To replace many spelling variations of commonly occurring words withstandard spelling.

– To standardise the representation of various attributes, to the samesystem of units, or to the same coding system. For example, 0/1 in-stead of M/F for a ‘gender’ attribute.

– To perform integrity checks on attribute values or combinations ofattribute values.

Standardisation methods need to be specific to the population of anapplication and the data capture processes. For example, the most com-mon name misspellings differ based upon the origin of the name. Thereforestandardisation for Italian names optimised to handle names Italian andLatin origin will perform better than generic standardisation [30].

3.3 Searching/Blocking

Searching or blocking is used to reduce the number of comparisons ofrecord pairs by bringing potentially linkable record pairs together. A goodattribute variable for blocking should contain a large number of attributevalues that are fairly uniformly distributed and such an attribute musthave a low probability of reporting error.

Errors in the attributes used for blocking can result in failure to bringlinkable record pairs together. For text attributes, various phonetic codeshave been derived to avoid effects of spelling and aural errors in record-ing names. Common phonetic codes include Russell-Soundex and NYSIIS.These codes were optimised for specific populations of names and a spe-cific type of English pronunciation. Some commercial systems providetools to derive phonetic codes for specific populations worldwide [97].

Kelley [57] developed an algorithm of choosing the best blockingscheme in light of the trade-off between computation cost and false non-match (negative) rates.

Sorted Neighbourhood Method (SNM) [48] SNM involves scanningthe N sorted records from sources A and B using a fixed size of window,w. Every pair of records falling within the window are compared. SNMrequires w ×N record comparisons. Note that the error rate induced bySNM is critically dependent on the choice of sorting keys.

Multiple passes with independent sorting keys could be used to min-imise the number of errors [48]. A transitive closure over matched recordpairs can be computed for combining the results of independent passes [75].An example of the circumstances where this is useful is as follows:

A matches B -> Drop BC matches D -> Drop CE does not match A or D but would have matchedB and CResult: A,D,E are kept separate when in realitythey are all the same entity.

A problem with this multi-pass approach is that the number of falsepositives is increased as they are propagated across each pass [79].

Priority Queue Method [76] This method is related to SNM, butsets of representative records belonging to recent clusters in the sortedrecord list are stored in a priority queue. Heuristics are needed to selectthese representative records of a cluster. The advantage is the avoidanceof the need to sort the data sources for each blocking pass, which cansave significant computational time for very large data sources. Winklerand Yancey [112] have implemented a similar approach in their Bigmatchsystem.

Blocking as Preselection [79] The idea behind preselection is basedon quickly computed rejection rules because almost all record pairs can beclassified as non-match through simple computation. Preselection is theapplication of adequate rejection rules to reduce the number of compar-isons. Rejection rules can be derived from the comparison functions usedand a training sample. The type of preselection comparison performedcan be adjusted to control the trade-off between the time complexity ofblocking and the required rate of misclassified pairs.

Canopies clustering is one of the efficient high-dimensional clusteringmethods which can be used for preselection [72, 10].

3.4 Selection of Attributes for Matching/Comparison

Common attributes should be selected for use in the comparison function.The issues involved in the attribute selection process include:

– Identifying which attributes are common.– Assessing whether the common attributes have sufficient information

content to support the linkage quality required for the project. In-formation theory measures for assessing project viability have beenproposed in [90, 27].

– Selecting the optimal subset of common attributes.

Attribute characteristics that affect the selection decision include levelof errors in attribute values and the number (and distribution) of attributevalues, i.e. information content of the attribute.

For example, a field such as gender only has two value states andconsequently could not impart enough information to identify a matchuniquely. Conversely, a field such as surname imparts much more infor-mation, but it may frequently be recorded incorrectly.

One decision rule for attribute selection is to select all the availablecommon attributes. Redundancy provided by related attributes can beuseful in reducing matching errors. However, redundancy from attributesis useless if their errors are highly correlated or even functionally depen-dent.

The power of low quality personal identifiers can be enhanced byconsidering semantics of the fields such as cause of death and knownco-morbidities [73].

3.5 Comparison

The probabilistic framework of Fellegi-Sunter requires the calculation ofthe comparison vector, γ for each record pair. Jaro [51] considers com-parison vectors consisting of only match/non-match (0/1) values. How-ever, the comparison function can be extended to include categorical andcontinuous-valued attribute values as well [112].

Dey et al. [33] proposed a distance-based metric for attribute compar-ison. They also have a probability-based metric in an earlier paper [32],but argue that probabilities are difficult to estimate accurately and so adistanced-based metric is more robust.

Text or string attributes are very commonly used as matching at-tributes. Several string comparators are therefore considered in the fol-lowing.

String comparison in record linkage can be difficult because lexico-graphically “nearby” records look like “matches” when they are in factnot. For example, consider the following three strings:

"Apt 11,101 North Rd, Acton, 2601""Apt 12,101 North Rd, Acton, 2601""Apt 12,101 North St, Acton, 2601"

When transitive closure is applied to pairs of nearby records, incorrectresults can often occur. This will result in the following conclusion:

Apt 11,101 North Rd, Acton, 2601->Apt 12,101 North St, Acton, 2601

The use of the semantics of the strings in domain-dependent compari-son functions can help avoid this problem. In this example, Apt 11 andApt 12 can be parsed and tagged as an apartment number and a com-parison function can capture the difference of even one digit as being verysignificant and so the records should not be matched.

A string comparator function returns a value between 0 and 1 de-pending on the degree of the match of the two strings. Because pairs ofstrings often exhibit typographical variation (e.g., Smith and Smoth), therecord linkage needs effective string comparison functions that deal withtypographical variations. Hernandez and Stolfo [47] discussed three pos-sible distance functions for measuring typographic errors; they are editdistance, phonetic distance, and typewriter distance. They developed anEquational Theory involving a declarative rule language to express thesecomparison models [47].

Some alternative string comparison methods include:

– Manual construction [39].– Jaro [51] introduced a string comparator that accounts for insertions,

deletions, and transpositions. The basic steps of this algorithm in-clude computing the string lengths and finding the number of com-mon characters in the two strings and the number of transpositions.Jaro’s definition of “common” is that the agreeing character must bewithin the half of the length of the shorter string. Jaro’s definitionof transposition is that the character from one string is out of orderwith the corresponding common character from the other string. Thestring comparator value is given by the following formula:

C(s1, s2) = 1/3 ∗ (Ncommon

Ls1+

Ncommon

Ls2+ 0.5 ∗ Ntranspositions

Ncommon) (8)

Where s1 and s2 are the two strings to be compared, with lengths Ls1

and Ls2 respectively. Ncommon and Ntranspositions are the numbers ofcommon characters and transpositions.

– Winkler [88] modified the original string comparator introduced byJaro in the following three ways:

• A weight of 0.3 is assigned to a ‘similar’ character when countingcommon characters. Winkler’s model of similar characters includesthose that may occur due to scanning errors (“1” versus “l”) orkey punch errors (“V” versus “B”).

• More weight is given to agreement at the beginning of a string.This is based on the observation that the fewest typographicalerrors occur at the beginning of a string and the error rate thenincreases monotically with character positions through the string.

• The string comparison value is adjusted if the strings are longerthan six characters and more than half the characters beyond thefirst four agree.

– N-gram distance [49, 44, 104]: The N-grams comparison function formsthe set of all sub-strings of length n for each string. The N-gramsmethod has been extended to Q-grams by Gravano et al [44] for com-puting approximate string joins efficiently.

– Edit-distance based [87, 53]: This method uses edit distance, alsoknown as Levenshtein distance [67], to compare two strings. Editdistance, a common measure of textural similarity, is the minimumnumber of edit operations (insertions, deletions, and substitutions)of single characters required to transform from one string to another(i.e., make two strings equal). A dynamic programming algorithm isused to find the optimal edit distance. The time complexity of thisalgorithm can be an issue for large databases [53].

– Vector space representation of fields such as TF-IDF [24]: Cohen usesthe Information Retrieval representation of text and heuristics forranking document similarity as a means to do schema integrationwithout a conceptual global model. Cohen and Richman claim thatTF-IDF often outperforms edit-distance metrics and is less computa-tionally expensive [26, 25].

– Adaptive comparator function [26]: This method learns the param-eters of the comparator function using training examples. Zhu andUngar [118] use a genetic algorithm to learn the edit operator costsfor a string-edit comparator function.

3.6 Decision Models

Once matching weights of individual attributes of two records are calcu-lated, the next step is to combine them to form a composite weight orscore and then decide whether a record pair should be a match, non-matchor possible match.

The simplest way to calculate the composite weight is to use the av-erage of all the matching weights assuming each attribute contributesequally. If some knowledge on the importance of individual attributes isavailable, the weighted (fixed) average can be used.

If the attribute value distribution for a field is not uniform, a value-specific (frequency-based), or outcome-specific, weight [110] can be intro-duced. For example, surname “Smith” occurs more often than “Zabrin-sky” and therefore a match of “Smith” carries less weight than a matchof “Zabrinsky”. The basic idea is that an agreement on rarely occurringvalues of an attribute is better at distinguishing matches than that oncommonly occurring values of an attribute.

Dey et al. [33] propose a method of directly eliciting the attributeweights from the user. The user ranks the attributes in order of theirperceived predicted power, with a rank of 1 for the most predictive at-tribute, a rank of 2 for the second most predictive attribute, etc. Severalreasons are provided in the literature [6] in favour of rank-based surro-gate weights over directly-elicited weights. The (surrogate) weights can becomputed based on these ranks. The rank-centroid method was chosenfor converting the ranks to weights. Although satisfactory results wereachieved by getting the ranks from the user, it would be difficult to applythis method if the user does not have enough knowledge or informationabout the attributes.

Earlier we described the optimal rule for minimising the number ofpossible matches given desired Type I and Type II errors, assuming con-ditional independence. We also described the EM method with the condi-tional independence assumption. The EM method has been derived with-out assuming conditional independence [109, 74].

We now consider several alternative decision models for deciding whethera record pair should be a match, non-match or possible match.

Statistical Models Copas and Hilton [28] propose a matching algorithmthat depends on the statistical characteristics of the errors which are likelyto arise. The distribution of errors can be studied using a large training fileof matched record pairs. Statistical models are fitted to a file of recordpairs known to be correctly matched. These models are then used to

estimate likelihood ratios. The advantage of this approach is that themodel fit can be used to consider the validity of the modelling approach.

Predictive Models Predictive models for learning the parameters (thresh-old values and attribute weights) have recently been proposed. Adequatetraining data is needed to train these models [109, 64]. Proposed modelshave included:

– Logistic regression [87], although this was found to not work for censusdata in [64].

– Support vector machines [10].– Decision trees [103].

Active learning techniques have also been proposed to optimise efficiencyin selection of training records [103].

Bayesian Decision Cost Model Verykios et al. [104] propose a Bayesiandecision model for cost optimal record matching. Conventional models forrecord matching rely on decision rules that minimise the probability oferror, i.e., the probability that a sample record pair is assigned to thewrong class. Because the misclassification of different samples may havedifferent consequences, their decision model minimises the cost of makinga decision rather than the probability of error in a decision.

3.7 Performance Measurement

Quality of record linkage can be measured in the following dimensions:

– The number of record pairs linked correctly (true positives) nm.– The number of record pairs linked incorrectly (false positives, Type I

error) nfp.– The number of record pairs unlinked correctly (true negatives) nu.– The number of record pairs unlinked incorrectly (false negatives, Type

II error) nfn.

Along with the two ground truth numbers (the total number of truematch record pairs, Nm, and the total number of true non-match recordpairs, Nu), various measures from different perspectives can be definedfrom these dimensions. Several of these measures [43] are listed in thefollowing:

– Sensitivity: nm/Nm, the number of correctly linked record pairs di-vided by the total number of true match record pairs.

– Specificity: nu/Nu, the number of correctly unlinked record pairs di-vided by the total number of true non-match record pairs.

– Match rate: (nm + nfp)/Nm, the total number of linked record pairsdivided by the total number of true match record pairs.

– Positive predictive value (ppv): nm/(nm + nfp), the number of cor-rectly linked record pairs divided by the total number of linked recordpairs.

It can be seen that sensitivity measures the percentage of correctlyclassified record matches while specificity measures the percentage of cor-rectly classified non-matches.

Two other performance measures widely used in the research field ofinformation retrieval is precision and recall. Precision measures the purityof search results, or how well a search avoids returning results that arenot relevant. Recall refers to completeness of retrieval of relevant items.For record linkage, precision can be defined, in terms of matches, as thenumber of correctly linked record pairs divided by the total number oflinked record pairs. So precision is equivalent to the positive predictedvalue defined above. Similarly, recall is defined, in terms of matches, asthe number of correctly linked record pairs divided by the total numberof true match record pairs. As a result, recall is equivalent to sensitivitydefined above. Of course, precision and recall can also be defined in termsof non-matches. Alternatively, combined measures of precision and recallcan be defined in terms of overall record pairs correctly classified (matchesand non-matches).

Additional performance criteria for record linkage are in terms of timeand number of records requiring manual review:

– Time taken. The time complexity of a record linkage algorithm isusually dominated by the number of record comparisons performed.The time taken for sorting on a blocking key for very large data sourcescan also be extremely long.

– Number of records requiring clerical review. Manual review of recordsis time-consuming, expensive and can be error prone.

Controlling Error Rates Belin and Rubin [7] proposed a mixturemodel for estimating false match rates for given threshold values. Thisis currently the only method for automatically estimating record linkageerror rates [109]. The method works well when there is good separation

between the matching weights associated with matches and non-matches,but it requires the existence of previously collected and accurate train-ing data. For situation where there is no good separation, methods thatuse more information from the matching process [107, 64] can be used toestimate the error rates.

4 Wider Issues in Record Linkage

We have focussed on methodological issues for record linkage so far. In thissection, we briefly mention the legal and ethical issues that are integral toany substantial record linkage project involving individuals. In particular,a number of linkage protocols for minimising risk in release of confidentialinformation are discussed in Section 4.3.

4.1 Ethical and Legal Issues

Ethical privacy concerns and legislative obligations both need to be metin a record linkage study [80, 8, 60]. A good description of the issues, ac-cepted processes and example documents are found in [45]. A comprehen-sive review of the legal issues for data linkage in Australian jurisdictionscan be found in [71].

4.2 Analytic Methods for Linked Data

Understanding of the error characteristics of linked data has been flaggedas a critical limiting factor in allowing widespread application of analysesto linked data [59]. Wang and Donnan [105] propose methods for adjustingfor missing records in outcome studies. The methods model the missingmechanism and therefore allow the regression methods to have reducedbias and more accurate confidence intervals.

The effect of errors on registry-based follow-up studies is an importantissue [17].

4.3 Linkage Protocols to Maintain Confidentiality

For research purpose linkage projects, personal identifying informationis not often relevant to the research problem and can be removed fromthe linked dataset. One exception is where the research outcomes identifya risk reduction action for individuals in the dataset and technical andethical issues in following up with those individuals may result.

In some projects, anonymity of individuals prior to linkage is also de-sirable. Methods for achieving this involve encrypting or pseudonymisingidentifying information in the data sources in a consistent way prior tolinkage [4, 89, 12]. Record linkage on encrypted identifying attributes islikely to decrease the linkage accuracy over time [77]. For example, if anindividual changes their surname, and their surname is used as part ofthe linkage key, then the match will not be made with new records forthat individual [45].

Two key ideas in the anonymised record linkage area [22] are:

– Separate the identifying information, such as name and address, fromother information, such as clinical events or diagnoses prior to linkagebetween data sources. This enables a separation between data forrecord linkage and the data for the researcher. This idea requires trustof a linkage unit or other third-party body [62, 22, 58, 45].

– Individual identifier keys should have limited scope. The scope couldbe limited to a single research project [58].

A protocol implementing some of these ideas has been proposed for Aus-tralian government departments and agencies [45]. Limitations of thiscurrent protocol include:

– It is designed for one-off linkage projects rather than long-term longi-tudinal and linked data collection with permanent storage and incre-mental additions over time.

– It does not cover staged multi-program linkage projects, where datacan be added from new programs over time.

4.4 Linkage protocols for distributed databases

An Italian research project into data quality improvement for distributedsystems has designed and is implementing a prototype for a RecordMatcher and Record Improver [9, 5, 93].

It is argued that automation of the blocking key selection is needed formultiple distributed source systems [5]. Quality metadata for the sourcesand identification power of an attribute are used as input into the keyselection algorithm.

5 Conclusions and Research Questions

Current record linkage methods perform well when the matching fields arewell standardised and there are sufficient attributes for matching [109].

The key insight is that new techniques for modelling text from Infor-mation Retrieval, for performing similarity joins from Database Researchand for learning optimal decision boundaries from data mining can po-tentially improve the record linkage process in terms of manual effort andscalability [111].

Winkler identifies a key research question regarding the selection ofmatching/comparison method [109]. In principle, weighting attribute match-ing by frequency-based weights should be more informative than simpleagree/disagree matching. However, for less frequent attribute values andfor noisy data sources, this is not actually the case. A method for decidingbetween these approaches is needed.

Adaptive learning where the comparator function is learnt has alsorecently been proposed [26]. Predictive models from machine learningsuch as bagging methods and SVMs have been suggested for learning thematch/non-match decision function (Section 3.6). Other learning meth-ods for learning the comparator functions have also been proposed (Sec-tion 3.5). A direct comparison with the Fellegi-Sunter approach has notyet been done but would be worthwhile [26].

Another area of interest is avoiding the need to sort large datasetsfor blocking. This can be done by using recent developments in high-dimensional similarity joins [53]. These techniques use clever data struc-tures to store records so that good candidates for matching are storedtogether based on the agreed distance or probabilistic measure.

References

[1] W. Alvey and B. Jamerson, editors. Record Linkage Techniques – 1997. FederalCommittee on Statistical Methodology, Washington, D.C., 1997.

[2] M.G. Arellano, G.R. Peterson, D.B. Petitti, and R.E. Smith. The CaliforniaMortality Linkage System (CAMLIS). Am. J. Pub. Health, 74:1324–30, 1984.

[3] Inc. Arkidata. Arkistra, 2003.[4] A.L. Avins, W.J. Woods, B. Lo, and S.B. Hulley. A novel use of the link-file

system for longitudinal studies of HIV infections: a practical solution to anethical dilemma. AIDS, 7:109–113, 1993.

[5] R. Baldoni, C. Cappiello, C. Francalanci, B. Pernici, P. Plebani, M. Scanna-pieco, S.T. Piergiovanni, and A. Vigirillito. Dl4.a: Design and definition of thecooperative architecture supporting data quality, December 2001.

[6] F.H. Barron and B.E. Barrett. Decision Quality Using Ranked AttributeWeights. Management Science, 42(11):1515–1523, 1996.

[7] T.R. Belin and D.B. Rubin. A Method for calibrating false-match rates in recordlinkage. Journal of the American Statistical Assocation, 90(430):694–707, June1995.

[8] G.B. Bell and A. Sethi. Matching Records in a National Medical Patient Index.Communications of the ACM, 44(9):83–88, 2001.

[9] P. Bertolazzi, L. DeSantis, and M. Scannapieco. Automatic Record Matchingin Cooperative Information Systems. In Int. Workshop on Data Quality inCooperative Information Systems, Jan 2003.

[10] M. Bilenko and R.J. Mooney. Learning to Combine Trained Distance Metricsfor Duplicates Detection in Databases. Technical Report AI-02-296, Universityof Texas at Austin, Feb 2002.

[11] Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automaticsegmentation of text into structured records. In SIGMOD Conference, 2001.

[12] F. Borst, F-A. Allaert, and C. Quantin. The Swiss Solution for AnonymousChaining Patient Files. MEDINFO 2001, pages 1239–1241, 2001.

[13] Andrew Borthwick. Choicemaker Technologies, Inc., 2002.[14] K.J. Brameld, C.D.J. Holman, M. Thomas, and A.J. Bass. Use of a state data

bank to measure incidence and prevalence of a chronic disease: end-stage renalfailure. American Journal of Kidney Disease, 34(6):1033–1039, 1999.

[15] K.J. Brameld, M. Thomas, C.D.J. Holman, A.J. Bass, and I.L. Rouse. Validationof linked administrative data on end-stage renal failure: application of recordlinkage to a ‘clinical base population’. Aust. NZ J. of Public Health, 23:464–467,1999.

[16] P.P. Breitfeld, T. Dale, J. Kohne, S. Hui, and W.M. Tierney. Accurate casefinding using linked electronic clinical and administrative data at a children’shospital. Journal of Clinical Epidemiology, 54, 2001.

[17] H. Brenner, I. Schmidtmann, and C. Stemaier. Effect of record linkage errorson registry-based follow-up studies. Statistics in Medicine, 16:2633–43, 1997.

[18] F. Caruso, M. Cochinwala, U. Ganapathy, G. Lalk, and P. Missier. Telcordia’sDatabase Reconciliation and Data Quality Analysis Tool. In Proc. of the 26thInt. Conf. on Very Large Databases, 2000.

[19] Arbee L. P. Chen, Pauray S. M. Tsai, and Jia-Ling Koh. Identifying Object Iso-merism in Multidatabase Systems. Distributed and Parallel Databases, 4(2):143–168, 1996.

[20] P. Christen and T. Churches. Febrl: Freely extensible biomedical record linkage,release 0.2 edition, April 2003.

[21] P. Christen, T. Churches, et al. http://datamining.anu.edu.au/projects/linkage.html,2003.

[22] T. Churches. A proposed architecture and method of operation for improvingthe protection of privacy and confidentiality in disease registers. BMC MedicalResearch Methodology, 3(1):1–13, 2003.

[23] T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name andaddress data for record linkage using hidden Markov models. BMC MedicalInformatics and Decision Making, 2, 2002.

[24] William W. Cohen. Integration of heterogeneous databases without commondomains using queries based on textual similarity. In Proceedings of ACMSIGMOD-98, pages 201–212, 1998.

[25] W.W. Cohen and J. Richman. Learning to Match and Cluster Entity Names.In Proc. of the ACM SIGIR-2001 Workshop on Mathematical/Formal Methodsin Information Retrieval, 2001.

[26] W.W. Cohen and J. Richman. Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. In SIGKDD’02, 2002.

[27] L.J. Cook, L.M. Olson, and J.M. Dean. Probabilistic Record Linkage: Relation-ships between File Sizes, Identifiers, and Match Weights. Methods of Informationin Medicine, 40:196–203, 2001.

[28] J.B. Copas and F.J. Hilton. Record Linkage: Statistical models for MatchingComputer Records. Journal of the Royal Statistical Society Series A, 153:287–320, 1990.

[29] W.S. Dai, S. Xue, K. Yoo, J.K. Jones, and J. Labraico. An Investigation of theSafety of Midazolam use in Hospital. Pharmacoepidemiology and Drug Safety,9, 1997.

[30] L. Dal Maso, C. Braga, and S. Franceschi. Methodology Used for Softwarefor Automated Linkage in Italy (SALI). Journal of Biomedical Informatics,34:387–395, 2001.

[31] N.M. Dempster, A.P. Laird and D.B. Rubin. Maximum likelihood from incom-plete data via the EM algorithm. J. R. Statist. Soc. B, 39:185–197, 1977.

[32] D. Dey, S. Sarkar, and P. De. A Probabilistic Decision Model for Entity Match-ing in Heterogeneous Databases. Management Science, 44(10):1379–1395, 1998.

[33] D. Dey, S. Sarkar, and P. De. A Distance-Based Approach to Entity Reconcili-ation in Heterogeneous Databases. IEEE Transactions on Knowledge and DataEngineering, 14(3), 2002.

[34] M.G. Elfeky, V.S. Verykios, and A.K. Elmagarmid. TAILOR: A Record LinkageToolbox. In Proc. of the 18th Int. Conf. on Data Engineering. IEEE, 2002.

[35] M.E. Fair. Recent Developments at Statistics Canada in the linking of complexhealth files. In Federal Committee on Statistical Methodology, Washington, D.C.,2001.

[36] L. Fellegi and A. Sunter. A Theory for Record Linkage. Journal of the AmericanStatistical Society, 64:1183–1210, 1969.

[37] D.R. Fletcher, M.S.T. Hobbs, P. Tan, R.L. Hockey, T.J. Pikora, M.W. Knuiman,H.J. Sheiner, A. Edis, and L.J. Valinsky. Complications following cholecystec-tomy - risks of the laproscopic approach and protective effects of operativecholangiography: a population based study. Annals of Surgery, 229:449–57,1999.

[38] M. Fortini, B. Liseo, A. Nuccitelli, and M. Scanu. On bayesian record linkage.In Sixth International World Meeting on Bayesian Analysis, 2000.

[39] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: An ExtensibleData Cleaning Tool. In SIGMOD (demonstration paper), 2000.

[40] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative DataCleaning: Language, Model and Algorithms. In Proc. of the 27th VLDB Conf.,2001.

[41] Helena Galhardas. http://cosmos.inesc.pt/ hig/cleaning.html, 2003.[42] L. Gill. Methods for Automatic Record Matching and Linking and their Use in

National Statistics. Technical Report National Statistics Methodological SeriesNo. 25, National Statistics, London, 2001.

[43] S. Gomatam, R. Carter, M. Ariet, and G. Mitchell. An empirical comparisonof record linkage procedures. Statistics in Medicine, 21:1485–1496, 2002.

[44] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, andD. Srinivasta. Approximate string joins in a database. In Proc. 27th Int. Conf.on Very Large Data Bases, pages 491–500, 2001.

[45] Statistical Linkage Key Working Group. Statistical data linkage in CommunityServices data collections, 2002.

[46] J. Halliday, O. Griffin, A. Bankier, C. Rose, and M. Riley. Use of Record LinkageBetween a Statewide Genetics Service and a Birth Defects/Congenital Malfor-mations Register to Determine Use of Genetic Counselling Services. AmericanJournal of Medical Genetics, 72, 1997.

[47] M.A. Hernandez and S.J. Stolfo. The Merge/Purge Problem for LargeDatabases. In Proc. of 1995 ACT SIGMOD Conf., pages 127–138, 1995.

[48] M.A. Hernandez and S.J. Stolfo. Real-world data is dirty: data cleansing andthe merge/purge problem. Journal of Data Mining and Knowledge Discovery,1(2), 1998.

[49] J.A. Hylton. Identifying and merging related bibliographic records, 1996.[50] Info Route Inc. ClientMatch, 2003.[51] M. A. Jaro. Advances in Record Linkage Methodology as Applied to Matching

the 1985 Census of Tampa, Florida. Journal of the American Statistical Society,84(406):414–420, 1989.

[52] M.A. Jaro. Probabilistic linkage of large public health datafiles. Statistics inMedicine, 14:491–498, 1995.

[53] L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. InInt. Conf. on Database Systems for Advanced Applications (DASFAA), Tokyo,Japan, March 2003.

[54] C. Johansen, J.D. Boice, J.K. McLaughlin, and J.H. Olsen. Cellular telephonesand cancer— a nationwide cohort study in Denmark. J. Natl. Cancer Inst.,93:203–7, 2001.

[55] J.A. Johnson and S.M. Wallace. Investigating the Relationship Betweenβ-Blocker and Antidepressant Use Through Linkage of the AdministrativeDatabases of Saskatchewan Health. Pharmacoepidemiology and Drug Safety,6, 1997.

[56] S.B. Jones. Linking Databases in Biotechnology. Neuroimage, 4, 1996.[57] R.P. Kelley. Blocking considerations for record linkage under conditions of un-

certainity. In Proceedings of the Social Statistics Section, American StatisticalAssociation, pages 602–605, 1984.

[58] C.W. Kelman, A.J. Bass, and C.D. Holman. Research use of linked health data-a best practice protocol. Aust N Z J Public Health, 26:251–255, 2002.

[59] B. Kilss and W. Alvey, editors. Record Linkage Techniques - 1985. Proceedingsof the Workshop on Exact Matching Methodologies in Arlington, Virginia May9-10. Internval Revenue Service Publication, Washington, DC, 1985.

[60] N.R. Kingsbury et al. Record Linkage and Privacy: Issues in Creating NewFederal Research and Statistical Information, 2001.

[61] S. Kohli, K. Sahlen, O. Lofman, A. Sivertun, M. Foldevi, E. Trell, andO. Wigertz. Individuals living in areas with high background radon: a GISmethod to identify populations at risk. Computer Methods and Programs inBiomedicine, 53, 1997.

[62] R.L. Kruse, B.G. Ewigman, and G.C. Tremblay. The Zipper: a method for usingpersonal identifiers to link data while preserving confidentiality. Child Abuse &Neglect, pages 1241–1248, 2001.

[63] L. Lakshmanan, F. Sadri, and I. Subramanian. Schema-SQL - A Language forInteroperability in Relational Database Systems. In Proc. of VLDB, 1999.

[64] M.D. Larsen and D.B. Rubin. Iterative automated record linkage using mixturemodels. Journal of the American Statistical Assocation, 96(453):32–41, March2001.

[65] D. Lawrence, C.D.J. Holman, A.V. Jablensky, and S.A. Fuller. Suicide rates inpsychiatric inpatients: an application of record linkage to mental health research.Aust. NZ J. of Public Health, 23:468–470, 1999.

[66] M.L. Lee, T.W. Ling, and W.L. Low. Intelliclean: A knowledge-based intelligentdata cleanser. In Proc. of the Sixth ACM SIGKDD Int. Conf. on KnowledgeDiscovery and Data Mining, pages 290–294, 2000.

[67] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions andreversals. Soviet Physics Doklady, 10:707–710, 1966.

[68] E. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification indatabase integration. In IEEE International Conference on Data Engineering,pages 294–301, 1993.

[69] S. Liu and S. Wen. Development of Record Linkage of Hospital Discharge Datafor the Study of Neonatal Readmission. Chronic Diseases in Canada, 20(3),2000.

[70] N. Maconochie, P. Doyle, E. Roman, E. Davies, P.G. Smith, and V. Beral. Thenuclear family study: linkage of occupational exposures to reproduction andchild death. British Medical Journal, 93:203–7, 2001.

[71] R.S. Magnusson. Data Linkage, Health Research and Privacy: Regulating DataFlows in Australia’s Health Information System. Sydney Law Review, 24(5),2002.

[72] A. McCallum, K. Nigam, and L. Ungar. Learning to Match and Cluster LargeHigh-Dimensional Data Sets for Data Integation. In Proc. of the Sixth Int. Conf.on KDD, pages 169–170, 2000.

[73] M.C.M. Mcleod, C.A. Bray, S.W. Kendrick, and S.M. Cobbe. Enhancing thepower of record linkage involving low quality personal identifiers: use of thebest link principle and cause of death prior likelihoods. Compt. Biomed. Res.,31:257–270, 1998.

[74] X. Meng and D.B. Rubin. Maximum Likelihood via the ECM ALgorithm: AGeneral Framework. Biometrika, 80:267–278, 1993.

[75] A.E. Monge. Matching algorithm within a duplicate detection system. IEEEData Engineering Bulletin, 23(4), 2000.

[76] A.E. Monge and C.P. Elkan. An efficient domain-independent for detecting ap-proximately duplicate database records. In Proc. of the ACM-SIGMOD Work-shop on Research Issues in on Knowledge Discovery and Data Mining, 1997.

[77] A.G. Muse, J. Mikl, and P.F. Smith. Evaluating the quality of anonymousrecord linkage using deterministic procedures with the New York State AIDSregistry and a hospital discharge file. Statistics in Medicine, 14:499–509, 1995.

[78] M. Neiling, S. Jurk, H. Lenz, and F. Neumann. Object Identification Quality.In Intl. Workshop on Data Quality in Cooperative Information Systems, 2003.

[79] M. Neiling and R.M. Muller. The good into the Pot, the bad into the Crop.Preselection of Record Pairs for Database Fusion. In Proc. of the First Interna-tional Workshop on Database, Documents, and Information Fusion, Magdeburg,Germany, 2001.

[80] C.I. Neutel. Privacy Issues in Research Using Record Linkage. Pharmcoepi-demiology and Drug Safety, 6:367–369, 1997.

[81] H.B. Newcombe and J.M. Kennedy. Record Linkage. Communications of theAssociation for Computing Machinery, 5:563–566, 1962.

[82] H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James. Automatic Link-age of Vital Records. Science, 130:954–959, 1959.

[83] T.B. Newman and A.N. Brown. Use of commercial record linkage software andvital statistics to identify patient deaths. J. Am. Med. Inform. Assoc., 4:233–237, 1997.

[84] P.E. Norman, J.B. Semmens, M.M.D. Lawrence-Brown, and C.D.J. Holman.Long-term survival after surgery for abdominal aortic aneurysm in WesternAustralia: population based study. British Medical Journal, 317:852–856, 1998.

[85] The West of Scotland Coronary Prevention Study Group. Computerised RecordLinkage: Compared With Traditional patient Follow-up Methods in Clinical Tri-als and Illustrated in a Prospective Epidemiological Study. Journal of ClinicalEpidemiology, 48(12), 1995.

[86] K.M. Patterson, C.D.J. Holman, and D.R. English. First-time hospital admis-sions with illicit drug problems in indigenous and non-indigenous West Aus-tralians: an application of record linkage to public health surveillance. Aust. NZJ. of Public Health, 23:460–463, 1999.

[87] J.C. Pinheiro and D.X. Sun. Methods for linking and mining massive hetero-geneous databases. In Fourth Int. Conf. on Knowledge Discovery and DataMining, 1998.

[88] E.H. Porter and W.E. Winkler. Approximate string comparison and its effecton an advanced record linkage system. In Proc. of an International Workshopand Exposition - Record Linkage Techniques, Arlington, VA, USA, 1997.

[89] C. Quantin, F-A. Allaert, and L. Dussere. Anonymous statistical methods versuscryptographic methods in epidemiology. Int. J. Med. Inf., 60:177–183, 2000.

[90] L.L. Roos and A. Wajda. Record Linkage Strategies. Methods of Informationin Medicine, 30:117–123, 1991.

[91] D.L. Rosman. The Western Australian Road Injury Database (1987-1996): tenyearsof linked police,hospital and death records of road crashes and injuries.Accident Analaysis and Prevention, 33:81–88, 2001.

[92] Inc. Sagent. Centrus Merge/Purge, 2003.[93] M. Scannapieco, M. Mecella, T. Cartarci, C. Cappiello, B. Pernici, F. Mazzoleni,

and F. Stella. DL3: Comparative Analysis of the Proposed Methodologies forMeasuring and Improving Data Quality and Description of an Integrated Pro-posal, 2003.

[94] SearchSoftwareAmerica. Introduction to SSA-NAME3, V2.0, 2001.[95] SearchSoftwareAmerica. An Introduction to Identity Systems(IDS), 2002.[96] SearchSoftwareAmerica. Introduction to the Data Clustering Engine, V2.2,

2002.[97] SearchSoftwareAmerica. Solving the Company Name Search & Matching Prob-

lem, 2003.[98] J.B. Semmens, P.E. Norman, M.M.D. Lawrence-Brown, and C.D.J. Holman.

The influence of gender on the outcomes of ruptured abdominal aortic aneurysm.British Journal of Surgery, 87:191–4, 2000.

[99] J.B. Semmens, C. Platell, T. Threlfall, and C.D.J. Holman. A population-basedstudy of the incidence, mortality and outcomes following surgery for colorectalcancer in Western Australia. Aust. NZ J. of Surgery, 70:11–18, 2000.

[100] J.B. Semmens, Z.S. Wisniewski, C.D.J. Holman, A.J. Bass, and I.L. Rouse.Trends in repeat prostatectomy after surgery for benign prostate disease: ap-plication of record linkage to health care outcomes. British Journal of Urology,84:972–975, 1999.

[101] Group 1 Software. DataSight, 2003.[102] Data Quality Solutions. LinkageWiz, 2002.[103] S. Tejada, C.A. Knoblock, and S. Minton. Learning object identification rules

for information integration. Information Systems, 26:607–633, 2001.[104] V. Verykios, G.V. Moustakides, and M.G. Elfeky. A Bayesian decision model

for cost optimal record matching. The VLDB Journal, 2002.[105] J.X. Wang and P.T. Donnan. Adjusting for missing record linkage in outcome

studies. Journal of Applied Statistics, 29(6):873–884, August 2002.

[106] Y.R. Wang and S. Madnick. The Interdatabase Instance Identification Problemin Integrating Autonomous Systems. In Proc. Fifth Intl. Conf. Data Eng., pages46–55, 1989.

[107] W.E. Winkler. Advanced Methods for Record Linkage. In Proceedings of theSection of Survey Research Methods, American Statistical Association, pages467–472, 1994.

[108] W.E. Winkler. Matching and Record Linkage. In Cox et al, editor, BusinessSurvey Methods. J. Wiley & Sons Inc., 1995.

[109] W.E. Winkler. The State of Record Linkage and Current Research Problems.Technical Report RR/1999/04, Statistical Research Report Series, US Bureauof the Census, Washington DC, 1999.

[110] W.E. Winkler. Frequency-Based Matching in Fellegi-Sunter Model of RecordLinkage. Technical Report RR/2000/06, Statistical Research Report Series, USBureau of the Census, Washington DC, 2000.

[111] W.E. Winkler. Machine Learning, Information Retrieval and Record Linkage.In Proc. of the Section on Survey Research Methods, American Statistical As-sociation, 2000.

[112] W.E. Winkler. Quality of Very Large Databases. Technical Report RR/2001/04,Statistical Research Report Series, US Bureau of the Census, Washington DC,2001.

[113] W.E. Winkler. Record Linkage Software and Methods for Merging Administra-tive Lists. Technical Report RR/2001/03, Statistical Research Report Series,US Bureau of the Census, Washington DC, 2001.

[114] www.ascentialsoftware.com. Integrity xe, 2003.[115] www.innovativesystems.com. I/analytics, 2003.[116] www.trilliumsoft.com. Trillium software system, 2003.[117] W.E. Yancey and W.E. Winkler. Advanced Methods of Record Linkage, October

2002.[118] J.J. Zhu and L.H. Ungar. String Edit Analysis for Merging Databases. In KDD

2002 Workshop on Text Mining, 2002.

A Record Linkage Software and Systems

The following list of record linkage software and systems is incomplete.The list was influenced by the excellent web resources at ANU [21] andINRIA [41].

A.1 Record Linkage Software: Government and Academic

U.S. Bureau of the Census Software (GDRIVER) This softwareparses name and address strings into their individual components andpresents them in a standard format. Several reference files are used toassist this process. Some of these reference files contain lists of tokensthat influence the way parsing is carried out. For example, if a conjunctiontoken is found, the software must then consider producing more than onestandardized record.

A pattern reference file is utilized for both name and address stan-dardization. These files contain a pattern of tokens that may be present inthe name or address strings. The documentation claims that the softwarecan “recognize the various elements in whatever order they occur” [117],however, that order must also be present in the pattern file.

These reference files list various known characteristics of the namefields such as replacing all associated nick names with a standard name,prefixes and suffixes used for surnames and occupations. These files can bemodified to incorporate characteristics of the particular data. For addressstandardisation, there is also a collection of files that store key addresswords, their variant spelling, and the various address patterns to be recog-nised. For example, “Ave” is an abbreviation of “Avenue” and is a streettype while “RD” could be an abbreviation of “Road” and a street typeor a rural route type in the US.

The reference files capture name and address standardization knowl-edge from the U.S. Census Bureau. Users of this software may find thattheir names and addresses may have features that differ from those dealtwith by the Census Bureau. In this case, the reference files would needto be edited to suit specific data sources.

Febrl - Freely Extensible Biomedical Record Linkage The cur-rent publicly released version of Febrl (0.2) provides data standardisa-tion and probabilistic record linkage with choices of methods for blockingand comparison functions. Parallelisation is also supported. Febrl’s datastandardisation “primarily employs a supervised machine learning ap-proach implemented through a novel application of hidden Markov mod-els (HMMs)” [20].

HMMs require training. Febrl requires suitable training data to beselected and the HMMs to be learnt. In the example provided in [23], aHMM for addresses took under 20 hours to train. A separate HMM is re-quired for name standardisation. It is argued in [23] that the developmentof rule sets to do the same task would take “at least several person-weeksof programming time”.

The approach in Febrl was influenced by Datamold, an approach de-veloped by Borkar et al [11] for address standardisation using nestedHMMs.

GRLS Statistics Canada has developed the Generalized Record LinkageSystem (GRLS) and Match360. Some of the features include [35]:

– Runs on UNIX and Oracle.

– Based on Fellegi-Sunter linkage methodology.– Has a graphical interface.– Allows multiple concurrent users.– Allows user-defined rules which are programmed in C.– Linked records can be grouped into ‘weak’ and ‘strong’ groups.– Allows refinements of weights and thresholds.– On-line help is available.– Has NYSIIS and Soundex rules built-in.

A.2 Record Linkage Software: Commercial

Ascential Software, Trillium and Innovative Systems, Inc. have a customermarketing or customer relationship marketing (CRM) focus. SearchSoft-wareAmerica Inc. is more broadly based.

Telcordia has an in-house database reconciliation and data qualityanalysis tool [18].

Ascential Software [114] Ascential Software bought Vality Technology.Its product is Integrity XE, whose features include:

– Probabilistic matching technology with a full spectrum of fuzzy match-ing capabilities.

– Thorough data investigation and analysis processes.– Flexibility through customizability to an organization’s business rules

with intuitive rules definition and interactive testing.

Trillium [116] The Trillium Software System is set of tools for inspect-ing, identifying, standardizing and linking data [116]. Components of thetools include:

– Parser: to parse, standardize and verify data.– Matcher.– GeoCoder.– DataBrowser: a GUI for analyzing data integrity issues.

Innovative Systems, Inc. [115] Innovation Systems Inc.’s data linkingproduct is called i/Lytics. Its features includes:

– Parsing and standardizing capabilities.– User-defined comparison fields.– Customizable ranking methodology.

Sagent [92] Sagent’s merge/purge product is called Centrus.

Choicemaker Inc. Choicemaker [13] uses many comparison functionsand combines them using maximum entropy to choose weights.

Search Software America Search Software America(SSA)’s ten per-son research and development team is based in Canberra. SSA has 500customers world-wide in customer integration, criminal intelligence, taxcollection, criminal investigation and marketing systems [97].

SSA’s three key products are:

– SSA-NAME3[94].– Identity Systems(IDS) [95] uses SSA-NAME3 for database data. High-

performance indexes are automatically maintained without changesto existing application programs. IDS is used to ‘centrally index iden-tity information from disparate source tables, databases and com-puters...allowing the central index to search, match, rank, group andmaintain all the data simultaneously and in real-time [95].

– Data Clustering Engine(DCE) [96]. This is a stand-alone batch datagrouping and investigation engine for all forms of identification data.The DCE does the transitive closure computation linking pairs ofmatches that did not directly match.

SSA-NAME3 [94] is the core product, providing name matching andsearch capabilities. This product is differentiated by the custom popu-lation databases for almost all countries and alphabets worldwide. SSA-NAME3 supports applications that need to match data using one or moreof the following data types:

– Names, addresses and descriptions.– Identification codes.– Dates.– Other attributes such as phone numbers, sex, eye colour, regions etc.

.There are a few names (family or first names) that occur very fre-

quently in a population. For instance, Smith or Williams can account formore than 1% of the population (with 50, 000 distinct names) [94]. SSA-NAME3 treats common words, codes and tokens in a different mannerto uncommon values to ensure good performance at each extreme of thenames distribution.

A SSA critique of text searching (wild-card, n-gram indexing, NYSIISand Soundex phonetic algorithms) argues that no single world stabiliza-tion algorithm is suitable for all data (even within a single country). Asuitable algorithm is dependent on true distribution of the errors andvariation in both the population of file data and the population of thesearch data [97].

LinkageWiz LinkageWiz [102] was developed within the South Aus-tralian Department of Human Services. It has been applied to a dataquality review of the statewide Clinical Data Repository and EnterprisePatient Master Index.

Technical features include:

– Linkage on names or Medicare numbers.– Probabilistic matching algorithms.– Phonetic name matching (NYSIIS and Soundex).– Value specfic weights for attributes i.e. ‘Smith’ gets less weight than

‘Ittak’.– Nick name mapping.– Identification of default values and other potential data quality prob-

lems.– User definable fields and weights.

LinkageWiz prices range from $1,000 (for 10,000 database record limit)upwards.

Info Route Inc. Info Route Inc. [50] has the following products:

– AddressAbility.– NamePro.– ClientMatch.

The company is based in Oakville, Ontario, Canada and has a Canadiandata focus. The product’s market focus is marketing.

Arkidata Arkidata’s product is called Arkistra [3]. It is designed to ef-ficiently integrate information from multiple systems while ensuring thequality of the that information. Arkistra is an integrated set of compo-nents that provides:

– Desktop applications for managing data loading, transformation andcleaning.

– Analysis tools for optimising data cleansing and information integra-tion using a business rule based method.

– A processing engine that optimises the data transformation processusing decision tree management.

Arkidata is based in Illinois, USA.

Group 1 Software Group 1 Software [101] has a suite of productsfor matching data such as DataSight and Merge/Purge Plus. Group 1’sheadquarters is Lanham, Maryland, USA. Its products have a CRM focus.For example, they are integrated with Siebels’ CRM solution.

B Research Projects Utilising Record Linkage

The case studies here are a sample of record linkage applications in epi-demiological, health services and social science research. The coverage isnot complete and just includes studies with which we are familiar. Thecase studies give an indication of the uses of record linkage in the publichealth and social science research domains. A comprehensive bibliographyon record linkage using administrative and survey records can be foundelsewhere [59, 1].

– The California Mortality Linkage System [2].– The west of Scotland coronary prevention study [85].– Linkage of biotechnology databases [56].– Hospitalisations and vital statistics [83].– Background radon risk [61].– β-Blocker and antidepressant use [55].– Genetic counselling and birth defects [46].– Midazolam use in hospital safety study [29].– Obesity, hypertension and kidney cancer [73].– Cellular phones and cancer cohort study [54].– Nuclear industry family study [70].– Neonatal readmissions [69].– Child cancer patient fever and neutropenia admission [16].– WA Linked Data Applications

• Incidence and prevalence of end-stage renal failure [14, 15]• Trends in suicide rates of psychiatric patients [65].• Trends in first-time admissions for illicit drug problems [86].• Complications following gallbladder surgery [37].• Outcomes following colorectal cancer surgery [99].

• Outcomes following a ruptured abdominal aortic aneurysm [98,84].

• Trends in repeat prostatectomy [100].• Rosman describes the linking of Western Australian police, hospi-

tal and death records to create a road injury database [91]. Thislinkage was done without use of names, instead using fields suchas injury type, severity and treatment.

Test Datasets Testing of record linkage methods on publicly availabledatabases is important means of enabling comparison of results betweenresearch and development teams.

– G1 data generator [47, 9].– The Berlin-Online-Apartment-Advertisements-Database [78].– DBGen, a public domain database generator [34, 75].– CORA, Computer Science bibliography [26].– RESTAURANT, New York Restaurants extracted from travel guides [103].– DBLP, www.informatik.uni-trier.de/ley/db/index.html [75].– IMDB, Internet Movie Database, www.imdb.com [53]– Die Vorfahren DB, feefhs.org/dpl/dv/indexdv.html, [53]