12
Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 0200, Australia [email protected] Abstract. Comparing similar strings is at the core of many applica- tions, including Web and text mining, information retrieval, bioinfor- matics, and deduplication and data linkage systems. Due to variations in the data, like typographical errors, missing or swapped words, exact string comparison can often not be used in such applications. Instead, ap- proximate string comparison methods are employed that either return an approximate comparison value or phonetically encode the strings before comparing them exactly. In this paper we present several approximate string comparison methods, and analyse their performance regarding dif- ferent possible types of single errors, like inserting, deleting, or substitut- ing a character, transposing two adjacent characters, as well as inserting or deleting a whitespace, and swapping of two words. The results show that commonly used approximate string comparison methods perform differently for different error types, and that they are sensitive to user definable thresholds. We also show that swapped words are the hardest type of error to classify correctly for many string comparison methods, and we propose two new methods that deal with this problem. Keywords: approximate string comparisons, similarity measures, typo- graphical errors, phonetic encodings, text mining and data linkage. 1 Introduction Comparing strings is at the core of many applications dealing with text, in- cluding Web and text data mining, information retrieval, search engines, spell checkers, name searching, information extraction, and sequence comparisons in bioinformatics. In many cases one is not only interested in exact string com- parisons, but rather in an approximate measure of how similar two strings are. In bioinformatics, for example, one is interested in comparing long sequences of protein or genome data in order to find similar sub-sequences. In data linkage and deduplication [2, 15, 17], the application area we are mainly interested in, shorter name strings are being compared in order to find records that belong to the same entity (e.g. a customer, patient or business). As reported in [15], the use of approximate string comparison methods does improve the matching accuracy in these applications.

Single Error Analysis of String Comparison Methods

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Single Error Analysis of String Comparison Methods

Single Error Analysis of String

Comparison Methods

Peter Christen

Department of Computer Science, Australian National University,Canberra ACT 0200, [email protected]

Abstract. Comparing similar strings is at the core of many applica-tions, including Web and text mining, information retrieval, bioinfor-matics, and deduplication and data linkage systems. Due to variationsin the data, like typographical errors, missing or swapped words, exactstring comparison can often not be used in such applications. Instead, ap-proximate string comparison methods are employed that either return anapproximate comparison value or phonetically encode the strings beforecomparing them exactly. In this paper we present several approximatestring comparison methods, and analyse their performance regarding dif-ferent possible types of single errors, like inserting, deleting, or substitut-ing a character, transposing two adjacent characters, as well as insertingor deleting a whitespace, and swapping of two words. The results showthat commonly used approximate string comparison methods performdifferently for different error types, and that they are sensitive to userdefinable thresholds. We also show that swapped words are the hardesttype of error to classify correctly for many string comparison methods,and we propose two new methods that deal with this problem.

Keywords: approximate string comparisons, similarity measures, typo-graphical errors, phonetic encodings, text mining and data linkage.

1 Introduction

Comparing strings is at the core of many applications dealing with text, in-cluding Web and text data mining, information retrieval, search engines, spellcheckers, name searching, information extraction, and sequence comparisons inbioinformatics. In many cases one is not only interested in exact string com-parisons, but rather in an approximate measure of how similar two strings are.In bioinformatics, for example, one is interested in comparing long sequences ofprotein or genome data in order to find similar sub-sequences. In data linkageand deduplication [2, 15, 17], the application area we are mainly interested in,shorter name strings are being compared in order to find records that belongto the same entity (e.g. a customer, patient or business). As reported in [15],the use of approximate string comparison methods does improve the matchingaccuracy in these applications.

Page 2: Single Error Analysis of String Comparison Methods

Variations in strings (and especially names) are due to the fact that mostreal world data is dirty [6], which means such data can contain noisy, incompleteand incorrectly formatted information. Names and addresses are especially proneto phonetical, typographical and other data entry errors. [9] classifies characterlevel (or non-word) misspellings as (1) typographical errors, where it is assumedthat the person doing the data entry does know the correct spelling of a wordbut makes a typing error (e.g. ’sydeny’ instead of ’sydney’ ); (2) cognitive errors,assumed to come from a lack of knowledge or misconceptions; and (3) phoneticerrors, coming from substituting a correct spelling with a similar sounding one(for example ’gail’ and ’gayle’ ). Depending upon the mode of data entry [9], forexample manually typed, scanned, or automatic voice recognition, there will bedifferent error characteristics. OCR (optical character recognition) data entry [5,14] can lead to substitution errors between similar looking characters (e.g. ’q’ and’g’ ), while keyboard based data entry can result in wrongly typed neighbouringkeys. Data entry over the telephone (for example as part of a survey study) willmostly lead to phonetical errors.

While for many regular words there is only one correct spelling, there areoften different written forms of proper names as the example above shows, withnone of these forms being wrong. Additionally, personal information (like namesand addresses) are often reported differently by the same person depending uponthe organisation they are in contact with (for example, somebody might give hisname as ’Bill’ in many day-to-day transactions, but on official documents he willwrite down his official name ’William’ ). This can lead to differently recordednames, omitted parts (missing middle name or only initials given), or sometimesswapped names (given name and surname interchanged).

Damerau in a study on spelling errors found in 1964 [4] that over 80% oferrors were single errors – either (1) a letter was missing, (2) an extra letterhas been introduced, (3) a letter has been substituted by another letter, or (4)two adjacent letters have been transposed. Substitutions were the most commonerrors, followed by deletes, then inserts and finally transpositions, followed bymultiple errors. Other studies [5, 9, 14] reported similar results.

In the study presented in this paper we analysed and compared differentapproximate string comparison methods according to their performance on thefour types of single errors presented above. We additionally looked at threeword level error types, namely (5) inserting a whitespace (splitting a word), (6)deleting a whitespace (merging two words), and (7) swapping two words. Ourmain interest is how different string comparison methods perform on these sevencommon types of errors using name strings.

2 Methods

All string comparisons methods analysed in this study as well as the data gen-erator used to create the test data set are implemented as part of the Febrl [2]open source data linkage system.

Page 3: Single Error Analysis of String Comparison Methods

2.1 String Comparators

Many different approximate string comparison methods have been developed [1,3–5, 10, 11, 14–19]. Some are generic, others are application specific (e.g. opti-mised for long genome sequences, or proper names), and some even languagedependent. Some comparison methods calculate the distance, others the simi-larity between two strings. The distance between two strings s1 and s2 can bedefined as the minimum number of operations that transform s1 into s2 [11]. Asimilarity measure, on the other hand, usually produces a real number σ(s1, s2),with a value of 1.0 for identical strings, and ranging down to 0.0 for increasinglydifferent strings [5]. In our study we converted string distances into similaritymeasures as detailed below.

In recent years machine learning based string comparators have been devel-oped which use training data to find a model or weights for improved compar-ison results. These include methods for learning edit distance weights [1, 18],or TFIDF (the cosine similarity commonly used in information retrieval) basedapproaches [3]. Because trained string comparison methods are application anddata set dependant, and because training data is often not available within datalinkage and deduplication systems, we restrict our study to methods that donot need any training data, nor do have access to the full data set(s) beforeperforming the comparisons (as is needed, for example, by TFIDF to calculateterm and document frequencies).

Other methods which are often used for interactive string searching (for ex-ample within a search engine) include wildcards and regular expressions. As theseapproaches need to be user specified, they can not be used within a deduplica-tion or data linkage system, where the aim is to automatically find a similaritymeasure for a large number of pairs of strings.

In our study we considered the following string comparisons methods.

– Truncate

This methods simply truncates a string at a certain length and only considersthe beginning using exact comparison (returning 1.0 if the truncated stringsare the same and 0.0 otherwise). In our experiments we truncated stringsto a lengths of 4 characters. This idea is based on a large study [14] whichfound that errors often occur towards the end of names, while the beginningsare more likely to be correct.

– Key-Difference

The key-difference comparator counts the number of different characters ateach string position. It’s start value is the difference in the string lengths,and for each character s1[i] 6= s2[i], with 0 ≤ i < min(len(s1), len(s2)),the key difference is increased by one. For example, the difference between’peter’ and ’pete’ is 1, and the difference between ’peter’ and ’petra’ is 2.We did set the maximum tolerated key difference value kmax = 2, and withkey difference k (0 ≤ k ≤ kmax), the similarity measure is calculated as

KeyDiff(s1, s2) = 1.0 −k

kmax + 1

Page 4: Single Error Analysis of String Comparison Methods

– Bigram

Bigrams are two-character sub-strings (i.e. n-grams of length 2) [9, 16] con-tained in a string. For example, ’peter’ contains the bigrams ’pe’, ’et’, ’te’,and ’er’. Assuming bigram set(s) returns the set of bigrams in string s, thecomparison method counts the number of common bigrams and divides itby the average number of bigrams to calculate the similarity measure as

Bigram(s1, s2) = 2

(

bigram set(s1) ∩ bigram set(s2)

len(bigram set(s1)) + len(bigram set(s2))

)

– Edit-Distance

Also known as Levenshtein distance [11], the edit distance is defined to bethe smallest number of insertions, deletions, and substitutions required tochange one string into another. Using a dynamic programming algorithm [8],it is possible to calculate this distance in O(len(s1)len(s2)). Distances arethen mapped into a similarity measure between 0.0 and 1.0 using

EditDist(s1, s2) =max(len(s1), len(s2)) − edit dist(s1, s2)

max(len(s1), len(s2))

with edit dist() being the actual edit distance function. Many variations ofthe original edit distance method have been proposed [8, 11], but we onlyconsider the standard algorithm where all edits do have the same costs.

– Soundex

Soundex [7, 10] is a phonetic encoding algorithm based on English languagepronunciation. It keeps the first letter in a string and converts all other lettersinto numbers according to the following rules.

aehiouwy → 0 bfpv → 1 cgjkqsxz → 2dt → 3 l → 4 mn → 5 r → 6

It then removes all zeros and replaces duplicates of the same number with onenumber only (e.g. ’333’ is replaced with ’3’ ). If the final code is less then fourcharacters long it is filled up with zeros. As an example, the Soundex codeof ’peter’ is ’p360’, for ’christen’ it is ’c623’. The Soundex string comparisonmethod first encodes both strings and then compares the codes using exactcomparison, returning 1.0 if they are the same and 0.0 if they differ.

– Phonex

Phonex [10] is a variation of Soundex which tries to improve the encodingquality by pre-processing names (according to their English pronunciation)before the encoding. For example, leading letter pairs ’kn’ are replaced with’n’, ’wr’ with ’r’, and ’ph’ with ’f ’. Similar to Soundex the code consists ofa leading letter followed by numbers. Exact comparison is then applied onthe Phonex encodings.

– NYSIIS

The New York State Identification Intelligence System (NYSIIS) encodingreturns a code that contains only characters, and is based on rules similar tothe Soundex encoding. English words sounding similar will be given similarcodes. Exact string comparison is performed between the NYSIIS codes.

Page 5: Single Error Analysis of String Comparison Methods

– Double-Metaphone

The more recently developed Double-Metaphone [13] algorithm attemptsto better account for non-English words, like European and Asian names.Similar to NYSIIS, it returns a code only consisting of letters. In general,Double-Metaphone seems to be closer to the correct pronunciation of namesthan NYSIIS. Again, exact string comparison is performed between the twophonetic codes.

– Jaro

The Jaro [15, 17] string comparator is commonly used in data (or record)linkage systems. It accounts for insertions, deletions and transpositions ofletters. It calculates the lengths of both strings, the number of characters incommon (the definition of common is that the agreeing character must bewithin half the length of the shorter string), and the number of transposi-tions. The approximate similarity measure is then calculated as [17]

Jaro(s1, s2) = 1/3

(

common

len(s1)+

common

len(s2)+

common − transpositions

common

)

– Winkler

The Winkler [15, 17] comparison method is an improvement over the Jaromethod, again based on ideas from a large empirical study [14] which foundthat the fewest errors typically occur at the beginning of names. The Win-kler comparator therefore increases the comparison value for agreeing initialcharacters (up to four). Based on the Jaro comparator the approximate sim-ilarity measure is calculated as [17]

Winkler(s1, s2) = Jaro(s1, s2) +same

10(1.0 − Jaro(s1, s2))

with same being the number of agreeing characters at the beginning of thetwo strings. For example, ’peter’ and ’petra’ have a same value of 3.

As we will see in Section 3 swapped words are the hardest type of single errorsto classify correctly by all above string comparison methods. We have thereforecombined one of the best performing methods – Winkler – with two methodsfor dealing with multi-word strings in a hierarchical way, similar to [3].

– Sorted-Winkler

If a string contains more than one word (i.e. it contains at least one white-space), then the words are first sorted before a standard Winkler comparisonis applied. The idea is that – unless there are errors in the first few lettersof a word – sorting of swapped words will bring them into the same order,thereby improving the similarity value.

– Permuted-Winkler

This is a more complex approach where Winkler comparisons are performedover all possible permutations of words, and the maximum of all the com-parison values is returned.

In order to analyse the performance of all the presented string comparison meth-ods regarding different types of single errors we generated a data set with welldefined error characteristics as discussed in the following section.

Page 6: Single Error Analysis of String Comparison Methods

Table 1. Example locality name and it’s single error duplicates.

Modification Identifier Value

Original rec-280-org south west rocksInsert character rec-280-dup-1 south west rocbksDelete character rec-280-dup-2 south wst rocksSubstitute character rec-280-dup-3 south west rvcksTranspose characters rec-280-dup-4 south west rcoksInsert whitespace rec-280-dup-5 south we st rocksDelete whitespace rec-280-dup-6 southwest rocksSwap words rec-280-dup-7 west south rocks

2.2 Data Set Generation

We used the data generator implemented as part of the Febrl [2] data linkagesystem to create a data set containing 1,000 different values based on Australianlocality names (i.e. cities, towns and suburbs). The values were randomly selectedfrom a publicly available telephone directory. Their lengths varied from 3 to20 characters. They consisted of up to three words, and besides letters andwhitespaces they also contained hyphens and apostrophes.

For each of the original values we then created corresponding duplicates byintroduction the following single errors: (1) deleting a character, (2) inserting anextra character, (3) substituting a character with another character, (4) trans-posing two adjacent characters, (5) inserting a whitespace (splitting a word), (6)removing a whitespace (merging two words), and (7) swapping two words. Obvi-ously the last two errors were only created when the original value contained atleast two words. Each original and duplicate value was given a unique identifieras shown in Table 1.

In order to simulate real typographical error characteristics, the position ofwhere an error was introduced was calculated using a Gaussian distribution, withthe mean being one position behind half of the string length. Errors are thereforeless likely introduced at the beginning or the end of a string, which is based onempirical studies [9, 14]. Letter substitution is based on the idea of randomlychoosing a letter which is a keyboard neighbour (same row or column) [6].

The final size of the data set was 6,684 values (1,000 originals and 5,684duplicates). We then split this data set into ten files, each containing 100 originalvalues plus their duplicates1. For all string pairs in a file we then calculated thesimilarity measures for the above presented twelve string comparison methods.Each of the resulting ten files contained around 220,000 12-dimensional vectorsof similarity measures.

1 This blocking [2] reduced the amount of comparison made by a factor of ten fromaround 22 millions to 2 millions, resulting in files of manageable size.

Page 7: Single Error Analysis of String Comparison Methods

Table 2. Correlation coefficients for true duplicate comparisons only (top) and allcomparisons (bottom).

KeyD Bigr Edit Sndx Phox NYS DMet Jaro Wink PWink SWink

Trun .11 .37 .45 .29 .23 .59 .29 .43 .48 .29 .02KeyD -.25 .10 -.09 -.08 -.02 -.10 -.10 -.00 -.47 .04Bigr .22 .12 .15 .34 .14 .11 .09 .48 .03Edit .36 .34 .49 .32 .73 .79 -.17 -.22Sndx .63 .54 .82 .42 .41 .15 -.16Phox .48 .54 .34 .34 .08 -.14NYS .51 .45 .48 .22 -.14DMet .39 .38 .17 -.14Jaro .98 .22 -.08Wink .13 -.10PWink .20

Trun .38 .50 .43 .65 .56 .75 .63 .26 .28 .26 .24KeyD .28 .25 .33 .30 .32 .32 .15 .15 .14 .14Bigr .60 .45 .41 .47 .44 .46 .46 .49 .45Edit .41 .37 .42 .39 .63 .64 .59 .56Sndx .74 .66 .86 .26 .26 .25 .23Phox .59 .67 .23 .24 .23 .21NYS .63 .26 .28 .26 .23DMet .24 .25 .23 .22Jaro .99 .88 .86Wink .88 .86PWink .88

3 Discussion

In this section we discuss the experimental results achieved using the comparisonmethods presented in Section 2.1. Note that the class distribution in the resultfiles was very skewed, with 5,632 true duplicate and 2,190,405 non-duplicatesimilarity measure vectors, resulting in a ratio of 1 to 389.

3.1 Correlation Analysis

We first performed a correlation analysis between all presented string comparisonmethods, and the results are shown in Table 2. The highest correlation occursbetween the Jaro and Winkler methods, as can be expected because Winkler isbased on Jaro and only increases the similarity measure slightly in cases wheretwo strings have agreeing characters at the beginning. Surprisingly though arethe low correlation results, some even negative, between many of the comparisonmethods for true duplicates. Even for similar methods, for example Soundex,Phonex, NYSIIS and Double-Metaphone, fairly low correlation values occur.

Page 8: Single Error Analysis of String Comparison Methods

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Win

kler

Jaro

’Jaro’ and ’Winkler’ comparison methods (all duplicates)

False comparisonsTrue duplicates 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Big

ram

Edit-dist

’Edit-dist’ and ’Bigram’ comparison methods (all duplicates)

False comparisonsTrue duplicates

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Win

kler

Jaro

’Jaro’ and ’Winkler’ comparison methods (word swappings)

False comparisonsTrue duplicates 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Big

ram

Edit-dist

’Edit-dist’ and ’Bigram’ comparison methods (word swappings)

False comparisonsTrue duplicates

Fig. 1. Scatterplots for selected comparison method pairs. Top row shows all errortypes, bottom row word swappings only.

For more detailed results we plotted the actual similarity values of selectedcomparison methods as scatterplots as shown in Figure 1. As can be seen, due tothe large number of non-duplicate comparisons, none of the methods is capableof efficiently separating true and false comparisons. Surprisingly, Edit-Distanceand to a lesser extent Bigram have some true duplicates with very low similarityvalues (some even 0.0), while both Jaro and Winkler seem to be fairly consistentwith only assigning higher (i.e over 0.5) similarity values to true duplicates.For Jaro and Winkler, one can clearly see the increased similarity values of theWinkler method which is due to same 1, 2, 3 and 4 characters at the beginning.Interesting is the performance of the comparison methods with regard to theword swapping errors (bottom row of Figure 1). One can see that most of thetrue duplicates with low similarity values are such word swappings. For Edit-Distance, most true duplicate word swapping similarity values are below 0.6.Only the Bigram comparison method returns high values (i.e. larger than 0.6)in this case. We can therefore conclude that swapped words are the hardest typeof errors to classify correctly for these comparison methods.

The detailed results for the new Sorted- and Permuted-Winkler comparisonmethods are shown in Figure 2. Both new string comparison methods return asimilarity measure of 1.0 for swapped words errors, with the Permuted-Winklermethod always returning a larger value than Sorted-Winkler for all other errortypes.

Page 9: Single Error Analysis of String Comparison Methods

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Per

m-w

inkl

er

Sort-winkler

’Sort-winkler’ and ’Perm-winkler’ comparison methods (all duplicates)

False comparisonsTrue duplicates 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Per

m-w

inkl

er

Sort-winkler

’Sort-winkler’ and ’Perm-winkler’ comparison methods (word swappings)

False comparisonsTrue duplicates

Fig. 2. Scatterplots for Sorted- and Permuted-Winkler. The left plot shows all errortypes, the right plot word swappings only.

3.2 Precision and Recall Results

For comparison methods that return a similarity value between 0.0 and 1.0 athreshold needs to be defined, with similarity values above being classified asduplicates and values below as non-duplicates. For all other comparison methodsa similarity value of 1.0 is classified as duplicate and 0.0 as non-duplicate.

Due to the skewed class distribution in our study the accuracy measure wouldnot show significant differences in the comparison methods performances. Insteadwe use precision and recall, which are commonly used metrics in information re-trieval [7, 12, 19]. Precision is measured as the number of true positives (trueduplicates classified as duplicates) divided by the sum of true and false posi-tives (i.e. all instances classified as duplicates), while recall (the true positiverate) is measured as the ratio of the number of true positives divided by thetotal number of positives. In Figure 3 and Table 3 we present the f-score (orf-measure), the harmonic mean of precision and recall which is calculated asf-score = 2(precision ∗ recall)/(precision + recall).

The results in Figure 3 for the different types of single errors introducedshow that there are large differences in the performance of the different methodsaccording to the values of the threshold. Key-Difference only performs good forsubstitutions and transpositions for low threshold values. Both Edit-Distanceand Bigram perform well for higher thresholds on most error types, except fortranspositions where both methods become worse for thresholds over 0.8. Win-kler seems to consistently perform well for higher thresholds. For word swappingerrors, only Bigram and Winkler, a well as the new Sorted- and Permuted-Winkler comparison methods have significant f-scores. Table 3 shows that pho-netic encoding methods do not perform well for any error type, but are especiallybad on word swappings.

As the results show, the same threshold for the various comparison methodsresults in different performances, and that the single error characteristics of adata set influence the performance of the different string comparison methods.

Page 10: Single Error Analysis of String Comparison Methods

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (insertions)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (deletions)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (substitutions)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (transpositions)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (space insertions)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (space deletions)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (word swappings)

WinklerBigram

Edit-distKey-diff-2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F-s

core

Threshold

F-score for comparison methods (word swappings)

WinklerJaro

Sort-winklerPerm-winkler

Fig. 3. F-Score results for all single error types. The bottom right plot shows the newSorted- and Permuted-Winkler comparison methods.

Page 11: Single Error Analysis of String Comparison Methods

Table 3. F-Score results.

Method Insert Delete Subst. Transp. Space ins. Space del. Word swaps

Truncate .33 .31 .30 .28 .29 .26 .02Soundex .30 .31 .27 .33 .38 .30 .01Phonex .28 .27 .25 .28 .33 .30 .01NYSIIS .26 .24 .24 .23 .29 .24 .01D-Metaphone .31 .33 .27 .34 .39 .32 .01

4 Related Work

A large number of approximate string comparison methods have been devel-oped [1, 3–5, 10, 11, 14–16, 18, 19], and various studies on their performance, aswell as their computational complexity, have been carried out [4, 7–12, 14, 19].Most of these studies were done with real world data sets taken from a specificapplication area, and to our knowledge only [10] to a certain degree analysesvarious comparison methods (mainly based on phonetic encodings using Englishnames) regarding their single error type performance. Based on their findingsthey propose the Phonex method which we included in our study.

Recent publications on using approximate string comparators in the areaof deduplication, data matching and data linkage include [1, 3, 18]. They mainlyfocus on different approaches to learn the parameters (e.g. edit distance costs) ofdifferent string comparison methods for improved deduplication and matchingclassification. None of them go into the details of analysing the comparisonmethods as we did in our study.

5 Conclusion and Future Work

In this paper we have presented a study of single error characteristics on dif-ferent approximate string comparison methods, which are at the core of manyapplications dealing with the processing, comparison or extraction of strings. Weconcentrated on locality name strings, and found that the various comparisonmethods perform differently and are often sensitive to a chosen threshold. Wealso found that swapped words are the hardest type of error to classify correctly,and we proposed two new comparison methods that deal with this problem.

While we only used artificial data with well defined introduced single errors,we are planning to use real world data sets to conduct further studies and to ver-ify our results. We also plan to extend our study by including other approximatestring comparison methods.

Acknowledgements

This work is supported by an Australian Research Council (ARC) Linkage GrantLP0453463. The author would like to thank Lifang Gu, Rohan Baxter and PaulThomas for their valuable comments and suggestions.

Page 12: Single Error Analysis of String Comparison Methods

References

1. Bilenko, M. and Mooney, R.J.: Adaptive duplicate detection using learnable stringsimilarity measures. Proceedings of the ninth ACM SIGKDD international confer-ence on Knowledge discovery and data mining, pp. 39–48, August 2003.

2. Christen, P., Churches, T. and Hegland, M.: A Parallel Open Source Data LinkageSystem. Proceedings of the 8th PAKDD’04 (Pacific-Asia Conference on KnowledgeDiscovery and Data Mining), Sydney. Springer LNAI-3056, pp. 638–647, May 2004.

3. Cohen, W.W., Ravikumar, P. and Fienberg, S.E.: A comparison of string distancemetrics for name-matching tasks. Proceedings of IJCAI-03 Workshop on Informa-tion Integration on the Web (IIWeb-03), pp. 73–78, Acapulco, August 2003.

4. Damerau, F.: A technique for computer detection and correction of spelling errors.Communications of the ACM, vol. 7, issue 3, pp. 171–176, March 1964.

5. Hall, P.A.V. and Dowling, G.R.: Approximate String Matching. ACM ComputingSurveys, vol. 12, no. 4, pp. 381–402, December 1980.

6. Hernandez, M.A. and Stolfo, S.J.: Real-world data is Dirty: Data Cleansing andThe Merge/Purge Problem. Data Mining and Knowledge Discovery, vol. 2, no. 1,pp. 9–37, January 1998.

7. Holmes, D. and McCabe, M.: Improving Precision and Recall for Soundex Re-trieval. Proceedings of the 2002 IEEE International Conference on InformationTechnology – Coding and Computing (ITCC), Las Vegas, April 2002.

8. Jokinen, P., Tarhio, J. and Ukkonen, E.: A comparison of approximate stringmatching algorithms. Software – Practice and Experience 26, 12 (1996), 1439-1458.

9. Kukich, K.: Techniques for automatically correcting words in text. ACM Comput-ing Surveys, vol. 24, no. 4, pp. 377–439, December 1992.

10. Lait, A.J. and Randell, B.: An Assessment of Name Matching Algorithms. Tech-nical Report, Dept. of Comp. Science, University of Newcastle upon Tyne, 1993.

11. Navarro, G.: A guided tour to approximate string matching. ACM ComputingSurveys, vol. 33, issue 1, pp. 31–88, March 2001.

12. Pfeifer, U., Poersch, T. and Fuhr, N.: Retrieval effectiveness of proper name searchmethods. Information Processing and Management: an International Journal, vol.32, no. 6, pp. 667–679, November 1996.

13. Philips, L.: The Double-Metaphone Search Algorithm, C/C++ User’s Journal, Vol.18 No. 6, June 2000.

14. Pollock, J.J. and Zamora, A.: Automatic spelling correction in scientific and schol-arly text. Commun. ACM, vol. 27, no. 4, pp. 358–368, 1984.

15. Porter, E. and Winkler, W.E.: Approximate String Comparison and its Effect onan Advanced Record Linkage System. RR 1997-02, US Bureau of the Census, 1997.

16. Van Berkel, B. and De Smedt, K.: Triphone analysis: A combined method for thecorrection of orthographical and typographical errors. Proceedings of the secondconference on Applied natural language processing, pp. 77–83, Austin, 1988.

17. Winkler, W.E. and Thibaudeau, Y.: An Application of the Fellegi-Sunter Modelof Record Linkage to the 1990 U.S. Decennial Census. RR-1991-09, US Bureau ofthe Census, 1991.

18. Yancey, W.E.: An Adaptive String Comparator for Record Linkage RR 2004-02,US Bureau of the Census, February 2004.

19. Zobel, J. and Dart, P.: Phonetic string matching: Lessons from information re-trieval. Proceedings of the 19th international ACM SIGIR conference on Researchand development in information retrieval, pp. 166–172, Zurich, Switzerland, 1996.