63
Techniques for Automatically Correcting Words in Text KAREN KUKICH Bellcore, 44.5 South Street, Morristown, NJ 07962-1910 Research aimed at correcting words in text has focused on three progressively more difficult problems: (1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent word correction. In response to the first problem, efflclent pattern-matching and n-gram analysis techniques have been developed for detecting strings that do not appear in a given word list. In response to the second problem, a variety of general and application-specific spelling correction tech n [ques have been developed. Some of them were based on detailed studies of spelling error patterns. In response to the third problem, a few experiments using natural-language-processing tools or statistical-language models have been carried out. This article surveys documented tindings on spelling error patterns, prowdes descriptions of various nonword detection and isolated-word error correction techmques, reviews the state of the art of context-dependent word correction techniques, and discusses research issues related to all three areas of automatic error correction in text. Categories and Subject Descriptors: 1.2.6 [Artificial Intelligence]: Learning- connection ism and neural nets; 1.2.7 [Artificial Intelligence]: Natural Language Processing—language models; language parsLng and understandlrLg; text analysts; 1.5.1 [Pattern Recognition]: Models—new-cd nets; stattstzcal: 1.54 [Pattern Recognition]: Applications—text processing; 1.7.1 [Text Processing]: Text Editing—spelhng General Terms: Experimentation, Human Factors, Performance Additional Key Words and Phrases: Context-dependent spelling correction, grammar checking, natural-language-processing models, neural net classifiers, n-gram analysis, Optical Character Recognition (OCR), spell checking, spelling error detection, spelling error patterns, statistical-language models, word recognition and correction INTRODUCTION The problem of devising algorithms and techniques for automatically correcting words in text has become a perennial research challenge. Work began as early as the 1960s on computer techniques for automatic spelling correction and auto- matic text recognition, and it has contin- ued up to the present. There are good reasons for the continuing efforts in this area. Although some excellent academic and commercial spelling checkers have been around for some time, existing spelling correction techniques are limited in their scope and accuracy. As a conse- quence, many current computer applica- tions are still vulnerable to costly text, code, and data entry mistakes. For exam- ple, the disruptive 13ell Atlantic and Pa- cific Bell telephone network outages that occurred during the summer of 1991 were due in part to a typographical error in a software patch. Similarly, although some good commercial text recognition devices are available today, they perform opti- mally only under ideal conditions in which input consists of clean text set in a Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specifc permission. @ 1992 ACM 0360-0300/92/1200-0377 $01.50 ACM Computing Surveys, Vol. 24, No. 4, December 1992

Techniques for automatically correcting words in text

  • Upload
    unyil96

  • View
    2.247

  • Download
    5

Embed Size (px)

Citation preview

Techniques for Automatically Correcting Words in Text

KAREN KUKICH

Bellcore, 44.5 South Street, Morristown, NJ 07962-1910

Research aimed at correcting words in text has focused on three progressively more

difficult problems: (1) nonword error detection; (2) isolated-word error correction; and

(3) context-dependent word correction. In response to the first problem, efflclent

pattern-matching and n-gram analysis techniques have been developed for detecting

strings that do not appear in a given word list. In response to the second problem, a

variety of general and application-specific spelling correction tech n [ques have been

developed. Some of them were based on detailed studies of spelling error patterns. In

response to the third problem, a few experiments using natural-language-processing

tools or statistical-language models have been carried out. This article surveys

documented tindings on spelling error patterns, prowdes descriptions of various

nonword detection and isolated-word error correction techmques, reviews the state of

the art of context-dependent word correction techniques, and discusses research issues

related to all three areas of automatic error correction in text.

Categories and Subject Descriptors: 1.2.6 [Artificial Intelligence]: Learning-

connection ism and neural nets; 1.2.7 [Artificial Intelligence]: Natural LanguageProcessing—language models; language parsLng and understandlrLg; text analysts;

1.5.1 [Pattern Recognition]: Models—new-cd nets; stattstzcal: 1.54 [Pattern

Recognition]: Applications—text processing; 1.7.1 [Text Processing]: Text

Editing—spelhng

General Terms: Experimentation, Human Factors, Performance

Additional Key Words and Phrases: Context-dependent spelling correction, grammar

checking, natural-language-processing models, neural net classifiers, n-gram analysis,

Optical Character Recognition (OCR), spell checking, spelling error detection, spelling

error patterns, statistical-language models, word recognition and correction

INTRODUCTION

The problem of devising algorithms andtechniques for automatically correctingwords in text has become a perennialresearch challenge. Work began as earlyas the 1960s on computer techniques forautomatic spelling correction and auto-matic text recognition, and it has contin-ued up to the present. There are goodreasons for the continuing efforts in thisarea. Although some excellent academicand commercial spelling checkers havebeen around for some time, existing

spelling correction techniques are limitedin their scope and accuracy. As a conse-quence, many current computer applica-tions are still vulnerable to costly text,code, and data entry mistakes. For exam-ple, the disruptive 13ell Atlantic and Pa-cific Bell telephone network outages thatoccurred during the summer of 1991 weredue in part to a typographical error in asoftware patch. Similarly, although somegood commercial text recognition devicesare available today, they perform opti-mally only under ideal conditions inwhich input consists of clean text set in a

Permission to copy without fee all or part of this material is granted provided that the copies are not made

or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication

and its date appear, and notice is given that copying is by permission of the Association for Computing

Machinery. To copy otherwise, or to republish, requires a fee and/or specifc permission.

@ 1992 ACM 0360-0300/92/1200-0377 $01.50

ACM Computing Surveys, Vol. 24, No. 4, December 1992

378 ● Karen Kukich

CONTENTS

INTRODUCTION

1. NONWORD ERROR DETECTION RESEARCH

11 N-gram Analysls Techniques

12 Dlctlonary Lookup Techniques

13 Dlctlonary Construction Issues

14 The Word Boundary Problem

15 Summary of Nonword Error DetectIon Work

2 ISOLATED-WORD ERROR CORRECTION

RESEARCH

21 Spelllng Error Patterns

22 Techniques for Isolated-Word Error

CorrectIon

3 CONTEXT-DEPENDENT WORD CORRECTION

RESEARCH

31 Real-Word Errors Frequency and

Classdlcatlon

32 NLP Prototypes for Handling Ill-Formed

Input

33 StatlstlcalIy Based Error DetectIon and

CorrectIon Experiments

34 Summary of Context-Dependent Word

CorrectIon Work

FUTURE DIRECTIONS

ACKNOWLEDGMENTS

REFERENCES

standard type font. Furthermore, even a“.character recognition accuracy rate ashigh as 99?4 yields only a 957. wordrecognition accuracy rate, because oneerror per 100 characters equates toroughly one error per 20 words, assum-ing five-character words. One study[Cushman 1990] found that in order foroptically scanned documents to containno more residual errors than typed docu-ments, at least 98% character recogni-tion accuracy coupled with computer-assisted proofreading would be required.

Evolving human-computer and com-puter-communications technologies haveopened the door for a host of new applica-

tions that will require better word recog-nition and error correction capabilities.Pen-based interfaces will enable users toprovide handwritten input to computers;text recognition devices will make scan-ning of printed material feasible undereveryday conditions; voice synthesis(text-to-speech) devices will make textualmaterial audible; and voice recognition(speech-to-text) technology will eventu-

ally allow voice input to computer sys-tems. But none of these applications willbecome practical until significant im-provements are made in the area of wordrecognition and correction. Some of theother applications that will benefit fromsuch improvements include more sophis-ticated software tools for text and codeediting, computer-aided authoring, ma-chine translation, language learning,computer-aided tutoring, and databaseinteraction, as well as various voice inputand voice output business applicationsand aids for the disabled such as fax-to-voice devices and phonetic-transcriptionservices.

Early on, researchers working withinthe paradigms of automatic spelling cor-rection and automatic text recognitionproceeded somewhat independently us-ing different techniques. Over time thevarious techniques began to migrate be-tween the fields so that today numeroushybrid approaches and many reasonablysuccessful systems exist. But those whowould consider spelling correction a—solved ~roblem mav be missirw some

L . .

subtle and some not so subtle pointsabout the nature of the problem and thescope of existing techniques.

A distinction must be made betweenthe tasks of error detection and errorcorrection. Efficient techniques havebeen devised for detecting strings that donot appear in a given word list, diction-ary, or lexicon. 1 But correcting a mis-spelled string is a much harder problem.Not only is the task of locating and rank-ing candidate words a challenge, but asBentley [ 1985] points out: given the mor-phological productivity of the Englishlanguage (e.g., almost any noun can beverbifled) and the rate at which words

enter and leave the lexicon (e. g., catwonz -

1 The terms “word hsl,,” “dlctlonary,” and “lexlcon”are used interchangeably m tbe hterature We pre-fer the use of the term le.wcon because Its connota-tion of “a list of words relevant to a particularsubject, field, or class” seems best suited to spellingcorrection applications, but we adopt the terms“dictmnary” and “word list” to describe research inwhich other authors have used them exclusively.

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text “ 379

anhood, balkanization), some even ques-tion the wisdom of attempts at automaticcorrection.

Many existing spelling correctors ex-ploit task-specific constraints. For exam-ple, interactive command line spellingcorrectors exploit the small size of a com-mand language lexicon to achieve quickresponse times. Alternatively, longer re-sponse times are tolerated for noninter-active-mode manuscript preparation ap-plications. Both of the foregoing spellingcorrection applications tolerate lowerfirst-guess accuracy by returning multi-ple guesses and allowing the user to makethe final choice of intended word. In con-trast, some future applications, such astext-to-speech synthesis, will require asystem to perform fully automatic, real-time word recognition and error correc-tion for vocabularies of many thousandsof words and names. The contrast be-tween the first two examples and thislast one highlights the distinction be-tween interactive spelling checkersand automatic correction. The lattertask is much more demanding, and it isnot clear how far existing spelling correc-tion techniques can go toward fully auto-matic word correction.

Most existing spelling correction tech-niques focus on isolated words, withouttaking into account any information thatmight be gleaned from the Linguistic ortextual context in which the string ap-pears. Such isolated-word correctiontechniques are unable to detect the sig-nificant portion of errors, including typo-graphic, phonetic, cognitive, and gram-matical errors, that result in other validwords. Some examples are the use ofform where from was intended, the mis-use of the words there, their, and they ‘re,

and the occurrence of the word minuets

in the phrase see you in fiue minuets. Itis clear that contextual information isnecessary for the detection and correc-tion of such errors. A context-based cor-rection technique would not only addressthe problem of real-word errors, i.e., er-rors that result in another valid word,but it would also be helpful in correctingthose nonword errors that have more

than one potential correction. An exam-ple might be the string ater. Withoutcontext there is littlk reason, apart froma priori frequencies of words, to preferafter, later, alter, water, or ate, amongothers, as the intended correction. Devel-oping context-based correction tech-niques has become the foremost chal-lenge for automatic word recognition anderror correction in text.

For descriptive purposes then, auto-matic word correcti cm research may beviewed as focusing on three increasinglybroader problems: (1) nonword errordetection; (2) isolated-word errorcorrection; and (3) context-depen-dent word correction. Work on the firstproblem spanned a period from the early1970s into the early 1980s. During thattime, effort was directed mainly towardexploring efficient pattern-matching andstring comparison techniques for decid-ing whether an input string appears in apredefine word list or dictionary. Workon the second problem spanned a broadertime frame, from as early as the 1960sinto the present. During that time vari-ous general and special-purpose correc-tion techniques were devised, some inconjunction with stuclies of spelling errorpatterns. Experimentation on the thirdproblem began in th,e early 1980s withthe development of automatic natural-language-processing lmodels, and interesthas recently been rekindled with the de-velopment of statistlwal language models.

The UnixTM spell program [McIh-oy1982] is an example of an effective andefficient program for spelling error detec-tion that typifies work on the first prob-lem. Spell takes a whole document asinput, looks up each string in a 25,000 -word dictionary tailored to include termsfound mainly in the domain of technicalwriting, and returns a list of all stringsthat were not found in the dictionary. Itdoes not make any attempt to correct thestrings that it believes are misspelled.That task is left to the user, who might

‘M Unix is a registered trademark of UNIX SystemLaboratories.

ACM Computmg Surveys, Vol 24, No 4, December 1992

380 - Karen Kukich

make use of an isolated-word spellingcorrection tool, such as grope [Taylor1981]. Grope, another Unix tool, is one ofmany existing isolated-word spelling cor-rectors that were developed to addressthe second problem. Grope acts as aninteractive decision aid by accepting astring as input and providing the userwith an ordered list of the most similarstrings found in its dictionary as output,allowing the user to choose among them.The Unix Writer’s Workbench package[Cherry 1983] represents one of the firstefforts to address the third problem, thatof detecting and correcting real-word er-rors in text. Its style and diction toolsflag common grammatical and stylisticerrors and suggest possible corrections.Although some commercial grammar-checking programs are available, no gen-eral-purpose context-dependent wordcorrection tool yet exists.

This article m-esents a review of thetechniques an~ issues related to auto-matic error correction in text. It is di-vided into three sections correspondingto the three moblems identified above.Section 1 reviews nonword error detec-tion work and includes a discussion ofissues related to dictionary size and cov-.erage and word boundary infractions.Section 2 reviews findings on spellingerror patterns and provides descriptionsof many of the techniques that have beenused for isolated-word error correction. Itnotes the application domains for whichthe techniques were devised and high-lights application-specific constraintsthat were exploited by some techniques.Section 3 reviews preliminary work inthe area of context-dependent error cor-rection in text. It includes a brief historyof early work based on natural languageprocessing as well as recent experimentsbased on statistical language modeling.Possible future directions for the fieldare hypothesized in an epilogue.

1. NONWORD ERROR

DETECTION RESEARCH

The two main techniques that have beenexplored for nonword error detection

are n-gram analysis and dictionarylookup. IV-grams are n-letter subse-quences of words or strings, where n isusually one, two, or three. One-letter n-grams are referred to as unigrams ormonograms; two-letter n-grams are re-ferred to as digrams or bigrams; andthree-letter n-grams as trigrams. In gen-eral, n-gram error detection techniqueswork by examining each n-gram in aninput string and looking it up in a pre-compiled table of n-gram statistics to as-certain either its existence or its fre-quency. Strings that are found to containnonexistent or hizhlv infrequent n-m-ams

(such as the tri&a”m sh i‘ or lQN”) areidentified as pr~bable misspellings. N-gram techniques usually require either adictionary or a large corpus of text inorder to precompiled an n-gram table. Dic-tionary lookup techniques work by sim-ply checking to see if an input stringappears in a dictionary, i.e., a list of ac-ceptable words. If not, the string isflagged as a misspelled word. There aresubtle problems involved in the compila-tion of a useful dictionary for a spellingcorrection application.

Historically, text recognition systemshave tended to relv on n-m-am tech-.niques for error detection while spellingcheckers have tended to rely on dictio-nary lookup techniques. In both cases,problems arise when errors cross wordboundaries resulting in run-on or splitwords. Issues related to each of thesetechniques and their accompanying prob-lems are discussed next.

1.1 N-gram Analysis Techniques

Text recognition systems usually focuson one of three modes of text: hand-printed text, handwritten text (some-times referred to as cursive script), ormachine-printed text. All three modesmay be processed by optical characterrecognition (OCR) devices. OCR devicestypically rely on feature analysis to rec-ognize individual characters withinwords. Examples of features might in-clude counts of the number of vertical,horizontal, curved, and crossing lines in

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text ● 381

a character. Errors made by OCR devicestend to be those that confuse characterswith similar features, such as O and D, Sand 5, or t and f. IV-gram analysis hasproven useful for detecting such errorsbecause they tend to result in improbablen-grams. Following up on early work bySitar [1961], Harmon [1972] reportedthat “l) in large samples of common En-glish publication text, 42% of all possibledigram combinations never occur and 2)random substitution of one letter in aword by another letter which is selectedwith equal likelihood from the other 25 ofthe alphabet will produce at least onenew digram which has zero probability70% of the time” (p. 1172). These factshave been exploited for detecting errorsin the output of optical-character recog-nizers.2

N-gram tables can take on a variety offorms. The simplest is a binary bigram

array, which is a two-dimensional arrayof size 26 X 26 whose elements representall possible two-letter combinations of thealphabet. The value of each element inthe array is set to either O or 1 dependingon whether that bigram occurs in at leastone word in a predefine lexicon or dic-tionary. A binary trigram array wouldhave three dimensions. Both of the abovearrays are referred to as nonpositionalbinary n-gram arrays because they donot indicate the position of the n-gramwithin a word.

More of the structure of the lexicon canbe captured by a set of positional bi-

nary n-gram arrays. For example, in apositional binary trigram array, thei, j, k th element would have the value 1if and only if there exists at least oneword in the lexicon with the letters 1, m,

and n in positions z, j, and k. The trade-off for representing more of the structureof the lexicon is the increase in storage

2 Although text recognition researchers often referto techniques that use n-gram statistics ascontext-dependent techniques, this use of the term“context” refers only to within-word context. In con-trast, the techniques that are discussed in Section 3of this article exploit extraword context for correct-ing real-word errors.

space required for the complete set ofpositional arrays, Any word can bechecked for errors by simply looking upits corresponding enltries in binary n-

gram arrays to make sure they are all 1s.Two studies give some indication of

how effective nonpositional and posi-tional binary n-gram arrays are at de-tecting errors in OCFt output. Hanson etal. [1976] studied the output of an opticalscanner that had been fed a corpus ofhand-printed six-letter words. The out-put contained 7,662 words each having asingle substitution error. The best de-tection performance was achieved usingpositional binary trigram arrays. Thesedetected 7,561, or 98%, of the errors.Nonpositional n-gram arrays scored sub-stantially lower. Hull and Srihari [1982]found that positional binary trigramsrepresenting a subdlictionary of 5,000seven-letter words detected over 98% ofsubstitution errors. 1t is not clear howwell these results generalize to shorter-length words and otlher types of errorssuch as insertions, deletions, and fram-ing errors that alter the lengths of words.An example of a framing error is thesubstitution of the sinlgle letter m for thetwo-letter sequence ni or vice versa.

Another study was designed to evalu-ate the effectiveness of trigram frequencystatistics for spell-checking applications.Zamora et al. [1981] compiled a trigramtable containing frequency counts orprobabilities instead of binary values.Such statistics must be compiled from asufficiently large corpus of text (e.g., atleast a million words) covering the do-main of discourse. Then they analyzed acollection of 50,000 word/misspellingpairs from seven machine-readabledatabases to determine “whether there issufficient difference lbetween the trigramcompositions of correct and misspelledwords for the latter to be reliably de-tected (p. 306). They found that al-though trigram analysis was able to de-termine the error site within a mis-spelled word accurate ly, it did not distin-guish effectively between valid words andmisspellings.

To address this problem, Morris and

ACM C!omputmg Surveys, Vol. 24, No. 4, December 1992

382 ● Karen Kukich

Cherry [1975] devised an alternativetechnique for using trigram frequencystatistics to detect errors. Rather thangenerating a trigram frequency tablefrom a general corpus of text, they gener-ated the table directly from the docu-ment to be checked. Their rationale forthis approach was based on their findingthat “for a given author, the number ofdistinct words [used in a document] in-creases approximately as the square rootof the total number of words” (p. 54).Thus, misspellings might show up aswords containing trigrams that were pe-culiar to the document itself.

To check a document for spelling er-rors, Morris and Cherry [1975] generatea trigram frequency table based on thedocument itself. Then, for each uniqueword in the document they compute anindex of peculiarity as a function of thetrigram frequencies of the word. Finally,they rank the words in decreasing orderof peculiarity. They hypothesize thatmisspelled words will tend to appear nearthe top of the list. As an example of thetechnique’s success, they point out (1)that it took only ten minutes for an au-thor of a 108-page document to scan theoutput list and identify misspelled wordsand (2) that 23 of the 30 misspelled wordsin the document occurred in the top 100words of the list.

1.2 Dictionary Lookup Techniques

Dictionary lookup is a straightforwardtask. However, response time becomes aproblem when dictionary size exceeds afew hundred words. In document pro-cessing and information retrieval, thenumber of dictionary entries can rangefrom 25,000 to more than 250,000 words.This problem has been addressed in threeways, via efficient dictionary lookupand/or pattern-matching algorithms, viadictionary-partitioning schemes, and viamorphological-processing techniques.

The most common technique for gain-ing fast access to a dictionary is the useof a hash table [Knuth 1973]. To look upan input string, one simply computes itshash address and retrieves the wordstored at that address in the precon-

structed hash table. Sometimes a link ortwo must be traversed if collisions oc-curred during construction of the hashtable. If the word stored at the hashaddress is different from the input stringor is null, a misspelling is indicated.

Turba [1981] provides a quick reviewof the pros and cons of using hash tablesfor dictionary lookup. The main advan-tage is that the random-access nature ofa hash code eliminates the large numberof comparisons needed for sequential oreven tree-based searches of the dictio-nary. The main disadvantage is the needto devise a clever hash function thatavoids collisions without requiring a hugehash table. Fox et al. [1992] recently de-scribed a technique for devising “minimalperfect hash functions whose specifica-tion space is very close to the theoreticallower bound” (p. 266). Some earlier sys-tems circumvented the storage problemby not storing the character representa-tion of the word itself but instead using asingle bit to indicate that a given hashcode maps into a valid word. This trickexplains the occasional odd behavior ofsome spelling checkers in which an in-valid string goes undetected because ithappened to map into a hash address fora valid word. The Unix spell program isone example of a program that employs ahash table for fast dictionary lookup.

Other standard search techniques,such as tries [Knuth 1973], frequency-ordered binary search trees [Knuth 1973],and finite-state automata [Aho andCorasick 1975], have been used to reducedictionary search time. Knuth [1973] de-fines a trie as “an M-ary tree, whosenodes are M-place vectors with compo-nents corresponding to digits or charac-ters. Each node on level 1 represents theset of all keys that begin with a certainsequence of 1 characters; the node speci-fies an M-way branch, depending on the(1 + l)st character” (p. 481). Knuth notesthat the name trie was coined by Fred-kin [1960] because of its role in informa-tion retrieval applications. Sheil [1978]introduced a fast lookup technique calledmedian split trees for searching lexiconswith highly skewed distributions such asEnglish text. A technique for fast regu-

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text e 383

lar-expression pattern matching in-vented by Aho and Corasick [1975] pre-scribes a method for creating a finite-state automaton to represent a small setof terms (keywords) that must bematched against a corpus of running text.Other algorithms for fast pattern match-ing, some of which have been adopted forspelling correction applications, are de-scribed in a review article by Aho [1990].

Peterson [1980] suggests partitioning adictionary into three levels for spellingerror detection in text-processing appli-cations. The first level, to be stored incache memory, would consist of the fewhundred most frequently used words thataccount for 5070 of dictionary accesses;the second level, to be stored in regularmemory, would consist of a few thousanddomain-specific words which account foranother 45T0 of dictionary accesses; athird level, to be stored in secondarymemory, would consist of tens of thou-sands of less frequently used words whichaccount for the remaining 57o of dictio-nary accesses.

Memory constraints also motivatedmany early spelling checkers to avoidstoring all possible morphological vari-ants of individual words (e.g., plurals,past tenses, adverbial, nominals, etc.) asseparate dictionary entries. Instead, onlyroot forms of words were stored. If alookup procedure failed to find an inputstring in the dictionary, it would issue acall to a morphological-processing rou-tine which would iteratively check forknown suffixes and prefixes to strip (andpossibly replace with alternative strings,as in dictionaries * dictionary) beforere-searching the dictionary. However, asElliott [1988] describes in a memo on animproved morphological-processing com-ponent for the Unix spell program, sim-ple affix-stripping methods often lead tofalse acceptances, such as adviseed ordisclam, when affixes are stripped withno regard to grammar. Elliott providesan efficient solution based on creatingaffix equivalence classes that are moti-vated by spelling similarities.

Increases in computational speed andmemory have made dictionary storageand processing-time constraints less crit-

ical, so the current trend is toward stor-ing all inflectional variants of words asseparate dictionary entries. However, theneed for morpholo~tical processing willnever be completely obviated because itis virtually impossible to anticipate all ofthe morphologically creative terms thatpeople tend to generate in both text andconversation (e.g., “1arge software pro-jects must avoid creeping featurism,”

“these actors are highly hatchable”).

1.3 Dictionary Construction issues

A lexicon for a spelling correction or textrecognition application must be carefullytuned to its intended domain of dis-course. Too small a lexicon can burdenthe user with too many false rejections ofvalid terms; too large a lexicon can resultin an unacceptably hllgh number of falseacceptances, i.e., genuine mistakes thatwent undetected beta use they happenedto form valid low-frequency or extra-domain words (e.g., icwe, fen, ueery, etc).But the relationship between mis-spellings and word frequencies is notstraightforward.

Peterson [1986] calculated that ap-proximately half a percent of all single-error transformations of each of the wordson a 350,000-item word list result in othervalid words on the list. He noted that thenumber of actual undetected errors maybe much higher in practice due to thefact that shorter words, which tend tooccur more frequently, also have a highertendency to result in other valid wordswhen misspelled. So he went on to com-pute frequency-weighted estimates of thepercentage of errors that would go unde-tected as a function of dictionary size.His estimates ranged from 2% for a smalldictionary to 10% for a 50,000-word dic-tionary to almost 16% for a 350,000-worddictionary. This led Peterson to recom-mend that word lists for spelling correc-tion be kept relatively small.

However, Damerau and Mays [1989]challenge this recommendation. Using acorpus of over 22 million words of textfrom various genres, they found that byincreasing the size of their frequencyrank-ordered word list from 50,000 to

ACM Computmg Surveys, Vol. ’24, No. 4, December 1992

384 - Karen Kuhich

60,000 words, they were able to elimi-nate 1,348 false rejections while incur-ring only 23 additional false acceptances.Since this 50-to-l differential error raterepresents a significant improvement incorrection accuracy, they recommend theuse of larger lexicons.

Dictionaries themselves are often in-sufficient sources for lexicon construc-tion. Walker and Amsler [1986] observedthat nearly two-thirds (61%) of the wordsin the Merriam-Webster Seventh Colle-

giate Dictionary did not appear in aneight million word corpus of New York

Times news wire text, and, conversely,almost two-thirds (649%) of the words inthe text were not in the dictionary. Inanalyzing a sample of the novel terms inthe New York Times text, they foundthat “one fourth were inflected forms;one fourth were proper nouns; one sixthwere hyphenated forms; one-twelfth weremisspellings; and one fourth were not yetresolved, but some are likely to be newwords occurring since the dictionary waspublished’ (p. 79). Many such novel termscould be desirable, frequently used termsin a given application. Some applicationsrequire lexicons specialized for commandand file names or specific database en-tries. On the topic of construction of thespelling list for the spell program, McIl-roy [1982] provides some helpful insightsinto appropriate sources that may bedrawn upon for general word-list con-struction. A more recent article by Dam-erau [1990] provides some insights intoand guidelines for the automatic con-struction and customization of domain-oriented vocabularies for specialized nat-ural-language-processing applications.

Mitton [1986] has made available acomputer-usable dictionary derived fromthe Oxford Advanced Learner’s Dictio-

nary of Current English. Its nearly 38,000main entries yield over 68,000 inflectedforms. Sampson [1989] has evaluated thecoverage of this dictionary against a50,000-word cross section of a corpus ofwritten Englislh. Both the dictionary andthe corpus “are British-based, thoughboth include some material pertaining toother varieties of English” (p. 29). Samp-

son reports that he found the dictionary’scoverage of the corpus to be surprisinglygood—only 1,477, or 3.24T0, of the wordtokens in the corpus were not in thedictionary.3 In analyzing the 1,176 dis-tinct word types represented by the 1,477word tokens, Sampson found that overhalf, about 600, were proper names oradjectives derived from proper names(e.g., Carolean). Of the remainder, he feltthat some, including abbreviations andnumbers written digitally with alpha-betic affixes, could be added with relativeease. But he found foreign words, hy-phenated variants, and derived morpho-logical variants (e.g., inevitably, glassily,

fairness ), especially missing negativeforms (e.g., uncongenial, irreversible),

to be problems for spelling dictionaryconstruction.

1.4 The Word Boundary Problem

For virtually all spelling error detectionand correction techniques, word bound-aries are defined by white space charac-ters (e.g., blanks, tabs, carriage returns,etc.). This assumption turns out to beproblematic since a significant portion oftext errors involve running together twoor more words, sometimes with intrinsicerrors (e.g., ofthe, understandhrne), orsplitting a single word (e.g., sp ent, th

ebooh ). Kukich [1992] found that a full1570 of all nonword spelling errors in a

40,000-word corpus of typed textual con-versations involved this type of error (i.e.,13% were run-on words, and 2Yc weresplit words), Mitton [1987] found thatrun-ons and splits frequently result in atleast one valid word (e.g., forgot + for

got, in form + inform), so one or bothsuch errors may go undetected. In con-trast to human-generated errors, Joneset al. [1991] found that OCR devices aremore likely to split words than to join

~ A type us token convention is used to dmtinguishactual occurrences of an individual word, i e , to-kens, from the name of the word itself, i.e., its typeFor example, a document may contain many tokens

of the word the, but the word the itself representsonly one type

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words ,in Text ● 385

them. The difficulty in dealing with run-ons and splits lies in the fact that weak-ening the word boundary constraint re-sults in a combinatorial explosion of thenumber of possible word combinations orsubdivisions that must be considered.

No spelling error detection systemstreat word boundary infractions any dif-ferently than other types of errors, but afew spelling error correction systems at-tempt to handle some types of run-onand split-word errors explicitly. One suchcorrector was built by Lee et al. [ 1990]for a natural language interface to anintelligent tutoring system in the domainof cardiovascular physiology. This correc-tor exploits the limited vocabulary of itsdomain of discourse to check for run-onsor splits when no candidate correctionsexceed a minimum threshold of similar-ity. Another corrector implemented byMeans [1988] for a limited-domain natu-ral language interface handles split-worderrors in the same way. A corrector de-signed by Kernighan [1991] for a text-to-speech application handles run-on wordsby checking for space bar deletions at thesame time it checks for other letter dele-tions. It sets a “likelihood score for atwo-word correction equal to a constant(determined by trial and error) times thefrequencies of the two words” (p. 4).CLARE, a front end to a natural lan-guage parsing and generation system im-plemented by Carter [1992], was explic-itly designed to handle word boundaryinfractions. It does so by maintaining “alattice of overlapping word hypothesesfrom which one or more complete pathsare subsequently selected” (p. 159). Thissystem was able to find a single, accuratecorrection for 59 of 108 errors in a testset of 102 sentences. Twenty-four of the59 corrections involved word boundaryinfractions. An error-correcting postpro-cessor for OCR devices designed by Joneset al. [1991] includes a processing phasethat explicitly checks for split-word er-rors.

There is some indication that a largeportion of run-on and split-word errorsinvolves a relatively small set of high-frequency function words (i.e., preposi-

tions, articles, quantifiers, pronouns,etc.), thus giving rise to the possibility oflimiting the search time required to checkfor this type of error. The SPEEDCOPcorrector, implemented by Pollock andZamora [1984] for a general text-editingapplication, includes a final subroutinethat checks for run-on errors involvingfunction words after other routines havefailed. However, the general problem ofhandling errors due to word boundaryinfractions remains one of the significantunsolved problems in spelling correctionresearch.

1.5 Summary of Nonword ErrorDetection Work

In general, although n-gram analysismay be useful for detecting machine-gen-erated errors such as those produced byoptical-character recognizes, it hasproven to be less accurate for detectinghuman-generated errors. Hence, mostcurrent spelling correction techniquesrely on dictionary lookup for error detec-tion. Since access speed may be a factorwhen dictionary size is moderate to large(e.g., > 20,000 entries), efficient algo-rithms for exact pattern matching thatexploit hash tables, tries, and other tech-niques have been used. Dictionaries mustbe carefully tailored to the domain ofdiscourse of the application in order toavoid frustrating the user with too manyfalse acceptances and rejections. More re-search in the area of generative morphol-

Ogy is warranted to address the problemof creative morphology if not dictionarysize reduction. Finally, errors due to wordboundary infractions remain a signifi-cant unsolved problelm for spelling errordetection and correct] on.

2. ISOLATED-WORD EIRROR

CORRECTION RESEARCH

For some applications, simply detectingerrors in text may be sufficient, but formost applications detection alone is notenough. For example, since the goal oftext recogmtlon devmes 1s to accuratelyreproduce input text, output errors mustbe both detected and corrected. Simi-

ACM Computmg Surveys, Vol 24, No 4, December 1992

386 “ Karen Kukich

larly, users have come to expect spellingcheckers to suggest corrections for thenonwords they detect. Indeed, somespelling correction applications, such astext-to-speech synthesis, require that er-rors be both detected and corrected with-out user intervention. To address theproblem of correcting words in text, avariety of isolated-word error correc-tion techniques have been developed.

The characteristics of different applica-tions impose different constraints on thedesign of isolated-word error correctors,and many successful correction tech-niques have been devised by exploitingapplication-specific characteristics andconstraints. It is worth reviewing someisolated-word error correction applica-tions and their characteristics beforedelving into the details of individualtechniques.

Text recognition and text editingare the most studied applications. Anumber of papers covering the first twodecades of text recognition research isavailable in an anthology [Srihari 1984],Some more recent work in this area in-cludes the following papers: [Burr 1987;Goshtasby and Ehrich 1988; Ho et al.1991; Jones et al. 1991]. A variety ofspelling correction techniques used intext editing is represented by the follow-ing papers: [Damerau 1964; Alberga1967; Gorin 1971; Wagner 1974; Hall1980; Peterson 1980; Yannakoudakis andFawthrop 1983a; Pollock and Zamora1984; Kernighan et al. 1990].

Other applications for which spellingcorrection techniques have been devisedinclude systems programming appli-cations [Morgan 1970; Aho and Peter-son 1972; Sidorov 1979; Spenke et al.1984], command language interfaces[Hawley 1982; Durham et al. 1983],database retrieval and informationretrieval interfaces [Blair 1960; David-son 1962; Boivie 1981; Mor and Fraenkel1982a; Bickel 1987; Kukich 1988a; Salton

1989; Gersho and Reiter 1990;Cherkassky et al. 1990; Parsaye et al.1990], natural language interfaces[Veronis 1988a; Means 1988; Berkel1988; Lee et al. 1990; Deffner et al.

1990a], computer-aided tutoring[Tenczar and Golden 1972], computer-aided language learning [ Contant andBrunelle 1992], text-to-speech applica-tions [Kukich 1990; Tsao 1990;Kernighan 1991], augmentative com-munication systems for the disabled[Aim et al. 1992; Wright and Newell 1991;Demasco and McCoy 1992], pen-basedinterfaces [Rhyne and Wolf 1991], andeven searching for historical word formsin databases of 17th century English [Ro-bertson and Willet 1992].

Most application-specific design con-siderations are related to three main is-sues: (1) lexicon issues, (2) computer-human interface issues, and (3) spell-ing error pattern issues. Lexicon is-sues include such things as lexicon sizeand coverage, rates of entry of new termsinto the lexicon, and whether morpholog-ical processing, such as affix handling, isrequired. These were discussed in thepreceding section on dictionary construc-tion issues.

Computer-human interface issuesinclude such considerations as whether areal-time response is needed, whether thecomputer can solicit feedback from theuser, how much accuracy is required onthe first guess, etc. These issues are es-pecially important in command languageinterfaces where trade-offs must be madeto balance the need for accuracy againstthe need for quick response time andwhere care must be taken to provide theuser with helpful corrections withoutbadgering her with unwanted sugges-tions. An empirical study by Durham etal. [1983] demonstrated, among otherthings, that considerable benefits ac-crued from a simple, speedy algorithmfor correcting single-error typos that onlycorrected about one quarter of thespelling mistakes made by users but didso in an unobtrusive manner. Other work,by Hawley [1982], demonstrated the easewith which spelling correction could beincorporated into the Unix command lan-guage interface. In studying alternativemethods for correcting recognition errorsin pen-based interfaces, Rhyne and Wolf[1993] found that the cost to repair er-

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text ● 387

rors is a critical factor in the design ofthese interfaces and in their ultimate ac-ceptance by users. Some of the correctionmethods they are studying include allow-ing the user to write over an error, allow-ing for keyboard correction, and provid-ing n best matches for the user to selectfrom.

Issues related to spelling error pat-terns include such things as what themost common errors are, how many er-rors tend to occur within a word, whethererrors tend to change word length,whether misspellings are typographi-cally, cognitively, or phonetically based,and, in general, whether errors can becharacterized by rules or by probabilistictendencies. Spelling error pattern issueshave had perhaps the greatest impactdesign of correction techniques.

2.1 Spelling Error Patterns

Spelling error patterns vary greatly de-pending on the application task. For ex-ample, transcription typing errors, whichare for the most part due to motor coordi-nation slips, tend to reflect typewriterkeyboard adjacencies, e.g., the substitu-tion of b for n. In contrast, errors intro-duced by optical-character recognizesare more likely to be based on confusionsdue to featural similarities between let-ters, e.g., the substitution of D for O.More subtly, even two similar text entrymodes, such as transcription typing andconversational text typing, e.g., elec-tronic mail, may exhibit significantly dif-ferent error frequency and distributionstatistics due to the greater cognitiveoverhead involved in the latter task. Socare must be taken not to overgeneralizefindings when discussing spelling errorpatterns.

Distinctions are sometimes madeamong three different types of nonwordmisspellings: (1) typographic errors, (2)cognitive errors, and (3) phonetic errors.In the case of typographic errors (e.g.,the + teh, spell + speel ), it is assumedthat the writer or typist knows the cor-rect spelling but simply makes a motorcoordination slip. The source of cognitive

errors (e.g., receive -+ recieve, conspiracy

+ conspiracy) is presumed to be a mis-conception or a lack clf knowledge on thepart of the writer or typist. Phonetic er-rors (e.g., abyss * abiss, naturally *

nacherly ) are a special class of cognitiveerrors in which the writer substitutes aphonetically correct but orthographicallyincorrect sequence of letters for the in-tended word. It is frequently impossibleto ascribe a single category to a givenerror. Is recieue necessarily a cognitiveerror, for example, or might it simply bea typographic transposition error? Simi-larly, is abiss a phonetic or typographicerror? Fortunately, it is often unneces-sary to categorize errors in order to de-velop a useful spelling correction tech-nique because man,y correction tech-niques handle typographic and cognitivemisspellings equally well. In fact, only afew researchers botlher to distinguishamong categories of misspellings. Pho-netic errors, however, tend to distortspellings more than typographic errorsand other cognitive errors, so the limitedfindings related to these errors are re-ported below.

Most studies of spelling error patternswere done for the purpose of designingcorrection techniques for nonword errors(i.e., those for which there is no exactmatch in the dictionary), so most find-ings relate only to nonword as opposed toreal-word errors (i. e., those that result inanother valid word). However, at leasttwo studies that did address real-worlderrors are included in this discussion. Itshould be noted that the increasing useof automatic spelling checkers has proba-bly reduced the number of nonword er-rors found in some genres of text. Conse-quently, the relative ratio of real-worderrors to nonword errors is probablyhigher today than it was in early studies,at least for some applications.

2.1.1 Basic Error Types

One of the first general findings on hu-man-generated spelling errors was ob-served by Damerau in 1964 and has beensubstantiated for man y applications since

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

388 ● Karen Ku?zich

then. Damerau [ 1964] found that approx-imately 8070 of all misspelled words con-tained a single instance of one of thefollowing four error types: insertion,deletion, substitution, and transposi-tion. Misspellings that fall into this largeclass are often referred to as single-error misspellings; misspellings thatcontain more than one such error havebeen dubbed multi-error misspellings.

The 80% single-error rule cannot betaken for granted for all applications,however. Pollock and Zamora [1984]found that only 6% of 50,000 nonwordspelling errors in the machine-readabledatabases they studied were multi-errormisspellings. Conversely, Mitton [1987]found that 31% of the misspellings in his170,0 16-word corpus of handwritten es-says contained multiple errors.

OCR-generated misspellings do not fol-low the pattern of human-generated mis-spellings. Most error correction tech-niques for OCR output assume that thebulk of the errors will be substitutionerrors. However, Jones et al. [1991] re-port that “the types of [OCR] errors thatoccur vary widely, not only from one rec-ognize to another but also based on font,input quality and other factors” (p. 929).They also report that “a significant frac-tion of OCR errors are not one-to-oneerrors (e.g., r-i + n or m + iii)” (p. 927).Rhyne and Wolf [1993] classify recogni-tion errors into four categories: (1) sub-stitutions; (2) failures (no character ex-ceeds a minimal recognition threshold);(3) insertions and deletions; and (4) fram-ing errors (one-to-one mapping failures).

2.1.2 Word Length Effects

Another general finding, a corollary tothe first, is the observation that mostmisspellings tend to be within two char-acters in length of the correct spelling.This has led many researchers, espe-cially within the OCR paradigm, to parti-tion their dictionaries into subdictionar-ies by word length in order to reducesearch time.

Unfortunately, little concrete data ex-ists on the frequency of occurrence oferrors by word length. This characteristic

will clearly affect the practical perfor-mance of a correction technique since er-rors in short words are harder to correct,in part because less intraword contextualinformation is available to the corrector.Furthermore, according to Zipfs law [Zipf1935], short words occur more frequentlythan long words, and, according to anempirical study by Landauer andStreeter [1973], high-frequency (i.e.,short) words tend to have more single-er-ror neighbors than low-frequency words,thus making it difficult to select the in-tended correction from its set of neigh-bors. On the other hand, if spelling er-rors occur less frequently in short words,then the problem may be less pressing.

Pollock and Zamora’s study [Pollockand Zamora 1983] of 50,000 nonword er-rors indicated that errors in short wordsare indeed problematic for spelling cor-rection even though their frequency ofoccurrence may be low. They state: “al-though 3–4 character misspellings con-stitute only 9.2% of total misspellings,they generate 42% of the miscorrections”(p. 367). There appears to be wide vari-ance in the proportion of errors that oc-cur in short words across different appli-cations. For example, Yannakoudakisand Fawthrop [ 1983b] found an evenlower frequency of occurrence of errors inshort words, about 1.570, in their analy-sis of 1,377 errors found in the literature.In contrast, Kukich [1990] analyzed over2,000 error types in a corpus of TDDconversations and found that over 639Z0 ofthe errors occurred in words of length 2,3, and 4 characters. These differencesemphasize the need for determining theactual characteristics of the spelling er-rors of the intended application beforedesigning or implementing a correctionsystem.

2.1.3 F/rst-Posit/on Errors

It is generally believed that few errorstend to occur in the first letter of a word.Only a few studies actually documentfirst-position error statistics. Pollock andZamora [1983] found that 3.370 of their50,000 misspellings involved first letters,and Yannakoudakis and Fawthrop[ 1983b] observed a first-position error

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words 1n Text ● 389

rate of 1.490 in 568 typing errors. Mitton[1987] found that 7% of all the mis-spellings he studied involved first-posi-tion errors. In contrast, Kukich [1992]observed a 15 TO first-position error ratein a 40,000-word corpus of typed textualconversations.

Discounting first-position errors allowsa lexicon to be partitioned into 26 sub-sets, each containing all words beginningwith a single letter, thus greatly reduc-ing search time. Many spelling correctiontechniques have exploited this character-istic. However, a caveat is in order anytime dictionary-partitioning schemes areused: a trade-off must be made betweenquicker response time and reduced accu-racy due to the possibility of missing thecorrection entirely because it is not in thepartition searched.

2.1.4 Keyboard Effects

Some extensive studies of typing behav-ior were carried out by the LNR TypingResearch Group [Gentner et al. 1983].Rather than developing a spelling correc-tion technique, their goal was to developa working computer simulation model oftyping. As part of this work, Grudin[1983] performed an extensive analysisof the typing errors made by six experttypists and eight novice typists whiletranscribing magazine articles totalingabout 60,000 characters of text. He foundlarge individual differences in both typ-ing speed and types of errors made. Forexample, error rates ranged from 0.4% to1.9?0 for experts and averaged 3.2% fornovices; the majority of expert errorswere insertions that resulted from hit-ting two adjacent keys simultaneouslywhile the majority of novice errors weresubstitutions. But Grudin did uncoversome general patterns. After compiling aconfusion matrix4 from the 3,800 substi-tution errors in his corpus, he observed

‘ A confusion matrix M a square matrix whose rowsand columns are labeled with the characters of thekeyboard and whose @h element contains the fre-quency count of the number of times letter L wasmistakenly keyed when letter J was intended in agiven corpus.

that 58% of all substitution errors in-volved adjacent typewriter keys. He alsofound strong letter frequency effects inthe adjacent substitution errors, i.e., evenafter normalizing for frequency in lan-guage, a typist is more likely to substi-tute a higher-frequency letter for a

lower-frequency nej ghbor than vice-versa. Grudin went on to develop a com-puter model of typing capable of generat-ing the same sorts of errors that humantypists make [Grudin 1981]. Until re-cently, few spelling correction techniqueshave attempted to directly exploit theprobabilistic tendencies that arise fromkeyboard adjacencies and letter frequen-cies. Those that have are discussed inSection 2.2.5.

2.1.5 Error Rates

Data concerning the frequency of occur-rence of spelling errors is not abundant.The few data points tlhat do exist must bequalified by (1) the size of the corpusfrom which they were drawn; (2) the textentry mode of the corpus, e.g., handwrit-ing, transcription typing, conversationaltyping, edited machine-readable text,

etc.; and (3) the date of the study, sincenewer studies for solme genres, such asedited machine-readable text, probablyreflect lower error rates due to the avail-ability of automatic spelling checkers.Furthermore, whether both nonword andreal-word errors wet-e included in thestudy should also be noted.

Grudin’s transcription-typing study[Grudin 1983] foundl average error ratesof 1% for expert tylpists and 3.2% fornovices. His corpus consisted of approxi-mately 60,000 characters of typed text.He did not distinguish between errorsthat resulted in nonwords and errors thatresulted in real words.

Two studies of errors in typed textualconversations by deaf TDD (Telecom-munications Device for the Deaf) usersby Kukich [1992] and Tsao [19901 foundnonword spelling error rates of 670 and5% respectively. The corpus studied byKukich contained 40,000 words, or, moreaccurately, strings, and that studied byTsao contained 130,000 strings.

Pollock and Zamora [1984] examined a

ACM Computing Surveys, Vol. 24, No. 4, December 1992

390 “ Karen Kukich

set of machine-readable databases con-taining 25 million words of text. Theyfound over 50,000 nonword errors, for aspelling error rate of 0.2%. An even lowererror rate holds for a corpus of AP (Asso-ciated Press) news wire text studied byChurch and Gale [1991]. For the pur-poses of their study, a typo was definedas any lower-case word rejected by theUnix spell program. By this measure,they found that “the AP generates about1 million words and 500 typos per week”(p. 93), which equates to an error rate of0.05Y0.

Spelling error rates of 1.5% and 2.5%for handwritten text have been reportedby Wing and Baddeley [ 1980] and Mitton[1987] respectively. Wing and Baddeley’scorpus consisted of 40 essays written byCambridge college applicants totalingover 80,000 words, of which 1,185 werein error; Mitton’s corpus contained 925high school student essays totaling170,016 words, of which 4,128 were inerror. Both of these studies included bothnonword and real-word errors. A quickscan of Wing and Baddeley’s data sug-gest that at least 30’%. were real-worderrors. Mitton found that fully 409” ofthe misspellings in his corpus were real-word errors. These involved mainly sub-stitution of a wrong function word (e.g.,he for her), substitution of a wrong in-flected form (e.g., arriue for arriued), andincorrect division of a word (e.g., my self

for myself ). None of these types of errorsare detectable by isolated-word error cor-rectors. As noted in Section 1.3,Peterson’s [1986] study of undetectedspelling errors indicates that single-errortypos alone are likely to result in validwords at least 16% of the time. Thesefindings suggest a genuine need for con-text-dependent spelling correction meth-ods, at least for some applications. Thistopic will be taken up in Section 3.

While many manufacturers of OCRequipment advertise character recogni-tion error rates as low as 1–270, or equiv-alent word error rates of 5–10%, recentresearch [Santos et al. 1992; Jones et al.1991] indicates that these devices tend toexhibit marginal performance levels, e.g.,

word error rates in the range of 7– 16~0,in actual field applications (as opposed tolaboratory testing under ideal condi-tions). Nevertheless, one recent study

[Nielsen et al. 19921 found that despiteword error rates of 8.87o for one pen-based system, information retrieval rateswere nearly as good for pen-based-en-tered documents as they were for thesame error-free text. The authors cautionreaders that this result holds only fordocuments whose average length was 185words, and that “very short notes wouldbe difficult to find without the use offurther attributes such as time or contextof writing” (p. 7).

2.1.6 Phonet/c Errors

Examples of applications in which pho-netic errors abound include (1) the auto-matic yellow-pages-like information re-trieval service available in the FrenchMinitel system [Veronis 1998b], (2) a di-rectory assistance program for looking upnames in a corporate directory [ Boivie1981], a database interface for locatingindividuals by surname in large credit,insurance, motor vehicle bureau, and lawenforcement databases [ Oshika et al.1988], and (3) a natural language inter-face to an electronic encyclopedia [Berkeland DeSmedt 1988]. The predominantphonetic nature of errors in these appli-cations is intuitively explained by thefact that people resort to phoneticspellings for unfamiliar names andwords.

Two data points on the frequency ofoccurrence of phonetic errors have beendocumented. One is provided by VanBerkel and DeSmedt [ 1988]. They had 10Dutch subjects transcribe a tape record-ing of 123 Dutch surnames randomlychosen from a telephone directory. Theyfound that 38% of the spellings gener-ated by the subjects were incorrect de-spite being phonetically plausible. Theother data point comes from Mitton’sstudy [Mitten 1987]. He found that 44970of all the spelling errors in his corpusof 925 student essays involved homo-phones.

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text ● 391

2.1.7 Heuristic Rules and Probabilistic

Tendencies

Three comprehensive studies of spellingerror patterns have been carried out withthe goal of devising spelling correctiontechniques. One study, by Yannakou-dakis and Fawthrop [ 1983b], aimed atdiscovering specific rules that spellingerrors tend to follow, with the intent ofdesigning a rule-based spelling correc-tion algorithm. Another study, by Pollockand Zamora [1983], aimed at discoveringprobabilistic tendencies, such as whichletters and which position within a wordare most frequently involved in errors,with the intent of devising a similaritykey-based technique. A third study, byKernighan et al. [1990] aimed at compil-ing error probability tables for each ofthe four classes of errors, with the intentof exploiting those probabilities directly.There is obviously some overlap amonggoals of these three studies. Neverthe-less, the information they derived wasused to devise and implement three verydifferent spelling correction techniques,each of which is described in the nextsection. This section contains a brief re-view of their findings.

Yannakoudakis and Fawthrop [1983a,1983b] sought a general characterizationof misspelling behavior. They compiled adatabase of 1,377 spelling errors culledfrom a variety of sources, including mis-takes found in the Brown corpus [Kuceraand Francis 1967] as well as those foundin a 60,000-word corpus of text typed bythree University of Bradford adults who“believed themselves to be very badspellers.” They found that a large portionof the errors could be accounted for by aset of 17 heuristic rules, 12 of whichrelated to the misuse of consonants orvowels in graphemes,5 and five of whichrelated to sequence production. For ex-ample, heuristics related to consonantsincluded (1) the letter h is frequently

‘ A grapheme is a letter sequence corresponding to

a phoneme. For example, that includes thegraphemes th, a, and t.

omitted in words containing the graph-emes cii, gh, ph, and rh, as in the mis-spellings agast and tecniques; and (2)doubling and singling of consonantswhich frequently occur doubled is a com-mon error. Heuristics related to sequenceproduction included: (3) the most fre-quent length of a misspelling is one lettershort of the correct spelling; (4) typingerrors are caused by hitting an adjacentkey on the keyboard or by hitting twokeys together; (5) short misspellings donot contain more than one error, and soon.

The application motivating Pollock andZamora’s study was that of editing ma-chine-readable text for bibliographic in-formation retrieval services. They ex-tracted over 50,000 errors from 25 mil-lion words of scientifi c and scholarly texttaken from seven (Uhemical AbstractsService databases. Among other things,they found that (1) 0.2(% of the words intheir corpus contained spelling errors; (2)9470 of the spelling errors were singleerrors; (3) 34% of all errors were omis-sions; (4) 2370 of all errors occurred inthe third position of a word; and (5) ex-cept for a few frequently misspelled words(e.g., the - teh), most misspellings arerarely repeated.

Kernighan et al. [1990] scanned 44million words of AP news wire text forspelling errors. Using spell to locate non-word strings in the text and a candidategeneration technique to locate every pos-sible dictionary entry formed by a singleinsertion, deletion, substitution, or trans-position in the nonword strings, theyfound over 25,000 misspellings for whichthere existed exactly one possible correctspelling. (Some nonword strings had nopossible corrections because they con-tained more than one error; others hadmore than one possible correction, e.g.,acress + actress, acres, access.) Their listof 25,000 misspelling/correct-spellingpairs was used to compile error fre-quency tables (a.k.a., confusion matrices)for each of the four classes of spellingerrors: insertions, deletions, substitu-tions, and transpositions. For example,they determined tha~ a was incorrectly

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

392 . Karen Kukieh

substituted for e 238 times; s was incor-rectly inserted after e 436 times; t wasincorrectly deleted after i 231 times; andit was incorrectly typed for ti48 times.These frequencies were used to estimatethe probability of occurrence of each po-tential error.

2.1,8 Common M/sspell/ngs

Still other sources of data on the natureof spelling errors are published lists ofcommon misspellings and their correc-tions. Webster’s New World Misspeller’s

Dictionary [Webster 1983] is one suchlist. Its foreword identifies 12 classes ofcommon errors, such as uncertaintyabout whether a consonant is doubled,errors resulting from mispronunciations,errors resulting from homonyms, etc.,some of which overlap with the Yan-nakoudakis and Fawthro~ [ 1983bl find-.-ings. Very few academic spelling correc-tors report the use of lists of commonmisspellings. One exception is the tech-nique devised by Pollock and Zamorawhich did incorporate a routine forchecking a short list of frequently mis-spelled words.

2.1.9 Summary of Spelling Error

Pattern F/ndings

In summary, the frequency of occurrenceof nonword spelling errors varies greatlydepending on application, data entrymode, and date of study. Estimates rangefrom as low as 0.0570 in edited news wiretext to as high as 38% in phoneticallyoriented database retrieval applications.Practical OCR word error rates currentlyrange between 7~c and 16%. OCR error~atterns tend to varv for different OCRkevices and tv~e fon<s

Some gene~~l findings on the nature ofhuman-generated spelling errors includethe facts that: (1) most errors (i.e.,roughly 80~c) tend to be single instancesof insertions, deletions, substitutions, ortranspositions; (2) as a corollary, mosterrors tend to be within one letter inlength of the intended word; and (3) fewmisspellings occur in the first letter of aword. These findings have been exploited

by implementing fast algorithms for cor-recting single-error misspellings and/orby partitioning dictionaries according tofirst letter and/or word length to reducesearch time. Other general findings of (4)strong keyboard adjacency effects and (5)strong letter frequency effects hold po-tential for improving correction accuracy.There is general agreement that phoneticerrors are harder to correct because theyresult in greater distortion of the mis-spelled string from the intended wordthan single-error typos.

At least one study has examined indi-vidual errors to try to identify commonmistakes that could be characterized bygeneral rules, such as the tendency todouble or undouble consonants. Anotherstudy examined errors to identify proba-bilistic tendencies, such as the fact thatover one-third of all errors were omis-sions. And another study compiled gen-eral error probability tables for inser-tions, deletions, substitutions, and trans-positions.

It is worth emphasizing that specificfindings do not hold for all applications.It is imperative to collect a sufficientlylarge sample of errors characteristic ofthe target application for analysis andtesting in order to select or devise aspelling correction technique that is opti-mally suited to that application. Finally,it is worth noting that one study foundthat fully 40% of all errors in a corpuswere real-word errors, i.e., errors thatare not detectable by isolated-word errorcorrection techniques, thus emphasizingthe need for research into context-depen-dent word correction techniques.

2.2 Techniques for Isolated-WordError Correction

The problem of isolated-word error cor-rection entails three subproblems: (1) de-tection of an error; (2) generation ofcandidate corrections; and (3) rank-ing of candidate corrections. Mosttechniques treat each subproblem as aseparate process and execute them in se-quence. The error detection process usu-ally consists of checking to see if an input

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text ● 393

string is a valid dictionary word or if itsn-grams are all legal. The candidate gen-eration process usually employs a dictio-nary or a database of legal n-grams tolocate one or more potential correctionterms. The ranking process usually in-vokes some lexical-similarity measurebetween the misspelled string and thecandidates or a probabilistic estimate ofthe likelihood of the correction to rankorder the candidates. Some techniquesomit the third process, leaving the rank-ing and final selection to the user. Othertechniques, including many probabilisticand neural net techniques, combine allthree subprocesses into one step by com-puting a similarity or probability mea-sure between an input string and eachword, or some subset of words, in thedictionary. If the resulting top-rankeddictionary word is not identical to theinput string, an error is indicated and aranked list of potential corrections iscomputed.

As in detection research, n-gram

statistics initially played a central role intext recognition techniques while dictio-nary-based methods dominated spellingcorrection techniques. But text recogni-tion researchers quickly discovered thatn-gram analysis alone was inadequate tothe task of correction, so they began todevelop “top-down/bottom-up” ap-proaches that combined the use of dictio-naries (the “top-down” aspect) with n-

gram statistics (the “bottom-up” aspect).At the same time, n-gram statistics foundtheir way into dictionary-based spellingcorrection research. In addition, manyother clever techniques were inventedbased on minimum edit distance algo-rithms, similarity keys, rule-based proce-dures, probability estimates, and neuralnets. A convenient way to organize theapproaches is to group them into six mainclasses:

(1) minimum edit distance techniques;

(2) similarity key techniques;

(3) rule-based techniques;(4) n-gram-based techniques;

(5) probabilistic techniques;

(6) neural nets.

This organization is somewhat arbitraryas there is some overlap among classes.In particular, the last three classes areclosely related in that most implementa-tions rely on some lexical feature-basedrepresentation of words and misspellingssuch as n-grams. Furthermore, many hy-brid techniques cc~mbining the ap-proaches of more than one class havebeen developed.

In the following discussion, each of thetechniques is briefly described in termsof the candidate generation and rankingsubprocesses. For the sake of providingsome insight into how techniques differin their scope and accuracy, some addi-tional relevant characteristics are re-ported whenever that information isavailable. These include: (1) lexiconsize; (2) test set size; (3) correctionaccuracy for single-error mis-spellings; (4) correction accuracy formulti-error misspellings; (5) wordlength limitations; and (6) type of er-rors handled (e.g., OCR-generated vs.human-generated, typographic vs. pho-netic, etc. ). These characteristics are im-portant because techniques differ greatlyin the types of errors they handle form-ing nonoverlapping or partially overlap-ping error coverage sets. Since few tech-niques have been tested on the same testset and lexicon, direct comparisons arerarely possible. The following sections aremeant to convey a general idea of thescope and accuracy c)f existing isolated-word error correction techniques.

2.2.1 Mm/mum Edit DNance Techniques

By far the most studied spelling correc-tion algorithms are those that compute aminimum edit distance between a mis-spelled string and a dictionary entry. Theterm minimum edit distance was definedby Wagner [1974] as the minimum num-ber of editing operations (i.e., insertions,deletions, and substitutions) required totransform one string into another. Thefirst minimum edit distance spelling cor-rection algorithm was implemented byDamerau [ 1964]. About the same time,Levenshtein [ 1966] developed a similar

ACM Computmg Surveys, V.I. 24, No. 4, December 1992

394 “ Karen Kukic?l

algorithm for correcting deletions, inser-tions, and reversals (transpositions).Wagner [1974] first introduced the no-tion of applying dynamic-programmingtechniques [Nemhauser 1966] to thespelling correction problem to increasecomputational efficiency. Wagner andFischer [ 1974] also generalized Leven-shtein’s algorithm to cover multi-errormisspellings. Another algorithmic vari-ant, due to Lowrance and Wagner [1975],accounted for additional transformations

(such as exchange of nonadjacent letters).Still, other variants were developed byWong and Chandra [1976], Okuda et al.[ 1976], and Kashyap and Oommen [ 19/31].The metric that all of these techniquescompute is sometimes referred to as aDamerau-Levenshtein metric after thetwo pioneers. Hawley [ 1982] tested someof these metrics in his spelling correctorfor the Unix command language inter-face.

Aho’s [1990] survey article of pattern-matching algorithms points out that theUnix file comparison tool, cliff, is basedon a dynamic-programming minimumedit distance algorithm. It goes on to de-scribe other dynamic-programming algo-rithms in terms of their time complexi-ties. Dynamic-programming algorithmshave also been studied extensively forother approximate pattern-matching ap-plications, such as comparing macro-molecular sequences (e.g., RNA andDNA), comparing time-warped speechsignals for speech recognition, comparinggas chromatograms for spectral analysis,and others. These are described in a col-lection of articles edited by Sankoff andKruskal [1983].

Although many minimum edit distancealgorithms compute integer distance

scores, some algorithms obtain i3ner-grained scores by assigning nonintegervalues to specific transformations basedon estimates of phonetic similarities orkeyboard adjacencies. Veronis [1988a]devised a modified dynamic-program-ming algorithm that differentiallyweights graphemic edit distances basedon phonemic similarity. This modifica-tion is necessary because phonetic mis-

spellings frequently result in greater de-viation from the correct orthographicspelling.

In general, minimum edit distancealgorithms require m comparisons be-tween the misspelled string and the dic-tionary, where m is the number of dictio-nary entries. However, some shortcutshave been devised to minimize the searchtime required. Going on the assumptionthat deletions are the most common typeof error, Mor and Fraenkel [ 1982b]achieved an efficient response time for afull-text information retrieval applicationby storing every word in the dictionary

Ix I + 1 times, where Ix I is the length ofthe word, each time omitting one letter.They then invoked a hash function tolook up misspellings.

A “reverse” minimum edit distancetechnique was used by Gorin [ 1971] inthe DEC-10 spelling corrector and byDurham et al. [1983] in their command

language corrector. Kernighan et al.[1990] and Church and Gale [ 1991b] alsoused a reverse technique to generate can-didates for their probabilistic spellingcorrector. In reverse techniques, a candi-date set is produced by first generatingevery possible single-error permutationof the misspelled string and then check-ing the dictionary to see if any make upvalid words. This means that given amisspelled string of length n and an al-phabet of size 26, the number of stringsthat must be checked against the dictio-nary is 26( n + 1) insertions plus n dele-tions plus 25n substitutions plus n – 1transpositions, or 53n + 25 strings, as-suming there is only one error in themisspelled string.

Tries have been explored as an alter-native to dynamic-programming tech-

niques for improving search time in min-

imum edit distance algorithms. Muth and

Tharp [ 1977] devised a heuristic search

technique in which a dictionary of “cor-

rect spellings are stored character-by-

character in a pseudo-binary tree. The

search examines a small subset of the

database (selected branches of the tree)

while checking for insertion, deletion,

substitution, and transposition errors”

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words tn Text ● 395

(p. 285). Dunlavey [1981] described analgorithm he called SPROOF, whichstored a dictionary as a trie, i.e., “a hugefinite-state machine recognizing all andonly the strings in the dictionary” (p.608). The search proceeded by visitingthe nodes of the trie in increasing editdistance until the candidate correctionsat the leaves were reached. A similartechnique was used by Boivie [1981] inhis da directory assistance program,which in turn became the underlying al-gorithm for Taylor’s [1981] grope spellingcorrector.

Minimum edit distance techniqueshave been applied to virtually all spellingcorrection tasks, including text editing,command language interfaces, naturallanguage interfaces, etc. Their spellingcorrection accuracy varies with applica-tions and algorithms. Damel au reports a957o correction rate for single-error mis-spellings for a test set of 964 mis-spellings of medium and long words(length 5 or more characters), using alexicon of 1,593 words. His overall correc-tion rate was 84~0 when multi-error mis-spellings were counted. Durham et al.[1983] report an overall 27% correctionrate for a very simple, fast, and unobtru-sive single-error correction algorithm ac-cessing a keyword lexicon of about 100entries. For as low as this seems, theauthors report a high degree of user sat-isfaction for this command language in-terface application, due largely to theiralgorithm’s unobtrusiveness. Kukich[1990] evaluated the grope minimum editdistance algorithm, among others, for usein a demanding text-to-speech synthesisapplication in which 2570 of 170 mis-spellings involved multiple errors andfully 63% of the misspellings involvedshort words (length 2, 3, or 4 characters).She observed a 627o overall correctionrate, using a lexicon of 1,142 words. Othertechniques evaluated by Kukich, includ-ing the vector distance and neural nettechniques described below, outper-formed the grope minimum edit distancetechnique for the same test set and lexi-con by more than 107o. Muth and Tharp[1977] report a performance level as high

as 977o for their trie-based algorithm ona 1,487-word test set, though they do notspecify the size of the lexicon they usedor the median length of the words intheir test set.

2.2.2 Similarity Key Techniques

The notion behind similarity key tech-niques is to map every string into a keysuch that similarly spelled strings willhave identical or similar keys. Thus,when a key is computed for a misspelledstring it will provide a pointer to all simi-larly spelled words (candidates) in thelexicon. Similarity key techniques have aspeed advantage because it is not neces-sary to directly compare the misspelledstring to every word in the dictionary.

A very early (1918), often cited similar-ity key technique, the SOUNDEX sys-tem, was patented by Odell and Russell[1918] for use in phonetic spelling correc-tion applications. It has been used in anairline reservation system [Davidson1962] among other applications. It mapsa word or misspelling into a key consist-ing of its first letter followed by a se-quence of digits. Digits are assigned ac-cording to the following rules:

A, E, I, O, U, H, W,I(-OB, F, P, V- 1C, G, J, K, Q, S, X,Z+2D,T+3La4

M,N+5R+6

Zeros are then eliminated and repeatedcharacters are collapsed. So, for example,we can have: Bush ~ B020 ~ B2 andBusch -+ B0220 ~ B2. However, wecould also have: QUAYLE ~ QOO040 -Q4 while KWAIL -~ KOO04 ~ K4. It isclear that the SOUNDEX similarity keyis a very coarse-grained one. Further-more, the SOUNDEX. system provides nofunction for ranking candidates. Instead,all candidates are simply presented tothe user.

Pollock and Zamora [1984] used thefindings of their stucly of 50,000 spellingerrors from seven Chemical Abstracts

ACM Computing Surveys, Vol. 24, No. 4, December 1992

396 - Karen Kukich

Service databases to devise a similaritykey technique, called SPEEDCOP, forcorrecting single-error misspellings. Af-ter constructing a key for each word inthe lexicon, the lexicon was sorted by keyorder. A misspelling was corrected byconstructing its key and locating it in thesorted key list. Then all nearby words(candidates) within a certain distance inboth directions were checked in sequenceto find the first word that could havebeen generated from the misspelling byeither an insertion, omission, substitu-tion, or transposition (i.e., the rankingprocess).

The SPEEDCOP system actually madeuse of two similarity keys, a skeleton keyand an omission key, both of which werecarefully tailored by statistical findingsto achieve maximum correction power fortheir application domain, which includedfew, if any, phonetic misspellings. Boththe skeleton key and the omission keyconsisted of single occurrences of all theletters that appear in a string arrangedin a specific order. In the skeleton key,the order is the first letter of the stringfollowed by the remaining unique conso-nants in order of occurrence and theunique vowels in order of occurrence. Inthe omission key, consonants come firstin reverse order of the frequency of oc-currence of omitted letters, which, ac-cording to Pollock and Zamora’s findings,is

RSTNLCHDPGMFBYWVZXQK.

Vowels follow, in order of their occur-rence in the string. They also found thatspelling errors tend to preserve the orderof occurrence of vowels. The omission keyfor the word chemical is MHCLEIA. Thekeys for two actual misspellings chemi-

cial and ehemcial are identical, and thekey for another misspelling chemcal isvery close.

The SPEEDCOP algorithm consistedof four steps, each of which was appliedonly if the previous one failed to yield acorrection: (1) consult a common mis-spelling dictionary; (2) apply the skeletonkey; (3) apply the omission key; (4) applya function word routine to check for con-

catenations of function words. Pollockand Zamora [ 1984] mention that the laststep increased corrections by only1%–2% but was entirely accurate. Theyreport that the correction accuracy oftheir technique ranged from 77% to 969Zin correcting the single-error mis-spellings in the seven databases theystudied. Single-error misspellings ac-counted for an average of 94% of all theerrors they observed. They report anoverall correction rate ranging from 74%to 88q0. These tests were based on alexicon of 40,000 words.

A creative similarity key spelling cor-rection technique was devised by Tenczarand Golden [1972] for the PLATO com-puter tutoring system. As in the SPEED-COP technique, a similarity key was con-structed for each word in the applicationlexicon and the lexicon was sorted by keyorder. But the PLATO similarity key wasbased on a set of human cognitive fea-tures intuited by the authors, including(1) word length, (2) first character, (3)letter content, (4) letter order, and (5)syllabic pronunciation. Each feature wasrepresented by a bit field in the similar-ity key, and the authors point out thatadditional features could easily be in-cluded by adding another bit field. Locat-ing the word typed by the student, or itsclosest approximation, consisted of com-puting the similarity key for the inputword and executing a “binary chop”search through the encoded applicationlexicon to find an exact or closest match.

Tenczar and Golden [ 1972, p. 13] testedtheir technique on a dictionary of com-mon misspellings and found that it cor-rected 959Z0 of a sample of items takenfrom the dictionary. And they tested it bychecking each of the 500 most commonlyused English words against each otherand found that it “performed satisfacto-rily, (i.e., pairs which it calls mis-spellings usually differ by only one let-ter).”

An algorithm called token reconstruc-

tion that computes a fuzzy similaritymeasure has been patented by 130cast[1991]. Given a misspelled string and adictionary word, the algorithm returns a

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

Automatically Correcting Words zn Text ● 397

similarity measure that is computed bytaking an average of the sum of fourempirically weighted indices that mea-sure the longest matching substringsfrom both ends of the two strings. Thetechnique partitions the dictionary intobuckets containing words that have thesame starting character c and the samelength n. The search space expands frombucket c,, nj to buckets c:, nj ~ 1 so asto check first for words with small differ-ences in length and then check for wordsbeginning with other characters in themisspelled string. Bocast reports that thealgorithm is simple enough to be blind-ingly fast, and that its 7890 first-guessaccuracy on a 123-word test set that in-cludes 15 multi-error misspellings ex-ceeds the accuracy of four popular com-mercial spelling correctors on the sametest set.

2.2.3 Rule-Based Techniques

Rule-based techniques are algorithms orheuristic programs that attempt to rep-resent knowledge of common spelling er-ror patterns in the form of rules fortransforming misspellings into validwords. The candidate generation processconsists of applying all applicable rulesto a misspelled string and retaining ev-ery valid dictionary word that results.Ranking is frequently done by assigninga numerical score to each candidate basedon a predefined estimate of the probabil-ity of having made the particular errorthat the invoked rule corrected.

Yannakoudakis and Fawthrop [1983]devised a knowledge-based spelling cor-rection program based on the set of rulesthey inferred from their study of 1,377misspellings described above. Their goalwas a general spelling correction system,so they tested their program on a set of1,534 misspellings using a dictionary of93,769 words. Because some of their rulesincorporated knowledge of the probablelength of the correct word based on themisspelling, their technique employed adictionary that was partitioned into manysubsets according to word length as wellas first letter. The candidate generation

process consisted of searching specificdictionary partitions for words that differfrom the misspelling by only one or twoerrors and that follow any of the rules.When multiple candidates were foundthey were ranked by predefine esti-mates of the probabilities of occurrenceof the rules. They found that the correctword was in the partition searched for1,153 cases, or 75% of the time. For those1,153 cases, the correct word was re-turned as the first choice 90% of thetime, yielding an overall correction accu-racy of 6870 (9070 of ‘75970).

In another project, a specialized rule-based spelling correction system was de-signed by Means [1988] for a naturallanguage data entry system with a highincidence of abbreviation, acronym, andjargon usage. This system first checks aset of morphology rules for common in-flectional violations (such as failure todouble a final consonant before adding-ing). It next consult,s a set of abbrevia-tion ex~ansion rules to determine if the.misspelled string might expand to a lexi-con term. Failing that, it tries all single-error transformations of the misspelledstring, including the possibility of an in-serted blank (i.e., a slplit-word error).

2.2.4 N-gram-Based Techniques

Letter n-grams, including trigrams, bi-grams, and/or unigrams, have been usedin a variety of ways in text recognitionand spelling correction techniques. Theyhave been used in OC!R correctors to cap-ture the lexical syntax of a dictionaryand to suggest legal corrections. Theyhave been used in spelling correctors asaccess keys into a dictionary for locatingcandidate corrections and as lexical fea-tures for computing similarity measures.They have also been used to representwords and misspellings as vectors of lexi-cal features to which traditional andnovel vector distance measures can beapplied to locate and rank candidate cor-rections. Some n-gralm-based correctiontechniques execute the processes of errordetection, candidate retrieval, and simi-larity ranking in three separate steps.

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

398 “ Karen Kukich

Other techniques collapse the three stepsinto one operation.

Riseman and Hanson [1974] provide aclear explanation of the traditional use ofn-grams in OCR correction. After parti-tioning a dictionary into subdictionariesby word length, they construct positionalbinary n-gram matrices for each subdic-tionary. They note that because thesematrices provide answers to the question“is there some word in the dictionarythat has letters a and ~ in positions i

and j respectively,” the matrices capturethe syntax of the dictionary. An OCRoutput string can be checked for errorsby simply checking that all its n-gramshave value 1. If a string has a singleerror, it will have a O value in at leastone binary n-gram; if more than one Ovalue is found, the position of the error isindicated by a matrix index that is com-mon to the O value n-grams. Correctioncandidates can be found by logically in-tersecting the rows or columns specifiedby the shared index of the incorrect n-

grams. If the intersection results in onlyone n-gram with value 1, the error canbe corrected; if more than one potentialn-gram correction is found the word isrejected as ambiguous.

Riseman and Hanson [1974] tested thetechnique on a test set of 2,755 6-letterwords. They found that positional tri-gram matrices, which outperformed otherpositional and nonpositional n-gram ma-trices, were able to detect 98.670 of theerrors in their test set and correct 62.4%of them. This technique has at least oneadvantage over dictionary lookup correc-tion techniques in that it obviates theneed for an exhaustive dictionary search.A minor drawback it that it is possiblefor a correction to result in a nonword onrare occasions. A more serious drawbackis that the technique is designed to han-dle only substitution errors, and it is notclear how well it would handle other er-rors, such as insertion, deletion, transpo-sition, and framing errors.

Two hardware-based techniques havebeen proposed that would exploit n-gramdatabases in parallel. Ullmann [ 1977]proposed a technique for finding all valid

words differing from an input string byup to two insertion, deletion, substitu-tion, and reversal errors by processingbinary n-grams in parallel. Henseler etal. [1987] implemented a parallel OCRrecognition and correction algorithm on a16-processor hypercube machine. Theirtechnique used a database of trigram fre-quencies as opposed to binary trigrams.A test of their technique on a small sam-ple of text yielded 95% initial characterrecognition accuracy and 1009ZO accuracyafter correction.

Angell et al. [ 1983] have written aseminal paper on the use of trigrams forspelling correction applications. Theirtechnique computes a similarity measurebased on the number of (nonpositionalbinary) trigrams common to a misspelledstring and a dictionary word. The simi-larity measure is the simple function2(c/(n + n’)), where c is the number oftrigrams common to both the dictionaryword and the misspelling and n and n’

are the lengths of the two strings. Theyrefer to this function as “the well-knownDice coefficient.” They also suggest creat-ing an inverted index into the dictionaryusing trigrams as access keys. The tri -grams of the misspelled string can beused to retrieve only those dictionarywords having at least one trigram incommon with the misspelling, and thesimilarity function needs to be computedonly for that subset.

This technique achieved an overall ac-curacy score of 76% on a test set of 1,544misspellings using a dictionary of 64,636words. The authors analyzed its perfor-mance by error type and found that itworked “very well indeed for omissionand insertion errors, adequately for sub-stitution errors, but very poorly fortransposition errors’” (p. 259). Althoughthey did not break down the technique’sperformance by word length, they didnote that it cannot be expected to do wellon short words because a single error canleave no valid trigrams intact. The meanlength of the misspelled strings in theirtest set was 8.4 characters.

A problem with the Dice coefficient wasthat it tended to make errors when the

ACM Comput]ng Surveys, Vol 24, No 4. December 1992

Automatically Correcting Words in Text ● 399

misspelled string was wholly containedin a valid word, or vice versa. For exam-ple, given the misspelled string concider,

it assigned values of 0.71 and 0.70 re-spectively to the words cider and con-

sider. Angell et al. noted that the ten-dency for misspellings to differ in lengthfrom their intended correction by at mostone character could be exploited bychanging the similarity function to(c/max(n, n’)). They found that thismodified similarity function corrected thecontainment error problem without af-fecting the correction rate for multi-errormisspellings.

Similar trigram correction techniqueshave been suggested or devised by Koho-nen [1980] and DeHeer [1982] for textediting and information retrieval appli-cations. A related technique, triphoneanalysis, was devised by Van Berkel andDeSmedt [1988] specifically for a naturallanguage interface application involvingthe use of proper nouns that tend to bephonetically misspelled. It first applies aset of grapheme-to-phoneme conversionrules to a string and then makes use ofan inverted-triphone index to retrieveother candidate corrections. It was highlysuccessful in a test experiment, correct-ing 94% of a test set of deviant spellingsof 123 Dutch surnames. Its lexicon was adatabase of 254 Dutch surnames ran-domly culled from a telephone directory.Van Berkel and DeSmedt compared theirphonetic-based technique to some well-known nonphonetic techniques and foundthat the phonetic technique outper-formed the others by anywhere from 4 to40 percentage points on this test set.

In addition to the trigram-based sim-ilarity functions suggested by Angellet al. [1983], a variety of traditional andnovel vector distance measures based onrepresentations of words as n-gram vec-tors has been tested. Most of these tech-niques collapse all three subprocesses(detection, candidate generation, andranking) into one process. The basic ideasare: (1) to position each lexicon entry atsome point in an n-chrnenslonal lexical-feature space and then (2) to project amisspelled string into that same space

and measure its proxj mity to its nearestneighbors (candidates). ~nigrams, bi-grams, and trigrams are three possiblecandidates for the features of the lexicalspace; words and misspellings can berepresented as sparse vectors in suchspaces. Hamming distances, dot prod-ucts, and cosine distances are three pos-sible measures of the proximity of anytwo vectors in lexical space.

Kukich [1992] evaluated the correctionaccuracy of these three vector distancemetrics using the same test set and lexi-con as those used to t(est the grope mini-mum edit distance technique (whichscored 627o accuracy); the test set con-tained 170 single and multi-error mis-spellings of which nearly two-thirds oc-curred in short words, (less than 5 char-acters), and the lexicon contained 1,142words. Using 790-element vectors of uni-grams and bigrams to represent wordsand misspellings, she recorded correctionrates of 54$70 for a allot product metric,687o for a Hamming distance metric, and7570 for a cosine distance metric.

Correlation matrix memory (CMM)techniques are closely related to vectorspace techniques. They represent a lexi-con of m words as an n x m matrix ofn-dimensional lexical-feature vectors.The correction process consists of multi-plying the n-dimensional feature vectorrepresenting the miss pelled string by then x m-dimensional matrix representingthe entire lexicon, yielding an m-dimen-sional vector in which the ith elementrepresents the zth lexicon entry. The ele-ment with the highest value is taken tobe the most strongly correlated entry,thus the correct word.

Cherkassky et al. [1.992] have done de-tailed studies comparing the use of bi-gram and trigram f’eature vectors inCMM models. They created two test setsof randomly generated single-error mis-spellings, one for medium-length words(5-7 characters) and one for long words(10-12 characters), and tested themagainst lexicons ranging in size from 500to 11,000 words. In particular, they ob-served excellent correction rates (above9070) for long words using trigram fea-

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

400 “ Karen Kukich

ture vectors and fair-to-good correctionrates (averaging from 60% to 857o de-pending on lexicon size) for medium-length words using bigram feature vec-tors. An even more detailed study by Dahland Cherkassky [1990] compared the useof unigrams, bigrams, and trigrams onwords of varying lengths for each of thefour classes of single-error typos. Theyfound that “[b]igram and trigram encod-ings give better recall rates for all kindsof errors except transpositions for allword lengths” and that a unigram encod-ing performs much better for transposi-tion errors. Indeed, a lexical study byDeloche and Debili [1980] found that inboth French and English, 96% of theanagrams created from lexicons of over20,000 uninflected word forms hadunique solutions.

Some attempts have been made totransform lexical-feature spaces byknown mathematical techniques in aneffort to better reflect the similarity rela-tions among lexical entries, and thus im-prove correction accuracy. One such tech-nique, the computation of the General-ized Inverse (GI) matrix, was explored byCherkassky et al. [1990]. The goal of theGI technique, which is based on finding aminimum least squares error inverse ofthe lexicon matrix, is to minimize the“crosstalk or interference that occursamong similar lexicon entries. Laterhowever, Cherkassky et al. [1991] de-rived a proof, based on a theorem of lin-ear algebra, that a GI matrix saturateswhen the number of lexicon entries ap-proaches the dimensionality of the fea-ture space, and a significant decline incorrection accuracy follows. This led themto conclude that simple C!MM models aremore effective and efficient for spellingcorrection.

Another technique for transforming afeature space, Singular Value Decompo-sition (SVD), was explored by Kukich

[1990]. SVD can be used to decompose alexicon matrix into a product of threematrices, one representing each of theindividual-letter n-grams as a vector offactors, a second, diagonal matrix con-sisting of a set of singular values that are

used to weight each of the factors, and athird matrix representing each of the lex-icon entries as a vector of factors. Thegoal of the decomposition process is toidentify and rank the most importantfactors that determine the relevant simi-larity relations in the lexical space. Theleast important factors, which may infact represent noise in the data, can bediscarded, yielding three matrices of re-duced rank which better capture the es-sential similarity relations among lexicalentries. SVD has been used successfullyto this end in information retrievalexperiments [Deerwester et al. 1990]. Inthose experiments, bibliographic data-bases consisting of thousands of docu-ments and lexicon entries are repre-sented by sparse term-by-document ma-trices. The technique has been dubbedLatent Semantic Indexing because theSVD process appears to elicit the inher-ent semantic relations among terms anddocuments.

In contrast to the information retrievalapplication, a spelling correction matrixin which words are represented by bi-gram and unigram vectors is far lesssparse. Once such a matrix has been de-composed into three factor matrices andreduced, a misspelling is corrected by firstsumming the vectors for each of the indi-vidual-letter n-grams in the misspelledstring (as represented in the first matrix)and then multiplying the sum vector by

the singular-value matrix of weights(represented in the second matrix). Theresultant vector determines the locationof the misspelled word in the n-dimen-sional lexical-feature space. Any stand-ard distance measure (such as a dotproduct or a cosine distance) can be usedthen to measure the distances betweenthe vector for the misspelled word andthe vectors for each of the correctlyspelled words (represented by the thirdmatrix) in order to locate and rank thenearest correctly spelled words. In Ku-kich’s studies, SVD matrices yielded im-proved spelling correction accuracy overundecomposed matrices for a small lexi-con (from 7670 to 8170 for a lexicon of 521words), but no significant improvement

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text ● 401

for larger lexicons. These empirical find-ings are consistent with the theoreticalfindings of Cherkassky et al. [1991] re-garding the saturation point of General-ized Inverse matrices.

Bickel [ 1987] devised a hybrid unigramand heuristically based vector distancecorrection technique for a database re-trieval application that used employeenames as access keys. The lexicon forthis application consisted of nearly onethousand personal names. Bickel’s tech-nique represented each valid name as aunigram vector. The value of each uni-gram vector element was set to either O ifthat letter did not occur in the name orto an integer between 3 and 9 if theletter did not occur in the name. Integervalues for each letter of the alphabetwere predetermined according to theirfrequency of occurrence in the lexicon ofnames, with least frequent letters receiv-ing the highest weights. The assumptionwas that less frequently occurring letterswere more valuable as search informa-tion. Potentially misspelled input nameswere represented by binary unigram vec-tors. The correction technique assumedthe first letter of the name was correctand confined its search to that portion ofthe lexicon. It derived a similarity mea-sure for the input name and each validname in the sublexicon by computing theinner product of the two vectors. Bickelfound this similarity measure to be morethan 9596 successful in locating the in-tended name given another spelling.

2.2.5 Probabilistic Techniques

IV-gram-based techniques led naturallyinto probabilistic techniques in both thetext recognition and spelling correctionparadigms. Two types of probabilitieshave been exploited, transition proba-bilities and confusion probabilities.Transition probabilities represent proba-bilities that a given letter (or letter se-quence) will be followed by another givenletter. Transition probabilities are lan-guage dependent. They are sometimesreferred to as Markov probabilities basedon the assumption that language is a

Markov source. They can be estimated bycollecting n-gram frequency statistics ona large corpus of text from the domain ofdiscourse.

Confusion probabilities are estimatesof how often a given 1etter is mistaken orsubstituted for another given letter. Con-fusion probabilities are source depen-dent. Because different OCR devices usedifferent techniques and features to rec-ognize characters, each device will havea unique confusion probability distribu-tion. For that reason, confusion probabil-ities are sometimes referred to as chan-nel characteristics. The same OCR devicemay generate different confusion proba-bility distributions for different font typesand points sizes.

A confusion probability model for aspecific device can be estimated by feed-ing the device a sample of text and tabu-lating error statistics. This process issometimes referred to as a “training”phase when the results are used in thedevelopment of an error-correcting post-processor. Alternatively, a device cangenerate a 26-element vector containinga likelihood estimate for each letter ofthe alphabet at the time a character isrecognized. In that case the likelihoodestimates may be interpreted as confu-sion probabilities.

Confusion probabilities based on hu-man errors are simply called error proba-bilities. They, too, have been estimatedby extracting error statistics from largesamples of human-generated text con-taining typos or other spelling errors. Oneother class of probabilistic informationthat has been exploited for isolated-wordcorrection techniques is word frequencydata, or word unigram probabilities.

Text recognition researchers have in-tensively explored tb e use of probabilis-tic information; spelling correction re-searchers have only recently joined thefray. Text recognition research has shownthat these probabilistic sources alone areinsufficient to achieve acceptable errorcorrection rates; however, combiningprobabilistic information with dictionarytechniques results in significantly bettererror correction ability. A brief history of

ACM Computmg Surveys, Vd 24, No 4, December 1992

402 0 Karen Kukich

probabilistic error correction researchfollows.

Bledsoe and Browning [1959] pio-neered the use of likelihood estimates intext recognition techniques. The Bledsoe-Browning technique entails two phases:(1) an individual-character recognitionphase in which a 26-element vector oflikelihood estimates is generated for eachcharacter in an input word, followed by(2) a whole-word recognition phase inwhich a dictionary is used to help choosethe individual letters whose likelihoodsjointly maximize the probability of pro-ducing a valid word from the dictionary.In effect, whole-word recognition en-hances individual-character recognitionby imposing dictionary constraints thathelp resolve uncertainty and correct er-rors made at the individual-characterrecognition level.

The Bledsoe-Browning technique usesBayes’ rule to compute an a posteriori

probability for each word in the dictio-nary from the a priori likelihood proba-bilities for the individual characters. Let-ting X represent a dictionary word andY represent an OCR output string, Bayes’rule states that

P(YIX)’P(X)P(XIY) =

P(Y)(1)

where P(X IY ) is the conditional proba-bility that X is the correct word, P(Y IX)is the conditional probability of observingY when X is the correct word, and P(X)

and P(Y) are the independent probabili-ties of words X and Y. Finding the mostprobable dictionary word X given theOCR output string Y amounts to maxi-mizing the function

G(YIX) = log P(YIX) + log P(X) (2)

where P(X) is often taken to be theunigram probability of the word X. Thea posteriori probability for each dictio-nary word can be readily computed fromthe a priori likelihood estimates for indi-vidual characters by the equation

~=~log P(Ylx) = ~ log P(YJxt) (3)

1=1

where n is the length of the word, and iindexes the individual characters in theword. For example, given an OCR outputstring DOQ and a dictionary word DOG,

G(DOQIDOG) = log P(DID)

+ log P(olo)

+ log P(QIG)

+ log P(DOG). (4)

In the Bledsoe-Browning technique, thelikelihood estimates for each character inthe output string are supplied by theOCR device; the unigram word frequencyis ignored; and the word with the maxi-mum probability is chosen.

The computational demands of theBledsoe-Browning technique are linearwith the size of the dictionary. Hence,the processing of large dictionaries is in-feasible, even if they are partitioned byword length. In an application for ahandwriting recognize, Burr [1983] ex-tended the Bledsoe-Browning techniqueto incorporate morphological processing.This allowed him to use a smaller dictio-nary of root forms of words along withprefixes and suffixes.

Kahan et al. [1987] have also combinedconfusion probabilities with dictionarylookup in a postprocessor for OCR out-put. When an output word is rejected byspelling checker, they generate succes-sive candidates, based on confusion prob-abilities obtained through training, untilan acceptable dictionary word is found orsome threshold is exceeded. They reporta 977o character recognition rate ignor-ing indistinguishable pairs such as 0/0(zero/oh) md 1/1 (one/en).

Techniques have been tried that makeuse of confusion and transition probabili-ties without the benefit of a dictionary.Hanson et al. [ 1976] report on experi-ments using likelihood probabilities incombination with either trinary transi-tion probabilities or positional binary tri-grams. They found that using trinarytransition probabilities to directly weightthe likelihoods of individual characters

ACM Comput,ng Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text 8 403

in OCR output words reduced the worderror rate on a test set of 2,755 6-letterwords, but only from 49% to 299%0.Theyconcluded that efficient techniques basedon transition probabilities would not beeffective. They went on to find that posi-tional binary trigrams alone correctedover 50% of the word errors in their testset, and that additional processing usinglikelihood probabilities further reducedthe error rate by an order of magnitude.

Both Toussaint [1978] and Hull andSrihari [1982] provide general overviewsof error correction techniques based ex-clusively on transition and confusionprobabilities. Bayes’ formula can be usedto combine these two kinds of probabili-ties. Instead of using a word unigramprobability for the final term in the for-mula, the sum of the transition probabili-ties in the dictionary word can be substi-tuted. But a more efficient and widelyused method for combining transition andconfusion probabilities is a dynamic-pro-gramming technique called the Viterbialgorithm [Forney 1973]. In the Viterbialgorithm, a directed graph, sometimesreferred to as a trellis, is used to captureboth the structure of a lexicon (transitionprobabilities) and the channel character-istics of a device (confusion probabilities).A starting node and an ending node rep-resent word boundary markers; interme-diate nodes represent likelihood esti-mates for individual letters; and edgesrepresent transition probabilities be-tween letters. The graph is efficientlytraversed via a dynamic-programmingalgorithm to find the sequence of lettershaving the highest probability given thelikelihood estimates output by the OCRdevice and the transition probabilities ofthe language.

Various modifications to the originalViterbi algorithm, e.g., Shinghal andToussaint [ 1979a], that limit the depth ofa search based on likelihoods or otherfactors have been developed. A disadvan-tage of all these techniques is that thehighest probability string is not always avalid word. More importantly, Toussaint,Hull, Srihari, and others have concludedthat the correction accuracy of tech-

niques based solely on these two proba-bilistic sources is on] y mediocre.

One other technique worth noting forexploiting these two probabilities is a re-laxation technique described by Gosh-tasby and Ehrich [ 1!)88]. It calls for set-ting the initial character probabilities foran OCR output word according to thelikelihood estimates for each character.Rosenfeld’s relaxation formula [Rosen-feld et al. 1976] is applied to iterativelyadjust the likelihood probabilities foreach character as a function of the tran-sition probabilities of adj scent charac-ters. The number of (confusion candidatesfor each character can be limited to asmall set that exceeds some threshold,thus making the relaxation process itselfcomputationally quite affordable, on theorder of 10 nz for a word of length n. Aswith the Viterbi algorithm, the relax-ation process can converge on a nonword.For example, during one test it convertedthe correct word biomass to biomess. Theauthors point out that the process can beimproved by incorporating additionalknowledge. They suggest using syllabledigram information to increase the prob-abilities of the candidates that form com-patible syllables at each iteration. Theyalso note that the relaxation process couldbe integrated with a dictionary-basedtechnique.

Probabilistic information has been usedeffectively as a preprocessor for a spellingcorrection application. Oshika et al.[1988] implemented a Hidden MarkovModel (HMM) procedure for preclassifi-cation of surnames according to ethnicbackground. HMM’s were automaticallyconstructed for each of five different lan-guages based on transition probabilitiesof letter sequences in 2,000 sample sur-names from each language. The HMMswere then used to automatically classifyinput names before generating langaage-speciflc spelling variants of the names.Oshika et al. report that their prepro-cessing technique improved retrieval ac-curacy from 6970 to 887o on a test of 160query names.

Better error correction performancehas been achieved by techniques that

ACM Computing Surveys, Vol. 24, No. 4, December 1992

404 “ Karen Kukich

combine the use of probabilistic infor-mation (bottom-up information) with dic-tionaries (top-down information). AsShinghal and Toussaint [ 1979b] note,“Dictionary methods have low error-ratesbut suffer from large storage demandsand high computational complexity.Markov methods have the inverse char-acteristics” (p. 201). So they invented atechnique they call the predictor-correc-tor method that “achieves the error-cor-recting capability of dictionary look-upmethods at half the cost.” The techniquecalls for sorting the dictionary accordingto a precomputed value V for each wordthat is based on the word’s transitionprobabilities. The predictor-corrector al-gorithm first uses a modified Viterbi al-gorithm to recognize an input word. Thenit computes V !ior the recognized wordand does a binary search of the dictio-nary for the word with a matching value.If that word is not an exact match, itcalculates V for words in the neighbor-hood and chooses the one with the high-est score. In an experiment with a dictio-nary of 11,603 words and two test sets of2,521 and 2,256 words, the techniqueachieved character recognition rates of96.69Z and 96.4%. Unfortunately, Shing-hal and Toussaint [1979] note that thetechnique worked best when the neigh-borhood included the whole dictionary, sothey suggest the need for a better methodfor organizing the dictionary. Even so,they note that the technique still cut dic-tionary-processing time in half becausethe modified Viterbi algorithm producedthe correct word more than half the time.

Srihari et al. [1983] implemented atechnique in which the dictionary is rep-resented as a trie and integrated into theViterbi algorithm search “so as to retainVA [Viterbi algorithm] efficiency whileguaranteeing the most likely dictionaryword as output” (p. 72). They tested theirtechnique on a sample text of 6,372 wordsthat yielded a dictionary of 1,724 words.They found that their Viterbi algorithmalone was able to correct 3570 of theerrors made by the OCR device; theirdictionary lookup algorithm alone cor-rected 8270; and their combination

Viterbi-dictionary algorithm corrected87%.

Sinha and Prasada [1988] developed atechnique that not only integrated proba-bilistic and dictionary information butalso incorporated a variety of heuristics.Their objective was to tackle the problemthat in practice one cannot assume that adictionary will contain all the wordsfound in a document, So they compiled a“partial dictionary” by starting with alist of the 10,000 most frequent wordsfrom the Brown Corpus [Kucera andFrancis 1967] and augmenting it “withall the valid words obtained by singlecharacter substitutions” (p. 465). Theiralgorithm takes two passes. It first cor-rects only those errors for which confu-sion probabilities produce a word foundin the partial dictionary. Then it uses amodified Viterbi algorithm to estimatethe most likely correction for words notin the dictionary. The dictionary issearched as a trie, and the system usesvarious heuristics to rank confusion can-didates, limit the search through the trie,and limit the branches of Viterbi netsearched. They tested their technique ona test set of 5,000 words (25,000 charac-ters) of text from typewritten and typesetdocuments in seven different font typesand point sizes. Its overall performanceexceeded 9870 character recognition ac-curacy, and its word recognition accuracywas around 9770 not counting punctua-tion errors. They note that the “aug-mented dictionary approach yields betterperformance as compared to the unaug-mented version in the case of texts thatare not very conventional such as techni-cal and literary documents. . . . However,performance [on conventional texts] dete-riorates for low performance [ < 90%]

IOCR [devices] with the augmented dic-tionary . . . [because] a comparativelylarger number of [correction candidates]get generated and many of the high fre-quency words. . . get mapped to less fre-quent words in the augmented dictio-nary” (p. 474).

Jones et al. [1991] have implementedan OCR postprocessor that “integrates aBayesian treatment of predictions made

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

Automatically Correcting Words in Text “ 405

by a variety of knowledge sources” (p.926). In a training phase they build sta-tistical models for both the OCR deviceand the documents to be processed. Theirmodels include letter unigram and di-gram frequencies, binary letter trigramsand initial-letter n-grams, word unigramfrequencies, and some word digram fre-quencies. They include letter and worddigrams involving punctuation and capi-talization. They adjust their statistics toreflect undertraining so that, for exam-ple, undertrained word digram frequen-cies do not unduly influence results. Theirconfusion probability models includenon-one-to-one mappings (e.g., and e

aml).

The corrector operates in three phases:(1) it generates a ranked list of candi-dates for each word of input based onconfusion probabilities and a trie-baseddictionary; (2) it merges words to undoapparent word-splitting errors; (3) itreranks candidates based on word di-gram probabilities. Phases 2 and 3 feedback proposed analyses into previousphases.

Jones et al. [1991] tested their systemin two experiments. In the first experi-ment, they trained the system on sixsoftware-testing documents totaling8,170 words and tested on a seventh doc-ument containing 2,229 words. The rawOCR word error rate was 16.270. Theirsystem corrected 89.47o of those errors.In the second experiment, they trainedthe system on 33 pages of AP news wiretext and tested it on 12 additional pages.The raw OCR word error rate on the 12pages varied from 6.87o to 10.37o, andtheir system corrected 72.570 of those er-rors. This system is significant in that itis one of the fh-st to incorporate extra-word contextual knowledge in the form ofword and punctuation digram probabili-ties into the correction process.

Recently, techniques that make ex-plicit use of error probabilities have be-gan to be explored for spelling correc-tions applications. Kashyap and Oom-men [ 1984] were motivated to exploreusing error probabilities in response tothe poor performance of traditional

spelling correction techniques on shortwords (less than six characters). Observ-ing that misspellings frequently involvethe mistyping of an adjacent key on thetypewriter keyboarcl, they compiled atable of subjective estimates of the proba-bilities of such substitution errors. Theyalso ascribed subjective estimates to theprobabilities of a single deletion or inser-tion of any character at various positionswithin a word. Assuming that everyspelling error can be reduced to a combi-nation of such deletions, insertions, andsubstitutions, they {devised a recursivealgorithm that used their estimates forcalculating the probability that a givenmisspelled string is a garbled version of adictionary word. When applied to eachword in a dictionary, this algorithm com-putes a probability ranking over the dic-tionary of correct words for a misspelledinput string.

Kashyap and Oommen [1981] testedtheir technique on artificially generatedmisspellings in words of length 3, 4, and5 characters. Misspellings were gener-ated according to their error probabilitytables with an average of between 1.65and 1.90 errors per word. They used adictionary of the 1,025 most frequentwords. They report correction accuraciesranging from 3070 tc~ 927o depending onword length and number of errors perword. They report also that these figurescompare favorably with those reportedby Okuda et al. [1976] which ranged from287. to 647o, for words of similar lengthand error character sties. Okuda et al.used a Damerau-Levenshtein minimumedit distance correction technique.

Kernighan et al. [1990] and Churchand Gale [199 la] devised an algorithmfor correcting single-error misspellings.They used a database of genuine errorsextracted from a 44 million-word corpusof AP news wire text to compile fourconfusion matrices, (one for each of thefour standard classes of errors (inser-tions, deletions, subs titutions, and trans-positions). They estimated confusionprobabilities then from these statistics.Their program, called correct, used a re-verse minimum edit distance technique

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

406 ● Karen Kukich

to generate a set of single-error candi-dates and identify a specific error withineach candidate, followed by a Bayesiancomputation to rank the candidates.

Correct starts with a misspelled stringdetected by spell. First, it generates a setof candidate correct spellings by system-atically testing all possible single-errortypographic permutations of the mis-spelled string, checking to see if the re-sult is a valid word in the spell dictio-nary. A precomputed deletion table andhashing are used to minimize dictionaryaccesses during candidate generation.Candidates are then ranked by multiply-ing the a priori probability of occurrenceof the candidate word (i. e., its normalizedfrequency) by the probability of occur-rence of the specific error by which thecandidate word differs from the mis-spelled string.

Correct was evaluated by testing it ona set of 332 misspellings from a subse-quent month of AP news wire text. Eachof these misspellings had exactly twocandidate correct spellings, and both cor-

rect and three human judges were askedto choose the best candidate. The humanjudges were allowed to pick “neither,”and they were given a surrounding sen-tence of context. Correct agreed with atleast two of the three judges 87% of thetime.

Kernighan went on to build a versionof correct into a text-to-speech synthe-sizer for TDD conversations [Kernighan1991]. Before pronouncing a word, thissystem’s pronunciation component [ Cokeret al. 1990] computed a pronunciationdifficulty index to avoid unnecessarilycorrecting misspelled words whose pro-nunciation would be acceptable, such asoperater and wud. Note that the lattermight be incorrectly changed to wed

when would was intended. A misspellingwas sent to correct only when the pro-nunciation difficulty index exceeded athreshold. In an experiment with humanlisteners, Kernighan’s system signifi-cantly improved the comprehensibility ofsynthesized text. Kernighan and Gale[1991] also tested some variations of thetechnique on a spelling corrector forSpanish.

In a related study, Troy [ 1990] investi-gated the effects of combining probabilis-tic information with a vector distancetechnique. She used a cosine distancemetric, word unigram probabilities, andKernighan-Church-Gale (KCG) [1990] er-ror probabilities on Kukich’s [1990] 170-word test set and 1,142-word lexicon. Thecosine distance metric was used to gener-ate a candidate correction set; a dy-namic-programming technique was usedto identify specific errors within the can-didates; and either word unigram proba-bilities or the KCG error probabilitieswere used to rerank candidates. Sincethe cosine distance metric could handlesingle- and multi-error misspellings,candidates could have multiple errors.Troy observed that the word frequencyinformation did not improve the correc-tion accuracy of the cosine distance met-ric, but error probability information didimprove the cosine distance metric accu-racy by three percentage points (from75V0 to 78%).

2.2.6 Neural Net Techniques

Neural nets are likely candidates forspelling correctors because of their inher-ent ability to do associative recall basedon incomplete or noisy input. Further-more, because they can be trained onactual spelling errors, they have the po-tential to adapt to the specific error pat-terns of their user community, thus max-imizing their correction accuracy for thatpopulation. One can imagine a neuralnet learning chip in a personal worksta-tion that continuously adapts its weightsto conform to its owner’s idiosyncraticspelling error behavior.

The back-propagation algorithm

[Rurnelhart et al. 19861 is the most widelyused algorithm for training a neural net.A typical back-propagation net consistsof three layers of nodes: an input layer,an intermediate layer, usually referredto as a hidden layer, and an output layer.Each node in the input layer is connectedby a weighted link to every node in thehidden layer. Similarly, each node in thehidden layer is connected by a weighted

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words 1n Text “ 407

link to every node in the output layer.Input and output information is repre-sented by on-off patterns of activity onthe input and output nodes of the net. A1 indicates that a node is turned on, anda O indicates that a node is turned off.Patterns of continuous real numbers canalso be used to indicate that nodes aremore or less active.

Processing in a back-propagation netconsists of placing a pattern of activityon the input nodes, sending the activityforward through the weighted links tothe hidden nodes, where a hidden pat-tern is computed, and then on to theoutput nodes, where an output pattern iscomputed. Weights represent connectionstrengths between nodes. They act as re-sistors to modify the amount of activityreaching an upper-level node from alower-level node. The total activity reach-ing a node is the sum of the activities ofeach of the lower-level nodes leading upto it, multiplied by their connectionstrengths. This sum is usually thresh-olded to produce a value of O or 1 or putthrough a squashing function to producea real number between 0.1 and 0.9.

The back-propagation algorithm pro-vides a means of finding a set of weightsfor the net that allows the net to producethe correct, or nearly correct, outputpattern for every input pattern. It is aconvergence technique that relies ontraining from examples. Weights are ini-tialized to small random values. Then,for each input-output pair in a trainingset, the input pattern is placed on theinput nodes of the net, and activation issent forward through the weights to thehidden and output nodes to produce anoutput pattern on the output nodes. Theactual output pattern is then comparedto the desired output pattern, and a dif-ference is computed for each node inthe output layer. Working backwardsthrough the net, the difference is used toadjust each weight by a small amountinversely proportional to the error. Thisprocedure is used to cycle repeatedlythrough all the examples in the trainingset until the weights converge. The re-sult is a set of weights that produces anearly correct output pattern for every

input pattern in the training set, as wellas for similar input patterns that werenot in the training set. This last factor isreferred to as the net’s ability to general-ize to noisy and novel inputs,

For example, in a spelling correctionapplication, a misspelling represented asa binary n-gram vector might serve as aninput pattern to the net. Its correspond-ing output pattern might be a vector ofm elements, where m is the number ofwords in the lexicon, and only the nodecorresponding to the correct word isturned on. This type of net is sometimesreferred to as a l-of-m classifier becausethe goal of the net is to maximize activa-tion on the output node representing thecorrect word and minimize activation onall other nodes. In fact, the values of thenodes in the resultant output vector aresometimes interpreted as likelihood val-ues. The l-of-m encoding scheme used forthe output pattern is referred to as alocal encoding scheme because each wordin the lexicon is represented by a singlenode. In contrast, the binary n-gram en-coding scheme used for the input patternis referred to as a distributed encodingscheme because no one node contains allthe information needed to represent aword. Rather, the information is dis-tributed across the whole vector of inputnodes.

Neural nets are being intensively ex-plored for OCR applications, not so muchfor word recognition as for applications ofhandwritten-number recognition such aspostal zipcode reading and bank checkand credit card recellpt reading [Matanet al. 1992; Keeler and Rumelhart 1992].These OCR applications are inherentlydifficult due to the lack of contextualinformation available. Some neural netword recognition work has been done,however. Burr [1987] implemented atwo-phase hand-printed-word recognizethat combines a neural net recognitiontechnique with a probabilistic correctiontechnique. In phase one, his system usesa neural net with 26 output nodes, eachrepresenting one letter, and 13 inputnodes, each representing one segment ofa bar mask, for capturing the features ofa handprinted letter. The output of the

ACM Computing Surveys, Vol 24. No 4. December 1992

408 ● Karen Kukich

neural net is a useful approximation to amaximum-likelihood probability distribu-tion over the 26 letters of the alphabet.In phase two, the system uses Bayes’formula to compute a probability esti-mate for each word in the dictionarybased on the likelihood probabilities ofthe individual characters and the uni-gram probabilities of the words. In a testexperiment, Burr created a set of 208hand-printed characters, eight instancesof each letter in the alphabet. He traineda neural net using half the set and testedit on the other half. The net attained949ic character recognition accuracy onthe test set. Then, he tested his wordrecognition technique by simulating ahand-printed word for each word in a20,000-word dictionary using instancesof letters in the test set. He obtained aword recognition rate of 99.7mC for wordsat least six characters long.

Neural nets have been investigated byboth Kukich [ 1988a, 1988b] and Cher-kassky and Vassilas [ 1989a, 1989b] forname correction applications, by Kukich[1990] for a text-to-speech synthesis ap-plication, by Gersho and Reiter [ 1990] foran address correction application, and byDeffner et al. [1990a, 1990b] for a natu-ral language interface application. Thestudies by Kukich and Cherkassky andVassilas explored the use of the back-propagation algorithm for training nets;Gersho and Reiter’s experiments madeuse of both self-organizing and back-propagation nets; and the system devisedby Deffner et al. employed a weightedcorrelation matrix model to combine avariety of sources of lexical knowledge.

For her name correction experimentsKukich [ 1988a] used a lexicon of 183 sur-names to train a standard back-propa-gation net. The output layer consisted of183 nodes, one for each name. The inputlayer contained 450 nodes in 15 sequen-tial blocks of 30 nodes each, so names ofup to 15 characters in length could berepresented. Each 30-node block con-tained one node for each character in a30-character alphabet. So if the letter C,for example, appeared as the first char-acter in a string, the node representing C

would be turned on (set to 1) in the fh-stblock, and the remaining 29 nodes inthat block would be turned off (set to O).Thus, it was a positionally encodedshift-sensitive input vector.

The net was trained on hundreds ofartificially generated single-error mis-spellings of each of the 183 names. Tensof hours of training time on the equiva-lent of a Sun SPARCstation 2TM proces-sor were required for the net to converge.The net was tested on additional ran-domly generated single-error misspell-ings. Testing required only a few secondsof CPU time. The net achieved near-per-fect (approaching 1007o) correction accu-racy for all four classes of typos, includ-ing the shift-variant insertion and dele-tion typos. In fact, these latter error typesdid take slightly longer to learn.

Some of the experimental findings werethat: (1) nets trained on names with mis-spellings learned better than nets trainedon perfectly spelled names; (2) as thenumber of hidden nodes was increasedup to 183 nodes, performance increased;above 183 nodes performance leveled oftand (3) nets using local encoding schemeslearned more quickly than variants usingdistributed encoding schemes. Subse-quent experiments revealed that nets us-ing the positionally encoded shift-sensi-tive input scheme and no hidden layeractually learned more quickly and per-formed at least as well as the optimallyconfigured hidden-layered nets. How-ever, nets that used a binary n-gram

(unigram-plus-bigram,), hence shift-in-variant, input scheme, while exhibitingthe same performance increase as thenumber of hidden nodes was increased,performed miserably with no hiddenlayer.

The experiments performed by Cher-kasskv and Vassilas [ 1989a, 1989b] cor-.roborated many of Kukich’s findings.They separately tested both unigram and

‘M Sun SPARCstatlon 2 M a re~stered trademarkof Sun Microsystems, Inc.

ACM Computmg Surveys, Vol 24, Na 4, December 1992

Automatically Correcting Words in Text ● 409

bigram vector encoding schemes for in-put to the nets, and they used the samelocal encoding scheme (one node per lexi-con entry) for output; thus their netswere also l-of-m classifiers. They trainednets on correctly spelled names andtested them on names with artificiallygenerated deletion and substitution er-rors. Lexicon sizes ranged from 24 to 100names. They found that nets attainedalmost 100% correction accuracy for thesmallest lexicons, but the choice of learn-ing rate and number of hidden unitsmade significant differences in the per-formance of the nets. They found alsothat the optimum number of hidden unitswas close to the number of names in thelexicon. One might be tempted to con-clude from this fact that nets were sim-ply performing a rote lookup operation,and no generalization was taking place.However, this is clearly not the case, asevidenced by the fact that nets weretested on novel misspellings which theyhad never seen during training. Cher-kassky and Vassilas concluded that be-cause the nets were so sensitive to vari-ous training parameters, and becausethey required extensive computationaltraining time, they are less desirablecandidates for spelling correctors thancorrelation matrix models. An overviewpaper by Cherkassky et al. [1992] sum-marizes these experiments and discussesthe relative merits and shortcomings ofcorrelation matrix models and neuralnets.

Kukich [1990] went on to investigatehow such nets might scale up for a text-

to-speech spelling correction applicationrequiring a larger lexicon and having a

significant portion (25 $%) of multi-errormisspellings. A standard three-layer,back-propagation architecture was used,with a 420-element unigram-plus-bigramvector input layer, a 500-node hiddenlayer, and a locally encoded 1,142-nodeoutput layer. Again, the net had to be

trained on artificially generated errorsbecause not enough genuine errors wereavailable, but it was tested on the sameset of 170 genuine misspellings previ-ously used to test vector space and other

techniques. Like the best vector spacetechniques, the net achieved an accuracyscore of 757o. Computational trainingtime requirements were high, how-ever—on the order of hundreds of hoursof Sun SPARCstation 2 CPU time. Thus,it appears that the heavy computationaltraining demands of standard three-layer, l-of-m back-propagation nets im-pose strict practical limits on scalabilitywith respect to lexicon size, at least withcurrently available computational power.

However, techniques for reducingtraining time by partitioning the lexiconor by exploiting alternative network ar-chitectures are being explored. One suchvariation was investigated by Gersho andReiter [ 1990] for a di~tabase retrieval ap-plication involving up to 1,000 streetnames. They devisecl a hierarchical two-layer network strategy consisting of aself-organizing Koh onen-type network[Kohonen 1988] anc[ many small back-propagation networks. Input to their sys-tem goes first to the Kohonen network tobe categorized into one of a few coarse-grained categories and then to the appro-priate back-propagation network for morefine-grained categorllzation. They reportaccuracy rates ranging from 7497 to 9770for a database of 800 street names, andthey are continuing to explore largerdatabases.

The neural net spelling corrector im-plemented by Deffher et al. [1990a],which is currently being tested on a5,000-word lexicon, is part of a naturallanguage interface to a database system.During retrieval, certain features (suchas syntactic and semantic features) maybe fixed by the context of the databasequery. Lexicon entries and misspellingsare represented by feature vectors con-taining letter n-grams, phonetic fea-tures, syntactic features (e.g., is _adjec-tive), and semantic features (e.g.,is_ color). A Hamming distance metricis used to compute a measure of similar-ity between a potentially misspelled orincomplete input string and each mem-ber of a restricted subset of the lexicon.This technique appears to be a promisingstep toward incorporating some context

ACM Computing Surveys, Vol 24, No 4, December 1992

410 “ Karen Kukich

Table 1.

Accuracy of Some Isolated-Word Spelling Correction Techniques

Technique

Minimum Edit Ihtance

- grope

Simllanty Key

- Bocast Token Reconstrru tmrr

Simple N-gram Vector Distance

- Dot Product

- Hamming Distance

- Cosine Distance

SVD N-gram Vector Distance

- Cosine D@mce

Probabilistic

on Kuklch’s 170-Word Test Set

521 -Word

Lexicon

64%

80%

58%

69%

7670

81%

1142-Word 1872-Word

Lexicon Lexicon

62~0 60%

78% 757.

54% 52%

6870 6770

75% 74%

76% 7470

- Kemighan-Church-Gale Error Probs 78%

Neural Net

- Back-propagation Classifier 757. 75TC

dependency into a corrector for domain- direct comparisons of techniques. In onlvspecific applications.

2.2.7 Summary of Isolated- Word Error

CorrectIon Research

It should be clear from the precedingsections that the ideal isolated-word er-ror corrector does not yet exist. Such asystem would have broad lexical cover-age ( > 100,000 words) and be capable ofcorrecting both single-error and multi-error misspellings in short, medium, andlong words in real time with near-perfectfirst-choice accuracy. Real-time text-to-speech synthesis is one example of such ademanding application. It appears thatperformance levels approaching 90% cor-rection accuracy for isolated-word textrecognition and 80% correction accuracyfor isolated-word spelling correction rep-resent the state of the art for practicalapplications.

In most cases it is impossible to make

a few case; were techniques <ested on th~same test set and lexicon. Furthermore,techniques vary greatly in the kinds oferrors they were designed to correct (e.g.,OCR-generated vs. human-generated, ty-pos vs. phonetic errors, single vs. multi-ple errors, errors in long vs. short words,etc.). Hence, the characteristics of theerrors in a test set as well as the size andcontent of a test dictionary all have greatimpact on the correction accuracy of atechnique.

One small set of direct comparisons isavailable based on Kukich’s test set of170 human-generated misspellings. Thistest set may be considered difficult butrealistic in that it contains relativelylarge portions of multi-error misspellings(2570) and short words (63%). The re-sults, shown in Table 1, provide at leasta partial ordering of the accuracy of someisolated-word spelling correction tech-niques.

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text “ 411

It maybe surprising to some that mostof the other techniques outperform theminimum edit distance technique. Theincreased accuracy of the n-gram vectordistance techniques, especially the cosinedistance metric, may be due to the factthat the bigram representation it usescaptures more useful information thanthe unigram representation used in theminimum edit distance technique. Bo-cast’s “token reconstruction,” whichmakes use of bidirectional letter se-quence information, also apparently cap-tures some useful information. It is inter-esting that the neural net technique at-tained nearly the same performance levelas the best n-gram vector distance tech-nique despite being trained on artificiallygenerated errors. At the same time, it isnotable that computational demandsmake standard neural net techniques im-practical given existing hardware. It iseven more interesting that Kernighan etal.’s [1990] error probabilities enhancedthe performance of the cosine distancemetric despite the fact that those proba-bilities were estimated from a generalsource (AP news wire data) while the testset and lexicon came from a specializeddomain.

It is not clear whether these tech-niques approach some theoretical upperbound on isolated-word correction accu-racy. It may not be coincidental that hu-man performance on the same test setwas nearly identical to the best auto-matic techniques. Seven human subjectsturned in scores ranging from 65’%o to83% and averaging 74%. To see why evenhumans have trouble achieving 100% ac-curacy levels, consider the problem ofguessing the intended word from amongthe closest candidates for just a few ofthe misspellings in that test set:

vver ~ over, ever, very?

ater ~ after, later, ate, alter?wekk - week, well, weak?gharge ~ charge, garage, garbage?throught - through, thought,

thoughout?thre ~ there, three, threw, the?

oer ~ per, or, over, her, ore?

Given isolated misspelled strings such as

these, it is difficult to rank candidatecorrections based on orthographic simi-larity alone.

Nevertheless, it does seem likely thatisolated-word spelling correction accu-racy levels could be pushed a little higherwith good sources o f probabilistic infor-mation. OCR error correction researchhas shown that although transition andconfusion probabihties provide onlymediocre correction capability when usedalone, they achieve excellent perfor-mance levels when coupled with dictio-nary lookup techniques. OCR re-searchers have been able to estimate reli-able confusion probabilities for OCR de-vices by feeding text to those devices andtabulating error frequencies, but confu-sion probabilities fcm human-generatederrors are harder to come by. Indeed,those compiled by Kernighan et al. [1990]remain the only such source. Given thatthe KCG error probabilities were able toimprove the performance of other tech-niques despite their generality, it seemsreasonable to infer that error probabili-ties that are specifically tuned to theirhuman sources and lexical domainswould achieve even better accuracy. An-notated machine-readable corpora of hu-man-generated errors from which errorprobabilities may be estimated are sorelyneeded. A starting collection containingmainly British-English misspellings isavailable from Mitto n [1985].

Error probabilities might be integratedinto n-gram vector distance techniquesby storing their values directly into wordvectors, thus creating detailed word mod-els that explicitly represent the specificprobabilities of typing any n-gram in anyposition of a correct word. Alternatively,once neural net learning chips becomepractical, the same probabilistic knowl-edge could not only be learned for eachindividual, but it could also be continu-ously adapted.

Further research is still needed to ad-dress the problem ojf word boundary in-fractions as well as the problem of neolo-gisms, i.e., novel terms that arise fromcreative morphology and from the influxof new proper nouns and other wordsreflecting a changing environment. As

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

412 ● Karen Kukich

Sinha and Prasada [1988] have aptlypointed out, no dictionary will ever becomplete for this reason. Their tech-nique, which relies on the use of transi-tion probabilities when it encounters suchterms, seems apropos and warrants fur-ther investigation.

Other techniques worth investigatingare those that, like Tenczar and Golden’s[ 1972], are modeled after human psycho-logical word recognition processes. Suchtechniques would attempt to capturegestalt features of words, such as num-ber of characters extending above andbelow the horizontal base line, etc. Hull[ 1987] and Ho et al. [1991] have alreadybegun some work along these lines. Thereis a wealth of experimental-psychologyresearch to be exploited. Rumelhart andMcClelland [ 1982], McClelland andRumelhart [1981], and Johnston and Mc-Clelland [1980] might serve as startingpoints for exploring this area.

It is interesting to note that while alltechniques fall short of perfect correctionaccuracy when only the first guess isreported, most researchers report accu-racy levels above 9070 when the firstthree guesses are considered. Given asmall candidate set that contains the cor-rect word over 9070 of the time, it ispossible to consider exploiting some con-textual information to choose from amongthe candidates. Thus, the potential existsfor current isolated-word error correctiontechniques to be extended by context-de-pendent knowledge. Some preliminaryexperiments along this line, along withother context-dependent spelling correc-tion efforts, are discussed next.

3. CONTEXT-DEPENDENT WORD

CORRECTION RESEARCH

Regardless of the progress that is madeon isolated-word error correction, therewill always remain a residual class oferrors that is beyond the capacity of thosetechniques to handle. This is the class ofreal-word errors in which one correctlyspelled word is substituted for another.Some of these errors result from simple

tYPos (e.g., from + form, form + farm)or cognitive or phonetic lapses (e.g., there

+ their, ingenious + ingenuous); someare syntactic or grammatical mistakes,including the use of the wrong inflectedform (e.g., arrives ~ arrive, was + were)

or the wrong function word (e.g., for +

of, his ~ her); others are semanticanomalies (e.g., in five minuets, lave a

message); and still others are due to in-sertions or deletions of whole words (e.g.,the system has been operating system for

almost three years, at absolutely extra

cost ) or improper spacing, including bothsplits and run-ons (e.g., myself ~ my self,

ad here - adhere). These errors all seemto require information from the sur-rounding context for both detection andcorrection. Contextual information wouldbe helpful also for improving correctionaccuracy for detectable nonword errors.

Devising context-dependent word cor-rection tools has remained an elusivegoal. This is due mainly to the seemingintractability of the problem, which ap-pears to call for full-blown natural-lan-guage-processing (NLP) capabilities, in-cluding robust natural language parsing,semantic understanding, pragmatic mod-eling, and discourse structure modeling.Up to now, successful NLP systems havebeen confined to highly restricted do-mains of discourse, and although a few ofthese systems have addressed the needto handle ill-formed input, none were de-signed for general use. Recently, how-ever, progress toward robust syntacticparsing has led to the development of atleast two general writing aid tools capa-ble of detecting and correcting errors dueto some syntactic-constraint violations.At the same time, advances in statisticallanguage modeling, together with asteady increase in computational power,have led to some promising experiments

in the area of statistically-based error

detection and correction.

The history of research related to con-

text-dependent word correction is re-

viewed in this section. It is divided into

four subsections. The first reviews find-

ings on the frequency and classification

of real-word errors. The second reviews

prototype NLP systems that address the

problem of handling ill-formed input, in-

cluding two syntactic rule-based writing

ACM Computing Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text w 413

aid tools that incorporate spelling andgrammar checking. The third reviews re-cent experiments that attempt to exploitstatistical language models for ccmtext-dependent spelling error detection andcorrection. The last provides a summaryof context-dependent spelling correctionwork.

3.1 Real-Word Errors: Frequencyand Classification

Only a few studies have attempted toestimate the frequency of occurrence ofreal-word errors. (One obvious reason forthe shortage of data on the frequency ofoccurrence of real-word errors is the lackof automatic tools for detecting such er-rors.) The significant findings are: Peter-son’s [1986] estimates suggesting thatanywhere from 2% to 16% of all single-error typos might result in real-word er-rors, Mitton’s [1987] analysis of 925handwritten student essays in which heobserved that a full 40% of misspellingswere real-word errors, and Wing andBaddeley’s [1980] listing of 1,185 hand-written errors in which roughly 30% in-volved real words.

An extrapolation from Mitton’s [1987]statistics to the 40,000-word corpus oftyped textual conversations studied byKukich [1992] dramatizes the need forcontext-dependent spelling correction.Only 85% of the 2,000+ detectable non-word error types in that corpus were cor-rectable by isolated-word correction tech-niques (the remaining 15% being wordboundary infractions that are not cor-rectable by current techniques). The bestisolated-word correction techniques cor-rected approximately 7890 of that 85%,for an overall nonword correction rate of66$Z0. But if the nonword errors studiedrepresent only 60% of all errors in thecorpus (the remaining 4090 being real-word errors), then only 66% of 6070, or40%, of all the errors in the corpus areultimately being corrected. Clearly, thegreatest gain is to be found in the detec-tion and correction of errors in the 40%block of real-word errors.

Because real-word spelling and gram-matical errors pose practical problems for

natural language interfaces, a few stud-ies have been undertaken to determinethe expected error rate in such environ-ments. Thompson [ 1!180] analyzed 1,615database input queries. She found that446 queries, or almost 28%, containederrors. Of those 446, 61 were nonwordspelling errors; 161 were errors due toincomplete lexicon coverage; and 72 werepunctuation errors. The remaining real-word errors included 62 grammatical er-rors and comprised 34$Z0 of all errors.

Eastman and McLean [1981] studied693 natural language queries to adatabase system. They found that over12% of them contained co-occurrence vio-lations such as subject-verb disagree-ment, pronoun-noun disagreement, in-correct verb form, uninflected plural, etc.Another 129. exhibited missing words,and 2% contained extraneous words.Overall, more than ’25% of the queriescontained real-word grammatical errors.In a more recent study, Young et al.[1991] again found real-word spelling andgrammar error rates in excess of 25T0 ina sample of 426 natural language queriesto a document retrieval system. Thus,real-word error rates ranging from 259?0to 40% of all errors have been empiri-cally documented.

One approach to handling real-worderrors is to view them as violations ofnatural language processing constraintsand to apply NLP tools to the task ofdetecting and correcting them. NLP re-searchers traditionally identify at leastfive levels of processi ng constraints: (1) alexical level, (2) a syntactic level, (3) asemantic level, (4) a discourse structurelevel, and (5) a pragmatic level. Nonworderrors would be classified as lexical er-rors since they violate constraints on theformation of valid words. Errors due tolack of subject-verb number agreementwould be classified as syntactic errorsbecause they viola te syntactic con-straints, as would other errors that re-sult in the substitution of a word whosesyntactic category does not fit its context,e.g., T’he study was conducted be XYZ

Marketing Research. Errors that do notnecessarily violate syntactic constraintsbut do result in semantic anomalies, e.g.,

ACM Comrmtmw Surveys, Vol. 24, No. 4, December 1992.- .

414 “ Karen Kukich

see you in five minuets, would be classi-fied as semantic errors. Errors that vio-late the inherent coherence relations in atext, such as an enumeration violation,would be classified as discourse structureerrors, e.g., Iris flowers have four parts:

standards, petals, and falls. And errorsthat reflect some anomaly related to thegoals and plans of the discourse partici-pants would be classified as pragmaticerrors, e.g., Has Mary registered for Intro

to Communication Science yet? (where

Computer was intended).It would be helpful to know the rela-

tive proportions of real-word errors thatfall into each of the five categories be-cause the five levels of natural-language-processing constraints represent increas-ingly more difficult problems for NLPresearch. Lexical errors are amenable toisolated-word spelling correction tech-niques; tools are just beginning to emergefor handling syntactic errors in unre-stricted texts; but only restricted-domainprototype systems exist for handling se-mantic, discourse structure, and prag-matic errors. Ramshaw [1989] astutelypoints out that the same error may beclassified differently with respect to de-tectability and correctability. He cites thefollowing example from a student advisorexpert system, Is there room in CIM

360?, as a case in which lexical con-straints may be sufficient to detect theerror (since CIM is not a valid coursecode), but pragmatic knowledge about thestudent’s goals and past course historymay be needed to determine whether CIS(Computer and Information Science) orCOM (Communication Sciences) was in-tended.

It would also be helpful to know therelative proportions of real-word errorsthat depend on local vs. long-distancerelations between words. This is becausean alternative approach to handlingreal-word errors is to treat them as ir-regularities or improbabilities withinthe framework to statistical languagemodes. Statistical language models basedon word bigram probabilities [Gale andChurch 1990; Church and Gale 1991b],word trigram probabilities [Bahl et al.

1983; Jelinek et al. 1991], word n-gramprobabilities [Bahl et al. 1989; Brown etal. 1990a], syntactic part-of-speech bi-gram and trigram probabilities [De-rouault and Merialdo 1984a; Garside etal. 1987; Church 1988], and collocatingprobabilities [Choueka 1988; Smadja andMcKeown 1990; Smadja 1991a] have beendeveloped already for other speech- andtext-processing applications. Statisticaltechniques based on such models couldbe used to detect improbable word se-quences and to suggest probable correc-tions. Most of these models are based onlocal lexical or syntactic relations, i.e.,relations among words that occur withina few positions of each other, so errordetection and correction techniques basedon these models would be limited to han-dling local errors. Since it is still notclear what proportion of real-word errorsfall within this realm, the error coveragepotential of statistically based models isyet to be determined.

One small but insightful study wasdone by Atwell and Elliott [1987] in con-junction with the University of LancasterUnit for Computer Research on the En-glish Language (UCREL) probabilisticparsing project [Garside et al. 1987].These researchers analyzed a small sam-ple of errors to determine the relativeproportion of errors in each of the follow-ing four categories: (1) nonword errors;(2) real-word errors in locally invalidsyntactic contexts; (3) real-word errors inglobally invalid syntactic contexts; and(4) syntactically valid real-word errors ininvalid semantic contexts. Their sampleconsisted of 50 errors each from threedifferent sources: (a) published, manu-ally proofread texts, (b) essays by 11- and12-year-old students, and (c) Englishwritten by non-native English speakers.The results of their analysis, shown in

6 Collocations are word pares (or tr]ples or quadru-ples, etc. ) that exhlblt statistically slgmticantaffimtles for occurring somewhere m each other’snearby, though not necessary contiguous, lexlcalenvmonment, e.g., salt and pepper, apple, and pze,

apple and big, state and u7Lzon, etc

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text

Table 2.

Distribution of Spelling Errors @om Anoell & Elllorf)

Total % % Local % Global %

#of Non-word Syntactic Syntactic Semantic

Errors Errors Errors Errors Errors

Text sourcePubhshedTexts 50 52% 28’% 8% 12%StudentEssays 50 36% 38% 16% 10%Non-Native Text 50 4% 48% 12% 36%

Table 2, indicate widely different distri-butions of errors across text genres, butthey do indicate that a significant por-tion of errors may be detectable as localsyntactic violations. The prototype sys-tem Atwell and Elliott implemented fordetecting such errors is described in Sec-tion 3.3.1

The following few sentences, whichwere randomly culled from existing texts,contain genuine real-word errors that il-lustrate some of the problems and poten-tial for various NLP and statisticallybased detection and correction tech-niques. Most of the errors in these sen-tences are readily detectable by humans,though errors of this type are frequentlyoverlooked because inherent, human, er-ror correction tendencies are so strong.7

Local Syntactic Errors:

(1) The study was conducted mainly be

John Black.

(2) The design an construction of thesystem will take more than a year.

7 See Cohen [ 1980] for a review of some experi-ments on human ability to detect nonword, real-word, homophonic, and nonhomophonic spelling er-rors in context. Experimental results indicate that,not surprisingly, nonword errors are easier to de-tect than real-word errors, and that, perhaps sur-prisingly, real-word homophones are easier to de-tect than real-word nonhomophones. The authorsuggests three possible explanations for the latterresult, a primmg explanation, a frequency explana-tion, and a semantic explanation,

. 415

(3) Hopefully, all with continue

(4)

(5)

(6)

smoothly in my absence.

Then they an going to go to hishouse.

I need to notified the bank of [thisproblem].

He need to go there right no w.

Global Syntactic Errors:

(7)

(8)

(9)

(lo)

Not only we in academia but thescientific community as a wholeneeds to guard against this.

Won’t they heave if next Monday atthat time?

I can’t pick him cup cuz he might beworking late.

This thesis is supported by the factthat since 1989 ithe system has beenoperating system with all four unitson-line, but. . .

Semantic Errors:

(11) He is trying to fine out.

(12) They are leaving in about fifteenminuets to go to her house.

(13) Can they lave him my message?

(14) Well, 1’11 call again when they comehoe.

Examples (1) through (10) all containsyntactic violations. In theory, theyshould be detectable by a sufficiently ro-bust automatic natural language parser,The syntactic violations in(1) through (6)involve local (within 1 or 2 words) incon-

ACM Computing Surveys, Vol 24, No 4, December 1992

416 ● Karen Kukich

gruities which should make them de-tectable not only by a parser but also viaa statistical language model.

The mistakes in examples (7) through

(10) are more problematic. The subject-verb number disagreement in example(7) turns on a long-term dependency be-tween the complex subject phrase andthe verb. This mistake requires a parsercapable of building the full syntacticstructure of the sentence. Even a parserwith that capability might have troublelocating the double error in (8) whichprecipitates a garden path effects Al-though the problem in (9) might be de-tectable by local syntactic-constraint vio-lations, it would be difficult to select be-tween two potential corrections, thephrases pick him up and pick his cup,

on syntactic grounds alone. Furthermore,the same mistake would not have beendetectable even via syntactic analysis ifthe sentence had read pick ?zer cup.

Deletions and insertions, such as the ex-tra word, system, in example (10) mightbe problematic for both locally based andglobally based syntactic analyzers.

The mistakes in examples (11) through(14) would all be undetectable via syntac-tic analysis alone since they all have validparses. However, they are readily de-tectable by humans for their semanticincongruities and/or lexical improbabili-ties. Unfortunately, there are no gen-eral-purpose NLP systems capable of de-tecting semantic incongruities that mightbe used for this purpose. There are sta-tistical language models, however, thatmight hold some promise for detectingnot only semantic errors but also localsyntactic errors. Intuition suggests thatmany of the above errors result in low-frequency bigrams, i.e., an construction,

with continue, and they an in (2), (3), and(4). However, some seem to require atleast trigrams to be detected, e.g., con-

ducted mainly be, been operating system,

and to fine out, in (l), (10), and (11). At

sA garden path effect occurs when a processor Mled down an incorrect search or processing pathfrom which it has difficulty recovering.

least one, (7), is resilient to both bigramand trigram improbability detection.

No examples of discourse structure er-rors or pragmatic errors are includedabove because none were readily found ina few days’ recording of actual errors.Whether this reflects a genuine sparse-ness of such errors is unknown.

One other characteristic of real-worderrors for which little hard data exists istheir apparent tendency to involve ortho-graphic intrusion errors, i.e., substitu-tions of graphemes from nearby words,as in the following examples:

(15) Is there a were to display your plotson a DEC workstation running X?(way + were)

(16) . . . and they would bust bill us 1/12of that each month. (just ~ bust)

(17) As noted above, such as systemwould give them a monopoly. (a ~as)

(18) I have access to those people throughby boss. (my ~ by)

(19) We will not meet again until. . . weget out first outside speaker. (our

+ out)

Although phonemic (aural) lexical intru-sions, or slips of the tongue, are welldocumented in cognitive studies of lan-guage production [Fromkin 1980; Garrett1982], orthographic (visual) lexical intru-sions, or slips of the pen, are less welldocumented [Ellis 1979; 1982; Hotopf1980]. Studies of these phenomena mightcertainly be exploited in the context ofword correction.

In summary, there is some evidencethat real-word errors are ubiquitous, ac-counting for as much as 40Yc or more ofall textual errors, but more research is

needed to determine their true fre-

quency. More research is also needed to

characterize or classify such errors.

Knowledge of the relative proportions of

such errors that violate each of the five

natural language processing level con-

straints, as well as the proportions thatentail local vs. global lexical or syntacticviolations and the proportions that in-volve orthographic intrusions, would be

ACM Computing Sur~ eys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text ● 417

helpful to determine which techniquesmight be most useful in detecting andcorrecting them. Emerging NLP tools forsyntactic processing have potential fordetecting and correcting some grammati-cal errors, while statistical language-modeling tools have potential for han-dling other certain errors that result inlocal lexical irregularities. These tech-niques are described in more detail inthe following sections.

3.2 NLP Prototypes for HandlingIll-Formed Input

Natural language researchers were quickto recognize the need for handling inf-ormed input as they observed their ear-liest NLP systems floundering with thenaturally erroneous input proffered bytheir first real users. They soon beganincorporating error-handling techniquesinto prototype NLP systems. Most of theprototypes were designed for specifictasks within restricted domains of dis-course, so their error-handling tech-niques do not scale up well for use onunrestricted text. Nevertheless, this workhelps to elucidate some to the issues sur-rounding the problem of context-depen-dent spelling and grammar correction, soan overview is included here. Progresshas been made toward processing unre-stricted text on the syntactic level, so twosyntactic rule-based NLP systems forcontext-dependent spelling and grammarcorrection are more fully described at theend of this section.

Most NLP systems are parser driven.That is, they view their central task asone of deriving the syntactic structure ofthe input text (or speech). Sometimesthey go on to do further semantic dis-course structure and pragmatic process-ing to interpret and act on the input. Atypical NLP system consists of a lexicon,a grammar, and a parsing procedure. Thelexicon is the set of terms that are rele-vant to the application domain. The termsare usually annotated with their poten-tial parts of speech as well as morpholog-ical information. The grammar is a set ofrules that specifies how words, parts of

speech, and higher-or der syntactic struc-tures (e.g., noun phrases, verb phrases,etc. ) may be validly organized into well-formed sentence fragments and sen-tences. The parsing procedure often con-sists of looking up each word in the dic-tionary to determine its potential partsof speech and applying grammar rules tobuild up higher-order syntactic struc-tures. Since many words have more thanone possible part of speech, multiple par-tial parses often must be built in paralleluntil later constraints resolve the syntac-tic ambiguity of the terms.

Not all NLP systelms retain clear-cutdivisions of lexical, syntactic, and seman-tic knowledge. Quite a few systems em-ploy some form of a “semantic grammar”in which semantic restrictions on howwords can be meaningfully juxtaposed areincorporated directly into the grammar.A few systems incorporate knowledge ofsyntactic structures and semantic re-strictions into their lexicons. In eithercase, the availability of knowledge of se-mantic constraints at an earlier stage ofthe parsing process helps narrow thespace of potential parses that must beexplored. Those NLP systems that makeuse of discourse structure and/or prag-matic knowledge usually keep thatknowledge in separate data structuresthat represent models of the organizationof the text, the user’s plans and goals,and/or the system’s knowledge of thedomain. The models may be consultedand modified during the parsing process,and the constraints they impose help toreduce further the search space of poten-tial parses.

Under this paradigm, the detection andcorrection of errors is nontrivial for atleast two reasons: ( 1 ) lexical and gram-matical coverage is necessarily incom-plete; and (2) locating the source of aparsing failure is often difficult. For ex-ample, a failure to find a string in thesystem’s lexicon cou Id indicate that aspelling error has occurred, but it couldalso indicate that a novel term has beenencountered. Because the occurrence ofnovel terms (e.g., proper nouns) is sohigh, most parsers assume that unknown

ACM Computing Surveys, Vol 24, No 4, December 1992

418 ● Karen Kukich

terms are not mistakes and assign a de-fault syntactic category (usually noun) tothem and then attempt to proceed withthe parse. If a parsing failure occurs fur-ther along in processing, often it is hardto tell whether it is due to the fact thatthe input violates some rule(s) of thegrammar, or the fact that grammar itselfis incomplete, or the fact that an inputstring has been miscategorized. Further-more, it is frequently hard to pinpointthe precise rule or word that precipitatedthe failure let alone suggest a correction.Consider the previous example: Wo?z’tthey heave if next Monday at that time?

Even a robust parser might not recognizethat a problem exists until the endingpunctuation is encountered, at whichtime it would be hard to guess that theterms heave if are the culprits and thatthey should be replaced with the termshave it.

Various approaches to dealing withill-formed natural language input havebeen tried. They can be loosely classifiedinto three general categories: (1) accep-tance-based approaches; (2) relax-ation-based approaches; and (3) ex-pectation-based approaches.

3.2.1 Acceptance-Based Techniques

Acceptance-based approaches are ground-ed in the philosophical observation thatdespite the fact that ordinary language isreplete with errors, people neverthelesshave little trouble interpreting it. So ac-ceptance-based techniques assume thaterrors can simply be ignored as long assome interpretation can be found that ismeaningful to the given application. Ac-ceptance-based approaches tend to makeheavy use of semantic information andlittle use of any other level of linguisticinformation.

Many of the earliest NLP systems, forexample, those devised by Waltz [1978]and Schank [ 1980] and their students,focused on semantic understanding, andso were able to adopt acceptance-basedapproaches for handling ill-formed input.Under these approaches ill-formed input,such as lack of number agreement be-tween a subject and a verb, was treated

on an equal basis with well-formed input,either by ignoring or eliminating fromthe grammar the syntactic or othergrammatical constraints that the inputviolated or by incorporating rules for pro-cessing certain common ill-formed con-structions directly into the grammar.These approaches proved successful fortheir particular applications, in part be-cause erroneous input is frequently se-mantically unambiguous within a re-stricted domain. But they do not general-ize readily to the task of detecting andcorrecting real-word errors in unre-stricted text for at least two reasons. Oneis that they often accommodate ill-formedinput without even detecting a problem.The other is that they require a semanticmodel of the domain of discourse. Nofull-blown semantic model is availablefor unrestricted text.

A variant of an acceptance-based ap-proach that has some intuitive appeal isthe preference semantics approach advo-cated by Fass and Wilks [1983]. It isintended to address the fact that a largeportion of all natural language utter-ances are either ill-formed or metaphori-cal to some degree, for example, thestatement My car drinks gasoline. Theyassume that a formal semantic-represen-tation scheme and an encoding procedureare available that enable every inputstring to be encoded in such a way thathighlights any semantic conflicts or se-mantic-preference violations. Furtherprocessing would then determine the de-grees to which either certain terms in theinput would have to be changed or thesystem’s semantic model would have tobe changed in order to minimize the con-flicts. The system would settle on theinterpretation resulting in the leastchange. While this approach has muchpsychological plausibility, at present it isimpractical for processing unrestrictedtext, again because the rich formal se-mantic knowledge base it requires is un-available.

3.2.2 Relaxation-Based Techniques

Relaxation-based approaches are at theopposite end of the error spectrum—they

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text “ 419

assume that no errors can be ignored.This philosophical bias arose from thefact that many early NLP systems reliedalmost exclusively on syntactic rules, andas such they simply broke down whensuch rules were violated. In relaxation-based techniques, when a parsing failureoccurs, the system attempts to locate an

error by identifying a rule that mighthave been violated and by determiningwhether its relaxation might lead to asuccessful parse. Locating and relaxingrules allow for both detection and correc-tion of errors.

Relaxation-based techniques, which fo-cus mainly on syntactic processing, arethe least knowledge intensive of the threeapproaches and for that reason the mostwell suited for use on unrestricted text.

In fact, two relaxation-based text-editingsystems, the EPISTLE/CRITIQUE sys-tem originally designed for editing busi-ness correspondence [Heidlorn et al. 1982;Richardson and Braden-Harder 1988]and a similar text editor for the Dutchlanguage [Kempen and Vosse 1990] havebeen tested on unrestricted text. Thosesystems and their test results are de-scribed in more detail later. For now wewill simply note that they both employrule-based parsers to detect and correctcertain errors that result in syntactic-ruleviolations. Neither system makes use ofsemantic, pragmatic, or discourse struc-ture knowledge, so neither system at-tempts to detect or correct errors in thosecategories. The EPISTLE/CRITIQUEsystem preprocesses input to detect andcorrect nonword spelling errors beforebeginning its parsing procedure; certainreal-word errors, including commonlyconfused terms such as who’s and whose,as well as a variety of grammatical er-rors, such as subject-verb number agree-ment violations and punctuation prob-lems, are detected during the parsingprocess, and corrections are suggested.The Dutch system preprocesses each in-put string to generate a rank-ordered co-hort set, i.e., a relatively small set oforthographically and phonetically similarwords. Presumably, the cohort sets forboth nonword and real-word errors willcontain the correct word. The parsing

procedure is then applied to the originalinput string, and, when a failure occurs,a rule is relaxed and retried using a can-didate term from the cohort set.

Similar relaxation-based techniqueshave been implemented or proposed byTrawick [1983], Weischedel and Sond-heimer [1983], Suri [ 1991], and Suri andMcCoy [1991]. In his REL system, Traw-ick used syntactic-rule relaxation, plusordering strategies foIr trying easy correc-tions first, plus cohort set generation as alast resort. Weischedel and Sondheimeradvocate adopting a principled approachto rule relaxation. They have defined aset of diagnostic metarules to guide thesearch for potential rules to relax whenparsing failures occur. Their proposedmetarules detect problems in the follow-ing general categories: omitted articles,homonyms, spelling /typographical er-rors, violated selection restrictions,personification, metonymy, and otherviolated grammatical tests such as agree-ment failures. Suri and McCoy have re-cently begun work on a system for cor-recting English written by American SignLanguage (ASL) users. Since the gram-mar and syntax of ASL differ signifi-cantly from standard English, text writ-ten by ASL users often contains errorswith respect to standard English that fallinto certain patterns. Suri has developeda taxonomy of such errors by analyzing acorpus of writing samples from ASLusers. The error taxonomy will guide thedevelopment of an error-correcting lan-guage-learning tool.

3.2.3 Expectation-Based Techniques

Expectation-based approaches fall some-where in the middle of the error spec-trum—they acknowledge both that er-rors occur and that people draw on con-textual knowledge to correct them. Inexpectation-based techniques, as theparsing procedure progresses the systembuilds a list of words it expects to see inthe next position based on its syntactic,semantic, and sometimes pragmatic anddiscourse structure knowledge. If thenext term in the input string is not in thelist of expected terms, the system as-

ACM Computmg Surveys, Vol 24, No 4, December 1992

420 “ Karen Kukich

sumes an error has been detected andattempts to correct it by choosing one ofthe words on its expectation list.

Expectation-based techniques havebeen devised to exploit various levels oflinguistic knowledge. One system thatexploits both syntactic and semantic ex-pectations for error correction is thecase-based parser referred to as CAS-PAR, DYPAR, and MULTIPAR [Carbo-nell and Hayes 1983], listed here in suc-cessive generations. This parser derivesmuch of its expectation-forming abilityfrom the fact that it operates under thecase frame paradigm. Case frames areslot-filler data structures for represent-ing predicates (usually realized as verbs)and their arguments. For example, aGIVE predicate would have a case framewith slots for three arguments, aFROM-slot, a TO-slot, and an OBJECT-

Slot .Under the case frame paradigm, both

the syntactic and semantic structures ofa given domain of discourse are modeledin recursive case frames which are incor-porated into a grammar and lexicon. Aseach word in an input string is parsed, iteither invokes a new case frame or fills aslot of an existing one. Existing caseframes and their unfilled slots set upexpectations for upcoming terms. Thus,an expectation list of potential next termscan be generated for each word in aninput string as the parse progresses. Thesize of the expectation list will vary fromthe whole dictionary, when no case framehas yet been selected, to small sets ofpotential terms when a partially filledcase frame is being processed, to a uniqueword when a function word is expected.These “reduced-dictionary” expectationlists can be exploited when both nonwordand real-word spelling errors are encoun-tered in the input. Case frames also

provide significant help in recognizingmissing word errors and in parsing validinstances of ellipsis and anaphoric refer-ences.

The most recent version of this parser,MULTIPAR [Minton et al. 1985], pur-sues multiple parses in parallel. It ex-ploits both expectation-based and relax-ation-based error recovery techniques.

When the parsing procedure reaches aroadblock, it invokes as many recoverystrategies as it needs to achieve at leastone valid parse, including nonword andreal-word spelling correction, missing-word insertion, and others. The order inwhich recovery strategies are invoked isgoverned by a control strategy to ensurethat strategies leading to less-deviant er-rors are tried first.

Another expectation-based parser, theNOMAD system [Granger 1983], makessimilar use of syntactic and semantic ex-pectations to correct nonword and real-word errors, to constrain possible wordsenses, and even to guess the meaningsof unknown words. For example, NO-MAD is able to guess that the unknownterm scudded in the input Enemy scud-

ded bombs at us is probably a verb whoseaction is in the PROPEL class.

Expectation-based techniques havealso been devised to exploit pragmaticand discourse structure knowledge. Asystem devised by Ramshaw [1989] ex-ploits pragmatic knowledge to detect andcorrect real-word errors in user input toan expert system. The expert system hasa rich pragmatic model “based in part ona library of domain plans for how agentscan deal with situations of the given type”(p. 23). Input errors are detected whenthe system is unable to construct a com-plete interpretation of an input that iscompatible with one of the domain plans.Ramshaw’s system uses a wildcardingtechnique to successively replace eachword in the input with an open slot andthen combine the syntactic and semanticconstraints from the partially inter-preted input with the predictions fromthe pragmatic model to yield a rankedset of corrections. Two other systems, oneby McCoy [1989] and one by Carberry[1984] prescribe methods for generatingresponses to pragmatically ill-formed in-put. Both rely on heuristics for building amodel of the user’s plans and goals basedon information that can be inferred fromthe preceding dialogue,

Expectations that can be derived fromthe structure of the dialogue have alsobeen used to correct errors and resolveambiguities introduced by speech recog-

ACM Computmg Surveys, Vol. 24, No, 4, December 1992

Automatically Correcting Words in Text “ 421

nizers. Speech-to-text systems are facedwith a compound problem: not only arecurrent voice recognition systems subjectto a high rate of error in classifyingspeech signals into phonemes, but manyunique phoneme sequences yield lexi-cally ambiguous homophones (e.g., hear,

here). Fink and Biermann [1986] de-scribe a connected-speech-understandingsystem capable of executing commandsspoken by a user. Their system learnsdialogue scripts as it is lbeing used andthen exploits the dialogue expectations ithas learned to resolve subsequent inf-ormed and ambiguous input.

In summary, expectation-based tech-niques appeal to the intuition that con-textual linguistic knowledge at all levels,lexical, syntactic, semantic, discoursestructure, and pragmatic, help to con-strain the list of possible word choiceswhen natural ambiguities and errors oc-cur in everyday language. Unfortunately,general computational processing tech-niques exist for only two of those levels,the lexical and syntactic levels. Tech-niques for representing and processingsemantic, pragmatic, and discoursestructure knowledge at sufficiently richlevels of detail are currently impracticalbeyond limited domains of discourse. So,like acceptance-based techniques, expec-tation-based techniques are not practicalyet for attacking the problems ofcontext-dependent spelling and gram-matical-error correction for unrestrictedtext. However, the less knowledge-inten-sive relaxation-based techniques, whichconcentrate almost exclusively on syntax,have made some progress in this area,and the extent of their success is de-scribed next.

3.2.4 Parser-Based Writing Aid Tools

One obvious application for context-de-pendent spelling correction is in writingaid and text critiquing systems. TheUnix-based Writer’s Workbench system

[Cherry and Macdonald 19831 is perhapsthe earliest such system. It consists of aset of automatic tools that include, amongothers, the spell isolated-word spellingcorrector, a grep-based pattern-matching

tool for detecting some 800 commonlymisspelled words (including some real-word misspellings), a word usage tool thatdescribes the proper usage of some 300words and phrases, a tool for detectingcommon instances of poor writing style,such as the use of sexist language andstilted phrases (e.g., more preferable, col-

laborate together, with the exception of),

and the style program, which computes anumber of statistics about a documentincluding a readability score, sentencelengths, percentage of words in eachpart-of-speech category, etc. Style makesuse of a program called parts thatguesses the parts of speech of the wordsin a document. Its authors claim that itachieves 959t0 accuralcy despite the factthat it uses only a small dictionary offunction words, irregular verbs, and suf-fixes to classify some words and thenlooks for relations among those words toclassify the rest.

Perhaps the first system to attempt tobring robust natural language parsing tobear on the problems of detecting andcorrecting real-word spelling and gram-mar errors was the IBM EPISTLE sys-tem [Heidorn et al. 1982], later known asCRITIQUE [Richardson and Broden-Harder 1988]. EPISTLE was originallydesigned to diagnose five classes of gram-matical errors:

(1) subject-verb disagreement: (e.g., Mr.Jones, as well as his assistance, areentitled. ., );

(2)

(3)

(4)

(5)

wrong pronoun case: (e.g., If I were

him...);

noun-modifier disagreement: (e.g.,These report must be checked. ..);

nonstandard verb forms: (e.g., The

completed manuscript was wroteby.. .); and

nonparallel structures: (e.g., We will

accept the funds, send rec~ipts to the

payers, and crediting their ac-

counts . . . ).

Note that EPISTLE’s robust parsing ap-proach has the advantage of being able todetect even problems that turn on long-term syntactic dependencies.

ACM Computing Surveys, Vol. 24, No. 4, December 1992

422 ● Karen Kukich

The EPISTLE parser and grammarwere designed and tested reiterativelyaround a corpus of 411 business lettersconsisting of 2,254 sentences whose aver-age length was about 19 words. EPIS-TLE made use of an on-line dictionary ofover 100,000 words. By 1982, EPISTLE’score grammar of augmented phrasestructure rules consisted of about 300rules. Its rule-based parser attempted toarrive at a unique parse for each sen-tence by invoking first its core grammarrules. When multiple parses were founda general metric for ranking parses

[Heidorn 19821 was invoked to select one.When the core grammar failed to pro-duce a complete parse, another tech-nique, called fitted parsing [Jensen et al.1983], was invoked. The fitted-parsingtechnique consisted of a set of heuristicsfor choosing a head constituent and at-taching remaining constituents to form acomplete sentence. In 1981, 64% of thesentences in the test corpus could beparsed using EPISTLE’s core grammaralone. By mid-1982, when the fitted-parsing technique was introduced, thewhole corpus could be successfullyparsed, 27% via fitted parsing.

Processing of text in EPISTLE beganwith a dictionary lookup phase that as-signed parts-of-speech to input words,and in the process detected nonwordspelling errors and certain awkwardphrases. Unrecognized words were as-signed a default part of speech, usuallynoun. Next came the parsing phase, dur-ing which the detection and correction ofreal-word grammatical errors occurred.The parsing process might require multi-ple passes; if a successful parse was notobtained on the first pass, it was retriedwith selected grammatical constraintsbeing relaxed. The result of the parsingphase was a parse tree that highlightedany relaxed constraints and suggestedcorrections. A final EPISTLE processing

9 Roughly speaking, a constituent is a portion of thesyntactic tree structure of a sentence that com-prises a complete phrasal unit, such as a nounphrase, verb phrase, or prepositional phrase.

phase diagnosed stylistic problems suchas sentences that were too long.

EPISTLE’s authors acknowledged thatits five original error classes “do not coverall possible grammar errors in English,but they do address those that are mostoften mentioned in the literature andmost frequently found in the observedcorrespondence” (p. 319) [Heidorn et al.1982]. Over the years, EPISTLE’s gram-mar and error coverage were extended sothat by 1985, its revised version, namedCRITIQUE [Richardson and Braden-Harder 1988], covered over 100 grammarand style errors and had been tested onfour text genres: business correspon-dence, freshman student essays, ESL

(English as a Second Language), and pro-fessional writing. CRITIQUE’s output on10 randomly selected samples from eachof these genres was analyzed, and itsadvice was classified into three cate-gories: correct, useful, or wrong. Theuseful category designated cases inwhich CRITIQUE detected a problemcorrectly but was not exactly correct inits diagnosis. Results of the analysis areshown in Table 3.

Perhaps the lower performance levelfor professional text might be due in partto a lack of coverage of the lexicon andgrammatical structure of the profes-sional domain. Unfortunately, we are nottold how many errors CRITIQUE over-looked. The authors indicate they believeCRITIQUE to be “most helpful onstraightforward texts before they are sig-nificantly revised” and that “In general,CRITIQUE also appears to be more accu-rate on texts with a shorter average sen-tence length” [Richardson and Braden-Harder 1988, pp. 201-202].

A parser-based text-editing system forthe Dutch language has produced someencouraging results [ Kempen and Vosse1990]. Although its authors are careful topoint out that the system was not de-signed to be a full-scale grammar andstyle checker, it does appear to functionwell as a general-purpose tool for detect-ing and correcting both nonword andreal-word syntactic errors. It incorpo-rates the triphone-based isolated-word

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text ● 423

Table 3.

CRITIQUE’s Error CorrectIon Ab]l]ty

Genre Average Correct Correct +

Sentence Adwce Useful

Length Adwce

Business Corm 18 words 7370 82%Student Essays 16 words 72% 8670

ESL Text 21 words 547. 87’%0

Pro!cwlond Text 22 words 397. 41?’.

spelling corrector of Van Berkel andDeSmedt described in Section 2.2.4 and aunification-based parser consisting ofsome 500 augmented phrase structurerules. It also has access to a 100,000-worddictionary that is cross-indexed to in-flected forms, giving it a total lexicon ofabout 250,000 words.

It wocesses a document in four staszes:

(1) p~eprocessing, (2) word-level anal~sis,(3) sentence-level analysis, and (4) textresynthesis. The preprocessing stage re-moves typesetting markup symbols andcorrects punctuation. The word-levelanalysis stage calls upon the Van Berkeland DeSmedt spelling corrector to gener-ate a small set ( - 5) of candidate correc-tions, including homophones, for eachword type in the document. In the sen-tence-level analysis stage, a unification-based parser is called to produce a high-level parse. Unification rules are relaxedto allow for syntactic violations whichare then flagged. When violations aredetected, the parser attempts to con-struct a valid parse using candidates fromthe correction set. Finally, in the textresynthesis stage the system regeneratesthe original document with suggestedcorrections and error diagnostic mes-sages.

Kempen and Vosse [1990] evaluatedthe system on a standard 150-sentencespelling correction exam for typists thatcontained 75 correct sentences and 75

incorrect sentences; 59 of the incorrectsentences had nonword errors; 14 hadreal-word syntactic errors; and 2 hadreal-word idiomatic errors. The systemgave no false alarma on the 75 correctsentences; it detected 56 of the 59 non-word errorsl” and corrected 55 of them; itdetected 9 of the 14 syntactic errors andcorrected 7 of them; it overlooked the 2idiomatic errors. All this required under10 seconds of CPU time on a DECstation3100. A more recent version of the sys-tem, which will soon be available com-mercially [Vosse 199>!], was tested on thesame test set. It left only 3 errors unde-tected, correcting the other 72 and caus-ing no false alarms. This system was alsotested on a random sample of 1,000 linesfrom two texts on employment legislationsubmitted for publication. The samplecontained 6,000 words with 30 spellingerrors. The system detected 28 of theerrors and accurately corrected 14 ofthem while producing 18 false alarms.

Recent research halS led to significantadvances toward robust natural lan-guage parsing of unrestricted Englishtext. 11 Given this progress, some usefulcommercial systems for real-word gram-matical-error correction based on robustparsing and syntactic-constraint relax-ation techniques should be appearing inthe not-too-distant f’uture.

3.3 Statistically Based Error Detection andCorrection Experiments

An alternative approach to context-dependent spelling correction is the sta-tistical language-modeling approach.Statistical language models (SLMS) areessentially tables of conditional probabil-ity estimates for some or all words in alanguage that specify a word’s likelihoodto occur within the cOIIteXt of other words.For example, a word trig-ram SLM speci-

10 The nonwords were overlooked because the sys-tem’s morphological analyzer construed them aslegal compounds.u See for example, ~indle~:; [1983] Fidditch parser,

Abne~s [1990] CASS parser, and Sleator’s [1992]Lmk Grammar.

ACM Computmg Surveys, Vol. 24, No 4, December 1992

424 - Karen Kukich

fies the probability distribution for thenext word conditioned on the previoustwo words; a part-o f-speech-bigram SLMspecifies the probability distribution forthe part of speech of the next word condi-tioned on the part of speech of the previ-ous word; and a collocation SLM specifiesthe probability distributions for certainwords to occur within the vicinity (e.g.,within five places in either direction) ofanother linguistically related word. Someof the various speech and text-processingapplications for which SLMS have beencompiled include speech recognition [Bahlet al. 1983; Jelinek et al. 1991], informa-tion retrieval [Choueka 1988], text gen-eration [Smadja 1990a, 1991 b], syntactictagging and parsing [Garside et al. 1987,Church 1988], machine translation[Brown et al. 1990b], semantic disam-biguation [Brown et al. 1991], and astenotype transcription application [De-rouault and Merialdo 1984b].

The probability estimates comprisingSLMS are derived from large textual cor-pora, where “large” is currently definedas tens of millions of words. Only re-cently have the computational resourcesbecome available and the sizes of ma-chine-readable corpora begun to ap-proach the dimensions required to makethe compilation and use of such modelsfeasible. Note, for example, that the sizeof a word trigram probability matrixgrows cubically with the size of the lexi-con, so that a full-scale word trigrammodel for a lexicon of only 5,000 wordswould have 25 trillion entries. Clearly,the majority of those entries should beassigned zero probabilities. But becauseof Zipf’s law [Zipf 1935], which statesthat the frequency of occurrence of a termis roughly inversely proportional to itsrank, there will always be a large tail ofterms that occur only once. So even withthe availability of corpora as large as 100million words, hence 100 million tri-grams, computing nonzero probabilitiesis a nontrivial statistical estimationproblem. To alleviate this problem, someclever estimation techniques have beendevised [Bahl et al. 1983, 1989; Brownet al. 1990a], and some comparativestudies of estimation procedures have

been carried out [Gale and Church 1990;Church and Gale 1991b]. The latter stud-ies were done in the context of word bi-gram models, which are somewhat moremanageable in size than word trigrammodels.

Other more manageably sized SLMShave been explored. For example, part-of-speech (POS) bigram and trigrammodels have been compiled [Garsideet al. 1987; Church 1988; Derouault andMerialdo 1984a] based on expanded part-of-speech category sets of approximately100 categories (for example, plural-possessive-personal-pronoun, e.g.,our, reflexive-indefinite-pronoun,e.g., oneself, superlative-degree-ad-verb, e.g., most, superlative-general-adverb, e.g., best, etc). Note that a POSbigram model based on 100 categorieswill have at most 10,000 POS bigramentries for which probabilities must becompiled. Manageably sized collocation-based SLMS have been compiled also[Choueka 1988; Smadja 1991a]. Since thegoal of collocation-based models is to lo-cate and estimate probabilities of co-oc-currence for only those terms that enterinto statistically significant collocationrelations, the resulting models containfewer table entries. The price that is paidfor the computational manageability ofthese smaller models is reduced predic-tive power: POS n-gram models are ableto predict only the part-of-speech of thenext word as opposed to the word itself,and collocation models provide predic-tions for only some terms. Nevertheless,by combining POS n-gram probabilitiesor collocation probabilities with otherlexical statistics, such as unigram wordfrequency or relative probability that aword will have a particular part-of-speech, these models have been put toproductive use in certain applications, in-cluding the syntactic tagging and pars-ing, stenotype transcription, informationretrieval, and text generation applica-tions cited above.

The statistical langaage-modeling ap-proach to context-dependent word correc-tion shares a basic premise with expecta-tion-based NLP approaches: namely, thatcontextual information can be used to

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words zn Text * 425

help set a priori expectations for possibleword choices. Thus, low-probability wordsequences might be used to detect real-word errors, and high-probability wordsequences might be used to rank correc-tion candidates. Word-based SLMS (asopposed to SLMS based on syntactic POSn-grams) might even capture some inher-ent semantic relations, e.g., the semanticrelation between the words doctor andnurse, without the need for an explicit,handcrafted, formal semantic model. In-deed, all SLMS are compiled automati-cally. One limitation of both word-basedand POS-based bigram and trigramSLMS is that they are confined to repre-senting only local relations (i. e., two orthree contiguous terms).

It is not immediately clear how suc-cessful various SLMS might be at eitherdetecting or correcting real-word errors.Three recent studies, one using a POSbigram model [Atwell and Elliott 1987],one using a word bigram model [Galeand Church 1990; Church and Gale1991a], and one using a word trigrammodel [Mays et al. 1991], were designedspecifically to shed some light on thesequestions. In the first study, researchersattempted to determine how well a POSbigram model could detect real-word er-rors. In the second study, the ability of aword bigram model to improve the accu-racy of a nonword corrector was tested.In the third study, the ability of a wordtrigram model both to detect and correctreal-word errors was tested. The resultsof all three studies are reviewed here.

3.3.1 Detecting Real-Word Errors wa

Part-of-Speech Bigram Probabilities

One successful application of statistical-language modeling to a natural languagetext-processing problem is the syntactictagging and parsing project undertakenby the University of Lancaster’s Unit forComputer Research on the English Lan-guage (UCREL) [Garside et al. 1987]. TheUCREL prototype system includes aprobabilistic word tagger called CLAWS(Constituent Likelihood AutomaticWord-tagging System) that makes use ofa syntactic-tag set consisting of 133 part-

of-speech categories and a part-of-speechbigram SLM. POS bigram transitionprobabilities were estimated from atagged corpus of one million words.CLAWS first assigns one or more syntac-tic tags to each word in a sentence bylooking up a word’s potential parts-of-speech in a dictionary. Approximately6570 of all words (i.e., tokens) receiveonly one tag when processed in this man-ner. Remaining ambiguities are resolvedby selecting the tag of highest probabilityaccording to POS bigram transition prob-abilities. This is accomplished by com-puting the path of maximum probabilityfor a sequence of one or more ambigu-ously tagged words bounded at either endby unambiguously tagged words. CLAWSultimately assigns an unambiguous part-of-speech tag to every word with an over-all accuracy rate of 9[S–97%.

Two UCREL researchers, Atwell andElliott [ 1987] set out to explore the util-ity of CLAWS for detecting and correct-ing real-word errors. Prior to experi-ments, they studied a small sample oferrors to determine what proportion fellinto each of the following categories: (1)nonword errors; (2) real-word errors inlocally invalid syntactic contexts; (3)real-word errors in globally invalid syn-tactic contexts; and (4) syntactically validreal-word errors in invalid semantic con-texts. They hypothesized that CLAWSmight be helpful in detecting and correct-ing errors in the second and perhaps thirdcategories. The results of their brief study(shown in Table 2 in Section 3.1) indi-cated that nearly half of the errors theyfound in published text were real-worderrors, and over three-quarters of thosewere due to either local or global syntac-tic violations. So they went on to test theability of CLAWS to detect these types oferrors.

By manually examining a corpus of13,500 words of text taken from four dif-ferent sources. Atwelll and Elliott [ 1987]and their colleagues identified 502 real-word errors. Of those, 420 errors fell intocategories 2 or 3, anld 82 fell into cate-gory 4. Since category 4 errors would beimpossible to detect via syntactic analy-sis alone, these errors were separated

ACM Computing Surveys, Vol. 24. No 4, December 1992

426 - Karen Kukich

from the others. Note that category 3errors, which turn on global synt attic re-lations, are also unlikely to be detectedby POS bigram probabilities, but theywere not separated from the others.

Next, Atwell and Elliott [ 1987] ran theCLAWS tagger to assign syntactic-cate-gory tags based on POS bigram transi-tion probabilities to the entire 13,500 -word corpus and to record the resultingPOS bigram transition probabilities.They hypothesized that a sequence of twolow POS bigram probabilities might indi-cate an occurrence of an error, and theyset about determining an optimal thresh-old setting for those probabilities suchthat the number of errors detected wouldbe maximized while the number of falsealarms would be minimized. The o~timal.setting they found resulted in a detection

rate of 62~0 and a precision rate of 35~0.

That is, 260 of the 420 errors were de-

tected. and 753 words in all were flagged

as errors. Unfortunately, changing”~hethresholds slightly only raised the preci-sion rate to 38Yc while sacrificing thedetection rate to the level of 47%.

These results indicate that raw POSbigram transition probability scores aloneperform relatively poorly as a diagnostictool because it is impossible to choose athreshold that ensures the detection of ahigh number of errors without permit-ting an unreasonable number of falsealarms. However, the original implemen-tation of CLAWS did not assign probabil-ities to the candidate tags for words (ex-cept for marking some tags as extremelyrare). So it might be possible to add can-didate tag probabilities, as in some otherPOS n-gram models [Church 1988; Der-ouault and Merialdo 1984a], thereby im-proving the error-detecting capabilities ofCLAWS. Nevertheless, there is an inher-ent limitation in the correction power of

POS bigram probabilities. That is, if mul-tiple candidates share the same pre-ferred POS tag, the tag loses its discrimi-nation power.

Atwell and Elliott [ 1987] consideredtheir results to be reasonably promising,but they acknowledged a need to incorpo-rate other sources of contextual knowl-edge into a real-word spelling correc-

tor. They have proposed a hypotheticalcontext-dependent spelling correctionscheme in which an isolated-word spell-ing corrector would be used to generatefirst a relatively small candidate set ( -5words) for every word in a sentence. Fol-lowing that. a relative likelihood ratingwould” be ascribed to each candidate b;considering five factors: (1) the degree ofsimilarity of the candidate to the typedword; (2) part-of-speech bigram transi-tion probabilities; (3) probabilities of in-dividual words; (4) collocation probabili-ties; and (5) domain-dependent lexicalpreferences. They did not try to imple-ment this scheme because they deemed itto be too computationally expensive.

3.3.2 Improving Nonword Correction Accuracy

wa Word B/gram Probab/1/t\es

A more general (though still local) proba-bilistic approach to word correction relieson the use of word bigram or trigramprobabilities rather than just POS bi-gram probabilities. Although the wordn-gram approach has the disadvantage ofrequiring far more data (e.g., a 50,000 -word dictionary gives rise to 50,000 gword bigrams vs. 1332 POS bigrams), ithas the ~otential for detecting and cor-recting n’ot only syntactic ano~alies butalso semantic anomalies that may besyntactically valid. Word n-gram ap-proaches to context-dependent spellingcorrection have been explored in at leasttwo studies, One study, by Church andGale [1991a], demonstrated the potentialof word bigrams to improve the accuracyof isolated-word spelling correctors. Italso demonstrated the im~ortance of ob-.taining careful estimates of word bigramprobabilities [Gale and Church 1990].Another, by Mays et al. [ 1991], demon-strated the detection/ correction power of

word trigrams. It also identified an opti-

mum setting for an a priori likelihoodestimate that the actual word typed wasthe correct word.

The Church and Gale [1991a] studydid not address the detection problem.Instead, It tested a context-sensitivetechnique for improving the accuracy ofthe correct isolated-word spelling correc-

ACM Computmg Surveys, VOI 24, No 4, December 1992

Automatically Correcting Words in Text ● 427

tion program (described in Section 2.2.5).Church and Gale’s technique took as in-put a nonword together with its immedi-ate left and right contextual neighbors.First, correct was used to generate a setof correction candidates and their proba-bilities for the nonword. Next, word bi-gram probabilities were used to estimateleft and right context conditional proba-bilities for each candidate. Finally, candi-dates were scored and ranked by comput-ing a simple Bayesian combination of thethree probabilities for the candidate it-

self and its left and right context condi-tional probabilities. Results obtainedfrom the context-sensitive ranking were

then compared to results obtained fromthe correct ranking alone. Experimentalresults indicated that, for a test set of329 errors, the context-sensitive rankingdid in fact achieve an improvement incorrection accuracy over the context-freeranking. The improvement was from 87%to almost 90%, which the authors foundto be significant.

Gale and Church [1990] also found thatthe performance of the context-sensitivetechnique was highly sensitive to the

method used to estimate conditionalprobabilities given word unigram and bi-gram frequencies. Their unigram and bi-gram frequencies were tabulated from a

44 million-word corpus of AP news wiretext. Of all the estimation techniquesthey tried, including various combina-tions of maximum-likelihood estimation,expected-likelihood estimation, and a

Good-Turing estimation [Good 1953;Church and Gale 1991b], the only combi-nation to achieve a significant improve-ment in performance was one that com-

bined a Good-Turing estimation with an

expected-likelihood estimation.

3.3.3 Detecting and Correcting Real- Word

Errors via Lexical Trigram Probabilities

The study by Mays et al. [1991] ad-dressed the simultaneous detection andcorrection capability of word trigrams. Itemployed a word trigram SLM based on

a Z0,000-word lexicon that was compiledoriginally for IBMs speech recognitionexperiments. It also employed a set of

100 sentences containing only wordsfound in that lexicon. The researcherscreated a set of 8,628 erroneous sen-

tences, each containing one real-word er-

ror, by generating every possible single-error real-word misspelling of each wordin all 100 sentences and by substituting“all of the misspellings for all of the wordsexactly once.” This error set, togetherwith the 100 original sentences, was usedto test tlhe ability of word trigrams to

detect and correct reid-word errors.Their combined detection/correction

technique consisted of computing aBayesian probability score for each sen-

tence in a cohort set, where the cohortset comprised one of the 8,728 test sen-tences together with all other sentencesgenerated by replacing each of its wordswith all of its potential real-word mis-

spellings, including the correct spelling.Given a cohort set, they could determinehow often some oth,er sentence had a

higher probability than the erroneoussentence (error detection) and how oftenthe correct sentence from the original100 had the highest probability (errorcorrection).

A Bayesian probability score for eachsentence was computed by taking theproduct of the word trigram probabilitiesof each word in the sentence multipliedby an a priori probability for each word.The a priori probability represents thelikelihood that a typed input word is ac-tually the intended word. The set of po-tential real-word alternatives to the typedinput word is called the confusion setand is made up of all possible single-error real-word transformations of thetyped input word. The authors assumethat the a priori probability for a typedinput word is some constant a, and thatthe remaining probability (1 – a) is di-vided equally among the other words in

the confusion set. If a is set too high theresult will be a tendlency to retain typedinput words even if they are incorrect; ifa is set too low the result will be a

tendency to change typed input wordseven if they are correct. The authorstested a range of values for a and foundthat the optimum value was between 0.99and 0.999.

ACM Computmg Surveys,Vol. 24, No 4, December 1992

428 ● Karen Kukich

Mays et al. [1991] obtained overall testresults that yielded a detection score of

7670 and a correction score of 74%. These

results are informative because they pro-

vide an approximate upper bound on the

capacity of lexical trigram SLMS to per-

form real-word error detection and cor-

rection on sentences containing single

real-word errors consisting of single-er-

ror typos. A system that could detect 76’?ZOof real-word errors would probably be

very useful. However, it is not yet clear

what level of performance could be

achieved in practice because it is unclear

to what extent genuine error data con-

form to the one-error-per-sentence, one-

typo-per-word random error distribution

of the artificially generated test. Further-

more, the computational requirements of

the technique were manageable under

the artificial test conditions since sen-

tences could be pregenerated and

Bayesian probability scores could be pre-

computed. But computational require-

ments would certainly be problematic if

cohort sets for multitypo, multiword er-

rors had to be generated and their

Bayesian scores computed on the fly.

Certainly, candidate sets for individual

words could be pregenerated and ranked,

but it somehow seems excessive to gener-

ate a complete cohort set of all possible

sentence alternatives in order to detect a

possible error, especially if multiword er-

rors are allowed. Intuition suggests that

localizing the error somehow before gen-

erating the complete sentential cohort set

would be useful. Might word trigrams

perhaps be more successful than POS

n-grams for localizing errors? Might uni-

gram probabilities (based on word fre-

quencies) be factored into the a priori

probabilities ascribed to input words and

the words in their confusion sets to bet-

ter reflect error likelihoods? Obviously,

there is still room for further investiga-

tion and invention.

3.3.4 Related Neural Net Research

One other area of research that hasbarely begun to be explored for context-

dependent word recognition and error

correction is that of neural net (NN)

modeling. Neural net models have much

in common with statistical language

models in that both attempt to capture

contextual expectations by modeling the

conditional probability distributions that

represent the strengths of associations

between terms. Their main difference is

that in SLMS expectations are explicitly

represented as conditional probabilities,whereas in NNs they are implicitly rep-resented as weights in the nets. Anothernon-negligible difference is the intense

computational storage and processing re-

quirements of neural nets. For SLMS,

hundreds of thousands of conditional

probabilities for lexicons of many thou-

sands of words can be incrementally com-

piled and stored on disks. In contrast, for

NNs, thousands of nodes representing

words and hundreds of thousands of

weights representing strengths of associ-

ations between words must be fully con-

tained in working memory during train-

ing and processing. Unfortunately, those

demands make NN models impractical

for vocabularies larger than a few thou-

sand words at most, at least until spe-

cialized neural net hardware and soft-

ware or clever neural net modularization

techniques become available.

Nevertheless, some peripherally rele-

vant neural net experiments have been

carried out by restricting NN models to

represent only syntactic parts of speech

or by severely limiting vocabulary size.

One of the earliest such experiments was

performed by Hanson and Kegl [1987]

who used the syntactically tagged Brown

corpus [Kucera and Francis 1967] to suc-

cessfully train a neural net to predict the

next part-of-speech based on the parts-

of-speech of previous words in the sen-

tence. More recently, Allen and Kamm

[1990] trained a net to recognize words in

an input stream of continuous phonemes,

using a limited vocabulary of 72 words.

Most recently, Gallant [ 1991] proposed a

neural net model that employs a context

vector representing the strengths of asso-

ciated terms for use in a word sense dis-

ambiguation task. These experiments

and proposals only hint at the possibili-

ACM Computmg Surveys, Vol 24, No. 4, December 1992

Automatically Correcting Words in Text ● 429

ties for neural nets to represent and ex-

ploit contextual information. However,

those possibilities will not be fully ex-

plored until more practical neural net

computational tools and techniques be-

come available.

3.4 Summary of Context-Dependent WordCorrection Work

Research in context-dependent spelling

correction techniques is in its infancy.

Despite limited empirical studies of the

pervasiveness and character of real-word

errors in text-processing applications, the

need for context-dependent spelling cor-

rection and word recognition techniques

is clear. The few data points that do exist

suggest that real-word errors account for

anywhere from 25T0 to well over !50Y0 of

all textual errors depending on the appli-

cation. One study indicates further that

as much as 75 YO of real-word errors in-

volve syntactic violations. This latter

finding, which begs further empirical

confirmation, suggests the value of ex-

ploring syntactic techniques for detecting

and correcting such errors.

The two main approaches being ex-

plored for exploiting contextual infor-

mation are traditional NLP (Natural

Language Processing) and SLM (Statisti-

cal Language Modeling). For practical use

on unrestricted domains, NLP tech-

niques are currently limited to syntax-

driven approaches. One research proto-

type NLP system, EPISTLE, offered ac-

curate or useful corrections 82~0, 87q0,

and 4170 of the time when tested on

corpora containing genuine errors in the

genres of business correspondence, stu-

dent essays, and professional text, re-

spectively. Another syntactic rule-based

parser/corrector for the Dutch language

detected and accurately corrected 72 of

75 errors, including 14 real-word errors,

on a standard secretarial typing exam,

without generating any false alarms.

SLM techniques may be explicitly syn-

tactic in their approach, as when they

exploit part-of-speech statistics, or they

may be implicitly semantic, as when they

exploit word co-occurrence statistics

which tend to reflect latent semantic re-

lations. One study demonstrated the ef-

fectiveness of word bigram statistics for

improving the accuracy of a nonword cor-

rector. But the problem of detecting

real-word! errors remains a significant

challenge. One real-word error detection

study demonstrated that a syntactic

part-of-speech big-ram model was able to

detect 62% of 420 real-word errors that

involved local or global syntactic-con-

straint violations, but not without gener-

ating nearly twice as many false alarms.

Another study demonstrated that a word

trigram model accurately detected real-

word errors 7670 of the time and accu-

rately corrected those errors 749Z0 of the

time, but not without significant compu-

tational overhead resulting from the need

to generate and test a large number of

alternative sentences for each potentially

erroneous input sentence. Further re-

search is needed to devise computation-

ally feasible real-word error detection

techniques that are sensitive enough to

attain high correction rates and discrimi-

nating enough to maintain low false

alarm rates.

Neural net approaches will remain

constrained to limii,eld vocabularies until

efficient neural net hardware and soft-

ware become available. Most impor-

tantly, these and other developments now

point toward a host of new techniques

that could add significant intelligence to

our computational text and speech-

processing systems. Hence the topic of

the epilogue.

FUTURE DIRECTIONS

Words are a basic unit of communication

common to all natural language text and

speech-processing activities. But the

physical signals that embody words,

whether realized in electronic, acoustic,

optical, or other forms, frequently arrive

at their destinations in imperfect condi-

tion. Therefore, autc)matic spelling cor-

rection, or, xnore generally, automatic

word recognition, is an essential, nontriv-

ial, practical problem at the heart of all

computational text- and speech-proces-

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

430 “ Karen Kukich

sing applications. Before any text under-standing, speech recognition, machinetranslation, computer-aided learning, op-

tical character, or handwriting recogni-

tion system can achieve a marketable

performance level, it must tackle the per-vasive problem of dealing with noisy, ill-fitting, novel, and otherwise unknown in-put words.

Now is a particularly exciting time tobe studying the problem of recognizingand correcting words in text and speech.In addition to the variety of techniquesthat are beginning to show potential, in-cluding rule-based, statistical, and neu-ral net techniques, various efforts areunderway, under the auspices of the As-sociation for Computational Linguistics,to accumulate a “critical mass” ofsharable text, including large textual cor-pora, machine-readable dictionaries,transcriptions of speech, bilingual cor-pora, etc., and to develop tools for anno-tating that text [Walker 199 1; Libermanand Walker 1989]. Past research hasshown that by bootstrapping from manu-ally annotated corpora (with part-of-speech tags, for example), automatictechniques can be developed for annotat-ing ever larger corpora, which can thenbe used for exploring higher-level text-processing techniques. Some manual ef-forts are underway, and some automatictools are already being developed forsome of the lower- and mid-level tasksinvolved in text processing, including for-matting, tagging, aligning, and extract-ing lexical, syntactic, and even latent se-mantic statistics. These tools and theirresultant annotated corpora and statisti-cal langaage models will pave the wayfor developing hybrid rule-based, statisti-cal, and neural net systems for higher-

level text-processing tasks, including

real-word error detection and correction,

parsing, generation, semantic uncler-

standing, and even pragmatic and dis-

course modeling. As the increasing body

of tools, annotated corpora, and statisti-

cal language models becomes generally

available, researchers will be able to build

on each other’s results for ever larger

domains of discourse. In short, the syner-

gistic effects resulting from the shareduse and combination of these techniquesand resources could be explosive.

So what are some of the possible errordetection and correction techniques thatmight be explored? Hybrid rule-based/statistical parsers capable of probabilis-tic error detection and correction are in-evitable. The prerequisite for such sys-tems is the general availability of lexicaland part-of-speech n-gram statisticallanguage models. As mentioned in Sec-tion 3.4, a variety of potential statisticaltechniques has yet to be explored for lo-calizing errors. These include the use ofword n-grams and the use of a combina-tion of part-of-speech trigram statisticsand part-of-speech probabilities for indi-vidual words. Other creative combina-tions of statistics will certainly be in-vented. Abandoning the assumption thatthe word is the basic unit of processingwill open up a host of word recognition,error correction, and other processingpossibilities. Not only might lexical orpart-of-speech n-grams be treated as ba-sic units, but so also might complex unitsbuilt of small syntactic structures thatare fully or only partially lexically in-stantiated. 12 Statistical language modelsas well as annotated dictionaries basedon such syntactic structures will cer-tainly become available as corpora-tag-ging efforts progress.

In the long term, the arrival of high-speed hardware and software for large-scale neural net processing and the avail-ability of sufficiently large taggedtraining corpora could mark a watershedin computational language processing.Neural nets could be trained to captureand adaptively reflect the changing lexi-cal, synt attic, and error probabilitystatistics of given discourse domains oruser populations. Even in the short term,the possibility exists for hybrid neuralnet/statistical techniques to improve theerror detection capabilities of existingstatistical techniques by addressing the

12 See Joshi’s [ 1985] Tree-Adjoining-Grammar

(TAG) formalism, for example.

ACM Computing Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text - 431

A Context-Dependent Word Recogmtlon Model

l-=

I Lexical Proc~

Im%qEiij-l*L

Stream

Syntactic Processing Unit

HH

. r Probabilistic 14-Asing+lJnlt

~g

-~ly ILexical

~ctwatlon ——

Urwt

=T+-JI r------l I

precision-recall problem. For example, aneural net might be trained to improvethe discrimination capability of part-of-speech transition probability statistics bydistinguishing between genuine errorsand false alarms.

A rough diagram for a hypothetical,large-scale, expectation-based wordrecognition/error detection system thatwould exploit many of the sources of lin-guistic knowledge available to humansmight look something like the model inFigure 1.

The central premise, hence the centralcomponent, of the model is the postula-tion of a lexical processing unit whose jobis to intelligently combine probabilisticestimates of the identity of a word in atext input stream. The estimates comefrom three primary sources: the inputsignal itself, feedback from a syntacticprocessing unit, and feedback from a se-mantic processing unit.

-1I Semantic Processing Umt

Figure 1.

ProbabillstlcallyRanked

Candrdate

L]st

In the first phase of processing, a to-ken from the text input stream, whichmay be a correct word, a nonword, or areal-word error, is fed to the Isolated-Word Recognize. This unit produces aprobabilistically ranked list of word can-didates based on orthographic, phonetic,and visual similarities as well as errorprobabilities. Thus, it is the locus of non-word spelling error detection, and it isalso the source of candidate correctionsfor real-word errors. The output of thisunit could be thought of as a large vectorrepresenting the entire lexicon, or it couldbe treated as a truncated list of the top-ranked lexicon entries given the inputsignal and the orthographic and phone-mic characteristics of the lexicon and theerror probability characteristics of thelinguistic domain. IJI either case, thatoutput becomes input to the Lexical Pro-cessing Unit.

The Lexical Processing Unit contains a

ACM Computmg Surveys, Vol. 24, No 4, December 1992

432 ● Karen Kukich

Lexical Trigram Model, a CollocationModel, and a Lexical Activation Unit. Thelatter retains some short-term memoryfor the past few words that have beenprocessed which it uses in combinationwith conditional probabilities from thelexical trigram and collocation models torerank the candidate list to take lexicalcontext into account. The output of theLexical Processing Unit is a new list ofword candidates whose probabilisticranking is dependent on lexical context.

That output can be simultaneously sentto both a Syntactic Processing Unit and aSemantic Processing Unit. These unitsare the locus of real-word error detectioncapabilities, Input to the Syntactic Pro-cessing Unit goes to a ProbabilisticParser. The parser makes use of a Lexi-cal Part-of-Speech Model to ascribe part-of-speech probabilities to the ranked in-put word candidates. It also employssome short-term memory of the parts ofspeech of previously processed words sothat it can exploit a Part-of-Speech Tri-gram Model during parsing. In the pro-cess of parsing it recomputes the proba-bilities of the words in the candidate listbased on syntactic considerations. A po-tential error is signaled if the top-rankedcandidate changes.

A similar process is carried out by theSemantic Processing Unit. Its SemanticInterpreter may employ a Latent Seman-tic Association Model that was precom-puted from lexical statistics, Local andGlobal Discourse Models that retainshort- and long-term memories of thetopic and focus of the discourse repre-sented by previously processed terms,and a Pragmatic Model that representsplans and goals of the system and user.Like the Syntactic Processing Unit, theSemantic Processing Unit also recom-putes the probabilities of the words inthe candidate list. Again, a change in thetop-ranked candidate may signal a po-tential real-word error.

The output streams of ranked candi-dates and their probabilities from boththe Syntactic Processing Unit and theSemantic Processing Unit are fed backinto the Lexical Processing Unit to be

recombined with the unchanged rankedlist from the Isolated-Word Recognize.The newly ranked candidate list pro-duced by the Lexical Processing Unit issent again to both the Syntactic and Se-mantic Processing Units which in turnrecompute their probability rankings andfeed them back to the Lexical ProcessingUnit. This process is repeated until allthree units agree on the top-ranked can-didate. Of course, on rare occasions unre-solvable conflicts may occur, much likethe conflicts of meaning and form foundin Escherian paintings, so some mecha-nism must be incorporated to detect such

conflicts.

The individual components of the

model might be implemented separately

using different techniques. For example,

a combination of rule-based and case-

based techniques might be appropriate

for the Semantic Interpreter; a combina-

tion of rule-based and stochastic tech-

niques might be appropriate for the

Probabilistic Parser; and neural net tech-

nology might someday be appropriate for

the Lexical Activation Unit.

The intent of the model in Figure 1 is

only to spark the imagination. Many

other possible organizations and realiza-

tions of the modules in this rough model

might be explored, and many practical

implementation problems will pose inter-

esting research problems. Indeed, the fu-

ture of research on real-word recognition

and error correction holds much promise

for a wide array of creative solutions and

useful applications.

ACKNOWLEDGMENTS

This article has benefited sigmficantly from thecontributions (and patience) of the editors, Sal

March and Dick Muntz, from the helpful input of

friends and colleagues who reviewed it, mcludmg

Vladimir Cherkassky, Fred Damerau, Bill Gale,

Joel Gannett, Don Gentner, Roger Mitton, Frank

Smadja, and Lynn Streeter, and from many pleas-

ant dmcusslons and/or collaborations with Steve

Abney, Al Aho, Alex Bocast, Dave Burr, Vladimir

Cherkassky, Ken Church, David Copp, Scott Deer-

wester, Sue Dumais, Bill Gale, Jonathan Grudin,

Stu Haber, Cheryl Jacques, Mark Jones, Len Kas-

day, Mark Kermghan, Tom Landauer, Michael

ACM Computmg Surveys, Vol 24, No. 4, December 1992

Automatically Correcting Words in Text 9 433

Llttman, Pam Troy Moyer, Sarvar Patel, Debbie

Schmitt, Frank Smadja, Scott Stornetta, Jim To-

bias, Jean Veronis, and Don Walker.

REFERENCES

AB~E~, S. 1990. Rapid incremental parsing with

repair. In Proceedings of the 6th New OEDConference: Electrozuc Text Research (Waterloo,

Ontario, Oct. 1990).

AHO, A. V. 1990. Algorithms for finding patterns

in strings. In Handbook of Theoretical Com-

puter Science, J. Van Leeuwen, Ed. ElsevierScience Publishers, B. V., Amsterdam.

AHo, A. V., AND CORASICK, M. J. 1975. Fast pat-

tern matching: An ald to bibliographic search.Commun. ACM 18, 6 (June), 333-340.

AHO, A. V., AND PETERSON, T. G, 1972. A mini-mum distance error-correcting parser for con-

text free languages. SIAM J. Comput. 1, 4

(Dec.), 305-312.

ALBERGA, C. N. 1967. String similarity and mis-spellings. Commun. ACM 10, 302–313.

ALLEN, R. B., AND KAMM, C. A. 1990. A recurrentneural network for word identification fromcontinuous phoneme strings. In Advances in

Neural Information Processing Systems, vol. 3.R. P. Lippmann, J. E. Moody, D. S. Touretzky,

Ed. Morgan Kaufmann Publishers, San Mateo,

Calif.

ALM, N., ARNOTT, J. L., AND NEWELL, A. F. 1992.

Prediction and conversational momentum in anaugmentative communication system. Corn -

mun. ACM 35, 5 (May), 46–56.

ANGELL, R. C., FREUND, G. E., AND WILLETT, P.

1983. Automatic spelling correction using atrigram similarity measure. Znf. Process. Man-age. 19, 255–261.

ATWELL, E , AND ELLIOTT, S. 1987. Dealing with

ill-formed English text (Chapter 10). In TheComputational Analysis of Engllsh: A Corpus-Based Approach. R. Garside, G. Leach, G.

Sampson, Ed. Longman, Inc. New York.

BAHL, L. R., BROWN, P. F., DESOUZA, P. V., AND

MERCER, R. L. 1989. A tree-based statistical

language model for natural language speech

recogrntion. IEEE Trans. AcozMt. Speech Slg.Process. 37, 7, (July), 1001-1008.

BAHL, L. R., JBLINEK, F., AND MERCER, R. L. 1983.A maximum likelihood approach to continuousspeech recognition. IEEE Trans. Patt. Anal.Machine Intell. PAMI-5, 2 (Mar.), 179-190

BENTLEY, J. 1985. A spelling checker. Commun.

ACM 28, 5 (May), 456-462.

BICKSL, M. A. 1987. Automatic correction to mis-spelled names: A fourth-generation languageapproach. Commun. ACM 30, 3 (Mar.),

224-228.

BLAIR, C. R. 1960. A program for correctingspelling errors. Inf. Contr. 3, 60–67.

BLEDSOE, W. W., AND BROWNING) I. 1959. Pattern

recognition and reading by machine. In Pro-

ceedings of the Eastern Joint Computer Confer-

ence, vol. 16, 225–232.

BOCAST, A. K. 1991. Method and apparatus for

reconstructing a token from a Token Fragment.

U.S. Patent Number 5,008,818, Design Ser-vices Group, Inc. McLean, Va.

BOIVIE, R. H. 1981. Directory assistance revis-ited. AT & T Bell Labs Tech. Mere. June 12,

1981.

BROWN, P. F., DELLA PIETRA, V. J., DESOUZA, P. V.,

AND MERCER, R. L. 1990a. Class-Based n-Gram Models of Natural Language.

BROWN. P., COCKE) J., DELLA PIETRA, S., DELLA

PIETRA, V., JELINEK, F., MERCER, R., AND

ROOSIN, P. 1990b. A statistical approach tomachine translation. Cornput. Lzng. 16, (June),

79-85.

BROWN, P., DELLA PIETRA, S., DELLA PIETRA, V., AND

MERCER, R. 1991. Word sense disambigua-

tion using statistical methods. In proceedingsof the 29th Annual Meeting of the Association

for Computational Lingzustics (Berkeley, Calif.,June), ACL, 264-270.

BURR, D. J. 1983. Designing a handwriting

reader. IEEE Trans. Patt. Anal. Machine In-tell. PAMI-5, 5 (Sept.), 554-559.

BURR, D. J. 1987. Experiments with a connec-

tionist text reader. In IEEE International Con-

ference on Neural Netu,orks (San Diego, Calif.,

June). IEEE, New York, IV:717-724.

CARBERRY, S. 1984. Understanding pragmati-cally ill-formed input In Proceedings of the

10th International Conference on Computa-

tional Linguistics. ACL, 100-206.

CARBONELL, J. G., AND HAYES, P. J. 1983. Recov-ery strategies for parsing extragrammaticallanguage. Amer. J. Comput. Lzng. 9, 3-4

(July-Dee.), 123-146.

CARTER, D. M. 1992. Lattice-based word identifi-

cation in CLARE. In Proceedings of the 30thAnnual Meeting of the Association for Compu-tational Linguistics (Newark, Del., June

28-July 2). ACL, 159-166.

CHERKASSKY, V., AND VASSILAS, N. 1989a. Back-propagation networks for spelling correction.

Neural Net. 1, 3 (July), 166-173.

CHERKASSKY, V., AND VASSILAS, N. 1989b. Perfor-mance of back-propagation networks for asso-

ciative database retrzeval. Int. J. Comput.Neural Net.

CHERKASSKY, V., RAO, M., AND WECHSLER, H. 1990.Fault-tolerant database retrieval using dis-tributed associative memories. Inf. Sci. 46,135-168.

CHERKASSKY, V., FASSETT, K., AND VASSILAS, N.1991. Linear algebra approach to neural asso-

clatme memories and noise performance ofneural classifiers. IEEE Trans. Comput. 40,12, 1429-1435.

CHERKASSKY, V., VASSILAS, N., BRODT, G. L., AND

ACM Computmg Surveys, Vol. 24, No. 4, December 1992

434 “ Karen Kukich

WECHSLER, H. 1992. Conventional and as$o-clatwe memory approaches to automaticspelhng checking. Eng. Appl. Ai-tzf. In tell. 5,

3.

CHERRY, L., AND MACDONALIL N. 1983. The

Writer’s Workbench software Byte, (Oct.),241-248.

CHOUEKA, Y. 1988 Looking for needles in a

haystack In Proceedings of RIA0,609-623

CHCTRCH, K. W. 1988. Astochastlc parts programand noun phrase parser for unrestricted text.

In Proceedings of the 2nd Appl[ed Natural

Language Processing Conference (Austin. Tex,

Feb.) .ACL, 136-143.

CHURCH, K. W., AND GALE, W. A. 1991a. Proba-

bility scoring for spelling correction. Stat. Com-put 1,93–103.

CHURCH, K. W, AND GALE, W A 1991b En-hanced Good-Tur]ng and cat-cal Two new

methods forestlmat]ng probabilities of English

blgrams. Comput. Speech Lang. 1991

COHEN, G. 1980. Reading and searching forspelhng errors. In Cognztt ue Processes m

Spelling. Uta Frlth, Ed. Academic Press, Lon-

don.

COKER, C. H., CHURCH, K. W., AND LIBERMAN, M. Y.

1990. Morphology and rhymmg: Two power-ful alternates to letter-to-sound rules for

speech synthesm. In proceedings of the Confer-ence un Speech Synthesis. European SpeechCommunication Assoclatlon.

CONTANT, C., mm BRUNELLE, E 1992 Explo-

ratexte: Un an alyseur a l’affut des erreursgrammatlcales. In Actes du colloque le.riques-

grammau-es compares, University du Quebec a

Montreal. In French.

CUSHW, W. H., OJHA, P. S., AND DANIELS, C. M.

1990. Usable OCR: What are the mummumreqmrernents, In CH1-90 Conference Procced-

lngs. Special Issue of the ACM SIGCHI Bul -

letzn (Seattle. Wash . Apr 1-5 ) ACM, NewYork, 145-151.

DAHL, P , AND CHERKASSKY, V. 1990, Combmed

encoding in associative spelhng checkers. Umv.of Minnesota EE Dept. Tech. Rep.

DAMERAL1. F. J. 1990. Evaluating computer-gen-

erated domain -oriented vocabularies Inf F’r-o-cess. Manuge. 26, 6, 791–801.

DANIERAIT, F J 1964 A techmque for computer

detection and correction of spelling errors.Cornrnun ACM 7, 3 (Mar.), 171-176.

DAMERAU, F J., ~N~ MAYS, E 1989 An examina-tion of undetected typing errors. Inf ProcessManage. 25, 6, 659-664.

DAVIDSON, L 1962. Retrieval of mmspellednames in an airline passenger record system.Commun. ACM 5, 169-171.

DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W ,

LANDAUER, T K., AND HARSHMAN, R. 1990Indexing by Latent Semantic Analysls. JASIS41, 6, 391-407.

DEFFNER, R, EDER, K, AND GEIGER, H. 1990a.

Word recognition as a first step towards natu-ral 1anguage processing with artificial neural

nets In Proceedings of KONNAI-90.

DEFFNER, R., GEIGER, H.. KAHLER, R, KREMPL, T,AND BRAUER, W. 1990b. Recognizing words

with connectlomst architectures In Proceed-

ings of INNC-90-Par~s (Parm. France, July),

196.

D~HEER, T 1982. The apphcatlon of the concept

of homeosemy to natural language mformatlonretrieval Znf. Process. Manage. 18, 229–236.

DELOCHE, G., AND DEBILI, F 1980 Order infor-

mation redundancy of verbal codes in Frenchand English Neurolinguistic implications J.i’erbal Learn. Verbal Behac 19, 525-530

DEMAWO, P. W., AND MCCOY, K. F. 1992. Gener-

ating text from compressed input: An mtelll-

gent interface for people with severe motorimpairments. Commun. ACM 35, 5 (May),

68-78.

DEROUAULT, A -M , AND MERIALDO, B. 1984a. Lan-

guage modeling at the syntactic level. In Pro-ceedings of the 7th International Conference on

Pattern Recogn~t~on (Montreal, Canada, July

30-Aug. 2), 1373-1375.

DEROUAULT, A.-M , ANn MERIALDO, B. 1984b

TASF: A stenot.vpy-to-French transcription sys-

tem, In Proceedings o~ the 7th International

Conference on Pattern l?ecognzt~on (Montreal,Canada, July 30-Aug. 2), 866-868

DUNLATTEY, M. R 1981. On spelhng correction

and beyond Commun ACM 24, 9 (Sept.), 608.

DURHAiW, I., LAMB, D A. AND SAX~, J B 1983.

Spelling correction in user interfaces Com -rnun ACM 26, 10 (Oct.), 764–773.

EASTMAN, C. M., AND MCLEAN, D. S, 1981 On

the need for parsing all-formed input. Amer J

Comput. Lmg. 7.4.257.

ELLIOTT, R. J. 1988. Annotating spelhng hstwords with affixatlon classes. AT & T Bell LabsInt. Mere. Dec. 14.

ELLIS, A W 1979 Slips of the pen Vis Lang.

13, 265-282

ELLIS, A. W. 1982. Spelling and wrltmg (and

reading and speaking). In Norma h ty andPathology m Cogrwtzre Functzons. A W Elhs,Ed. Academic Press, London.

FAS?, D., .-mm WILhS, Y 1983. Preference seman-tics. dl-formedness. and metaphor Amer J

Comput. Lzng. 9. 3-4 (July-Dee ), 178-189

FINK, P. K., AND BIERM+NN, A. W. 1986. The cor-rection of ill-formed input using history-basedexpectation with applications to speech under-standing. Comput Lmg. 12, 1 (Jan.-Mar.),13-36

FORNEY, G. D., JR. 1973. The Vlterbi algorithmProc, IEEE 61, 3 (Mar.), 268-278.

Fox, E A, CHEN, Q. F., .4ND HEATH, L. S. 1992.

A faster algorithm for constructing mmimalperfect hash function~ In Proceedings of the

ACM Computmg Surveys, Vol 24, No 4, December 1992

Automatically Correcting Words in Text 9 435

15th Annual International SYGIR Meeting, SI-

GZR’92 (Denmark, June). ACM, New York.

266-273.

FREDKIN, E. 1960. Trie memory. Commun ACM

3, 9, (Sept.), 490-500.

FROMKIN, V., ED. 1980. Errors m Lmguistzc Per-

formance: Sllps of the Tongue, Ear, Pen andHand. Academic Press, New York, 1980.

GALE, W. A., AND CHURCH, K. W. 1990. Estima-tion procedures for language context: Poor esti-mates are worse than none. In Proceedings

of Compstat-90 (Dubrovnik, Yugoslavia).

Springer-Verlag, New York, 69-74.

GALLANT, S. I. 1991. A practical approach forrepresenting context and for performing word

sense disambiguation using neural networks.

Neural Comput. 3, 293-309.

GARRETT, M. 1982. ProductIon of speech: Obser-vations from normal and pathological language

use. In Normahty and Pathology m CognztzueFunctzons, A. W. Ellis, Ed. Academic Press,

London.

GARSIDE, R., LEACH, G., AND SAMPSON, G. 1987.The Computat~onal Analysis of English: A Cor-

pus-Based Approach. Longman, Inc., New York.

GENTNER, D. R., GRUDIN, J., LAROCHELLE, S., NOR-

MAN, D. A., AND RUMELHART, D. E. 1983.

Studies of typing from the LNR typing re-

search group. In Cognitive Aspects of SkilledTypeun-zting, W. E. Cooper, Ed. Springer-Verlag, New York.

GERSHO, M., AND REITER, R. 1990. Informationretrieval using self-organizing and heteroasso-

ciative supervised neural networks. In Pro-

ceedings of IJCNN (San Diego, Calif. June).

GOOD, I. J. 1953. The population frequencies ofspecies and the estimation of population pa-rameters Biometrikcz 40, 3 and 4 (Dec.),

129-264.

GORIN, R. E 1971. SPELL: A spelling checkingand correction program. Online documentation

for the DEC-10 computer.

GOSHTASBY, A., AND EHRICII, R. W. 1988. Contex-

tual word recognition using probabilistic relax-ation labeling. Putt. Recog. 21, 5, 455–462.

GRANGER, R. H. 1983, The NOMAD system: Ex-

pectation-based detection and correction of er-rors during understanding of syntactically andsemantically ill-formed text. Amer. J. Comput.Lmg. 9, 3-4 (July-Dee.), 188-196.

GRUDIN, J. 1983. Error patterns in skilled andnovice transcription typing. In Cognitiue As-

pects of Skilled Typewriting, W. E. Copper, Ed.Springer-Verlag, New York.

GRUDIN, J. 1981. The organization of serial order

m typing. Ph.D. dissertation Umv. of Cahfor-ma, San Diego.

HALL, P. A. V., AND DOWLING, G. R. 1980. Ap-proximate string matching. ACM Comput.

Suru. 12, 4 (Dec.), 17-38.

HANSON, S. J., AND KEGL, J. 1987. PARSNIP: A

connectionist network that natural languagegrammar from exposure to natural language

sentences. In Proceedings of the Cognitzue Sci-

ence Conference.

HANSON, A. R., RISEMAN, E. M., AND FISHER, E.,1976. Context in word recognition. Patt.

Recog. 8, 35-45.

HARMON, L. D. 1972. Automatic recogmtion ofprint and script. Proc. IEEE 60, (Oct.),1165-1176.

HAWLEY, M. J. 1982. Interactive spelling correc-

tion in Umx: The METRIC Library. AT&TBell Labs Tech. Mere., August 31.

HEIDORN, G. E. 1982. Experience with an easily

computed metric for ranking alternative parses.

In Proceedings of the 20th Annual Meeting of

the Association for Computational Linguistics

(Toronto, Canada). ACL, 82-84.

HEIDORN, G. E., JENSEN, K., MILLER, L. A., BYRD,

R. J., AND CHODOROW, M. S. 1982. The EPIS-TLE text-critiquing system. IBM Syst. J. 21,

3, 305-326.

HIZNSELER, J., SCHOLTILS, J. C., AND VIIRDOEST,C. R. J. 1987. The design of a parallel

knowledge-based optical character recognition

system. Master of Science Theses, Dept. ofMathematics and Informatics, Delft Univ. of

Technology.

HINDLE, D. 1983. User manual for Fidditch, a

deterministic parser. Tech. Mere. 7590-142,

Naval Research Lab.

Ho, T. K., HULL, J. J., AND SRIHARI, S. N. 1991.

Word recognition with multi-level contextualknowledge. In Proceedings of IDCAR91 (St.

Malo, France), 905-915.

HOTOPF, N. 1980. Slips of the pen. In CognitiveProcesses in Spelling, Uta Frith, Ed. AcademicPress, London.

HULL, J. J. 1987. Hypothesis testing in a compu-

tational theory of visual word recognition. Inproceedings of AAAI-87, 6th Nat~onal Confer-

ence on Art@cial Intelligence. vol. 2 (Seattle,

Wash., July 13-17). AAAI, 718-722.

HULL, J. J., AND SRIHARI, S. N. 1982. Experi-

ments m text recogmtlon with binary n-gram

and Viterbi algorithms. IEEE Trans. Patt.

Anal. Machzne Intell. PAMI-4, 5 (Sept.),520-530.

JELINEK, F., MERIALDO, B., ROUKOS, S., AND STKALTSS,M. 1991. A dynamic language model forspeech recognition. In Proceedings of the

DARPA Speech and Natural Language Work-

shop (Feb. 19–22), 293–295.JENSEN,K., HEIDORN,G. E., MILLER, L. A., AND

RAVIN,Y. 1983. Parse fitting and prose fix-ing: Getting a hold on ill-formedness Amer J.Comput. Lmg. 9, 3-4 (July-Dee.), 147–160.

JOHNSTON, J. C., AND MCCLELLAND, J. L. 1980.Experimental tests of a hierarchical model ofword identification. J. Verbal Learn. VerbalBehau. 19.503-524.

ACM Computing Surveys, Vol. 24, No. 4, December 1992

436 ● Karen Kukich

JONES, M. A, STORY, G. A., AND BALLARD, B. W.1991. Integrating multiple knowledge sourcesm a Bayesian OCR post-processor. In F’roceed-

tngs of IDCAR-91 (St Malo, France), 925–933

JOSHI, A. K. 1985. How much context-sensltlvltyM necessary for characterizing structural de-

scriptions—Tree Adjommg Grammars In Nat-

ural Language Processing—Theoretical, Com-putational and psychologzca[ Perspectives, DDowty, L. Karttunen, A. Zwlcky, Ed. Cam-

bridge Umverslty Press, New York.

KAHAN, S , PAVLIDIS, T,, AND BAIRD. H. S. 1987.

On the recognition of characters of any fontsize IEEE Trans Patt. Anal. Mach~ne Intell.

PAMI-93 9, 274-287

KASHYAP, R. L . ANU OOMMEN, B J 1981 Aneffective algorithm for string correction usinggenerahzed edit distances Inf SCZ .23,

123-142.

KASHYAP, R. L., AND OOMMEN, B. J. 1984. Spelling

correction using probablhst]c methods Putt

Recog. Lett. 2, 3 (Mar.), 147-154.

KEELER, J., AND RUMELHART, D. E. 1992. A self-orgamzmg mtegreted segmentation and recog-nition neural net, In Advances zn Neural Infor-

mation Processing Systems, vol. 4, J, E. Moody,S. J. Hanson, R. P. Lippmann, Ed. Morgan

Kaufmann, San Mateo, Cahf., 496-503,

KEMPEN, G , AND VOSSE, T. 1990. A language-sensitive text editor for Dutch In Proceedingsof the Computers and Wrltmg III Conference

(Edinburgh, Scotland. Apr )

KERNIGHAN, M. D. 1991. Specialized spelling cor-

rection for a TDD system AT & T Bell LabsTech. Mere., August 30.

KERNIGHAN, M. D., AND GALE, W. A. 1991. Varia-tions on channel-frequency spelhng correction

in Spamsh. AT & T Bell Labs Tech Mere.,September.

KERNIGHAN, M. D., CHURCH, K. W., AND GALE, W. A.1990 A spelling correction program based ona noisy channel model. In Proceedz ngs of COL-ING-90, The 13th International Conference onComputational Linguistics, vol. 2 (Helsmkl,Finland) Hans Karlgren, Ed 205-210.

KNUTH, D. E 1973 The Art of Programming.Vol. 3, Sorting and Search zng Addison-Wesley,

Reading, Mass

KOHON~N, T, 1980, Content Addressable Memo-rzes Springer-Verlag, New York

KOHONEN, T. 1988. Self-Organ zzatlon and Asso-cla tzue Memory. Springer-Verlag, New York.

KUCERA, H., AND FRANCIS, W. N. 1967. Compz~ta-

tlonal Analyszs of Present-Day Arnerzcan En-glzsh Brown Umver.slty Press, Providence, R.1,

KUKICH, K. 1988a. Varlatlons on a back-propa-gation name recognition net. In Proceedings ofthe Advanced Technology Conference, vol 2(May 3–5). U.S. Posta~-Serwce, WashingtonD. C., 722-735.

KUKIrH, K. 1988b. Back-propagation topolo~es

for sequence generation, In Proceedings of theIEEE Intern at~onal Conference on Neural Net-

works, vol. 1 (San Diego, Cahf., July 24–27)IEEE, New York, 301-308.

KUKICH, K 1990 A comparison of some novel

and traditional lexical distance metrics for

spelling correction In Proceedings of INNtT90-Paris (Paris, France, July), 309-313,

KUKICH, K. 1992 Spelling correction for the

telecommumcatlons network for the deaf Com -mun ACM 35, 5 (May), 80–90

LANDALTER, T. K, AND STREETER. L. A. 1973.

Structural differences between common andrare words, J, Verbal Learn. Verbal Behau 12,119-131.

LEE, Y.-H., EVENS, M., MICHAEL, J. A., AND ROVICK,

A. A 1990 Spelling Correction for an inte-lligent tutoring system. Tech. Rep., Dept. of

Computer Science, Illinois Inst. of Technology,

Chicago

LEVENSHTEIN, V I. 1966 Binary codes capable of

correcting deletions, insertions and reversals.SOL Phys Dokl 10, (I?eb ), 707-710.

LIBERMAN, M. Y., AND WALKER, D. E 1989. ACLData Collection untlatwe. Fu-st release I’znzte

Strz,ng 15, 4 (Dec.), 46-47

LOWItANCE, R., AND WAGNER, R. 1975. An exten-sion of the strmg-to-strmg correction problem.J ACM 22, 2 (Apr.), 177–183.

MATAN, O , BURGKX, C J C , LECUN, Y, ANDDENKER, J S 1992 Multi-digit recogmtlon

using a space displacement neural network. InAdvances in Neural Information Proceswng

Systems, vol. 4, J E Moody, S. J. Hanson, R, P.

Lippznann, Ed. Morgan Kaufmann, San Mateo,

Calif, 488-495

MAYS, E., DAMERAU, F ,1, AND MERCER, R L 1991.

Context based spelling correction. Inf. Process,

Manage. 27, 5, 517-522

MCCLELLAND, J. L., AND RUMELHART. D. E 1981An interactive activation model of context ef-fects in letter perception. Psychol Rev. 88, 5

(Sept.), 375-407.

MCCOY, K F. 1989 Generating context-sensltlveresponses to object-related rnlsconceptlons, Ar-

tzf Intell. 41, 157–195

MCILROY, M. D 1992. Development of a spelhng

11A. IEEE Trans C’ommun CCJM-30, 1 (Jan.),91–99.

MEANS, L. G. 1988 Cn yur cmputr raed tbs. InProceedings of the 2nd Applled Natural Lan-

guage processing Conference (Austin, Tex,Feb.) ACL, 93-100

MINTON, S., HAYES, P. J., AND FAIN J. 1985. Con-trolling search in flexlble parsing. In Proceed-zngs of the International Joznt Conference on

Artificial Intelligence, Morgan Kaufman, SanMateo, Calif., 786-787,

MTTTON, R. 1987. Spelhng checkers, spelling cor-rectors, and the misspellings of poor spellers.Inf, Process. Manage 23, 5, 495-505.

ACM Computmg Surveys, Vol 24, No. 4, December 1992

Automatically Correcting Words in Text ● 437

MITTON, R. 1986. A partial-dictionary of English

m computer-usable form. L~t. -hzg. Cornput. 1,4, 214-215.

MIT”TONj R. 1985. A collection of computer-read-

able corpora of English spelling errors. Cog.

Neuropsychol. 2, 3, 275-279.

MOR, M., AND FRAENKEL, A. S. 1982a. Retrieval

in an environment of faulty texts or faulty

queries. In Proceedings of the 2nd Interna-

tional Conference on Improving Database Us-

abiltty and Responsiveness (Jerusalem), P.

Scheuerman, Ed. Academic Press, New York,

405-425.

MOR, M., AND FRAENKEL, A. S. 1982b. A hash

code method for detecting and correcting

spelhng errors. Commun. ACM 25, 12 (Dec.),935-938.

MORGAN, H. L. 1970. Spelling correction in sys-

tems programs. Commun. ACM 13, 2 (Feb.),

90-94.

MORRIS, R., AND CHERRY, L. L. 1975. Computer

detection of typographical errors. IEEE Trans.Profess. Commun. PC-18, 1,54-63.

MUTH, F. E., JR., AND THARP, A. L. 1977. Correct-

ing human error in alphanumeric terminal m-input. Inf. Process. Manage. 13, 329-337.

NEMHAUSER, G. L. 1966. Introduction to Dy-

namic Programming. Wiley, New York.

NIELSEN, J., PHILLIPS, V. L., AND DUMAIS, S. T.

1992. Retrieving imperfectly recognized

handwritten notes. Behav. Inf. Tech.

ODELL, M. K., AND RUSSELL, R. C. 1918. U.S.

Patent Numbers, 1,261,167 (1918) and

1,435,663 (1922). U.S. Patent Office, Washing-

ton, D.C.

OKUDA, T., TANAKA, E., AND KASAI, T. 1976. Amethod of correction of garbled words based on

the Levenshtein metric. IEEE Trans. Comput.

25, 172-177.

OSHIKA, T., MACHI, F., EVANS, B., AND TOM, J. 1988,

Computational techniques for improved namesearch. In Proceedings of the 2nd Annual Ap-

plted Natural Language Conference (Austin,Tex, Feb.). ACL, 203-210.

PARSAYE, K., CHIGNELL, M., KHOSHAFIAN, S., AND

WONG, H. 1990. Intelligent databases. AZExpert, (Mar.), 38-47.

PETERSON, J. L. 1980. Computer programs for

detecting and correcting spelhng errors. Com-mun. ACM 23, 12, (Dec.), 676–684,

PETERSON, J L. 1986. A note on undetected typ-

ing errors. Commun. ACM 29, 7 (July),633-637.

POLLOCK, J. J., AND ZAMORA, A. 1983. CollectIon

and characterization of spelling errors in scien-tific and scholarly text. J. Amer. Sot. Inf. Scz.34, 1, 51–58.

POLLOCK, J. J., AND ZAMonA, A. 1984. Automatic

spelling correction in scientific and scholarlytext. Commun. ACM 27, 4 (Apr.), 358-368.

RAMSHAW, L. A. 1989. Pragmatic knowledge for

resolving ill-formedness. Tech. Rep. No. 89-18,

BBN, Cambridge, Mass

RHYNE, J. R., AND WOLF, C. G. 1991, Paperlikeuser interfaces. RC 17271 (#76097), IBM Re-search Division, T. J Watson Research Center,Yorktown Heights, N.Y

RHYNE, J. R., AND WOLF, C. G. 1993. Recogni-tion-based user interfaces. In Aduances Ln Hu-

man -Computer In teractton, vol. 4, H. R. Hart-

son and D. Hlx, Ed. Ablex, Norwood, N.J.

RICHARDSON, S. D., AND BRADEN-HARDER, L. C.1988. The experience of developing a larger-

scale natural language text processing system:

CRITIQUE. In Proceedings of the 2nd Anrutal

Apphed Natural Language Conference, (Austin,Tex. Feb.). ACL, 195-202.

RISEMAN, E. M., ANI) HANSON, A. R, 1974. A con-textual postprocessing system for error correc-tion using binary n-grams. IEEE Trans. Com-put. C-23, (May), 480-493.

ROBERTSON, A, M., AND WILLETT, P. 1992.Searching for historical word-forms in adatabase of 17th-century English text using

spelling-correction methods. In Proceedings ofthe 15th Ann ual International SIGIR Meeting,SIGIR’92 (Denmark, June). ACM, New York,256-265.

ROS~NFELD, A., HUMMEL, R. A., AND ZUCKER, S. W.1976. Scene labeling by relaxation operations.IEEE Trans. Syst. Man Cybernet. SMC-6, 6,

420-433.

RUMELHART, D. E., AND MCCLELLAND, J. L. 1982.An interactive activation model of context ef-

fects in letter perception. Psychol. Rez,. 89, 1,60-94.

RUMELHART, D. E., HINTON, G. E., AND WILLIAMS,R. J. 1986. Learmng internal representa-

tions by error propagation. In Parallel Dls-

trlbuted Processing. Explorations zn the Ml-crostructure of Cognztzon, D. E. Rumelhart and

J. L. McClelland, Ed. Bradford Books/MITPress.

SALTON, G. 1989. Automatic text transforma-tions. In AutomatLc Text Processing: TheTransformation, Analysts and Retr~eual of In-

formation by Computer. Addison-Wesley, Read-ing, Mass.

SAMPSON, G. 1989. How fully does a machine-usable dictionary cover Enghsh text. Lzt, Lmg.Comput. 4, 1,29-35.

SANKOFF, D., AND KRUSKAL, J. B. 1983. TimeWarps, String Edits, and Macromolecules: TheTheory and Practice of Sequence Comparison.Addison-Wesley, Reading, Mass.

SANTOS, P. J., BALTZER, A. J., BADRE, A. N., HENNE -MAN, R L., ANU MILLER. M. S 1992 Onhandwriting recognition system performance:Some experimental results. In Proceedings ofthe Human Factors Soctety 36th Annual Meet-zng (Atlanta, Ga., Oct. 12– 16). Human FactorsSociety.

ACM Computing Surveys, Vol 24, No 4, December 1992

438 “ Karen Kukich

SCHANIi, R. C., LEBOWITZ, M., AND BIRNBAUM, L.1980. An integrated understander. Am. J.

Cornput. Lzng. 6, 1,13-30.

SHEIL, B. A. 1978. Median split trees. A fastlook-up techmque for frequently occurring keys

domm urz. ACM 21, 11 (Nov.), 947-958

SHINGHAL, R, AND TOLSSAINT, G. T 1979a Ex-periments in text recognition with the modified

Vlterbi algorithm. IEEE Trans Putt AnalMachzne Intell PAMI-1. 4 (Apr ), 184-193

SHIN~HAL, R., AND TOUSSAINT, G. T. 1979b. A bot-tom-up and top-down approach to using con-text m text recogmtlon. Int. J. Man-Machine

Stud. 11, 201-212.

SIDOROV, A. A. 1979. Analysls of word similarityon spelhng correction systems. Program Com-

put. Softw 5, 274-277

SINHA, R. M K, AND PRASADA, B. 1988 Visualtext recognition through contextual processingPutt Recog 21, 5, 463-479

SITAR, E. J. 1961. Machine recogmtlon of cursivescript: The use of context for error detection

and correction. Bell Labs Tech. Mere.

SLEATOR, D. D., AND TEMPERLY, D. 1992. Parsing

Englzsh wtth a Link Grammar. Source code via

internet host: spade .pc. cs. emu. edu:/usr/sleator/pubhc. Carnegie-Mellon Univ , Pitts-burgh, Pa.

S~ADJA, F 199 la From n-grams to collocations

An evaluation of XTRACT. In Proceedings of

the 29th Ann ual Meetl ng of the Assoczatzon for

Computatumal Lmgzustzcs (Berkeley, Cahf,June) ACL, 279-284

SMADJA, F. 1991b. Extracting collocations fromtext. An apphcatlon: Text Generation. Ph.D.

dlssertatlon, Columbla Umv., New York.

SMADJA, F., AND MCKEOWN, K. 1990. Automati-cally extracting and representing collocationsfor language generation. In Proceechngs of the28th Annual Meetzng of the Association for

Compufatlonal Lmgulstws, (Pittsburgh, Pa ,

June). ACL, 252–259

SPENIiE, M., BEIL~EN, C., MATTERN, F.. MEVENRAMP,

M., AND H. M. 1984. A language indepe-

ndent error recovery method for LL( 1) parsers.

Softzo. Pratt. Exp. 14.11.

SRIHARI, S., ED. 1984. Computer Text Recognz-tzon and Error Correction. IEEE Computer Socmty Press, Plscataway, N J

SRIHARL S. N., HULL, J. J., AND CHOUDHARI. R.1983. Integrating dwerse knowledge sourcesm text recogmtion. ACM Trans. Office Znf.Syst. 1, 1 (Jan.), 68-87.

SURL L. Z. 1991. Language transfer: A founda-tion for correcting the written English of ASLsigners Tech Rep. No 91-19, Dept of Com-

puter and Information Sciences, Umv. ofDelaware, Newark, Del.

SURI, L. Z., AND MCCOY, K. F. 1991. Languagetransfer m deaf wrltmg: A correction methodol-ogy for an instructional system. Tech. Rep. No.

91-20, Dept. of Computer and Information Sci-ences, Umv. of Delaware, Newark, Del.

TAYLOR, W D. 1981 GROPE—A spelling errorcorrection tool. AT& T Bell Labs Tech. Mem

Tm-wz@ P., AND GOLD~N, W. 1972. CERL Re-

port X-35 Computer-Based Education Re-

search Lab , Umv of Ilhnom, Urbana, Ill.

THOMPSON, B. H. 1980. Lmgulstlc analysls of

natural language commumcatlon with comput-ers. In Proceedings of the 8th In ternatzonalConference on Computational Lznguzstzcs(Tokyo, Japan), 190-201.

TOUYSAINT, G T 1978 The use of context in pat-

tern recognition Patt Recog 10, 189-204

TR~WICL D J. 1983. Robust sentence analysis

and habfcablhty Ph.D dlssertatlon, Cahforma

Inst. of Technology, Pasadena, Cahf.

TROY, P. L. 1990 Combming probabdistlc

sources with lexical distance measures forspelhng correction. Bellcore Tech Memo , Bell-

core, Morristown, N.J

TbAo, Y C 1990 A lexical study of sentences

typed by hearmg-lmpamed TDD users In Pro-ceedings of the 13th Internatumal Symposwm

on Human Factors m Telecommunzcatzons

(Turin, Italy, Sept ), 197-201

TURBA, T. N. 1981. Checking for spelhng and ty

pographlcal errors m computer-based text.SIGPLANSIGOA Newslett. (June), 51-60.

ULLMANN, J R. 1977 A binary n-gram technique

for automatic correction of substitution, dele-tion, insertion and reversal errors in words

C“ompzd J. 20, 141-147

VAN BF,R~EI., B., AND DESMEDT, K. 1988 Trl-

phone analysls A combmed method for the

correction of orthographical and typographical

errors. In Proceedz ngs of the 2nd Applzed Na tu-ral Language Processing Conference (Austin,Tex., Feb.) Assoclatlon for Computational Lin-gu,st,cs (ACL).

VERONIS, J. 1988a. Computerized correction of

phonographic errors. Comput. Hum. 22, 43–56.

VERONIS, J 1988b. Morphosyntactic correction in

natural language interfaces m pmceedmgsof the 12th Iaternatmnal Conference on Comp-

utational Lznguzstms (Budapest, Hungary),

708-713

VOSSE, T. 1992. Detecting and correcting mor-pho-syntactic errors In real texts In l+oceed-zngs of the 3rd Conference on Applwd NaturalLanguage Proces.szng (Trento, Italy, Mar31-Apr 3). ACL, 111-118

WAGNER, R. A. 1974. Order-n correction for regwlar languages. Commun. ACM 17, 5 (May),265-268,

W~GPJER, R. A., AND FISrHER, M J 1974 The

strmg-to-string correction problem. J AC’M 21,1 (Jan.), 168-178.

WALRER, D. E. 1991. The ecology of language.In proceedings of the International Workshopon Electronic Dzctzonarzes (Feb.). Japan Elec-

ACM Computmg Surveys, Vol. 24, No 4, December 1992

Automatically Correcting Words in Text “ 439

tronic Dictionary Research Institute, Tokyo,10-22.

WALKER, D. E., AND AMSLER, R. A. 1986. The use

of machine-readable dictionaries in sublan-

gnage analysis. In Analymag Language m Re-

stricted Domains: Sublanguage Description andProcessing. Lawrence Erlbaum, Hillsdale, N. J.,

69-83.

WALTZ, D L. 1978. An English language ques-

tion answering system for a large relational

database. Commun. ACM 21, 7, 526-539.

Webster’s New World Misspeller’s Dlctlonary. Si-mon and Schuster, New York.

WEISCHEDEL, R. M., AND SONDHEIMER, N. K. 1983.

Mets-rules as a basis for processing ill-formed

input. Amer. J. Comput. Ling. 9, 3-4(July-Dee.), 161-177.

WING, A. M., AND BADDELEY, A. D. 1980. Spelling

errors in handwriting: A corpus and distribu-

tional analysis. In Cogrutzue Processes inSpelhng, U. Frlth, Ed. Academic Press, Lon-

don.

Recewed February 1992; accepted June 1992

WONG, C. K., AND CHANDKA, A. K 1976. Bounds

for the string editing problem J. ACM 23, 1

(Nov.), 13-16.

WRIGHT, A. G., AND NEWELL, A. F. 1991. Com-puter help for poor spellers. Brzt. J. Educ.Tech. 22, 2 (Feb.), 146-148.

YANNAKOUDAKIS, E. J., AND FAWTHROP, D. 1983a.

An intelligent spelling corrector. Znf. Process.Manage. 19, 12, 101-108.

YANNAKOUDAKIS, E. J , AND FAWTHROP, D. 1983b.

The rules of spelling errors. Znf. Process. Man-

age. 19, 2, 87–99.

YOUNG, C. W., EASTMAN, C. M., AND OAKMAN, R. L.1991. An analysm of all-formed input m natu-

ral language queries to document retrieval sys-tems. Znf, Process, Manage. 27, 6, 615-622.

ZAMORA, E. M., POLLOCK, J. J., AND ZAMORA, A,1981. The use of trigram analysis for spelling

error detection. Inf. Process. Manage. 17, 6,305-316.

ZIPF, G. K. 1935. The Psycho-Bzology of Lan-guage. Houghton Mifflin, Boston.

ACM Computmg Surveys, V.] 24, No 4, December 1992