6
thesis (University of Michigan, 1994). (S.Z. thanks G. Hulbert, H. Pollack, L. Ruff, and K. Satake). Funded by the David and Lucile Packard Foundation and NSF grant EAR-9496185. S.Z. is supported by a Texaco Fellowship at the California Institute of Technology. Some computations were performed at the Pitts- burgh Supercomputer Center. Contribution 5451 of the Division of Geological and Planetary Sciences of the Califomia Institute of Technology. Gauging Similarity with n-Grams: Language-Independent Categorization of Text Marc Damashek A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive se- quences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese. I report here on a simple, effective means of gauging similarity of language and con- tent among text-based documents. The technique, known as Acquaintance, is straightforward; a workable software system can be implemented in a few days' time. It yields a similarity measure that makes sort- ing, clustering, and retrieving feasible in a large multilingual collection of documents that span an unrestricted range of topics. It makes no use of words per se to achieve its goals, nor does it require prior information about document content or language. It has been put to practical use in a demanding government environment over a period of several years, where it has demonstrated the ability to deal with error-laden multilingual texts. Sorting and categorizing the enormous amount of text now available in machine- readable form has become a pressing prob- lem. To complicate matters, much of that text is imperfect, having been derived from existing paper documents by means of an error-prone scanning and character recog- nition process. Over the past few decades, many docu- ment categorization and retrieval methods [for example, (1-3) and references therein] have relied on the self-evident utility of words, sentences, and paragraphs for sorting, categorizing, and retrieving text (4), and various means of suppressing uninformative words, removing prefixes, suffixes, and end- ings, interpreting inflected forms, and per- forming related tasks have been developed. Depending on the application, these meth- ods share a number of potential drawbacks: They require a linguist (or a polyglot) for initial setup and subsequent tuning, they are vulnerable to variant spellings, misspellings, and random character errors (garbles), and they tend to be both language-specific and topic-specific. A potentially more robust alternative, the purely statistical characterization of text in terms of its constituent n-grams (se- quences of n consecutive characters) (5, 6), has sporadically been applied to textual analysis and document processing (7). Re- cent examples include spelling and error correction (8-14), text compression (15), language identification (16, 17), and text search and retrieval (18-21). The literature offers no convincing evi- dence of the usefulness of either approach for the purpose of categorizing text accord- ing to topic in a completely unrestricted multilingual environment, that is, an envi- ronment that encompasses many different documents containing a nonnegligible num- ber of character errors. The present paper is intended to provide such a demonstration. SCIENCE * VOL. 267 * 10 FEBRUARY 1995 Methodology Except for the Japanese example below, all text shown here has been reduced to a 27- character alphabet (uppercase A through z plus space). The alphabet size only weakly affects the ultimate outcome (22), so there is no loss of generality; consistent results are also obtained for a range of n-gram lengths (22). I have worked with 5-grams for the English language examples, and 6-grams for the Japanese (23), but it should be borne in mind that the size of the alphabet and the n-gram length are both flexible. Given an alphabet and value of n, a naive calculation of the number of possible n-grams can be misleading. It is immaterial that, for example, 275 = 14,348,907 distin- guishable 5-grams can be formed using 27 characters because most of them are never encountered. Huge reserves of computer memory for n-gram statistics are therefore unnecessary. An entire document can be represented as a vector whose components are the rel- ative frequencies of its distinct constituent n-grams (the exhaustive list of constituent n-grams comprises all n-character sequences produced by an n-character-wide window displaced along the text one character at a time, and contains many duplications). Let the document contain J distinct n-grams, with m, occurrences of n-gram number i. Then the weight assigned to the ith vector component will be where m X = 1 a mi j=l (1) (2) j= 1 j~l Because both the size of the alphabet and the length of the n-grams are arbitrary, doc- ument vectors can be stored conveniently by indexing ["hashing" (24)] each n-gram in a consistent manner; numerical values of vector components are stored and retrieved using these indices as pointers to memory. For the present work, I have used an 18-bit index (hash key) and ignored collisions (rel- atively infrequent instances of different n- grams being mapped to the same key). Documents are characterized as follows: (i) Step the n-gram window through the document, one character at a time. (ii) Con- vert each n-gram into an indexing key. (iii) Concatenate all such keys into a list and note its length. (iv) Order the list by key value [efficient algorithms will do this in linear time (25)]. (v) Count and store the number of occurrences of each distinct key while removing duplicates from the list. (vi) Divide the number of occurrences of each 843 36. M. E. Wysession, L. Bartko, J. B. Wilson, ibid., p. 13667. 37. T. Seno and S. Maruyama, Tectonophysics 160, 23 (1989). 38. C. Kincaid and P. Olson, J. Geophys. Res. 92, 13832 (1987). 39. T. A. Cross and R. H. Pilger, Nature 274, 653 (1978). 40. R. D. Jarrard, Rev. Geophys. 24, 217 (1986). 41. This work is based in part on one chapter of S.7's The author is at the Department of Defense, Fort George G. Meade, MD 20755-6000, USA. 'laullililis 00 manamarcon on March 10, 2011 www.sciencemag.org Downloaded from

Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

thesis (University of Michigan, 1994). (S.Z. thanks G.Hulbert, H. Pollack, L. Ruff, and K. Satake). Fundedby the David and Lucile Packard Foundation and NSFgrant EAR-9496185. S.Z. is supported by a TexacoFellowship at the California Institute of Technology.Some computations were performed at the Pitts-burgh Supercomputer Center. Contribution 5451 ofthe Division of Geological and Planetary Sciences ofthe Califomia Institute of Technology.

Gauging Similarity with

n-Grams: Language-IndependentCategorization of Text

Marc Damashek

A language-independent means of gauging topical similarity in unrestricted text isdescribed. The method combines information derived from n-grams (consecutive se-quences of n characters) with a simple vector-space technique that makes sorting,categorization, and retrieval feasible in a large multilingual collection of documents. Noprior information about document content or language is required. Context, as it appliesto document similarity, can be accommodated by a well-defined procedure. When anexisting document is used as an exemplar, the completeness and accuracy with whichtopically related documents are retrieved is comparable to that of the best existingsystems. The results of a formal evaluation are discussed, and examples are given usingdocuments in English and Japanese.

I report here on a simple, effective means

of gauging similarity of language and con-

tent among text-based documents. Thetechnique, known as Acquaintance, isstraightforward; a workable software systemcan be implemented in a few days' time. Ityields a similarity measure that makes sort-ing, clustering, and retrieving feasible in a

large multilingual collection of documentsthat span an unrestricted range of topics. Itmakes no use of words per se to achieve itsgoals, nor does it require prior informationabout document content or language. It hasbeen put to practical use in a demandinggovernment environment over a period ofseveral years, where it has demonstrated theability to deal with error-laden multilingualtexts.

Sorting and categorizing the enormous

amount of text now available in machine-readable form has become a pressing prob-lem. To complicate matters, much of thattext is imperfect, having been derived fromexisting paper documents by means of an

error-prone scanning and character recog-nition process.

Over the past few decades, many docu-ment categorization and retrieval methods[for example, (1-3) and references therein]have relied on the self-evident utility of

words, sentences, and paragraphs for sorting,categorizing, and retrieving text (4), andvarious means of suppressing uninformativewords, removing prefixes, suffixes, and end-ings, interpreting inflected forms, and per-

forming related tasks have been developed.Depending on the application, these meth-ods share a number of potential drawbacks:They require a linguist (or a polyglot) forinitial setup and subsequent tuning, they are

vulnerable to variant spellings, misspellings,and random character errors (garbles), andthey tend to be both language-specific andtopic-specific.A potentially more robust alternative,

the purely statistical characterization of textin terms of its constituent n-grams (se-quences of n consecutive characters) (5, 6),has sporadically been applied to textualanalysis and document processing (7). Re-cent examples include spelling and error

correction (8-14), text compression (15),language identification (16, 17), and textsearch and retrieval (18-21).

The literature offers no convincing evi-dence of the usefulness of either approachfor the purpose of categorizing text accord-ing to topic in a completely unrestrictedmultilingual environment, that is, an envi-ronment that encompasses many differentdocuments containing a nonnegligible num-ber of character errors. The present paper isintended to provide such a demonstration.

SCIENCE * VOL. 267 * 10 FEBRUARY 1995

Methodology

Except for the Japanese example below, alltext shown here has been reduced to a 27-character alphabet (uppercase A through z

plus space). The alphabet size only weaklyaffects the ultimate outcome (22), so there isno loss of generality; consistent results are

also obtained for a range of n-gram lengths(22). I have worked with 5-grams for theEnglish language examples, and 6-grams forthe Japanese (23), but it should be borne inmind that the size of the alphabet and then-gram length are both flexible.

Given an alphabet and value of n, a

naive calculation of the number of possiblen-grams can be misleading. It is immaterialthat, for example, 275 = 14,348,907 distin-guishable 5-grams can be formed using 27characters because most of them are never

encountered. Huge reserves of computermemory for n-gram statistics are thereforeunnecessary.

An entire document can be representedas a vector whose components are the rel-ative frequencies of its distinct constituentn-grams (the exhaustive list of constituentn-grams comprises all n-character sequences

produced by an n-character-wide windowdisplaced along the text one character at a

time, and contains many duplications). Letthe document contain J distinct n-grams,

with m, occurrences of n-gram number i.Then the weight assigned to the ith vectorcomponent will be

where

mX =

1 a mij=l

(1)

(2)j= 1

j~l

Because both the size of the alphabet andthe length of the n-grams are arbitrary, doc-ument vectors can be stored convenientlyby indexing ["hashing" (24)] each n-gram ina consistent manner; numerical values ofvector components are stored and retrievedusing these indices as pointers to memory.

For the present work, I have used an 18-bitindex (hash key) and ignored collisions (rel-atively infrequent instances of different n-

grams being mapped to the same key).Documents are characterized as follows:

(i) Step the n-gram window through thedocument, one character at a time. (ii) Con-vert each n-gram into an indexing key. (iii)Concatenate all such keys into a list andnote its length. (iv) Order the list by keyvalue [efficient algorithms will do this inlinear time (25)]. (v) Count and store thenumber of occurrences of each distinct keywhile removing duplicates from the list. (vi)Divide the number of occurrences of each

843

36. M. E. Wysession, L. Bartko, J. B. Wilson, ibid., p.13667.

37. T. Seno and S. Maruyama, Tectonophysics 160, 23(1989).

38. C. Kincaid and P. Olson, J. Geophys. Res. 92,13832 (1987).

39. T. A. Cross and R. H. Pilger, Nature 274, 653 (1978).40. R. D. Jarrard, Rev. Geophys. 24, 217 (1986).41. This work is based in part on one chapter of S.7's

The author is at the Department of Defense, Fort GeorgeG. Meade, MD 20755-6000, USA.

'laullililis 00 manamarcon

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 2: Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

distinct key by the length of the original list.Because every character of a document

(except for the last n - 1 characters) is theinitial character of some n-gram (which maynot necessarily be distinct from previous n-grams encountered in the same document),the number of distinct n-grams will initiallyclosely track the document size in characters.Eventually, as the substance and tone of thedocument become established, fewer andfewer new n-grams will be introduced, andthe initial rise will slow considerably (Fig. 1).

In gauging similarity, I make the basicassumption that two documents whose n-gram vectors are "similar" in some usefulsense are likely to deal with related subjectmatter, and that documents whose vectorsare dissimilar are likely to have little to dowith one another. As a tentative first step,consider the normalized dot product S be-tween document vectors. For documents mand n drawn from a set of size M (m, n E 1,...,M)

]

XmjEXnjSmn = 1/2 -_ COS Omn (3)

j=1 j=1

Here xmj is the relative frequency withwhich key j (out of a total of J possibilities)occurs in document m. The score given byEq. 3 is the cosine of the angle Omn betweentwo vectors in the high-dimensional docu-ment space, as viewed from the absoluteorigin. Points (documents) in this vector

space are constrained to lie in the hyper-plane (subspace) defined by Eq. 2.

This result can easily be visualized in alow-dimensional space. Taking ] = 3 (ordi-nary three-space), Eq. 2 means that all doc-ument points lie in the plane x1 + X2 + x3= 1 (Fig. 2), and Eq. 3 is in fact the cosineof the angle between document vectors mand n as viewed from the origin.

Equation 3 can provide a gross measureof similarity-in particular, language dis-crimination is excellent. As a simple exam-ple (Fig. 3), 1 intercompared samples of textin 31 different languages (averaging about5000 characters each) and displayed theresults in a way that divulges clusteringamong the similarity scores (26). Similarityscores below a threshold determined by thecalibration procedure described below werediscarded. Within each independent classof languages, which by definition has noabove-threshold links to other classes (thefive African languages in the lower leftcorner, for example), the algorithm repre-sents similarity by proximity (27). In thisrepresentation, two independent samplesfrom the same language would typically beoffset from one another by about 10% of theradius of one of the circular icons. Solely onthe basis of their 5-gram content, these

samples have been accurately grouped bylanguage family.

The metric Eq. 3 fails at subtler taskssuch as topic discrimination, however, be-cause plain-language document vectors areusually dominated by uninformative compo-nents (in English, for example, n-grams de-rived from "is the", "and the", and "withthe"), and the simple dot product Eq. 3 isdriven primarily by the very strongest vectorcomponents. This weakness is shared byconventional word-based vector-space sys-tems, which is why they usually use (lan-guage- and domain-dependent) stop lists.An effective solution to this problem is

to translate the origin of the vector space toa location that characterizes the informa-tion one wishes to ignore and to compute asimilarity score referred to that new vantagepoint. Because a judiciously chosen origincan represent information common to agiven set of documents, it can implicitlydefine the context in which discriminationis to take place. This is an important stepbecause similarity comparisons becomemeaningful only when those documentcharacteristics to which the measuring sys-tem is sensitive (be they, for example, over-all formatting, alphabet type, specific lan-guage, or primary topic) have been identi-

E 4.5iE 4.0m 3.5

3.00A 2.50 2.0-.D 1.5m 2.0

_., ,,, ~~~., "" ':

2.5 3.0 3.5 4.0 4.5log[Document length (characters)]

5.0

Fig. 1. Number of distinct n-grams (n = 5) as afunction of document length for 5050 broad-rang-ing English-language magazine articles.

j(0,0,.1)

(0, 1,0)x2

r mn

Fig. 2. A realization of the similarity score (Eq. 3) ina three-dimensional document space. Docu-ments are constrained to lie in the plane x1 + x2 +X3= 1'

844

Fig. 3. Clustering of 31 language samples based solely on the normalized dot product (Eq. 3). Proximitywithin each of the six disjoint classes connotes similarity (only relative distance is meaningful; there are no

axes).

SCIENCE * VOL. 267 * 10 FEBRUARY 1995

rodBulgarian

Serb*Croatian

LGCi

midl.!j.-. 1 l i: E~ .M

a

(1, 0, 0

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 3: Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

AS1lcwsw

fied and accounted for.One straightforward way to specify this

new origin is to average a particular set ofdocument vectors, for example, the docu-ment vectors that belong to an identifi-able cluster. We may call this average thecentroid of the set, whether it be themean, median, or some other measure ofcommonality (it is an open question as towhich of these might be most effective inany specific application; preliminary ex-periments indicate that the end result isnot a sensitive function of the adoptedmeasure). Note that each of the many axesin the document vector space is associatedwith a unique n-gram (apart from an arbi-trarily small collision factor), and that theproposed transformation is neither arescaling nor a rotation of the axes, merelya translation. Consequently, documentsthat belong to sets dominated by nominal-ly different primary topics (for example,health care reform and communicable dis-eases) but that are related to one anotherat a subordinate topic level (for example,AIDS epidemiology) can still be recog-nized as such if the various primary clus-ters are first "superimposed" by subtract-ing the corresponding cluster centroidfrom the documents that define each clus-ter (thereby centering the clusters them-selves about a single origin).

For simplicity, I have chosen the cen-troid of a given set to be the arithmeticmean of its document vectors. In J-dimen-sional space, intercomparison of the M doc-

J Percentage scoringbelow...("Nontwin" pairs)

-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Similarity score

Fig. 4. Cumulative distributions of test-pair scoresbased on a broad sample of English-languagemagazine text.

020- 23- Brief queries2!15-IS10- 12 11

05- 2 1 1

i so * [0 20 40 60 80 100Rank among systems under test (percentile)

i8-g Exe p ar-based retrieval Bv-86- 6 74 S 9~~~~~2- ~~~~~~~~11

O0 20 40 60 80 100Rank among systems under test (percentile)

Fig. 5. Rank among retrieval systems versusnumber of queries in which that rank wasachieved for (A) the ad hoc and (B) routing tasks.

uments of one set, represented by vectorsxm, m E 1, .. ., M, with the N documentsof a second set (the second set may ofcourse be identical with the first), repre-sented by vectors yn n E 1, ... , N, yieldsthe modified score

I

S.= j=1Smn = [ -2

I(Xmj - gj)2 (y j - vf)U= j=1

= cOS mn (4)where

= M > Xm,MlM

l N

and v,=- YjNn~lLetx' =x-gandy' = y-v (gandvarethe centroid vectors associated with the twodocuments x and y, respectively). By defi-nition, then

iX1jXj=O.

I

yj = 0Jil

These two conditions constrain all docu-ment vectors to a hyperplane parallel to theone shown in Fig. 2 but passing through theabsolute origin. The superimposition ofclusters referred to above is enforced by thevector differences in the numerator anddenominator of Eq. 4.

The score defined by Eq. 4 is sensitiveto "noise" (for example, garbled text, sty-listic differences among authors, and resid-ual fluctuations in common elements aftersubtraction of the centroid) in documentsthat lie close to the origin, with the resultthat small perturbations of a documentvector can drastically alter its similarityscores with other documents (because asmall change in some vector componentcan cause a large change in Omn). Howev-er, because a document close to the origin(in terms of some typical cluster radius)contains little information beyond thatrepresented by the origin itself, and in thatsense can be considered fully character-ized, there is little to be gained by pursuingmore subtle comparisons involving thatdocument. In practice, one can penalizescores involving such documents by asso-ciating an overall multiplicative factorwith every document vector, such that thecloser the document is to the current ori-gin, the smaller the factor becomes. Thenet result is a similarity score XmknSmnywhere Xm and Xn go smoothly from 0 to 1as the length of the respective documentvector increases (28), with

X = f(r), r 2 (7)

How are these similarity scores distrib-

uted over a wide-ranging corpus of docu-ments? Can one distinguish among relateddocuments, unrelated documents, and in-termediate cases (29)? Rather than rely onhuman judges to establish "ground truth,"I create a set of closely related documentpairs by partitioning each member of a testcollection in a special way, extracting al-ternate sentences into two new "twin"documents (22). With such a test set, onecan establish the performance of a similar-ity metric against documents known to behighly similar (twins) and documents as-sumed on average to be dissimilar (non-twins). The behavior of intermediate casescan plausibly be interpolated into a num-ber of broad similarity categories, eachwith associated confidence levels.

For the present test, I partitioned 4000diverse English-language magazine articleschosen at random from a commercial full-text CD-ROM. The original articles wereat least 1000 characters long and producedtwins containing at least 20 sentencesapiece. Each of these altered documentswas compared with all others by its mod-ified score (4), producing two score distri-butions (document versus twin, documentversus nontwin) (Fig. 4). If in fact thesetest sets faithfully model strongly relatedand unrelated documents, then the risk ofseverely misclassifying strongly relateddocuments by calling them totally unre-lated (and vice versa) is low (less than1%). By interpolating between the twodistributions, one can likewise reduce thelikelihood of subtler misclassifications toan acceptable level.

Formal Evaluation

The annual Text Retrieval Conference(TREC) sponsored by the National Insti-tute of Standards and Technology (30) is awell-attended forum for the comparison ofdocument retrieval and categorizationmethodologies, with more than 90 partici-pants in 1994 from government, industry,and academia. Two high-volume test pro-tocols (run against over a million broad-ranging English-language documents) havebeen devised to assess participating systems,which are ranked according to their confor-mity with human evaluators' judgments ofthe "relevance" of retrieved documents to aprescribed set of queries. The results ofthese assessments are characterized in termsof recall (the fraction of all "relevant" doc-uments in a corpus that are actually re-trieved) and precision (the fraction of thedocuments retrieved that are tagged "rele-vant"). Obviously, the significance of suchresults depends on the care with which"relevance" is defined and determined (31).

The purpose of taking part in this year'sTREC activity (32) was to compare the

SCIENCE * VOL. 267 * 10 FEBRUARY 1995

1 12Willoillillillilli -- IiiiiiIiiiialli RPNRM rom.W lQ zm W I- lRR5COOCdffigjk...M

845

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 4: Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

performance of Acquaintance with that ofstate-of-the-art document retrieval systems.The two main tasks undertaken by partici-pants were (i) document retrieval based onbrief stylized descriptions of the desired doc-uments (known as the ad hoc task), and (ii)document retrieval based on the full text ofexemplars certified to be of interest (knownas the routing task).

The performance of Acquaintance onthe ad hoc task is shown in Fig. 5A. Out of50 queries considered, the measured preci-sion exceeded the median (across 34 partic-ipating systems) only twice. It was far belowthe median in almost all other cases, al-though it fared better than 10% of theparticipants more than half the time. Suchqueries fail to model desired documents wellenough to serve as the sole input for retriev-al by Acquaintance.

Performance on the exemplar-basedrouting task is plotted in Fig. 5B. Acquain-tance scored at least as well as half of the 34systems addressing this task in more thanone-third of the queries (18 out of 50); itscored better than two-thirds of the systemsin one-fifth of the queries (10 out of 50).

In the latter task, and in terms of thesewidely adopted metrics, it would appearthat Acquaintance can perform on a parwith some of the best existing retrievalsystems. Aside from the utter simplicity ofits approach, however, one feature that

sets it apart is its complete language inde-pendence: At no time was the system in-formed that it was processing English. Notsurprisingly, then, useful practical resultshave been obtained during the past severalyears of development, with no modifica-tion of the algorithm, in close to twodozen languages.

Examples

The following three examples illustrate theuse of Acquaintance as a basis for blindclustering, the grouping of documents ac-cording to language, topic, and subtopic(and even finer subdivisions) with no priorinformation about document content. Oth-er applications-such as sorting (redirect-ing documents, that is, forwarding them toend users, in accordance with previouslyspecified categories), categorization (label-ing documents, but not necessarily redirect-ing them, in accordance with previouslyspecified categories), and retrieval-can beviewed as procedural modifications of thisbasic process. In sorting and categorization,incoming documents are compared withpreviously established references, whichmay be existing documents or groups ofdocuments (and which may well have beenidentified by previous stages of processing).

For retrieval, Acquaintance facilitatesan iterative refinement process in which

one maps relationships and labels entireclusters of documents at each stage of re-trieval. (It is not necessary to intercompareall documents in a large corpus beforehandbecause relationships can quickly bemapped among just the documents re-trieved at any stage.) Query-matching re-quirements in the initial stage of a searchcan be relaxed significantly (so as to en-hance recall), and entire clusters of inap-propriate documents can be discardedstrictly on the basis of their labels (therebyenhancing precision). If appropriate docu-ments are identified in this way, they can bemerged and resubmitted as a far more fo-cused follow-up query, and the mapping andlabeling procedure can be repeated. In theprocess, the context (that is, the origin indocument space) can be redefined aftereach round of retrieval, if desired, enhanc-ing discrimination in successive rounds(33).

For the first example, Acquaintance pro-cessed two days' worth of Associated Press(AP) wire service news articles (19 and 20February 1990) from the TREC collection.All possible pairs of articles were intercom-pared, and the resulting scores were used toconstruct Fig. 6, which reflects the interre-lationships among a subset of the 392 arti-cles (26). Guided by a related n-gram text-profiling technique (34), clusters were au-tomatically labeled with an informative set

Fig. 6. A subset of AP wireservice articles automaticallygrouped by topic with noprior information on docu-ment content. The clustershave been labeled withn-gram-derived word andphrase highlights (34). In ad-dition, text excerpts labeltwo of the articles to the rightof center.

SCIENCE * VOL. 267 * 10 FEBRUARY 1995

WM,1%,ImI,MI 11M

846

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 5: Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

of "highlights" (words and phrases thatcharacterize a document or group of docu-ments in the context of a specified set; thecontext in this case was the full set of 392articles). The resulting labels clearly delin-eate the topic or topics addressed by thosedocuments. Two ostensibly related articleshave been further labeled (by explicit userrequest) with excerpted text.

To illustrate the robustness of the algo-rithm in the face of severe degradation intext quality, I compared the roster of aparticular cluster of news items (museumand art gallery announcements) identifiedwithin the full data set used in the preced-ing example with that of the correspondingcluster in an artificially corrupted version ofthat same text (the clean and corruptedtext versions of the file were processed com-pletely independently of one another). Thecharacter error rate was approximately 15%:On average, 15 of every 100 original char-acters were randomly modified either bysubstitution, addition, or deletion. Thecluster of corrupted announcements wasfound still to contain 17 of the 20 membersoriginally found in the clean version. Figure7 shows four corresponding pairs from thetwo clusters.

Figure 8 illustrates the results of blindclustering of Japanese newspaper text. Thetest set of 394 articles was drawn from theDefense Advanced Research ProjectsAgency (DARPA) TIPSTER Japanese col-lection [16-bit shifted Japanese IndustrialStandards (JIS) code], which deals exclu-sively with the general topic of joint ven-tures. Successful discrimination must there-fore take place at what would commonly beviewed as the subtopic level or better.

The figure provides a view into onebranch of a set of 119 computer-relatedarticles (other prominent subject areas cov-ered by the full set included joint venturesbetween Japanese and German auto manu-facturers, and the world of haute couture).The articles in Fig. 8 relate mainly to com-puter memory and report on developmentsamong companies such as Intel, Motorola,

NKK, and Siemens. The articles withineach cluster are mutually consistent to thesame extent as the English-language exam-ples shown above.

Summary

The capabilities of the Acquaintancetechnique include (i) language sortingbased solely on brief reference samples andan "unknown" sample some dozens ofcharacters long, (ii) topical partitioning ofa large collection of documents in anylanguage, which need not be specified inadvance, with no prior specification ofsubject matter and with no user interven-tion, (iii) document sorting in any lan-guage, based on reference documents con-structed in a self-consistent manner, and(iv) natural-language exemplar-based doc-ument retrieval in any language, at a per-formance level that compares favorablywith state-of-the-art retrieval systems.These derive directly from an n-gram-based measure of document similarity anda means of automatically mitigating theeffects of uninformative text. The latterhas made possible a practical definition ofthe context in which document similarityis to be judged and is a generally applica-ble vector-space technique, rather than

being specifically tailored to n-grams.An off-the-shelf personal computer can

fully intercompare hundreds to thousands ofdocuments in a matter of minutes (35). Thetime for full intercomparison is of coursequadratic in the number of items compared,but the iterative refinement process de-scribed above obviates the need for fullintercomparison of more than several thou-sand documents at a time in all but the veryhighest volume applications. Sorting andretrieval are linear in the number of docu-ments investigated.

The time required to set up and tune aspecific application is measured in minutesto hours, rather than weeks to months. Theapproach lends itself well to experimenta-tion and is straightforward to adapt to awide variety of problems. Research into thecharacteristics, capabilities, and limitationsof this technique is ongoing.

The results obtained thus far underscorethe desirability of distinguishing betweentopic and meaning in text processing. Al-though words in isolation may well be am-biguous, one can assume that documentslike the ones discussed here are intended tocommunicate useful information. The au-thor of a coherent document will attemptto mitigate ambiguities by supplying re-lated words and phrases-and therefore n-

Fig. 8. One branch of a set of computer-related Japanese newspaper articles. The articles in this branchdeal mainly with computer memory hardware.

Fig. 7. Comparison of fourarticles from a prominentcluster with their counter-parts, which were alsogrouped into a significantcluster. The average charac-ter error rate in the lattercase is 15%.

High Museum of Art:-French Ceramics: Masterpieces From Lorraine.--- Through Jan. 6.Poster Art of the Soviet Union: A Window Into Soviet Life.---

Through Feb. 8.

Baltimore Museum of Art:'-'Ndebele Beadwork.'-' Through Jan. 13.'Lalique: A Century of Glass for a Modern World.-'- Through Jan.

13.

Walters Art Gallery:Islamic Art and Patronage: Selections From Kuwait.'' Through

Feb. 17.

Museum of Fine Arts:-Rediscovering Pompeii. Through Jan. 27.Adolph Menzel, 1815-1905: Master Drawings From Berlin.'-'

Through Jan. 27.''The Pen and the Sword: Winslow Homer, Thomas Nast and the

American Civil War.--- Through Feb. 3.''The Sculpture of Indonesia.-- Through March 17.

Nelson-Atkins Museum of Art:'Organic Abstraction.'' Through Feb. 10.'The Modern Poster: The Museum of Modern Art.'' Through Feb.

10.South Asian Textiles From the Permanent Collection: Woven

Patterns.-' Through Feb. 17.

Hgh Musum of ArrIt:' French Cewraamic: Masterpi4eces F~rom; Lorraine.bh'H Through

Jan. 6.'-'Poster XIA-rFtt ofU thC bSovietH Uion: A Windw Ito Soiet

Life.' '1 Thoug-h Feby.N 8.

J N +BaltimoeMuseum of Art:H o '-'NdbLl{e Beadwork.' ('% Thrjough Jan 13. ''Lliqqu-:1 A Centuy ofGlYa=s for a Modern &W#orld.'' Throug Jan).w 13.u $

WatetrRs Ar Galler:''Islamic A[r!t and Pat[rqoHnbag: SlecUt7ions From Kuwait.'-'

Through Feib. 17.

M Museum Wof Fu'i!nez aArts:--RediscoeriDn)g PoWopei.'' Troungh Jan. 27.'-'Adolph M(nbzl, 181(5c-1905: Mast%er Drawvings FKrom/

uBerlin.o '7'hrough cJUan. 27.

TherPienY~armidd te Sw5or(:7E MWinldoG*wc Homer, ThomasHNNastandg thAmerican CliSviWar.' Through Fe.n3.

''The Sculkpture of IndoneWsXia'' ThroughS =Marc+h 17.

T oNelson-Atkins Museum of ArtV:s''Organic bstraction.'' iTkhrUorugh Feb. 0.'-'The ModePrfn Posterj:L HThLe Musge6um of\ 2Mo.d8ern

rt. '=hyrougWh FebH. 120.iv' Sou$thAsan TexXt+iles\ 'From ith Permanent Collection:

WuoIven PattGernnmse.' Thrfough Feb. t1(7.

SCIENCE * VOL. 267 * 10 FEBRUARY 1995

ONam- .. 1

847

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 6: Gauging Similarity n-Grams: Language-Independent Categorization …pdodds/files/papers/others/1995/damashek1995a.pdf · ment categorization and retrieval methods [for example, (1-3)

grams-as necessary, no matter what thelanguage. The evidence presented here sug-gests that the mere presence of such terms,rather than their linguistic interrelation-ships, can usefully constrain the topic.

The distinction between topic andmeaning is of practical interest becausemany document-handling tasks are facili-tated by the ability to sort solely accordingto topic, deferring the appreciation ofmeaning to a late stage of processing. Wenow possess a versatile tool to do just that.

REFERENCES AND NOTES

1. G. Salton, Science 253, 974 (1991).2. __ and C. Buckley, ibid., p. 1012.3. G. Salton, J. Allan, C. Buckley, A. Singhal, ibid. 264,

1421 (1994).4. "The first step of any language analysis system is

necessarily recognizing and identifying individual textwords." G. Salton, Automatic Text Processing: TheTransformation, Analysis, and Retrieval of Informa-tion by Computer (Addison-Wesley, Reading, MA,1989), p. 379.

5. C. E. Shannon, The Mathematical Theory of Com-munication (Univ. of Illinois Press, Urbana, IL, 1949).

6. _ , Bell Syst. Tech. J. 30, 50 (1951).7. The term n-gram has occasionally been used to

mean a sequence of n words or other compositeunits. Throughout this paper, except for the Japa-nese example, n-gram will refer exclusively to a se-quence of n characters; in the case of Japanese, itrefers to n bytes of text.

8. C. Y. Suen, IEEE Trans. Pattern Anal. Mach. Intell. 1,164 (1979).

9. A. Zamora, J. Am. Soc. Int. Sci. 31, 51 (1980).10. J. L. Peterson, Commun. ACM 23, 676 (1980).11. E. M. Zamora, J. J. Pollock, A. Zamora, Inf. Process.

Manage. 17, 305 (1981).12. J. J. Hull and S. N. Srihari, IEEE Trans. Pattern Anal.

Mach. Intell. 4, 520 (1982).13. J. J. Pollock, J. Doc. 38, 282 (1982).14. R. C. Angell, G. E. Freund, P. Willett, Inf. Process.

Manage. 19, 255 (1983).15. E. J. Yannakoudakis, P. Goyal, J. A. Huggill, ibid. 18,

15 (1982).16. J. C. Schmitt, U.S. Patent No. 5,062,143 (1990).17. W. B. Cavnar and J. M. Trenkle, "N-Gram-Based

Text Categorization," Proceedings of the 1994 Sym-posium on Document Analysis and Information Re-trieval (Univ. of Nevada, Las Vegas, 1994), p. 161.

18. P. Willett, J. Doc. 35, 296 (1979).19. C. P. Mah and R. J. D'Amore, "DISCIPLE Final Re-

port," PAR Rep. 83-121 (PAR Technology Corpora-tion, New Hartford, NY, 1983).

20. J. Scholtes, in International Joint Conference onNeural Networks, Singapore (IEEE Press, New York,1991), vol. 1, p. 95.

21. W. B. Cavnar, in (30), pp. 171-179.22. S. M. Huffman, in preparation.23. The Japanese text is encoded as two-byte integers

(specifically, 16-bit shifted JIS code), and the currentprocessing system is byte-oriented; a 6-gram in Jap-anese therefore refers to a sequence of three kanji orkana.

24. D. E. Knuth, Sorting and Searching, vol. 3 of The Artof Computer Programming (Addison-Wesley, Read-ing, MA, 1973), sec. 6.4.

25. T. H. Cormen, C. E. Leiserson, R. L. Rivest, Introduc-tion to Algorithms (McGraw-Hill, New York, 1990),sec. 9.3.

26. J. D. Cohen, in preparation.27. The visualization algorithm assumes that the vari-

ous relationships are transitive: The fact thatFrench resembles Spanish and that Spanish re-sembles Portuguese, for example, strengthens theimputed relationship between French and Portu-guese. Only relative distance is meaningful in Fig. 3,that is, there are no axes in this diagram. Note thatFig. 3 does not represent the same space as Fig. 2.It is effectively a distillation in two dimensions ofsome of the useful information found in the high-

dimensional document vector space.28. The specific form used for the work illustrated here is

tir = {1 + exp[(ro - r)/A]}-1, but any similarly behavedfunction would likely do as well. Note that Eq. 2implies that the maximum possible Euclidean lengthof a vector is 1.

29. To skirt the contentious issue of "relevance" [L.Schamber, M. B. Eisenberg, M. S. Nilan, Inf. Pro-cess. Manage. 26, 755 (1990)], adopt an opera-tional definition of "strong relatedness": Two doc-uments are strongly related if (but not only if) theyhave been produced by the document-splittingprocedure described in this section. As with alloperational definitions, the utility of this definition isultimately to be judged by the utility of its conse-quences.

30. D. K. Harman, Ed., The Second Text Retrieval Con-ference (TREC-2) (NIST Spec. Publ. 500-215, Na-tional Institute of Standards and Technology, Gaith-ersburg, MD, 1994).

31. "The possibility, for example, that two irrelevant doc-uments might become relevant if put together hasnever been adequately considered, so far asknow." D. R. Swanson, J. Am. Soc. Inf. Sci. 26, 755(1 990).

32. D. K. Harman, Ed., The Third Text Retrieval Confer-ence (TREC-3) (National Institute of Standards andTechnology, Gaithersburg, MD, in preparation).

33. This process is ideally suited to the ad hoc task in theTREC environment but was not implemented in timefor the 1994 conference.

34. J. D. Cohen, J. Am. Soc. Inf. Sci. 46,162 (1995).35. In a typical run, 941 newspaper articles were ex-

haustively compared with one another, generating atotal of 443,211 pairwise scores, in 726 seconds,including the time required to save the scores to disk(Apple Macintosh Quadra 950).

36. thank J. D. Cohen, S. M. Huffman, J. M. Kubina,and C. Pearce for their enthusiastic contributions tothis project, and D. E. Brown and M. W. Goldbergfor their support and encouragement. The tech-nique described in this paper is the subject of apending U.S. patent and of France's patent no.2,694,984.

AAAS-Newcomb Cleveland Prize

To Be Awarded for a Report, Research Article, or an Article Published in Science

The AAAS-Newcomb Cleveland Prize is awarded to theauthor of an outstanding paper published in Science. The valueof the prize is $5000; the winner also receives a bronze medal.The current competition period began with the 3 June 1994 issueand ends with the issue of 26 May 1995.

Reports, Research Articles, and Articles that include originalresearch data, theories, or syntheses and are fundamental contri-butions to basic knowledge or technical achievements of far-reaching consequence are eligible for consideration for theprize. The paper must be a first-time publication of the author'sown work. Reference to pertinent earlier work by the authormaybe included to give perspective.

Throughout the competition period, readers are invited tonominate papers appearing in the Reports, Research Articles, orArticles sections. Nominations must be typed, and the followinginformation provided: the title ofthe paper, issue in which it waspublished, author's name, and a brief statement of justificationfor nomination. Nominations should be submitted to the AAAS-Newcomb Cleveland Prize, AAAS, Room 924, 1333 H Street,NW, Washington, DC 20005, and must be received on orbefore 30 June 1995. Final selection will rest with a panel of dis-tinguished scientists appointed by the editor-in-chief of Science.

The award will be presented at the 1996 AAAS annualmeeting. In cases of multiple authorship, the prize will bedivided equally between or among the authors.

SCIENCE * VOL. 267 * 10 FEBRUARY 1995

INIMAN'U',rl k 0 'MM-M.om W1111. 01. I:::_ wtwyr*czmmcwl 'I

m4

on

Mar

ch 1

0, 2

011

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om