9
On the Law of Zipf-Mandelbrot for Multi-Word Phrases L. Egghe LUC, Universitaire Campus, B-3590 Diepenbeek, Belgium, and UIA, Universiteitsplein 1, B-2610 Wilrijk, Belgium. E-mail: [email protected] This article studies the probabilities of the occurrence of multi-word (m-word) phrases (m 5 2,3,...) in relation to the probabilities of occurrence of the single words. It is well known that, in the latter case, the law of Zipf is valid (i.e., a power law). We prove that in the case of m-word phrases (m > 2), this is not the case. We present two independent proofs of this. We furthermore show that, in case we want to approximate the found distribution by Zipf’s law, we obtain exponents b m in this power law for which the sequence (b m ) m[N is strictly decreasing. This explains experimental findings of Smith and Devine (1985), Hilberg (1988), and Meyer (1989a,b). Our results should be compared with a heuristic finding of Rous- seau who states that the law of Zipf-Mandelbrot is valid for multi-word phrases. He, however, uses other—less classical—assumptions than we do. I. Introduction Zipf’s law (Zipf, 1949/1965) states that if words in a text are ordered in decreasing order of occurrence in this text, then the product of the rank r of a word and the number of times it occurs f ( r ), is a constant of that text. rf ~ r ! 5 c. (1) So f ~ r ! 5 c r . (2) Dividing by the total number of words in the text yields the probability P( r ) that the word with rank r occurs: P~ r ! 5 C r , (3) where C is a constant. More generally, one can state P~ r ! 5 C r b (4) where b . 0; see Egghe and Rousseau (1990). In the same notation, the law of Mandelbrot states P~ r ! 5 A ~ 1 1 Br ! b . (5) Again A and B are constants. This law can be found in Mandelbrot (1954, 1977a, 1977b); see also Egghe and Rousseau (1990). It is clear that laws (4) and (5) are asymptotically the same, i.e., when ranks are high: A ~ 1 1 Br ! b 5 A SS 1 r 1 B D r D b < A B b r b 5 C r b , C 5 A B b . The main part of this article is only dealing with high ranks r and therefore we will refer to these laws as the Zipf-Mandelbrot laws. The laws of Zipf and Mandelbrot are very important in information science. Indeed, they express very clearly how “skew” the distribution of the use of words in a text is. Although this is a property of all the so-called informetric laws (such as the ones of Lotka, Bradford, Leimkuhler and others; see, e.g., Egghe & Rousseau, 1990), the laws of Zipf and Mandelbrot, especially, have important applications. Since these applications will also underline the importance of the laws of Zipf and Mandelbrot for multi- word phrases (to be studied in the sequel), we will briefly go into them. It is well known that, if symbols (e.g., letters, numbers) have an unequal chance of appearance in texts, one can find unequal length coding such that the length becomes optimal (i.e., minimal) with respect to these unequal chances. In its purest form, such a compression technique is called the Huffman compression technique; see Huffman (1952). The same can be said about words in texts. Their unequal Permanent address: LUC, Universitaire Campus, B-3590 Diepenbeek, Belgium. © 1999 John Wiley & Sons, Inc. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 50(3):233–241, 1998 CCC 0002-8231/99/030233-09

On the law of Zipf-Mandelbrot for multi-word phrases

  • Upload
    l-egghe

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

On the Law of Zipf-Mandelbrot for Multi-Word Phrases

L. EggheLUC, Universitaire Campus, B-3590 Diepenbeek, Belgium, and UIA, Universiteitsplein 1, B-2610 Wilrijk,Belgium. E-mail: [email protected]

This article studies the probabilities of the occurrence ofmulti-word (m-word) phrases (m 5 2,3,. . .) in relation tothe probabilities of occurrence of the single words. It iswell known that, in the latter case, the law of Zipf is valid(i.e., a power law). We prove that in the case of m-wordphrases (m > 2), this is not the case. We present twoindependent proofs of this. We furthermore show that, incase we want to approximate the found distribution byZipf’s law, we obtain exponents bm in this power law forwhich the sequence (bm)m[N is strictly decreasing. Thisexplains experimental findings of Smith and Devine(1985), Hilberg (1988), and Meyer (1989a,b). Our resultsshould be compared with a heuristic finding of Rous-seau who states that the law of Zipf-Mandelbrot is validfor multi-word phrases. He, however, uses other—lessclassical—assumptions than we do.

I. Introduction

Zipf’s law (Zipf, 1949/1965) states that if words in a textare ordered in decreasing order of occurrence in this text,then the product of the rankr of a word and the number oftimes it occursf(r ), is a constant of that text.

rf ~r ! 5 c. (1)

So

f~r ! 5c

r. (2)

Dividing by the total number of words in the text yieldsthe probability P(r ) that the word with rankr occurs:

P~r ! 5C

r, (3)

where C is a constant. More generally, one can state

P~r ! 5C

rb (4)

whereb . 0; see Egghe and Rousseau (1990). In the samenotation, the law of Mandelbrot states

P~r ! 5A

~1 1 Br!b . (5)

Again A and B are constants.This law can be found in Mandelbrot (1954, 1977a,

1977b); see also Egghe and Rousseau (1990). It is clear thatlaws (4) and (5) are asymptotically the same, i.e., whenranks are high:

A

~1 1 Br!b 5A

SS1

r1 BD rDb <

A

Bbrb 5C

rb ,

C 5A

Bb . The main part of this article is only dealing with

high ranksr and therefore we will refer to these laws as theZipf-Mandelbrot laws. The laws of Zipf and Mandelbrot arevery important in information science. Indeed, they expressvery clearly how “skew” the distribution of the use of wordsin a text is. Although this is a property of all the so-calledinformetric laws (such as the ones of Lotka, Bradford,Leimkuhler and others; see, e.g., Egghe & Rousseau, 1990),the laws of Zipf and Mandelbrot, especially, have importantapplications. Since these applications will also underline theimportance of the laws of Zipf and Mandelbrot for multi-word phrases (to be studied in the sequel), we will briefly gointo them.

It is well known that, if symbols (e.g., letters, numbers)have an unequal chance of appearance in texts, one can findunequal length coding such that the length becomes optimal(i.e., minimal) with respect to these unequal chances. In itspurest form, such a compression technique is called theHuffman compression technique; see Huffman (1952). Thesame can be said about words in texts. Their unequal

Permanent address: LUC, Universitaire Campus, B-3590 Diepenbeek,Belgium.

© 1999 John Wiley & Sons, Inc.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 50(3):233–241, 1998 CCC 0002-8231/99/030233-09

appearance in texts is ruled by the laws of Zipf or Mandel-brot. This knowledge leads to compression of texts; it alsoleads to the treatment of abbreviations. Such type of appli-cations can be found in Heaps (1978), Aalbersberg,Brandsma, and Corthout (1991), Schuegraf and Heaps(1973), and Jones (1979).

Knowing the exact form of the law of Zipf or Mandelbrotalso enables researchers to draw conclusions on the stylistictype of texts. This, in turn, can lead to the proof (or dis-proval) of authorship or to the determination of the order inwhich different texts of the same author have been written.References to the first application are: Allen (1974), Brin-egar (1963), McColly and Weier (1983), Michaelson andMorton (1971–1972), Smith (1983), Ule (1982), and Wick-mann (1976). A reference to the second application is Coxand Brandwood (1959). A general treatment of these prob-lems can be found in Herdan (1964).

The Laws of Zipf and Mandelbrot are also important ininformation retrieval (IR). Once known and used in thecalculation of the entropy formula, one gets a measure of theselectivity of an IR result. This application to IR is intu-itively clear: The law of Zipf-Mandelbrot deals with theunequal appearance of words in texts and hence, indirectlyby the “value” of key words in IR. In the same way, andsince indexing and retrieval are similar (“dual”) notions (seeEgghe & Rousseau, 1997), the same can be said about theindexing value of key words. The law of Zipf-Mandelbrot isa basic ingredient in this in the sense that, via the calculationof the entropy, it determines the separation properties of keywords. For this important application of automatic indexing,see Salton and McGill (1984).

The laws of Zipf and Mandelbrot are also important inthe fractal aspects of information science and of linguistics.Fractals have been introduced by Mandelbrot (see Mandel-brot (1954, 1977a, 1977b) in order to estimate the complex-ity of certain systems. The main idea behind fractals is thefact that, in real life, it is not possible to measure distancesbetween objects since this is related with the scale of visu-alization of the object and since multiplying the scale by,saya, this leads to an increase of the distancelarger thanmultiplying the old distance bya. A typical example is theproblem of measuring the length of coast lines. For example(see Feder, 1988), put simply, if we multiply the scale of themap of the coast of Australia by 2, we will find that thelength has been multiplied by 1.52 times 2! D5 1.52 iscalled the fractal dimension of the coastline and denotes, ina way, its complexity. Mandelbrot was able, under simpli-fied circumstances, to calculate the fractal dimension D of a

text and found that D51

b, whereb is the exponent ap-

pearing in his law (5). For a more understandable proof ofthis, we refer the reader to Egghe and Rousseau (1990)—infact, its derivation is a simplified version of the argumentthat is given in Section IV.

Let us finally mention applications of the laws of Zipfand Mandelbrot in speech recognition; see Chen (1989,1991).

We hope to have indicated the incredible importance ofthe law of Zipf-Mandelbrot in the information sciences andin linguistics. In all of these applications, one is limited tothe consideration of single term key words. However multi-word phrases are very important objects in all of theseapplications, and their importance is continuously increas-ing. It is, e.g., clear that, from the point of view of com-pression of databases, multi-word phrases are more impor-tant than single words. Multi-word phrases are abbreviatedmore often than single words. The optimal allocation ofabbreviations to such multi-word phrases, however, canonly be studied if one knows the underlying law of Zipf-Mandelbrot for multi-word phrases.

Knowing these type of laws also opens perspectives inauthorship determination: The use of multi-word phrases intexts reveals more of the style of an author than singleterms; they have an “added value” above the value ofknowing that the separated words have been used.

Perhaps the most important applications of laws of Zipf-Mandelbrot for multi-word phrases are to be expected in IR.This area experiences a fast evolution in IR-techniques,partially due to the steep increase of the importance of theInternet and, with this, of search engines (e.g., in WWW) tobe used by, potentially, every person (the use of searchengines is certainly not restricted to scientists or librariansas it was, say, two decades ago, when only “scientific”databases existed). With the steep increase of information,one is badly in need of refined search techniques. Onecannot suffice by using single words as key words. Theseparating possibility of multi-word keys are much higherthan the cumulated one of the separate words. Knowing thelaw of Zipf-Mandelbrot for multi-word phrases might thenlead to a scientific study of these “separation” values ofmulti-word phrases. As remarked above in the case of singlekey words, the law of Zipf-Mandelbrot for multi-word keyswill lead us to a theory of automatic indexing of multi-wordphrases.

Finally, knowing the type of law of Zipf-Mandelbrot willlead us to the determination of the complexity of texts, alsoconsidered to be composed of multi-word phrases, whichgives a better idea of the real nature of the text: A text ismore than the set of its words; the study of multi-wordphrases might help in the determination of the many com-plex aspects of texts.

Let us now investigate what type of distribution is de-scribing the occurrence of 2-word (3-word, . . .) phrases intexts. As in the case of single words, all combinations of 2(or more) consecutive words are taken into account. Smithand Devine (1985) stated that the occurrence of multi-wordphrases can be described by distributions of the Zipf-Man-delbrot type (5). They furthermore found experimentallythat the exponentb is decreasing the larger phrases oneconsiders. So it was found thatb ' 1 for single words (i.e.,classical Zipf-Mandelbrot), thatb ' 2

3for 2-word phrases

and thatb ' 25

for 3-word phrases. Similar observationswere made by Hilberg (1988) and Meyer (1989a, 1989b). In

234 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999

these findings, English, German, and even Chinese textswere the subject.

A rationale behind this was given by Mandelbrot forsingle words but was lacking for multi-word phrases, untilRousseau (1997) studied the problem from a theoreticalpoint of view. The argument is fractal, thereby extendingMandelbrot’s original arguments. (See Egghe and Rous-seau, 1990, from p. 306 on, for a detailed description ofMandelbrot’s arguments.)

To get an idea of what Rousseau is doing, we will brieflyintroduce Mandelbrot’s argument for single words. Thiswill also be a good introduction to our derivation of thesimilar argument for 2-word phrases (given in Section IV).Mandelbrot considers texts of words, formed by using analphabet of N letters. He also assumes that each letter has anequal chance to occur, denoted byr. Words are formed byseparating strings of letters by a blank. Blanks are possible

by requiring thatr ,1

N. Hence 12 Nr is the probability

to have, at a certain spot in the text, a blank. By assumingindependence between the letters in a word (which is a verymuch simplifying assumption), we then find that the prob-ability for the occurrence of a word of length k (i.e., con-sisting of k letters) is

P1~k! 5 rk~1 2 Nr!. (6)

The same argument yields the probability to have a 2-wordphrase consisting of k letters in total:

P2~k! 5 rk~1 2 Nr!2. (7)

In total there are (k2 1)Nk, 2-word phrases possiblewith k symbols and, although they will not all occur in thetext, we do not follow the argument of Rousseau, when heassumes that only«2N

k1s(2) of them exist. The only reasonwhy «2 is used (a constant as well as s(2)) is that he gets ridof the k-dependent factor k2 1 in (k 2 1)Nk. There is noargument for such a “simplification,” and in Section IV, wewill, in fact, work with the formula (k 2 1)Nk itself.Although, indeed, not all of these 2-word phrases occur,their probability of occurrence is governed by formula (7)and, hence, there is no need to add other (discrete) restric-tions.

The main part of the article, however, follows anotherapproach. We will show that, supposing Mandelbrot’s lawto be valid for single words, that the corresponding law formulti-word phrases cannot be of this “power law” type. Wewill arrive at concrete expressions in the case of high ranksand supposing independence between the occurrence of thewords. This is only a simplifying approximation in the sameway as independence of occurrence of letters in singlewords was used in Mandelbrot, and the same in multi-wordphrases in Rousseau (1997) (admitting a correction factor inthe latter paper).

So, we arrive at the impossibility for the power law (7) tobe valid in case of multi-word phrases and this is alreadyfound in the simplest models of independence between thewords. We furthermore prove that if we approximate thefound laws by a power law of type (7), that thebms aredecreasing withm. This represents a rationale for the sta-tistical findings in Smith and Devine (1985), Hilberg(1988), and Meyer (1989a, 1989b).

As said, the article closes with an examination of thefractal argument of Mandelbrot in the case of 2-wordphrases. Here, we do not use the assumptions of Rousseau.We show that the same type of law is found as in theabove-described approach, giving another rationale for it.

II. The Case of 2-Word Phrases

Although we have a general solution form-word phrases(m 5 2,3. . .), we will deal with the 2-word case separately.The reason is that the general argument is rather intricate(see Appendix), but that this argument is similar to the muchsimpler one in the 2-word case. Denote by

P1~r ! 5C

rb . (8)

Zipf-Mandelbrot’s law is supposed to be valid in thesingle-word case. Hence P1(r ) denotes the probability ofoccurrence of a word that has rankr (ranking according tosingle words, of course). We use independence of occur-rence of two consecutive words as a simplification (we note,again, that also in Mandelbrot, in order to obtain (8), inde-pendence between letters is used as a simplification). Hence,we have

P2~r 1,r 2! 5 P1~r 1!P1~r 2!, (9)

where P2(r1,r2) denotes the probability of occurrence of a2-word phrase that consists of a word with rankr1 and onewith rank r2 in the single-word case.

Note thatOr1,r2 P2~r1,r2! 5 1 sinceOr P1~r! 5 1.We have solved our problem if we can find an estimate

of the final rankr of this 2-word phrase, i.e., findr such that

P2~r ! 5 P2~r 1,r 2!, (10)

wherer denotes the ranking of a 2-word phrase consistingof a word with rankr1 and one with rankr2 in the single-word case (8). It is already certain that this is not anill-posed problem:r1 and r2 determine P1(r1) and P1(r2),hence (9), hence the ranking according to the value ofP2(r1,r2), hencer (although ties are possible). We will onlybe able to solve this problem in caser1 andr2 are high. Inany case, we have

r 5 #$~r 91,r 92!\P1~r 91!P1~r 92! $ P1~r 1!P1~r 2!% (11)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999 235

(# 5 number of elements in).The inequality

P1~r 91!P1~r 92! $ P1~r 1!P1~r 2! (12)

is equivalent with, using (8),

r 1r 2 $ r 91r 92. (13)

(Note that B5N 2 1

Nas follows from the fractal argument

of Mandelbrot, but we do not need this result here.) We putr 91 5 r1 1 i , count the number of possiblei ’s and then thevalue of

r 92 #r 1r 2

r 1 1 i, (14)

according to (13). Onlyi -values that yield

r 1r 2

r 1 1 i$ 1

are possible sincer 92 is a rank. Hence,

i # r 1~r 2 2 1!. (15)

Of course,r 91 is also a rank, hence,

i $ 2 ~r 1 2 1!. (16)

Formulae (15), (16), and (14) yield in (11)

r < Oi52~r121!

r1~r221! r 1r 2

r 1 1 i. (17)

But, using the integral test for series,

Oi52~r121!

r1~r221! 1

r 1 1 i< E

2~r121!

r1~r221! di

r 1 1 i5 ,n~r 1r 2!. (18)

Formulae (17) and (18) yield

r < r 1r 2,n~r 1r 2!. (19)

Formulae (8), (9), (10), and (19) now yield

P2~r ! < P2~r 1r 2,n~r 1r 2!! <C2

~r 1r 2!b . (20)

Denote byw( x), the inverse function of the injectivefunction x 3 xlnx, then we have that

P2~r ! 5C2

~w~r !!b (21)

clearly indicating that, even asymptotically, P2—i.e., theanalogue of Zipf-Mandelbrot’s law for 2-word phrases—cannot be a power function as in the case (4) or (5).

However, forcing (21) into a power law (inr ), as is thecase with statistical fitting of data, shows that the exponentb9 in this new law is inferior tob. Indeed, if we write

C2

~w~r !!b <D

rb9 (22)

where also D is a constant, yields

b9 <,nS D

C2D,nr

1 b,nw~r !

,nr. (23)

Hence, in terms ofr1 andr2, and by definition ofw, wehave

b9 <,nS D

C2D,n~r 1r 2,n~r 1r 2!!

1 b,n~r 1r 2!

,n~r 1r 2,n~r 1r 2!!.

So

b9 <,nS D

C2D,n~r 1r 2,n~r 1r 2!!

1 b1

1 1,n~,n~r 1r 2!!

,n~r 1r 2!

. (24)

Sincer1 and r2 are large, the first term can be made assmall as we wish. Furthermore, in the second term, the valueof the coefficient ofb is clearly below 1. In conclusion:b9, b in fitting. Note that we cannot put an equality informula (22) since this would then not result in a power lawD

r b9 with D a constant. We can also conclude from (24) that,

sincer1 and r2 are high, that

b9 < b1

1 1,n~,n~r 1r 2!!

,n~r 1r 2!

. (25)

Summarizing this section on 2-word phrases, we havethe following results:

Theorem

Under the assumption of independence of the occurrenceof consecutive words, we obtain the following law for theoccurrence of 2-word phrases in texts. Let P2(r ) denote the

236 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999

probability that a 2-word phrase on rankr occurs. Then, ifr is large,

P2~r ! 5C2

~w~r !!b

wherew is the inverse of the functionx3 xlnx, and wherewe suppose the law of Zipf-Mandelbrot

P1~r 1! 5C

r 1b

to be valid for single words. The rankr for 2-word phrasesis obtained from the ranksr1 andr2 of the single words viathe formula (for larger1 and r2)

r < r 1r 2,n~r 1r 2!

(hencew(r ) ' r1r2).Finally, when approximating (21) by a power of the type

(22), we find that

b9 < b1

1 1,n~,n~r 1r 2!!

,n~r 1r 2!

, b.

III. The Case of m-Word Phrases

In this section, we study the probability of occurrence ofm-word phrases (m 5 2,3,4, . . .). Thegeneral case issimilar to the casem 5 2, though a lot more intricate. Weagain suppose the law of Zipf-Mandelbrot (8) to be valid forsingle words and we use high ranks. In the Appendix, weprove the following result.

Theorem

Let Pm(r ) denote the probability that anm-word phraseon rankr occurs. Then, ifr is large,

Pm~r ! 5Cm

~wm~r !!b (26)

wherewm is the inverse of the function

x3x,nm21x

m 2 1. (27)

This rank r for m-word phrases is obtained from theranksr1, r2, . . . ,rm of the single words via the formula (forlarge r1, . . . ,rm)

r <r 1. . .rm

m 2 1,nm21~r 1. . .rm! (28)

hencewm(r) 5 r1. . .rm). Finally, when approximating (26)

by a power law of type

D

rb9m (29)

we have that

b9m < b1

1 1

,nS,nm21~r 1. . .rm!

m 2 1 D,n~r 1. . .rm!

, b. (30)

If we taker1 '. . .'rm in the formula, it can be shown thatthe sequenceb92,b93,. . . is strictly decreasing.

All these findings are in accordance with the experimen-tal findings in Smith and Devine (1985), Hilberg (1988),and Meyer (1989a, 1989b), but we stress the fact that theZipf-Mandelbrot exponentsb9m only apply to a power lawthat approximately the non-power law (26). So, we havepresented a rationale for these experimental findings andfound that the power law of Zipf-Mandelbrot is not correctfor the case ofm-word phrases. The formula

r <r 1. . .rm

m 2 1,nm21~r 1. . .rm! (28)

has independent interest: It gives a formula for the rankr ofan m-word phrase in function of the ranksr1. . .rm of thesingle words, supposing the Zipf-Mandelbrot law for singlewords. Note that this formula is independent of the param-eters C andb in this law and only uses the fact that a powerlaw applies for single words! This finding in itself makes itclear that a classical Zipf-Mandelbrot power law cannotapply in the case ofm-word phrases (m 5 2,3,. . .).

IV. The Fractal Argument of Mandelbrot for 2-Word Phrases

We suppose that we have an alphabet of N letters. Wordsare formed with these letters and these words are separatedby a blank. In total, there are (k 2 1)Nk possible 2-wordphrases consisting ofk letters in total. We putr as theprobability for a letter to occur. Since there are also blanks,

we have thatr ,1

N. The probability to have a 2-word

phrase consisting ofk letters hence is

P2 5 rk~1 2 Nr!2 (31)

(1 2 Nr is the probability to have a blank).Since this is decreasing ink, we can estimate the rankr

of this word as follows (cf. the Mandelbrot argument forsingle words; see, e.g., Egghe & Rousseau (1990)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999 237

N2 1 2N3 1 . . . 1 ~k 2 2!Nk21

5N

~N 2 1!2~iNi11 2 ~i 1 1!Ni 1 1!. (32)

Indeed, 2-word phrases consist of at least 2 letters.1 A longcalculation, but only involving sums of geometric seriesyields

N 1 2N2 1 . . . 1 iNi

5N

~N 2 1!2~iNi11 2 ~i 1 1!Ni 1 1! (33)

for all i . Applied in (32) this yields

~k 2 2!Nk21 2 ~k 2 1!Nk22 , SN 2 1

N D 2

r 2 1

# ~k 2 1!Nk 2 kNk21. (34)

To fix r , we will take the average of both sides:

SN 2 1

N D 2

r 2 1

< 12@~k 2 2!Nk21 2 ~k 2 1!Nk22 1 ~k 2 1!Nk 2 kNk21#

512

Nk22@~N2 2 1!~k 2 1! 2 2N#

SN 2 1

N D 2

r 2 1 < 12Nk22~N2 2 1!~k 2 1!. (35)

This is the (average) rank of 2-word phrases consisting ofkletters in total. From (31) it follows that

P2~r ! 5 rk~1 2 Nr!2 5 rk21r~1 2 Nr!2.

Hence

k 2 1 5

,nS P2~r !

r~1 2 Nr!2D,nr

. (36)

Formula (36) in (35) yields

,nFS P2~r !

r~1 2 Nr!2D ,nN/,nrG z S P2~r !

r~1 2 Nr!2D ,nN/,nr

52N,nN

N2 2 1FSN 2 1

N D 2

r 2 1G <2~,nN!~N 2 1!

N~N 1 1!r (37)

for large r . So

S P2~r !

r~1 2 Nr!2D ,nN/,nr

5 w~ar ! (38)

with

a 52~,nN!~N 2 1!

N~N 1 1!,

a fixed number, and wherew is (as in formula 21) theinverse of the functionx 3 xlnx. We finally have

P2~r ! 5D

~w~ar !!b (39)

where

D 5 r~1 2 Nr!2

and

b 5 2,nr

,nN(40)

(cf. b 51

DS, the fractal dimension of the text, cf. Egghe &

Rousseau, 1990, p. 307).Up to the appearance ofa in (38), this formula is exactly

the same as the one obtained in Section II (formula 21). Forhigh r , we can even get rid ofa as follows: If we put

x 5 S P2~r !

r~1 2 Nr!2D ,nN/,nr

(41)

then (37) reads

x,nx 5 ar . (42)

So

r 51

ax,nx 5

1

ax,nS1

axD 2

1

ax,n

1

a

r <1

ax,nS1

axD , (43)

hereby using that

1 This also corrects the argument in Egghe and Rousseau (1990), p. 306for single words: One should start with N instead of 1 in the analogous

inequality since words have at least 1 letter. This correction yieldsN 2 1

Ninstead of N as the coefficient ofr in the formula for P.

238 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999

1

ax,n

1

a!

1

ax,nS1

axD

since

1

ax,nS1

axD 5

1

ax,n

1

a1

1

ax,nx @

1

ax,n

1

a.

Here we use thatr , hence by (42)x, is large (a is fixed).From (43), it follows that

1

ax < w~r !. (44)

Hence

1

aS P2~r !

r~1 2 Nr!2D ,nN/,nr

< w~r !,

yielding

P2~r ! <E

~w~r !!b

with b still as in (40) and with

E 5 r~1 2 Nr!2a,nr/,nN. (45)

Formula (45) is exactly formula (21). This gives a secondrationale for the validity of (21) or (45) in the case of2-word phrases, hence disproving again the validity of theZipf-Mandelbrot power law. Analogous arguments can begiven for m-word phrases. We leave this to the reader.

Acknowledgment

The author is grateful to Prof. Dr. R. Rousseau forinteresting discussions on the topic of this article.

Appendix

Proof of the Theorem in Section III on the Probability ofOccurrence and on the Ranks of m-Word Phrases

Let the m-word phrase consist ofm single words withranks (amongst the single words) of occurrencer1, . . . , rm.Supposing independence as before, we obtain the rankr forthe m-word (amongst all them-words) as the number ofvectors (r 91, . . . ,r 9m) for which

P1~r 91!. . .P1~r 9m! $ P1~r 1!. . .P1~rm!, (A1)

where

P1~x! 5C

xb (A2)

is the classical Zipf-Mandelbrot law for single words. (A1)and (A2) yield the condition

r 91. . .r 9m # r 1. . .rm. (A3)

We have to count all possibilities. Forr 9m this is

r 9m #r 1. . .rm

r 91. . .r 9m21(A4)

and for r 91, . . . ,r 9m21, we put

r 9, 5 r , 1 k, (A5)

l 5 1, . . . ,m 2 1. Of course

k, $ 2 ~r , 2 1! (A6)

for all l 5 1, . . . ,m 2 1, since ranks must be larger than,or equal to, 1. For the same reason (A4) implies

r 1. . .rm

~r 1 1 k1!. . .~rm21 1 km21!$ 1

hence

r 1. . .rm 2 ~r 1 1 k1!. . .~rm22 1 km22!rm21

~r 1 1 k1!. . .~rm22 1 km22!$ km21. (A7)

Applying (A7) and (A6) forl 5 m 2 1 yields

r 1. . .rm 2 ~r 1 1 k1!. . .~rm23 1 km23!rm22

~r 1 1 k1!. . .~rm23 1 km23!$ km22. (A8)

Continuing in this way, we obtain

k, #r 1. . .rm 2 ~r 1 1 k1!. . .~r ,21 1 k,21!r ,

~r 1 1 k1!. . .~r ,21 1 k,21!(A9)

for all l 5 2, . . .m 2 1. Finally, using this forl 5 2 and(A6) for l 5 2, we obtain

k1 $ r 1~r 2. . .rm 2 1!. (A10)

The formulae (A5), (A9), and (A10) give us all possi-bilities and hence, using (A4), we obtain

r < r 1. . .rm Ok1

1

r 1 1 k1· · ·O

km21

1

rm21 1 km21(A11)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999 239

where eachkl ranges between the values indicated in (A6),(A9), and (A10) (forl 5 1). Denoting the upper values in(A9), by a l and approximating the¥ by integrals (using theintegral test for series), we find

r < r 1. . .rm E2~r121!

r1~r2. . .rm21! dk1

r 1 1 k1· · ·E

2~rm2,921!

am2,9 dkm2,9

rm2,9 1 km2,9

· · · E2~rm2121!

am21

dkm21

rm21 1 km21(A12)

wherel 9 denotes an arbitrary value between 1 andm 2 1(these cases are given explicitly). This intricate form iscalculated as follows. Casel 9 5 1:

E2~rm2121!

am21 dkm21

rm21 1 km21

5 ,n~rm21 1 km21!]Fkm21 5 am21

km21 5 2 ~rm21 2 1!Gwith am21 given by the upper bound in (A9) forl 5 m2 1. This gives the value

,nr 1. . .rm

~r 1 1 k1!. . .~rm22 1 km22!.

After calculating the further factors this way, it becomesclear what the general formula is going to be. We will proveit by complete induction: Suppose that we have the result

1

,9 2 1,n,921S r 1. . .rm

~r 1 1 k1!. . .~rm2,9 1 km2,9!D

after l 9 2 1 steps. Stepl 9 is then

E2~rm2,921!

am2,9

1

,9 2 1,n,921S r 1. . .rm

~r 1 1 k1!. . .~rm2,9 1 km2,9!D

rm2,9 1 km2,9dkm2,9

5E2~rm2,921!

am2,9

1

,9 2 1,n,921S r1. . .rm

~r1 1 k1!. . .~rm2,921 1 km2,921!D

rm2,9 1 km2,9dkm2,9

2 E2~rm2,921!

am2,9 1

,9 2 1

,n,921~rm2,9 1 km2,9!

rm2,9 1 km2,9dkm2,9

wheream2,9 is given as the upper bound in (A9) forl 5 m2 l 9. This gives the result

1

,9,n,9S r 1. . .rm

~r 1 1 k1!. . .~rm2,921 1 km2,921!D

concluding the induction step. By (A12) and the aboveresult, we can now conclude that

r <r 1. . .rm

m 2 1,nm21~r 1. . .rm!. (A13)

Since we have, by definition of Pm(r ), that

Pm~r ! 5 P1~r 1!. . .P1~rm! (A14)

we have now that

PmS r 1. . .rm

m 2 1,nm21~r 1. . .rm!D

5Cm

~r 1. . .rm!b (A15)

or else, by (A13)

Pm~r ! 5Cm

~wm~r !!b (A16)

wherewm is the inverse function of the injective function

x3x,nm21~x!

m 2 1. (A17)

An analogous argument, as in the casem 5 2, nowyields that, if we put

Cm

~wm~r !!b <Dm

rb9m (A18)

(cf. 22), we find

b9m < b1

1 1

,nS,nm21~r 1. . .rm!

m 2 1 D,n~r 1. . .rm!

(A19)

It is clear thatb9m , b for every m5 2,3, . . . .Sinceformula (A19) is valid for allr1, . . . ,rm which are large, wecan study the behavior of (A19) in caser1 ' . . . 'rm

(denoted asr0) and in caser0 is large. It is clear that thesequenceb9m decreases if the sequenceam increases, where

240 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999

am 5

,nS,nm21~r 1. . .rm!

m 2 1 D,n~r 1. . .rm!

. (A20)

Now we have

am 5

,nS,nm21~r 0m!

m 2 1 D,n~r 0

m!

am 5~m 2 1!,n~m,nr0! 2 ,n~m 2 1!

m,nr0. (A21)

Now am11 . am iff

,nr0 .mm21m21

~m 2 1!m11~m 1 1!m2 (A22)

which is satisfied for allr0 $ 3 on (and this is so, since wesupposedr0 to be large). Indeed the criticalr0 2 values foram11.am are (m52,3,4,5, . . .)

am11 . am

m 5 2m 5 3m 5 4m 5 5

···

from r 0 on

2.4801.1001.0201.004

···

This concludes the proof that the sequenceb9m is decreasingin m.

References

Aalbersberg, IJ., Brandsma, E., & Corthout, M. (1991). Full-text documentretrieval: From theory to applications. In Kempen & de Vroomen (Eds.),Informatiewetenschap 1991 (pp 3–17). Leiden: Stinfon.

Allen, J.R. (1974). Methods of author identification through stylistic anal-ysis. The French Review, XLVII (5), 904–916.

Brinegar, C. (1963). Mark Twain and the Quintus Curtius Snodgrassletters: A statistical test of authorship. Journal of the American StatisticalAssociation, 58, 85–96.

Chen, Y.-S. (1989). Zipf’s law in text modeling. International Journal ofGeneral Systems, 15, 233–252.

Chen, Y.-S. (1991). Statistical models of text in continuous speech recog-nition. Kybernetes, 20(5), 29–40.

Cox, D.R., & Brandwood, L. (1959). On a discriminatory problem con-nected with the works of Plato. Journal of the Royal Statistical Society[B] 21, 195–200.

Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Quantita-tive methods in library, documentation and information science. Am-sterdam: Elsevier.

Egghe, L., & Rousseau, R. (1997). Duality in information retrieval and thehypergeometric distribution. Journal of Documentation, 53(5), 488–496.

Feder, J. (1988). Fractals. New York: Plenum.Heaps, H.S. (1978). Information retrieval: Computational and theoretical

aspects. New York: Academic Press.Herdan, G. (1964). Quantitative linguistics. London: Butterworths.Hilberg, W. (1988). Das Netzwerk der menschlichen Sprache und Grund-

zuge einer entsprechend gebauten Sprachmaschine. ntzArchiv, 10, 133–146.

Huffman, D.A. (1952). A method for the construction of minimum redun-dancy codes. Proceedings of the IRE, 40, 1098–1101.

Jones, D.S. (1979). Elementary information theory (Oxford applied math-ematics and computing series). Oxford: Clarendon Press.

Mandelbrot, B. (1954). Structure formelle des textes et communication.Word, 10, 1–27.

Mandelbrot, B.B. (1977a). Fractals, form, chance and dimension. SanFrancisco: Freeman.

Mandelbrot, B.B. (1977b). The fractal geometry of nature. New York:Freeman.

McColly, W., & Weier, D. (1983). Literary attribution and likelihood-ratiotests: The case of the middle English PEARL-poems. Computers andHumanities, 17, 65–75.

Meyer, J. (1989a). Gilt das Zipfsche Gesetz auch fu¨r die chinesischeSchriftsprache? ntzArchiv, 11, 13–16.

Meyer, J. (1989b). Statistische Zusammenha¨nge in Texten. ntzArchiv, 11,287–294.

Michaelson, S., & Morton, A.Q. (1971–1972). Last words. A test ofauthorship for Greek authors. New Testament Studies, 18, 192–208.

Rousseau, R. (1997). A fractal approach to word occurrences in texts: TheZipf–Mandelbrot law for multi-word phrases. Kelpro Bulletin (to ap-pear).

Salton, G., & McGill, M.J. (1984). Introduction to modern informationretrieval. Auckland, New Zealand: McGraw-Hill.

Schuegraf, E.J., & Heaps, H.S. (1973). Selection of equifrequent wordfragments for information retrieval. Information Storage and Retrieval,9, 697–711.

Smith, F.J., & Devine, K. (1985). Storing and retrieving word phrases.Information Processing and Management, 21, 215–224.

Smith, M.W.A. (1983). Recent experience and new developments of meth-ods for the determination of authorship. Association for Literary andLinguistical Computing Bulletin, 11(3), 73–82.

Ule, L. (1982). Recent progress in computer methods of authorship deter-mination. Association for Literary and Linguistical Computing Bulletin,10(3), 73–89.

Wickmann, D. (1976). On disputed authorship, statistically. Associationfor Literary and Linguistical Computing Bulletin, 4(1), 32–41.

Zipf, G.K. (1965). Human behavior and the principle of least effort. NewYork: Hafner. (Original work published 1949, Cambridge: Addison-Wesley)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—March 1999 241