15
Prediction and Entropy of Printed By C. E. SHANNON (Manuscript &ceiDcd Sept. IS, I950) A Dew method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics pos- sessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed. 1. INTRODUCTION I N A previous paper! the entropy and redundancy of a language have been defined. The entropy is a statistical parameter which measures, in a certain sense, how much infonnation is produced on the average for each letter of a text in the language. If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number of binary digits required per letter of the original language. The redundancy, on the other hand, measures the amount of constraint imposed on a text in the language_ due to its statistical structure, e.g., in English the high fre- quency of the letter E, the strong tendency of H to follow T or of U to follow Q: It was estimated that when statistical effects extending over not more than eight letters are considered the entropy is roughly 2.3 bits per letter, the redundancy about 50 per cent. Since then a new method has been found for estimating·these quantIties, which is more sensitive and takes account of long range statistics, iniluences extending over phrases, sentences, etc. This method is based on a study of the predictability of English; how well can the next letter of a text be pre- dicted when the preceding .1\7 letters are known. The results of some experi- ments in prediction will be given, and a theoretical analysis of some of the properties of ideal prediction. By combining the experimental and theoreti- cal results it is possible to estimate upper and lower bounds for the entropy and redundancy. From this analysis it appears that, in ordinary literary English; the long range statistical effects (up to 100 letters) reduce the entropy to somethiD.g of the order of one bit per letter, with a corresponding redundancy of roughly 75%. The redundancy may be still higher when structure extending over paragraphs, chapters, etc: is included. However, as the lengths involved ate 4J.creased, the parameters in question become more 1 C. E. Shannon, <lA Mathematical Theory of Communication," Bdt S;'stem Tedmical Journal, v. 27, pp. 379-423, 623-656, July, October, 1948. PREDICTION AN erratic and uncertain, and t: involved. 2. ENTROPY CALCULA One method of calculatin e F o , F I , F 2 , ••• , which sue of the language into accoun the entropy; it mea to statistics extending over _ F H = -LP(b, i,j -:- L pCb, in which: b i is a block of A j is an arbitrary pCb, ,j) is the pr poJj) is the cone and is E The equation (1) can be (conditional entropy) of th known. As N is increased; and the entropy, H, is givt The N-gram entropies 1 standard tables of letter, punctuation are ignored Vi be taken (by definition) to frequencies and is given t " F , =-L i=l . The digram approximatio F, - p(i 7.70 - 4. 2: Fletcher Pratt, "Secret Ii ;;

-:-L I - School of Natural Sciences | School of Natural …tlusty/courses/InfoInBio/Papers/Shannon...in prediction of the next letter when the preceding text is known. Results of experimentsin

Embed Size (px)

Citation preview

Prediction and Entropy of Printed Engli~h

By C. E. SHANNON

(Manuscript &ceiDcd Sept. IS, I950)

A Dew method of estimating the entropy and redundancy of a language isdescribed. This method exploits the knowledge of the language statistics pos­sessed by those who speak the language, and depends on experimental resultsin prediction of the next letter when the preceding text is known. Results ofexperiments in prediction are given, and some properties of an ideal predictor aredeveloped.

1. INTRODUCTION

I N A previous paper! the entropy and redundancy of a language havebeen defined. The entropy is a statistical parameter which measures,

in a certain sense, how much infonnation is produced on the average foreach letter of a text in the language. If the language is translated into binarydigits (0 or 1) in the most efficient way, the entropy H is the average numberof binary digits required per letter of the original language. The redundancy,on the other hand, measures the amount of constraint imposed on a text inthe language_ due to its statistical structure, e.g., in English the high fre­quency of the letter E, the strong tendency of H to follow T or of U to followQ: It was estimated that when statistical effects extending over not morethan eight letters are considered the entropy is roughly 2.3 bits per letter,the redundancy about 50 per cent.

Since then a new method has been found for estimating· these quantIties,which is more sensitive and takes account of long range statistics, iniluencesextending over phrases, sentences, etc. This method is based on a study ofthe predictability of English; how well can the next letter of a text be pre­dicted when the preceding .1\7 letters are known. The results of some experi­ments in prediction will be given, and a theoretical analysis of some of theproperties of ideal prediction. By combining the experimental and theoreti­cal results it is possible to estimate upper and lower bounds for the entropyand redundancy. From this analysis it appears that, in ordinary literaryEnglish; the long range statistical effects (up to 100 letters) reduce theentropy to somethiD.g of the order of one bit per letter, with a correspondingredundancy of roughly 75%. The redundancy may be still higher whenstructure extending over paragraphs, chapters, etc: is included. However, asthe lengths involved ate 4J.creased, the parameters in question become more

1 C. E. Shannon, <lA Mathematical Theory of Communication," Bdt S;'stem TedmicalJournal, v. 27, pp. 379-423, 623-656, July, October, 1948.

PREDICTION AN

erratic and uncertain, and t:involved.

2. ENTROPY CALCULA

One method of calculatineFo, F I , F 2 , ••• , which sueof the language into accounthe N-g~am entropy; it meato statistics extending over _

FH = -LP(b,i,j

-:- L pCb,

in which: bi is a block of A

j is an arbitrary

pCb, ,j) is the pr

poJj) is the cone

and is E

The equation (1) can be(conditional entropy) of thknown. As N is increased;and the entropy, H, is givt

The N-gram entropies 1standard tables of letter,punctuation are ignored Vi

be taken (by definition) tofrequencies and is given t

"F , =-L

i=l .

The digram approximatio

F, ~ - ~ p(i

~ 7.70 - 4.

2: Fletcher Pratt, "Secret a·

f---~·Ii;;

erratic and uncertai~J and they depend more critically on the type of textinvolved. ....

(1)- L. pCb, ,j) log, p(b" j) + L, pCb,) log pCb,)

'i i

2. ENTROPY CALCULATION FROM THE STATISTICS OF ENGLISH

One method of calculating the entropy H is by a series of approximationsF 0 J F1 J F 2 J ... J which successively take more and more of the statisticsof the language into account and approach H as a limit. F N may be calledthe lV-gram entropy; it measures the amount of infonnation or entropy dueto statistics extending over .LV adjacent letters of text. F N is given by l

FN ~ - L. p(b"j) log, p"V)1.j

Printed Engli~h

~ON

:ON

md redundancy of a lan.:,ouage is;e of the language statistics pos­depends on experimental results

ceding text is known. Results ofroperties of an ideal predictor are

t. I5, I950)

'edundaney of a language havecal parameter which measures,is produced on the average fOf:mguage is translated into binary~ntropyH is the averagenumberginallanguage. The redundancy,constraint imposed on a text in

'e, e.g., in English the high fre-)f H to follow T or of U to followeffects extending over not more)y is roughly 2.3 bits per letter,

in which: b, is a block of N-1letters [(N-1)-gram]

j is an arbitrary letter following hi

p(b" j) is the probability of the N-gram b" j

Po,(j) is the conditional probability of letter j after the block b"

and is given by p(b" j)1pCb,).

The equation (1) can be interpreted as measuring t~e average uncertainty(conditional entropy) of the next letter j when the preceding N-1leiters areknown. As f{ is increased, FN includes longer and longer range statisticsand th~ entropy, H, is given by the limiting value of FN as N ----7 00 :

H = Lim FN • (2)

The digram approximation F 2 gives the result

(4)

(3)

- L, p(i, j) log, p(i, j) + L, p(i) log, p(i)i,j i

F, = - L, p(i, j) log, p,(j)i,j

= 7.70 - 4.14 ~ 3.56 bits per letter.

2 Fletcher Pratt, "Secret and Urgent," Blue Ribbon Books, 1942.

The 'N-gram entropies F N for small values of iV can be calculated fromstandard tables of letter, digram and trigram frequencies. 2 If spaces andpunc.tuation are ignored we have a twenty-six letter alphabet and F 0 maybe taken (by definition) to be log, 26, or 4.7 bits per letter. F, involves letterfrequen~ies and is given by

"F , ~ - L, p(i) log, p(i) = 4.14 bits per letter.1=1

l for estimating these quantities,.f long range statistics, influencess method is based on a study ofthe next letter of a text be pre-

own. The results of some experi­,eoretical analysis of some of the.g the experimental and theoreti­.nd lower bounds for the entropy)pears that, in ordinary literary

(up to 100 letters) reduce thet per letter, with a corresponding.aney may be still higher when:ers, etc: .is included. However I aslrneters in question become more

:ommunication," Bell System Technical)er, 194-8.

52 THE BELL SYSTEM TECHNIClli, JOURNAL, JAl\lJARY 1951

The trigram entropy is given by

F, = L: p(i, j, k) log, p;;(k)i.i.k

L: p(i, j, k) log, p(i, j, k) + L: p(i, j) log, p(i, j)i,i,k i,j

(5)

PREDICTION AND EN1

formula (6) clearly cannot hold i"~

must be unity, while L: .1/n is in!1

better estimate) that the formula

total probability is unity, amcritical 11, is the word of rank :

8727

- L: p.lo,1

or 11.82/4.5 = 2.62 bits per 1,is 4.5 letters. One might be 1actually the ordinate of the 1The reason is that F, or F 5 ITof word division. A word is a

4681020

, Fig. i,-Relative freq

2

~;" THE I

" ~OF I" AND

~t+-TO, I,

..-I1

'"'~",,~

1

.

1

01

0.000

0.000

0.1

0.0

uzw

"@a:: 0.00w

o

'"o•

- 11.0 - 7.7 = 3.3

p. =1Z

.1

where rei) is the probability of letter i as the terminal letter of a word and.s(k) is the probability of k as an initial letter. Thus the trigrams "\vitl1inwords (all average of 2.5 per word) are counted according to the table; thebridging trigrams (one of each type per word) are counted approximatelyby assuming independence of the terminal letter of-one word and the initialdigram in the next or vice versa. Because of the approximations"involve4here, and also because of the fact that the sampling error in identifyipgprobability with sample frequency is more serious, the value of F s is lessreliable than the previous numbers.

Since tables of N-gram frequencies were not available for~V > 3, F41 F s ,etc. could not be calculated in the same way. However, word frequencieshave been tabulated:! and can b~ used to obtain a further approximation.Figure 1 is· a plot on log-log paper of the probabilities of words againstfrequency rank. The most frequent English word "the" has a probability "'".071 and this is plotted against 1. The next most frequent word "of" has aprobability of .034 and is plotted against 2, etc. Using logarithmic scalesboth for probability and rank, the curve is approximately a straight linewith slope -1; thus, if Pn is the probability of the nth most frequent word,we have, roughly

p(i,j, k) = ~:~ p'Ci,j, k) + 4\ r(i)p(j, k) + 4\ p(i,j)s(k)

In this calculation the trigram table2 used did not take into account tri­grams bridging two words, such as WOW and OWO in TWO WORDS. Tocompensate partially for this omission, corrected trigram probabilities p(i~

j, k) were obtained from the prohabilities p'(i,j, k) of the table by the follow­ing rough formula:

Zipf4 has pointed out that this type of formula, pn = kin, gives a rather goodapproximation to the word probabilities in many different languages. The

3 G. Dewey, "Relative Frequency of English Speech Sounds," Harvard UniversityPress, 1923. ,

4 G. K. Zipf, "Human Behavior and the Prinnple of Least Effort," Addison-WesleyPress, 1949. -

-~.~-~"_._,-

I

PREDICTION AND ENTROPY OF PRINTED ENGLISH 53

1- L pCi, j) log, p(i, J}i,j

(5)

formula (6) clearly cannot bold indefinitely since the total probability Lpn~

must be unity, while L .lln is infinite. If we assume (in the absence of any1

better estimate) that the formula Pn = .lln holds out to the n at which the

'~~... THE

"" .,---OF A~O

~t+-TO

.~ _I

01

''''"'~OR,......~

:--01

, _SAY

''l.

~REAt..LY

'\ ... __ QUALITY1

." "

,,,,

,,01

total probability is unity, and that pn = 0 for larger n, we find that thecritical n is the word of rank 8,727. Tbe entropy is then:

2 4 6 8 to 20 40 60 100 200 400 1000 2000 4000 10,000WORD ORpER

Fig. i-Relative frequency against rank for English wOrds.

0.000

0.000

ozwo@c:: 0.0~

coc°3

0.

o

did not take -into account tri­ad OWO in TWO WORDS. To'ected trigram probabilities P(i,i,j, k) of the table by the follow-

'p(j, k) + 41• p(i, j)s(k)

.0

he terminal letter of a word and~tter. Thus the trigrams withinlilted according to the table; theord) are counted approximatelyletter atone word and the initiaiof the approximations involved

:Ie sampling error in identifying~ sc:rious, the value of Fs is less

not available for N > 3, F4 1 F 5 ,

way. However, word frequenciesobtain a further approximation.

le probabilities of words againstsh word "the" has a probability

~

:t most frequent word "of" has a: 2, etc. Using logarithmic scales: is approximately a straight linety of the 12th most frequent word,

(7)fr/27

- L p. log, p. = 11.82 bits per word,1

or 11.82/4.5 ~ 2.62 bits per letter since the average word length in Englishis 4.5 letters. One might be tempted to identify this value with F4•5 , but-a-ctuall§"the ordinate of the F N curve at N = 4.5 will be above this value.The reason is that F4 Or F" involves groups·of four or five letters regardless

division. A word is a cohesive group of letters with strong internal

J.ula, pn = kjn, gives a rather goodin many different languages. The

1. Speech Sounds," Harvard University

lldple of Least Effort," Addison-Wesley

The estimate of 2.3 for F 8 J alluded to above, was found by several methods,one of which is the extrapolation of the 26-letter series above out to thatpoint. Since the space symbol is almost completely redundant when se­quences of one or more words are involved, the values of F N in the 27-!etter

case will be ~.5 or .818 of FN for the 26-letter alphabet when. N is reasonably~.5

statistical influences, and consequently the N-grams within words are morerestricted than those which bridge words. The effect of this is that we haveobtained, in 2.62 bits per letter, an estimate which corresponds more nearlyto, say, F 5 or F 6 "

A similar set of calculations was carried Qut illcluding the space as anadditional letter, givljlg a 27 letter alphabet. The results of both 26- and27-1etter calculations are summarized below:

THE BELL SYSTEM TECHNICAL JOURNAL, JA..-l\ffiARY 1951PREDICTION Al\TJ)

Of a total or 129 letters, 89 cexpected, occur mo~

syU.ables where the line of thoube thought that the secc

contains much less inform,the same .information in t

to recover the first line -fanidentical twin of the indivj(who must be mathelnaticallYlthe salne way when faced witlonly the reduced text of (8). Wpoint we will know whether his

the first twin and the preserto a. correct guess. The letters beach stage he can be supplied'twin had available.

F."",

2.622.14

F,

3.33.1

F,3.563.32

F,

4.144.03

F,

4.704.76

26 letter.27 letter.

54

large.

3. PREDICTION OF ENGLISH

The new method of estimating entropy exploits the fact that anyonespeaking a language possesses, implicitly, an enormous knO"IYledge of thestatistics of the language. Faroiliari,ty with the words, idioms, cliches andgrammar enables him to fill in missing or incorrect letters :in proof-reading;or to complete an unfinished phrase in conversation. An e.xperitnental demon­stration of the extent to which English is predictable can be given as follows:Select a short passage unfamiliar to the person who is to do the predicting.He is then asked to guess the first letter in the passage. If the guess is correcthe is so informed, and procee9s to guess the second letter. If not, he is toldthe correct first letter and proceeds to his next guess. This is continuedthrough the text. As the experiment progresses, the subject writes dowp. thecorrect text up to the current point for use in predicting future letters. 1;heresult of a typical experiment of this type is given below. Spaces were in­cluded as an additional letter, making a 27 letter alphabet. The first line isthe- original text; the second line contains a dash for each .letter correctlyguessed.. In the case of incorrect guesses the correct let~er is copied in thesecond line.

COMPARISON

Fig. 2-Communi

ORIGINALmer

The need for an identical 1

eliminated as follows. In gener.edge of more than N precedingonly a finite nUlllber· of possihsubject to guess the next letterplete list of these predictionsreduced text from the original "

To put this another way, tlencoded form of the original, tla- reversible transdncer. In fa(structed in which only the redthe other. This could be set updiction devices. -

An extension of the .abovecerning the predictability of Erup to the current point and is a

l he is told so and asked to guesCorred letter. A typical result

(8)(1) THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG(2) ----Roo------NOT-V~--c-I------SM----OBL---­

(1) READING LAMP ON THE DESK SHED GLOW ON(2) REA----------O"-----D----SHED-GLO--O--

(1) POLISHED WOOD BIlT LESS ON THE SHABBY RED CARPET(2) P-L-S -----O'-_BU--L-3--0 ------3H -----RE --C ------

-~---------~.~ ---~.---~----~~.~-..~

T.RNAL, JA,}..""(JARY 1951 J PREDICTION A..1>,J"D ENTROPY OF PRThi"TED ENGLISH 55

" alpbabet when N is reasonably

~ out including the space as anet. The results of both 26- and

N-grams within words are morehe effect qf this is that we have: which corresponds more nearly

Of a total of 129 letters, 89 or 69% were guessed correctly. The errors, aswould be expected, occur most frequently at the beginning of words andsyllables where the line of thought has more possibility of branching out. Itmight be thought that the second line in (8), which we will call the reducedtext, contains much less information than the first. Actually, both lines con­tain the same information in the sense that it is possible, at least in prin­ciple, to recover the first line from the second. To accomplish this we need'an identical twin of the individual who produced the sequence. The twin(\...ho must be mathematically, not just biologically identical) will respond inthe same way when faced with the same problem. Suppose, now, we haveonly the reduced text of (8). We ask the twin to guess the passage. At eachpoint we will know whether his guess is correct, since he is guessing the sameas the first twin and the presence of a dash in the reduced text correspondsto a correct guess. The letters he guesses wrong are also available, so that ateach stage he can be supplied with precisely the same information the firsttwin had available.

F,.ord

2.622.14

F,

3.33.1

F,

3.563.32

w:

~, was found by several methods,5-letter series above out to thatcompletely redundant when se­the values of F lot in the 27-letter

ENGLISH

1 exploits the fact that anyonean enormous knowledge of the

:1 the words, idioms, cliches andncorrect letters in proof-reading,~rsation. An experimental demon­'edictable can be given as follows:rson who is to do the predicting.he passage. If the guess is correctle second letter. If not, he is toldis next guess. This is continued$ses, the subject writes dowp. the, in predicting future letters. The~ is given below. Spaces were in­7 letter alphabet. Tbe first line is; a dash for each letter correctly,he correct letter is copied in the

ORIGINALTEXT

COMPARISON COMPARISON

REDUCED TEXT

Fig. 2-Communication system using reduced text.

ORIGINALTEXT.'.I-

The need for an identical twin in this conceptual experiment can beeliminated as follows. In general, good prediction does not require knowl­edge of more than N preceding letters of text, with N fairly small. There areonly a finite number of possible ,sequences of N letters. We could ask thesubject to guess the next letter for each of these possible N-grams. The co::n­plete list of these predictions could thell'be used both for obtaining thereduced text from the original and for the inverse reconstruction process.

To put this another way, the reduced text can be considered to be anencoded form of the original, the result of passing the original text throughct reversible transducer. In fact, a communication system could be con­structed in which only the reduced text is transmitted from one point tothe other. This could be set up as shown in Fig. 2, with two identical pre­diction devices.

An extension of the above experiment yields further information can­cerning the predictability of English. As before, the subject knows the textup ~o the current point and is asked to guess the next letter. If he is wrong,he IS told so and asked to guess again. This is continued until he [lllds thecorrect letter. A typical result with this expe~ent is shown below. The

(8):HT A SMALL OBLOHG------SM----OBL-.-­

SHED GLOW ONSHEO-GLD--O--

THE SHABBY RED CARl'ET·-----SH-----RE --C ------

56 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY 1951 PREDICTION AND

first line is the original text and the numbers in the second line ~ndic3.te theguess at which the correct letter was obtained.

(1) THE REI 8 H 0 REV E R 8 E 0 HAM 0 TOR G Y G LEA(2) 1 1 1 5 11 2 11 2 1115 1 17 1 1 1 21 3 21 22 7 1 1 1 1 4 1 1 1 1 1 3 1

(1) F R I E H D 0 F M I H E F 0 U H D T H I 8 0 U T(2) 8 6 1 3 1 11 1 11 1 1 1 11 6 2 1 1 11 1 1 2 11 1 1 1 1

(1) RAT HER D RAM A TIC ALL Y THE 0 THE R DAY(2) 4 1 1 1 1 11 11 5 1 1 1 1 1 1 1 1 1 11 6 1 11 1 1 1 1 11 1 1 1 1 (9)

Out of 102 symbols the subject guessed right on the first guess 79 times,on the second guess 8 times, on the third guess 3 times, the fourth and fifthguesses 2 each and only eight times required more than five guesses. Resultsof this order are typical of prediction by a good subject with ordinary literaryEnglish. Newspaper writing, scientific work and poetry generally lead tosomewhat poorer scores.

The reduced text in this case also contains the same infonnation as theoriginal. Again utilizing the identical twin we ask him at each stage to guessas many times as the number given in the reduced text and recover in thisway the original. To eliminate the human element here we must ask oursubject, for each possible l\T_gram of text, to guess the most probable nextletter, the second most probable next letter/letc. This set of data can thenserve both for prediction and recovery. ".

Just as before, the reduced text can be considered an encoded version ofthe original. The original language, with an alphabet of 27 symbols, A,B, ,Z, space, has been translated into a new language with the alphabet1, 2, , 27. The translating has been such that the symbol 1 now has anextremely high frequency. The symbols 2, 3, 4 have successively smallerfrequencies and the final symbols 20, 21, ... , 27 occur very rarely. Thus thetranslating has simplified to a considerable extent the nature of the statisti­cal structure involved. The redundancy which originally appeared in com­plicated constraints among groups of letters, has, by the translating process,been made explicit to a large extent in the very unequal probabilities of thenew symbols. It is this, as will appear later, which enables one to estimatethe entropy from these experiments.

In order to determine how predictability depends on the number 11' ofpreceding letters known to the subject, a- more involved experiment wascarried out. One hundred samples of English text were selected at randomfrom a book, each fifteen letters in length. The subject was required to guessthe text, letter by letter, for each sample as in the preceding experiment.Thus one hundred samples were obtained in which the subject had available0, 1, 2, 3, ... , 14 preceding letters. To aid in prediction the subject madesuch use as he wish~d oi various statistical tables, letter, digram and trigram

L

~

I

I

~'f .,....-

__ I~tables, a table of the frequencqueneies of common words amwere from IIJefferson the Virggether with a similar test in wIsummarized in Table 1. The ccletters known to the subject IThe entry in column N at fO\V

the right letter at the 5th gues,

I 2 3 4 5 I- --------

~ll1 18.2 29.2 36 472 10.7 14.8 20 183 8.6 10.0 12 14 8 I

4 6.7 8.6 7 3 jl5 6.5 7.1 1 16 5.8 5.5 4 5 27 5.6 4.5 3 3 28 5.2 3.6 2 2 19 5.0 3.0 4 5

10 4.3 2.6 2 1 311 3.1 2.2 2 2 2 ,

12 2.8 1.9 4 2,

13 2.4 1 .5 J 1 114 2.3 1.2 115 2.1 1.0 1 116 2.0 .917 1.6 .7 1 218 1.6 .519 1.6 .4 120 1.3 .3 121 1.2 .222 .8 .123 .3 .124 .1 .025 .126 .127 .1

the entry 19 in column 6, rowrect letter was obtained on thdred. The first two columns (mental procedure outlined aknown letter and digram fregprobable symbol is the spae<wrong, should be E (probalfrequencies with which the ri~

trials with best prediction, Sitable gives the entries in coIl

\L JOURNAL, JANUARY 1951 PREDICTION A~"D ENTROPY OF PRINTED ENGLISH 57

umbers in the second line indic3.te the!htained.

lE ONA MOTORCYCLE Al21321227111l41111131

ND THIS OUT111112111111

LY THE OTHER OAY1 11 6 1 11 1 1 1 1 11 1 1 1 1 (9)

essed right on the first guess 79 times,lird guess 3 times, the fourth and fifth;quired more than five guesses. Resul~s)y a good subject with ordinary literaryic work and poetry generally lead to

contains the same information as thetwin we ask him at each stage to guessn the reduced text and recover in thislUman elem~nt here we must ask ourtext, to guess the most probable nextt letter, etc. This set ofdata can then'ery.n be considered an encoded version ofwith an alphabet of 27 symbols, A,into a new language with the alphabet

en such that the symbol I now has an;)ols 2, 3) 4 have successively smaller21) ... , 27 occur very rarely. Thus theerable extent the nature of the statist i­lCy which originally appeared in com­letters, has) by the translating process,n the very unequal probabilities of thelr later) which enables one to estimate

tability depends on the number N ofiect, a more involved experiment wasEnglish text were selected at random

tgth. The subject was required to guesstmple as in the preceding experiment.ined in which the subject had availableTo aid in prediction the subject made3tical tables) letter) digram and trigram

tables) a table of the frequencies of initial letters in words, a list of the fre­quencies of common words and a dictionary. The samples in this experimentwere from "Jefferson the Virgillian" by Dumas Malone. These results, to­gether with a similar test in which 100 letters were known to the subject, aresummarized in Table I. The column corresponds to the number of precedingletters known to the subject plus one; the row is the number of the guess.The entry in column N at row S is the number of times the subject guessedtlie right letter at the Sth guess when (N-I) letters were known. For example,

TABLE I

1 2 3 4 5 6 1 8 9 10 ~l~ 13 HI 151100- _I_-

I 18.2 29.2 36 47 51 58 48 66 66 67 62 58 66 721

60 802 10.7 14.8 20 18 13 19 17 15 13 10 9 14 9 6 18 73 8.6 10.0 12 14 8 5 3 5 9 4 7 7 4 9 j 54 6.7 8.6 7 3 4 1 4 4 4 4 5 6 4 3 5 35 6.5 7.1 1 1 3 4 3 6 1 6 5 2 3 46 5.8 5.5 4 5 2 3 2 1 4 2 3 4 1 27 5.6 4.5 3 3 2 2 8 1 1 1 4 1 4 18 5.2 3.6 2 2 1 1 2 1 1 1 1 2 1 39 5.0 3.0 4 5 1 4 2 1 1 2 1 1

10 4.3 2.6 2 1 3 3 1 211 3.1 2.2 2 2 2 1 1 3 1 1 2 112 2.8 1.9 4 2 1 1 1 2 1 1 1 113 2.4 1.5 1 1 1 1 1 1 1 1 1 114 2.3 1.2 ,-;'1 1 1 115 2.1 1.0 1 1 1 1 116 2.0 .9 1 1 117 1.6 .7 1 2 1 1 1 2 218 1.6 .5 119 1.6 .4 1 1 1 120 1.3 .3 1 1 121 1.2 .222 .8 .123 .3 .124 .1 .025 .126 .127 .1

the entry 19 in column 6, row 2, means that with five letters known th~ correct letter was obtailTM. on the second guess nineteen times ou t of the hundred. The first two columns of this table were not obtained by the experi­mental procedure outlined above but were calculated directly from theknown letter and digram frequencies. Thus with no known letters the mostprobable symbol is the space (probability .182); the next guess, if this iswrong, should be E (probability .107), etc. These probabilities are thefrequencies with which the right guess would occur at the first) second, etc.,trials with best prediction. Similarly, a simple calculation from the digramtable gives the entries in column I when the subject uses the table to best

58 THE BELL SYSTEM TECIThi1:CAL JOURNAL, JANuARY 1951

are monotonic increasing functipreach limits as iV ---7 00. The:limits as jl{ ---7 00, i.e., the l: aplas the relative frequency of coredge of the entire (infinite) pas

B

Q:=Li=l

This means that the probabilitthe preceding N letters are 1monly (N-1) are known, for all ,p(i1 , i 2 , ••• , iN) j) arranged irthe N-grams vertically. The tal

tenn on the left of (1]row, SlltnIDed over all the rows. '::of entries from this table in whicnecessarily the S largest. Thismember would be calculated Iretha.n N-gramslisted vertically. :of 27 rows of the N-gram table,

The sum of the S largest entrithe sum of the 27S selected entN-gram table only if the latter j

to hold for a particular S, this J

table. In this case, the first lette]S most probable choices for th,the set may be affeeted. Howevfollows that the ordering as welN-gram. The reduced text obtauidentical with that obtained frOJ

Since the partial sums

p(h, i a, ••• ,iN))

PREDICTION AJ\'D

redl1ced N+l N+ltext, ql ,q2 ,pred'Lcti:on is on the basis of a g

.._.';cc ......ee the probabilities of low nthe following inequalities

S ,

L li+1 ?c Li=l iE

)

7 8 >83· 0 41 2 9

6

32

26

4

24

3

74

2

107

1

7066

No. of guess

Forward ...Reverse.

Incidentally, the N -gram entropy FN for. a reversed language is equal tothat for the forward language as may be seen from the second form in equa­tion (1). Both terms have the same value in the forward and reversed cases.

advantage. Since the frequency tables are determined from long samples ofEnglish, these two columns are subject to less sampling error than the others.

It will be seen that the _prediction gradually improves, apart from somestatistical fluctuation, with increasing knowledge of the past as indicatedby the larger numbers of correct first guesses and the smaller numbers ofhigh rank guesses.

One experiment was carried out with "reverse" prediction) in which thesubject guessed the letter preceding those already known. Although thetask is subjectively much more difficult, the scores were only slightly poorer.Thus, with two 101 letter samples from the same source) the subject ob~

tained the following resnlts:

4. IDEAL N -GRAM PREDICTION

The data of Table I can be nsed to obtain upper and lower bounds to theN-gram entropies F N : In order t~ do this, it is n:eces~ary first to developsome general results concerning the best possible prediction of it languagewhen the preceding N letters are known. There will be for the language a setof conditionalprob~bilities Pil , i! , •.• , iN_l (j). This is the probability whenthe (N-i) gram iI, i 2 , ••• , i N_ 1 occurs that the next letter will be j. Th~'

best guess for the next letter, when this (N-1) gram is known to have oc­curred, will be that letter having the highest conditional probability. Thesecond guess should be that with the second highest probability, etc. Amachine or person guessing in the best way would guess letters in the orderof decreasing q,mditional probability. Thus the process of reducing a textwith such an ideal predictor conSists of a mapping of the letters -into thenumbers from 1 to 27 in such a way that the most probable next letter[conditional on the known preceding (N-1) gram] is mapped into 1, etc.The frequency of 1's in the reduced text will then be given by

q'/. = 1;p(i, , i" ... , i N _ 1 , j) (10)

where the sum is taken overall (N-1) grams i 1 ) i 2 ) ••• ,iN _ 1 thej being theone which maximizes p for that particular (N-1) gram. Similarly, the fre· .quency of 2's, qf, is given by the same formula with j chosen to be thatletter having the second highest value of p, etc.

On the basis of LV-grams, a different set of probabilities for the symbols

JOUR:,AL, JA><"ARY 1951 PREnICTro~ .-\...-"\;n E)o"TROPY OF PRINTED ENGLISH 59

for. a reversed language is equal toseen from the second fann in equa­

e in the forward and reversed cases.

"reverse" prediction, in which thelOse already known. Although thethe SCores were only slightly poorer.n the same SQurce, the subject ob-

re determined from long samples oflless sampling error than the others.adually improves, apart from somemow1edge of the past as indicated;uesses and the smaller numbers of

(11)

(13)

S = 1,2, ....

S = 1,2, ...

. h ddt .v+l N+l .v+l uld II lt S· thom t e re uce tex, ql , q? , ... , q27 ,wo norma y resu . mce 15

prediction is on the basis of a greater knowledge of the past, one would ex­pect the probabilities of low numbers to be greater, and in fact one canprove the following inequalities:

This means that the probability of being right in the first S guesses whenthe preceding N letters are knmvn is greater than or equal to that whenonly (N-l) are known, for all S. To prove this, imagine the probabilitiesp(ir, i:!., ... , iN ,j) arranged in a table with j running horizontally and allthe ~V-grams vertically. The table will therefore have 27 columns and 27N

rows. The tenn on the left of (11) is the sum of the S largest entries in eacRrow, summed over all the rows. The right-hand member of (11) is also a sumof entries from this table in which S entries are taken from each row but notnecessarily the S largest. This follows from the fact that the right-hand.member would be calculated from a similar table with (iV-i) grams ratherthan N-grams-1i$ted vertically. Each row in the lV-l gram table is the sumof 27 rows of the N-gram table, since:

27

p(;",i" .;- ,iN,j) = L: p(i"i" ··-,iN,j). (12),- "1=1

The sum of the S largest entries in a row of the N-l gram table will equalthe sum of the 275 selected entries from the corresponding 27 rowS of theiV-gram table only if the latter fall into S columns. For the equality in (11)to hold for a particular 5, this must be true of every row of the lV-l gramtable. In this case, the first letter of the iV-gram does not affect the set of theS most probable choices for the next letter, although the ordering withinthe set may be affected. However, if the equality in (11) holds for all S, itfollows that the ordering as well will be unaffected by the first letter of the·N-gram. The reduced text obtained from an ideal IV-l gram predictor is thenidentical with that obtained from an ideal iV-gram predictor.

Since the partial sums

,1.

(10)

8 >8

o 42 9

7

31

6

32

,26

,24

,74

... ,iN _ 1 ,j)

( PREDICTION

!tain upper and lower bounds to the~hisJ it is necessary first to develop;t possible prediction of alanguageTheJ;"e will be for the language a set;N_, (j). This is the probability when

; that the next letter will be j. The5 (iV-1) gram is known to have oc­lighest conditional probability. Thesecond highest probability, etc. A

,yay would guess letters in the order[,hus the process of reducing a text·f a mapping of the letters into thethat the most probable next letterN-1) gram] is mapped into 1, etc.t will then be given by

ams i l , i 2 , ••• , i N _ 1 the j being the"lar (N-l) gram. Similarly, the fre­,e fonnula with j chosen to be thatof p, etc_set of probabilities for the symbols

are monotonic increasing functions of N, <1 for all iV, they must all ap­proach limits as IV --+ 00. Their :first differences must therefore approachlimits as N -)- 00, i.e., the l! approach limits, gO; . These may be interpretedas the relatiye frequency of correct first, second, ... , guesses with knowl­edge of the entire (infinite) past history of the text.

5 Hardy, Littlewood and Polya, "Inequalities," Cambridge University Press, 1934.

The left-hand In<I' Nmagme the qi anThe actual qf cantions as shown~ Thtions. Thus, the

The upper bouentropy j

the entrap)iV-gram ent,

Ia!1guage, .as maysums involved wi

. different order. Tdiction is ideal.

The lower boun,with any selectior

"" '( N N~ ~ qi - gi+l

-1_1

Pl

27 2'

also .L q~ = :L. 1 1

case is 1), thenknown that the

., 'properties:

1. Ther, can.flow is undone, as he;:direction.

2. The ri cal

2£era~~i~. ~~ \} -"

i

,1

(15)

(14)S = 1,2, '., ,27

s = .1, 2, ... , 27

The ideal N-gram predictor can be considered, as has been pointed out, tobe a transducer which operates on the language translating it into a sequenceof numbers running from 1 to 27. As such it has the following two properties:

1. The output symbol is a function of the present input (the predictednext letter when we think of it as a predicting device) and the preced­ing (N-l) letters.

2. It is instantaneously reversible. The original input can be recovered bya suitable operation on the reduced text without loss of time. In fact,the inverse operation also operates on only the (N-l) preceding sym­bols of the reduced text together with the present output.

The above proof that the frequencies of output symbols with an N-lgram predictor satisfy the inequalities:

can be applied to any transducer having the two properties listed above.In fact we can imagine again an array with the various (N-l) grams listedvertically' and the present input letter horizontally. Since the present outputis a function of only these quantities there will be a definite output symbolwhich may be entered at the corresponding intersection of row and column.Furthermore, the instantaneous reversibility requires that no two entriesin the same row be the same. Otherwise, there would beambiguity betweenthe two or more possible present inpu t letters when reversing the transla­tion. The total probability of the S most probable symbols in the output,

ssay L r i , will be the sum of the probabilitiesforS entriesin each row, summed

1

over the rows, and consequently is certainly not greater than the sum of theS largest entries in each row. Thus we wiU have

In other words ideal prediction as defined above enjoys a preferred positi?namong all translating operations that may be applied to a language andwhich sat.isfy the bvo properties above.. Roughly speaking, ideal predictioJ;1 .collapses the probabilities of various symbols to a small group mQre thanany other translating operation involving the same number of letters whichis instantaneously reversible.

Sets of numbers satisfying the inequalities (15) have been studied byMuirhead in cOllilection with the theory of algebraic inequalities' If (15)holds when the gr and ri are arranged in decreasing order of magnitude, and

60 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY 1951

61

(16)r; = L,a;j(q7).j

PREnICTIOX ;\,....'1(1) E~TROPY OF PRTh"TED EXGLISH

21 27

also Lq'Y = Lr., (this is true here since the total probability in ea.ch1 1

ca~e is 1), then the first set, q7 , is sa.id to majoriz3 the second set, ri. It isknown that the majorizing property is equivalent to either of the followingproperties:

1. The r i can be obtained from the q~ by a finite series of ({flows." By aflow is understood a transfer of probability from a larger q to a smallerone, as heat flows from hotter to cooler bodies but not in the reversedirection.

2. The r i can be obtained from the qlf by a generalized "averaging"eE,eration. T~ere exists a set of non-negative real numbers, aij, with~ aij = L aij = land such that

j i

.,

(14)S = 1,2, ... ,27

e original input can be recovered byed text without loss of time. In fact,es on only the (N-l) preceding sym­with the present output.ies of output symbols with an N-l

onsidered, as has been pointed out, tonguage translating it into a sequence

ch it has the following two properties:of the present input (the predicteda predicting device) and the preced-

·ties," Cambridge University Press, 1934.

qualities (15) have been studied byory of algebraic inequalities.' If (15)in decreasing order of magnitude, and

(17)

(18)

27 ,.

L i(qf - qf+l) log i :S FN :$ -i=l

27

L, i(q~ - q~+l) log i :s;i=l

The upper bound follows immediately from the fact that the maximumpossible entropy in a language with letter frequencies q~ is - L qf log q~.

Thus the entropy per symbol of the reduced text is not greatt'f than this.The lV-gram entropy of the reduced text is equal to that for the originallanguage, as may be seen by an inspection of the definition (1) Df FN. Thesums involved will contain precisely the same terms although, perhaps, in adifferent order. This upper bound is clearly valid, whether or not the pre­diction is ideal.

The lower bound is more difficult to establish. It is necessary to show thatwit:le-any selection of iV-gram probabilities p(i1, i'l, ... , iN), we will have

5. ENTROPY BOU:N'DS FROM PREDICTION FREQUENCIES

If we know the frequencies of symbols in the reduced text with the idealLV-gram predictor, qf , it is possible to set both upper and lower bounds toth~'N-gram entropy, F N, of the original language. These bounds are asfollows:

The left-hand member of the inequality can be interpreted as follows:Imagine the q~ arranged.as a sequence of lines of decreasing height (Fig. 3).The actual qf can be considered as the sum of a set of rectangular distribu­tions as shown: The left member of (18) is the entropy of this set of distribu­tions. Thus, the ilh rectangular distribution has a total probability of

(15)S = 1,2, ... , 27

ed above enjoys a preferred positi~m

may be applied to a language ande. Roughly speaking, ideal predictionsymbols to a small group more than. g the same number of letters which

ing the two properties listed above.with the various (N-I) grams listed

horizontally. Since the present outputhere will be a definite output symbolding intersection of row and colunin.sibility requires 'that no two entriese, there would be ambiguity betweent letters when reversing the transla­ost probable symbols in the output,

ili ties forS entries ineach row, summed

tainly not greater than the sum of thewill have

62 THE BELL SYSTEM TECHNICAL JOURNAL, JA?>!uARY 1951 PREDICTION AND

\\\ :'\.

""< -UPPER BOUND

\ '- \--.

"'"......;--

~

LOWER BOUNO-j" .......

1

0

LlU = [~(i - 1) log(i - 1)-

U=L

of the general theorem that By'The equality holds only if theNow we may add the c1iffereJchanging the entropy (since inThe result is that we have arriq.. , by a series of processes whstarting with the original N-gr,of the original systeln F N is gre,decomposition of the q,. This

It will be noted that the lowe)row of the table has a rectangu

,

.234567NUO

Fig. 4--Upper and lower experimen

o

is increased by an equalizing flowq, to q'+1, the first decreased byamount. Then three terms in the S'

possible eN-I) gram there is a s,

probability, while all other nextIt will now be showJ>-that the

(17) are monotonic decftasing funsm'ce th N+l " h N d. e q.. maJonze t e qi anIncreases the entropy. To prove t

'.-:-.reasmg we will show that the (

0.60

ORIGINAL DISTRIBUTION

0.20

10.05 10.05 0.025 10.025 0.025 0.025I I I

q, q. q3 q. q-, g. q, q.

0040 (ql-q2)

RECTANGULAR DECOMPOSITION

/""'5r" Cq2-q3)

10.025 I I I 0.025

10.025Cq.-q,l

, 0 .025 qeI I I I I I

Fig. 3--Rectangular decomposition of a monotonic distribution.

27

L i(q': - q':H) log i.i_I

The problem, then, is to show that any system of probabilities P(il , ...iN), with best prediction frequencies q.. has an entropy FN greater than orequal to that of this rectangular ~ystem, derived from the same set of qi.

i(q': - q':+l)' The entropy of the distribution is log i. The total entropy isthen

The qi as we have said are obtained from the p(i1 1 ••• 1 iN) by arrangingeach row of the table in decreasing order of magnitude and adding vertically.Thus the qi are the sum of a set of monotonic decreasing distributions. Re- .place each of these distributions by its rectangular decomposition. Each oneis replaced then (in general) by 27 rectangular distributions; the q, are thesum of 27 x 27N rectangular distributions, of froni 1 to 27 elements, and allstarting at the left column. The entropy for this set is less than or equal tothat of the.original set of distributions since a termwiseaddition·of two or>

. more distributions always increases entropy. This is actually an application.

I

1'.

______ ' .~__ PREDICTION-AND-- ENTROPY OF PRINTED _ENGLISH -63

)r

of the general theorem that Hy(X) ~ H(x) for any chance variables x and y.The equality holds only if the distributions being added are proportional.Now we may add the different components of the same width withoutchanging the entropy (since in this case the distributions are proportional).The result is that we have arrived at the rectangular decomposition of theqi, by a series of processes which decrease or leave constant the entropy,starting with the original N-gram probabilities. Consequently the entropyof the original system FN is greater than or equal to that of the rectangulardecomposition of the qi. This proves the des~red result.

It will be noted that the lower bound is definitely less than F N unless eachrow of the table has a rectangular distribution. This requires that for each

,,\ .

\ I"\.i'< - UPPER BOUND

"\. --...... ......

~...... :---

" ........LOWER BOUNO-

,A'" r--1 '-- -

"L,~0 , 2 3 4 5

4

5

o6 7 8 9 10 11 12 13 14 15 100

NUMBER OF LETTERS

Fig. 4-Upper and lowerexpeiimental bounds for the entropy of 27-letter English.

2

3

'I

possi?le (N-I) gram there is a set of possible next letters each with equalprobability, while all other next letters have zero probability.

It will now be shown that the upper and lower bounds for FN given by(17) are monotonic decreasing functions of lV. This is true of the upper boundsince the qf+l majorize the qf and any equalizing flow in a set of probabilitiesincreases the entropy. To prove that the lower bound is also monotonic de­creasing we will show that the quantity

ij = L: i(q, - q1+1) log i, (20)

r

is increased by an equalizing flow among the qi. Suppose a flow occurs fromqi to qi+l, the first decreased by !1q and the latter increased by the same

.amount. Then three terms in the sum change and the change in U is given by

tlU ~ [- (i - 1) log (i - 1) + 2i log i - (i + 1) log (i + 1)]Ll.q (21)

I.1

-

64 THE BELL SYSTEM TECHNICAL JOURNAL, JA1"UARY 1951

6. EXPERIMENTAL BOUNDS FOR ENGLISH

Working from the data of Table I, the upper and lower bounds were calcu­lated from relations (17). The data were first smoothed somewhat to over­come the worst sampling fluctuations. The low numbers in this table arethe least reliable and these were averaged together in groups. Thus, incolumn 4, the 47, 18 and 14 were not changed but the remaining grouptotaling 21 was divided uniformly over the rows from 4 to 20. The upper andlower bounds given by (17) were then calculated" for each column giving thefollowing results:

A Submarine Telephon€

IN APRIL of last year thenand Havana, Cuba, a subm,

cal departure from the conventany. This departure consisted I

marine cable of electron tube r,t~~ cable laying machinery anicable, and which, over an externot require servicing for the pUlcircuit elements. The repeaterabout three inches in diametercable diameter of a little over anthe taper at each end is about:it can conform to the curvature (in the laying gear on the cable,in Fig. 1.

Bj

(Manuscri.i

The paper describes-the recentelephone system in which repeacable structure and are laid as

Th~ new cable system, com!". Amencan Telephone and Teleg

__ .~~,,:,the development of telephonic ('.,.an~ Cuba, which has presented

dlllons make it difficult, if not,:':nethods of communication. One

..'lI} Florida that would permit thE" the stretch of water between Flo. as 6,000 feet in depth and whi,.1.lsed. The practical solution has

!?'C Bell System toll lines at Miline (With Some water crossings).

-.,' abo t 100- .';'-""h ,',' ~ n.m., by submarine c.~"avmg a single coaxial circuit ins

'~:-; .' ,

\

...,.

1 2 3 4 S 6 7 8 9 10 It 12 13 14 15 100

4.033.423.0 2.6 2.72.22.81.8 1.92.1 2.2 2.3 2.1 1.72.1 1.33.192.50 2.1 1.71.71.31.81.01.0 1.0 1.3 1,3 1,2 .91.2 .6'

Column

Upper.Lower.

The term in brackets has the form -f(x - 1) + 2f(x) - f(x + 1) wheref(x) = x log x. Now f(x) is a function which is concave upward for positive x,since!" (x) = l/x > O. The bracketed term is twice the,difference between theordinate of the curve at x = i and the ordinate of the midpoint of the chordjoining i - 1 and i + 1, and consequently is negative. Since t1q also is nega­tive, the change in U brought about by the flow is posit.ive. An even simplercalculation shows that this is also true for a flow from ql to q2 or from qZ6 toq27 (where only two terms of the sum are affected). It follows that the lowerbound based on the N-gram prediction frequencies q1[ is gre3.ter than orequal to that calculated from the N + 1 gram frequencies q1[+l .

It is evident that there is ~till considerable sampling error in these figuresdue to identifying the observed sample frequencies with the predictionprobabilities. It must also be remembered that the lower bound was provedonly for the ideal predictor, while the frequencies used here are from humanprediction. Some rough calculations, however, indicate that the discrepangrbetween the actual F N and the lower bound with ideal prediction (due tothe failure to have rectangular distributions of _conditional probability)more than compensates for the failure of human subjects to predict in theideal manner. Thus we feel reasonably confident of both bounds apart fromsampling errors. The values given above are plotted against LV in Fig. 4.

ACKNOWLEDGMENT

The writer is indebted to Mrs. Mary E. Shannon and to Dr. B. M. Oliverfor help with the experimental work and for a number of suggestions and.criticisms concerning the theoretical aspects of 'this paper.

I)

~III

1\1

IIIii',,'':1

lij,

r_:1'1IIii'I'il!!J!

1'1;"I,

Iii!'I'!'I1/.,I

I'II',j