NOVEMBER 1984 VOL. 9, NO. 12$ 4.25 IN CANADA / 2.10 IN U.K. A MCGRAW-HILL PUBLICATION 0360-5280
$3.50 IN UNITED STATES
THE SMALL SYSTEMS JOLH'
can be disconcertingly recognizable
recognizable mannerisms of the texts from which they are derived. For example, the following text was generated by the first sentences of this article: English letter-combination frequencies from text was generived. For example. Though nonsentencies from text was the text was generated to generisms of that mimics the first sentencies from text the texts have a have a sample, they article:The nature of such texts has been little explored, in part because it's been difficult to get samples. Claude Shannon generated "approximations to English" by hand in 1948, but the laborious calculation it involved prevented extensive study. This is clearly a task for a computer, but programs have been hampered by the need for impractical amounts of memory. We offer a Pascal program, Travesty, to fabricate Hugh Kenner and Joseph o'Roarke teach pseudo-text quickly from any input text. Students of style and linguistics will see possibilities . So may pro- English and computer science, respectively, at grammers since Travest contains a feature th at ca n The Johns y Hopkins greatly speed up general pattern-matching proce- university, Baltimore, dures. We add a special-case version that is (continued) MD 21218.
nglish letter-combination frequencies can be used to generate random text that mimics the frequencies found in a sample. Though nonsensical, these pseudo-texts havea haunting plausibility, preserving as they do many
NOVEMBER 1984 BYTE 129
130 B Y T E NOVEMBER 1984 ILLUSTRATED BY JOAN HALL. WITH APOLOGIES TO JAMES JOYCE AND HENRY JAMES
Each of these writers had his own way with trigrams, tetragrams, pentagrams, matters to which he surely gave no thought.even speedier. To make clear what Travesty does, we'll first discuss language statistics and what they imply.LANGUAGE STATISTICS Finish typing a page of English prose, and the key you hit most often will have been the space bar. Either "e" or "t" will rank second. You did not make those decisions, the language did. In fact, the language makes three-quarters of your writing decisions for you. Not only do the letters observe preferred frequencies, they keep preferred company. A familiar example: write "Q", and (unless you are drafting a QANTAS ad or some comments on Iraq) the next character is almost sure to be "u". If probability coerces the successor to a single letter, what follows a letter pair is even more tightly bound. Write "th", and the probability is very high that what follows will be "e". If it is, then the character after "e" is most likely to be either a space or an "r". Pairs like "th" are called digrams; triplets like "the" are trigrams. They have frequencies, like letters. The most common English digram is "he"; you will find it three times in the sentence you are reading now, 15 times in this paragraph. And you will guess correctly that as we move up from single letters to diagrams and trigrams, the probabilities that govern the next character grow ever more rigorous. By the time we've reached, say, pentagrams, has the author any choice at all?
Yes, he has; otherwise Henry James could have had no way to be Henry James, or James Joyce to be James Joyce. At a fairly low level, the statistics of English would have taken over from both of them, and neither would have been distinguishable from The New York Times.But that is not what happens. True, even with a James or a Joyce holding the pen, the statistics do not lie dormant. However, they no longer derive from the undifferentiated language , i.e., from a large sample of everything we can find. The significant statistics derive from the personal habits of James , or Joyce, or Jack London, or J. D. Salinger. Each of these writers, amazingly, had his own way with trigrams, tetragrams, pentagrams, mat(continued on page 449) ters to which he surely gave no thought.
NOVEMBER 1984 BYTE 131
(continued from page 131)
This line of reasoning brings us to the unexpected fact that essentially random nonsense can preserve many "personal" characteristics of a source text. Travesty (listing 1), a program suitable for small systems, will scan a sample text and generate, from the sample's n-gram statistics, a "nonsense" imitation through which the original text, and even its authorship, is disconcertingly recognizable. For example, we provided Travesty with 29 names of towns taken from a gazetteer of England and called for third-order (trigram) analysis. It promptly churned out a couple thousand characters. These letter groups included (1) many input words regurgitated; (2) some uninteresting letter strings that we agreed to call "garbage" (on the principle that a weed is a flower you don't want); and (3) some wondrously plausible names for English towns that don't exist but ought to. They included Bambudge, Nettlewett, Gidge, Hample, Bognorton, Chire, Clop, Tootinton, Bleweth, and Eastle. (If any of these is a real name, that's by accident; none was on our input list.) And fancy being Mayor of Clop! The connection of the output to the source can be stated exactly: for an order-n scan , every n-character sequence in the output occurs somewhere in the input, and at about the same frequency. That is all, yet it is enough to account for an eerie similarity. Every string of three letters in our pseudo-place-names, "ttl" or "dge", for instance, was lifted out of a string of characters and spaces that consisted simply of the 29 input words typed one after another with one space after each. Figure 1 shows one of the thousands of machine-generated derivations Travesty can extract from a 75-word sample of James Joyce's Ulysses. This passage is an order-4 scan; every four-character sequence in the output comes from somewhere in the input.FREQUENCY ARRAYS
language and literature to investigate. To what degree can personal "style" be described as a manifestation of letter frequencies? Such a question, though not new, was merely tantalizing before the modern computer;
even more so before procedures were discovered-quite recently-that didn't demand impossible amounts of machine memory.Brian P. Hayes, associate editor of (continued)
REMARKS ON THE TRAVESTY LISTINGstandardized. We have three Pascal systems available: Turbo Pascal for CP/M and MS-DOS, Lucidata Pascal for CP/M and HDOS, and Berkeley Pascal running under UNIX-and we haven't been able to write a version of Travesty that will run on all three unmodified. judging that Turbo is the rising young comer, we list the Turbo Pascal version, with notes on such problem areas as we know about. This version might run on UCSD Pascal too, but we've not been able to try it. Since it avoids features unique to Turbo and UCSD, it ought to be transportable to any decent Pascal system at the cost of a little attention to input and output.
ascal input/output (1/O) conven-
have been declared:Type STRING = PACKED ARRAY[1 .. 12] OF CHAR;
tions are, to say the least, poorly
Then change line 49 to InFile STRING. 62 Some Pascals will require you to declare a variable i and say, FOR i 1 TO 12 DO READ InFile[i};. 63 Berkeley Pascal doesn't use the ASSIGN command. You'd omit this line and change line 64 to reset (f, infile);.Also, you will probably want output to a disk file, and you'll have to set that up yourself. Add a second TEXT variable, g, to line 33 and a second STRING variable, OutFile, to line 49. Then insert after line 64 a request for the name of the Outfile, and ASSIGN it to g in whatever way your system provides. And if your system requires files to be explicitly closed, add a statement line, CLOSE (g), just before the final END. (Don't forget the semicolon at the end of the line above it.)NOTES ON HELLBAT
Line numbers are, of course, for reference only: don't type them into your Pascal listing.23 This value is safe and may even be increased, but remember that you'll have two arrays this size. How big you can make ArraySize depends on your system's memory requirements. Turbo Pascal, when compiled to disk to get the compiler itself out of the way, permitted ArraySize = 14,000 on a 64Kbyte CP/M system. That's about 2300 words of input text. On an MS-DOS system with 196K bytes, maximum ArraySize increased to 21,000, or 3500 words of text. independent of whether compilation was to memory or to disk. 33 If your Pascal doesn't know about the TEXT type, change this line to f file of char. 40 If your Pascal system has a RANDOM function, you can drop lines 40 to 44 altogether. Then change line 239 to read toss : = random(total) + 1;. You should also delete lines 38. 52, and 53. 49 Many versions of Pascal don't recognize STRING types unless they
There is a lot of fun to be had here. There is also much for the student of
To change Travesty into Hellbat, procedures InitSkip and Match are replaced by the versions given in listing 2, and numerous lines are deleted as shown below. Note that WriteCharacter now receives its characters from Match and has only formatting duties to perform. If your Pascal has its own RANDOM function, make the deletions listed in the section on Travesty for line 40; and the major change-applied above to the WriteCharacter procedure-should instead be made to the line in the new Match procedure that invokes Random. Lines to delete for Hellbat include 28, 72 to 80, 269, 273 (all references to FregArray), and 232 to 245 (process for getting a character).
NOVEMBER 1984 B Y T E 449
Circle 445 on inquiry card.
CAMBRIDGE GRAPHIC SYSTEMS
Listing 1: Travesty, a program for generating pseu