Transcript
Page 1: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

ENCYCLOPEDIA OF MATHEMATICS AND ITS APPLICATIONS

Editorial Board

P. Flajolet, M. Ismail, E. Lutwak

Volume 104

Applied Combinatorics on Words

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 2: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

ENCYCLOPEDIA OF MATHEMATICS AND ITS APPLICATIONS

All the titles listed below can be obtained from good booksellers or from Cambridge University Press. For acomplete series listing visit http://publishing.cambridge.org/stm/mathematics/eom/

60 Jan Krajicek Bounded arithmetic, propositional logic, and complex theory61 H. Gromer Geometric applications of Fourier series and spherical harmonics62 H. O. Fattorini Infinite dimensional optimization and control theory63 A. C. Thompson Minkowski geometry64 R. B. Bapat and T. E. S. Raghavan Nonnegative matrices and applications65 K. Engel Sperner theory66 D. Cvetkovic, P. Rowlinson and S. Simic Eigenspaces of graphs67 F. Bergeron, G. Labelle and P. Leroux Combinatorial species and tree-like structures68 R. Goodman and N. Wallach Representations of the classical groups69 T. Beth, D. Jungnickel and H. Lenz Design theory volume I 2 ed.70 A. Pietsch and J. Wenzel Orthonormal systems and Banach space geometry71 George E. Andrews, Richard Askey and Ranjan Roy Special functions72 R. Ticciati Quantum field theory for mathematicians76 A. A. Ivanov Geometry of sporadic groups I78 T. Beth, D. Jungnickel and H. Lenz Design theory volume II 2 ed.80 O. Stormark Lie’s structural approach to PDE systems81 C. F. Dunkl and Y. Xu Orthogonal polynomials of several variables82 J. Mayberry The foundations of mathematics in the theory of sets83 C. Foias, R. Temam, O. Manley and R. Martins da Silva Rosa Navier–Stokes equations and turbulence84 B. Polster and G. Steinke Geometries on surfaces85 D. Kaminski and R. B. Paris Asymptotics and Mellin–Barnes integrals86 Robert J. McEliece The theory of information and coding 2 ed.87 Bruce A. Magurn An algebraic introduction to K-theory88 Teo Mora Solving polynomial equation systems I89 Klaus Bichtelor Stochastic integration with jumps90 M. Lothaire Algebraic combinatorics on words91 A. A. Ivanov & S. V. Shpectorov Geometry of sporadic groups 292 Peter McMullen & Egon Schulte Abstract regular polytopes93 G. Gierz et al. Continuous lattices and domains94 Steven R. Finch Mathematical constants95 Youssof Jabri The mountain pass theorem96 George Gasper & Mizan Rahman Basic hypergeometric series 2 ed.97 Maria Cristina Pedicchio & Walter Tholen Categorical foundations

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 3: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

ENCYCLOPEDIA OF MATHEMATICS AND ITS APPLICATIONS

Applied Combinatorics on Words

M. LOTHAIRE

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 4: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

c a m b r i d g e u n i v e r s i t y p r e s sCambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo

Cambridge University PressThe Edinburgh Building, Cambridge CB2 2RU, UK

Published in the United States of America by Cambridge University Press, New York

www.cambridge.orgInformation on this title: www.cambridge.org/9780521848022

C© Cambridge University Press 2005

This book is in copyright. Subject to statutory exceptionand to the provisions of relevant collective licensing agreements,no reproduction of any part may take place withoutthe written permission of Cambridge University Press.

First published 2005

Printed in the United Kingdom at the University Press, Cambridge

A catalogue record for this book is available from the British Library

Library of Congress Cataloguing in Publication data

ISBN-13 978-0-521-84802-2 hardbackISBN-10 0-521-84802-4 hardback

Cambridge University Press has no responsibility for the persistence or accuracy of URLsfor external or third-party internet websites referred to in this book, and does not guaranteethat any content on such websites is, or will remain, accurate or appropriate.

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 5: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . page ix

Chapter 1 Algorithms on Words . . . . . . . . . . . . . . . . . . . . . . 1

1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Elementary algorithms . . . . . . . . . . . . . . . . . . . . . 71.3 Tries and automata . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . 361.5 Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401.6 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521.7 Word enumeration . . . . . . . . . . . . . . . . . . . . . . . . 691.8 Probability distributions on words . . . . . . . . . . . . . . . 741.9 Statistics on words . . . . . . . . . . . . . . . . . . . . . . . . 91

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Chapter 2 Structures for Indexes . . . . . . . . . . . . . . . . . . . . . . 106

2.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1062.1 Suffix trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072.2 Suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1132.3 Contexts of factors . . . . . . . . . . . . . . . . . . . . . . . . 1212.4 Suffix automaton . . . . . . . . . . . . . . . . . . . . . . . . . 1272.5 Compact suffix automaton . . . . . . . . . . . . . . . . . . . 1382.6 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412.7 Finding regularities . . . . . . . . . . . . . . . . . . . . . . . 1502.8 Pattern matching machine . . . . . . . . . . . . . . . . . . . 155

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

v

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 6: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

vi Contents

Chapter 3 Symbolic Natural Language Processing . . . . . . . . . . . 164

3.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1643.1 From letters to words . . . . . . . . . . . . . . . . . . . . . . 1653.2 From words to sentences . . . . . . . . . . . . . . . . . . . . 199

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Chapter 4 Statistical Natural Language Processing . . . . . . . . . . . 210

4.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2104.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 2114.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2134.3 Application to speech recognition . . . . . . . . . . . . . . . 226

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

Chapter 5 Inference of Network Expressions . . . . . . . . . . . . . . 241

5.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2415.1 Inferring simple network expressions: models and

related problems . . . . . . . . . . . . . . . . . . . . . . . . . 2425.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2485.3 Inferring network expressions with spacers . . . . . . . . . 2565.4 Related issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 2605.5 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Chapter 6 Statistics on Words with Applications to BiologicalSequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

6.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2686.1 Probabilistic models for biological sequences . . . . . . . . 2706.2 Overlapping and nonoverlapping occurrences . . . . . . . . 2776.3 Word locations along a sequence . . . . . . . . . . . . . . . 2816.4 Word count distribution . . . . . . . . . . . . . . . . . . . . . 2896.5 Renewal count distribution . . . . . . . . . . . . . . . . . . . 3116.6 Occurrences and counts of multiple patterns . . . . . . . . 3156.7 Some applications to DNA sequences . . . . . . . . . . . . 3286.8 Some probabilistic and statistical tools . . . . . . . . . . . . 338

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

Chapter 7 Analytic Approach to Pattern Matching . . . . . . . . . . . 353

7.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3537.1 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . 356

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 7: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

Contents vii

7.2 Exact string matching . . . . . . . . . . . . . . . . . . . . . . 3597.3 Generalized string matching . . . . . . . . . . . . . . . . . . 3757.4 Subsequence pattern matching . . . . . . . . . . . . . . . . . 3937.5 Generalized subsequence problem . . . . . . . . . . . . . . 4077.6 Self-repetitive pattern matching . . . . . . . . . . . . . . . . 413

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

Chapter 8 Periodic Structures in Words . . . . . . . . . . . . . . . . . 430

8.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4308.1 Definitions and preliminary results . . . . . . . . . . . . . . 4318.2 Counting maximal repetitions . . . . . . . . . . . . . . . . . 4338.3 Basic algorithmic tools . . . . . . . . . . . . . . . . . . . . . 4398.4 Finding all maximal repetitions in a word . . . . . . . . . . 4438.5 Finding quasi-squares in two words . . . . . . . . . . . . . . 4488.6 Finding repeats with a fixed gap . . . . . . . . . . . . . . . . 4508.7 Computing local periods of a word . . . . . . . . . . . . . . 4548.8 Finding approximate repetitions . . . . . . . . . . . . . . . . 461

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

Chapter 9 Counting, Coding, and Sampling with Words . . . . . . . 478

9.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4789.1 Counting: walks in sectors of the plane . . . . . . . . . . . . 4809.2 Sampling: polygons, animals, and polyominoes . . . . . . 4929.3 Coding: trees and maps . . . . . . . . . . . . . . . . . . . . . 504

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518

Chapter 10 Words in Number Theory . . . . . . . . . . . . . . . . . . . 520

10.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52010.1 Morphic and automatic sequences: definitions and

generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52110.2 d-Kernels and properties of automatic sequences . . . . . 52610.3 Christol’s algebraic characterization of automatic

sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53610.4 An application to transcendence in positive

characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 54110.5 An application to transcendental power series

over the rationals . . . . . . . . . . . . . . . . . . . . . . . . . 54310.6 An application to transcendence of real numbers . . . . . . 544

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 8: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

viii Contents

10.7 The Tribonacci word . . . . . . . . . . . . . . . . . . . . . . . 54610.8 The Rauzy fractal . . . . . . . . . . . . . . . . . . . . . . . . 55210.9 An application to simultaneous approximation . . . . . . . 564

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579

General Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 9: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

Preface

A series of important applications of combinatorics on words has emergedwith the development of computerized text and string processing, especiallyin biology and in linguistics. The aim of this volume is to present, in a unifiedtreatment, some of the major fields of applications. The main topics thatare covered in this book are

1. Algorithms for manipulating text, such as string searching, patternmatching, and testing a word for special properties.

2. Efficient data structures for retrieving information on large indexes,including suffix trees and suffix automata.

3. Combinatorial, probabilistic, and statistical properties of patterns infinite words, and more general pattern, under various assumptions onthe sources of the text.

4. Inference of regular expressions.5. Algorithms for repetitions in strings, such as maximal run or tandem

repeats.6. Linguistic text processing, especially analysis of the syntactic and

semantic structure of natural language. Applications to languageprocessing with large dictionaries.

7. Enumeration, generation, and sampling of complex combinatorialstructures by their encodings in words.

This book is actually the third of a series of books on combinatorics onwords. Lothaire’s “Combinatorics on Words” appeared in its first printingin 1984 as Volume 17 of the Encyclopedia of Mathematics. It was basedon the impulse of M. P. Schutzenberger’s scientific work. Since then, thetheory developed to a large scientific domain. It was reprinted in 1997in the Cambridge Mathematical Library. Lothaire is a nom de plume fora group of authors initially constituted of former students of Schutzen-berger. Along the years, it has enlarged to a broader community coordinatedby the editors. A second volume of Lothaire’s series, entitled “AlgebraicCombinatorics on Words” appeared in 2002. It contains both complementsand new developments that have emerged since the publication of the firstvolume.

ix

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 10: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

x Preface

The content of this volume is quite applied, in comparison with the twoprevious ones. However, we have tried to follow the same spirit, namelyto present introductory expositions, with full descriptions and numerousexamples. Refinements are frequently deferred to problems, or mentionedin Notes. There is presently no similar book that covers these topics in thisway.

Although each chapter has a different author, the book is really a cooper-ative work. A set of common notation has been agreed upon. Algorithms arepresented in a consistent way using transparent conventions. There is alsoa common general index, and a common list of bibliographic references.

This book is independent of Lothaire’s other books, in the sense that noknowledge of the other volumes is assumed.

The book has been written with the objective of being readable by a largeaudience. The prerequisites are those of a general scientific education. Somechapters may require a more advanced preparation. A graduate student inscience or engineering should have no difficulty in reading all the chapters.A student in linguistics should be able to read part of it with profit andinterest.

Outline of contents.The general organization is shown in Figure 0.1 and is described as

follows.The two first chapters are devoted to core algorithms. The first, “Algo-

rithms on Words”, is quite general, and is used in all other chapters. Thesecond chapter, “Structures for Indexes”, is fundamental for all advancedalgorithmic treatment, and more technical.

Among the applications, a first domain is linguistics, represented by twochapters entitled “Symbolic Natural Language Processing” and “StatisticalNatural Language Processing”.

A second application is biology. This is covered by two chapters, en-titled “Inference of Network Expressions”, and “Statistics on Words withApplications to Biological Sequences”.

The next block is composed of two chapters dealing with algorithmics,a subject which is of interest on its own in theoretical computer science,but is also related to biology and linguistics One chapter is entitled “An-alytic Approach to Pattern Matching” and deals with generalized patternmatching algorithms. A chapter entitled “Periodic Structures in Words”describes algorithms used for discovering and enumerating repetitions inwords.

A final block is devoted to applications to mathematics (and theoret-ical physics). It is represented by two chapters. The first chapter, enti-tled “Counting, Coding, and Sampling with Words” deals with the use of

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 11: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

Preface xi

Core algorithms

Algorithms on words

Structures for indexes

Mathematics

Counting, coding, and sampling

Words in number theory

Algorithmics

Analytic approach to pattern matching

Periodic structures in words

Bioinformatics

Inference of network expressions

Statistics on words with applications

Natural languages

Symbolic language processing

Statistical language processing

Figure 0.1. Overall structure of “Applied Combinatorics on Words”.

words for coding combinatorial structures. Another chapter, entitled “Wordsin Number Theory” deals with transcendence, fractals, and dynamicalsystems.

Description of contents.Basic algorithms, as needed later, and notation are given in Chapter 1

“Algorithms on Words”, written by Jean Berstel and Dominique Perrin. Thischapter also contains basic concepts on automata, grammars, and parsing. Itends with an exposition of probability distribution on words. The conceptsand methods introduced are used in all the other chapters.

Chapter 2, entitled “Structures for Indexes” and written by MaximeCrochemore, presents data structures for the compact representation of thesuffixes of a text. These are used in several subsequent chapters. Compactsuffix trees are presented, and construction of these trees in linear time iscarefully described. The theory and algorithmics for suffix automata arepresented next. The main application, namely the construction of indexes,is described next. Many other applications are given, such as detection ofrepetitions or forbidden words in a text, use as a pattern matching machine,and search for conjugates.

The first domain of applications, linguistics, is represented by Chapters 3and 4. Chapter 3, entitled “Symbolic Natural Language Processing” is

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 12: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

xii Preface

written by Eric Laporte. In language processing, a text or a discourse isa sequence of sentences; a sentence is a sequence of words; a word is asequence of letters. The most universal levels are those of sentence, word,and letter (or phoneme), but intermediate levels exist, and can be crucial insome languages, between word and letter: a level of morphological elements(e.g. suffixes), and the level of syllables. The discovery of this piling upof levels, and in particular of word level and phoneme level, delightedstructuralist linguists in the twentieth century. They termed this inherent,universal feature of human language as “double articulation”.

This chapter is organized around the main levels of any language mod-elling: first, how words are made from letters; second, how sentences aremade from words. It surveys the basic operations of interest for languageprocessing, and for each type of operation, it examines the formal notionsand tools involved. The main originality of this presentation is the systematicand consistent use of finite state automata at every level of the description.This point of view is reflected in some practical implementations of naturallanguage processing systems.

Chapter 4, entitled “Statistical Natural Language Processing” is writtenby Mehryar Mohri. It presents the use of statistical methods to naturallanguage processing. The main tool developed is the notion of weightedtransducers. The weights are numbers in some semiring that can representprobabilities. Applications to speech processing are discussed.

The block of applications to biology is concerned with analysis of wordoccurences, pattern matching, and connections with genome analysis. It iscovered by the next two chapters, and to some extent also by the algorithmicsbloc.

Chapter 5, “Inference of Network Expressions”, is written by NadiaPisanti and Marie-France Sagot. This chapter introduces various mathe-matical models and algorithms for inferring regular expressions withoutKleene star that appear repeated in a word or are common to a set of words.Inferring a network expression means to discover such expressions, whichare initially unknown, from the word(s) where the repeated (or common) ex-pressions will be sought. This is in contrast to the string searching problemconsidered in other chapters. It has many applications, notably in molecularbiology, system security, text mining, etc. Because of the richness of themathematical and algorithmical problems posed by molecular biology, weconcentrate on applications in this area. Applications to biology motivateus also to consider network expressions that appear repeated not exactlybut approximately.

Chapter 6 is written by Gesine Reinert, Sophie Schbath and MichaelWaterman, and entitled “Statistics on Words with Applications to Biological

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 13: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

Preface xiii

Sequences”. Properties of words in sequences have been of considerableinterest in many fields, such as coding theory and reliability theory, andmost recently in the analysis of biological sequences. The latter will serveas the key example in this chapter.

Two main aspects of word occurrences in biological sequences are:where do they occur and how many times do they occur? An importantproblem, for instance, was to determine the statistical significance of aword frequency in a DNA sequence. The naive idea is the following: aword may be significantly rare in a DNA sequence because it disrupts repli-cation or gene expression, (perhaps a negative selection factor), whereas asignificantly frequent word may have a fundamental activity with regard togenome stability. Well-known examples of words with exceptional frequen-cies in DNA sequences are certain biological palindromes corresponding torestriction sites avoided for instance in E. coli, and the Cross-over HotspotInstigator sites in several bacteria.

Statistical methods for studying the distribution of the word locationsalong a sequence and word frequencies have also been an active field ofresearch; the goal of this chapter is to provide an overview of the state ofthis research.

Because DNA sequences are long, asymptotic distributions were pro-posed first. Exact distributions exist now, motivated by the analysis of genesand protein sequences. Unfortunately, exact results are not adapted in prac-tice for long sequences because of heavy numerical calculation, but theyallow the user to assess the quality of the stochastic approximations whenno approximation error can be provided. For example, BLAST is probablythe best-known algorithm for DNA matching, and it relies on a Poissonapproximation. This is another motivation for the statistical analysis givenin this chapter.

The algorithmics block is composed of two chapters. In Chapter 7,entitled “Analytic Approach to Pattern Matching”, and written by PhilippeJacquet and Wojciech Szpankowski, pattern matching is considered forvarious types of patterns, and for various types of sources. Single patterns,sequences of patterns, and sequences of patterns with separation conditionsare considered. The sources are Bernoulli and Markov, and also moregeneral sources arising from dynamical systems. The derivation of theequations is heavily based on combinatorics on words and formal languages.

Chapter 9, written by Roman Kolpakov and Gregory Koucherov andentitled “Periodic Structures in Words”, deals with the algorithmic problemof detecting, counting, and enumeration repetitions in a word. The interestfor this is in text processing, compression, and genome analysis, wheretandem repeats may have a particular significance. Linear time algorithms

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 14: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

xiv Preface

exist for detecting tandem repeats, but since there may be quadraticallymany repetitions, maximal repetitions or “runs” are of importance, and areconsidered in this chapter.

A final block is concerned with applications to mathematics. Chapter 8,written by Dominique Poulalhon and Gilles Schaeffer, is entitled “Counting,Coding, and Sampling with Words”. Its aim is to give typical descriptionsof the interaction of combinatorics on words with the treatments of combi-natorial structures. The chapter is focused on three aspects of enumeration:counting elements of a family according to their size, generating them uni-formly at random, and coding them as compactly as possible by binarywords. These aspects are respectively illustrated on examples taken fromclassical combinatorics (walks on lattices), from statistical physics (convexpolyominoes and directed animals), and from graph algorithmics (planarmaps). The rationale of the chapter is that nice enumerative properties arethe visible traces of structural properties, and that making the latter explicitin terms of words of simple languages is a way of solving simultaneouslyand simply the above three problems.

Chapter 10 is written by Jean-Paul Allouche and Valerie Berthe. It isentitled “Words in Number Theory”. This chapter is concerned with theinterconnection between combinatorial properties of infinite words, suchas repetitions, and transcendental numbers. A second part considers a fa-mous infinite word, called the Tribonacci word, to investigate and illustrateconnections between combinatorics on words and dynamical systems, qua-sicrystals, the Rauzy fractal, rotation on the torus, etc. Relations to thecut and project method are described, and an application to simultaneousapproximation is given.

Acknowledgements.Gesine Reinert, Sophie Schbath and Michael Waterman would like to

thank Simon Tavare for many helpful comments. Thanks go also to XueyingXie for pointing out inconsistencies in a previous version concerning testingfor the order of a Markov chain.

Their work was supported in part by Sandia National Laboratories,operated by Lockheed Martin for the U.S. Department of Energy undercontract no. DE-AC04-94AL85000, and by the Mathematics, Informa-tion, and Computational Science Program of the Office of Science of theU.S. Department of Energy. Gesine Reinert was supported in part by EP-SRC grant aGR/R52183/01. Michael Waterman was partially supported byCelera Genomics.

An earlier and shorter version of Chapter 6 appeared as “Probabilis-tic and statistical properties of words: an overview” in the Journal of

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information

Page 15: Applied Combinatorics on Words - Assetsassets.cambridge.org/97805218/48022/frontmatter/9780521848022... · Applied Combinatorics on Words ... Chapter 5 Inference of Network Expressions

Preface xv

Computational Biology, Vol. 7 (2000), pp. 1–46. The authors thank MaryAnn Liebert, Inc. Publishers for permission to include that material here.

Philippe Jacquet and Wojciech Szpankowski thank J. Bourdon, P.Flajolet, M. Regnier and B. Vallee for collaborating on pattern match-ing problems, co-authoring papers, and commenting on this chapter. Theyalso thank M. Drmota and J. Fayolle for reading the chapter and providinguseful comments.

W. Szpankowski acknowledges NSF and NIH support through grantsCCR-0208709 and R01 GM068959-01.

J.-P. Allouche and Valerie Berthe would like to express their gratitude toP. Arnoux, A. Remondiere, D. Jamet and A. Siegel for their careful readingand their numerous suggestions.

Jean BerstelDominique Perrin

Marne-la-Vallee, June 23, 2004

© Cambridge University Press www.cambridge.org

Cambridge University Press0521848024 - Applied Combinatorics on WordsM. LothaireFrontmatterMore information


Recommended