67
Contents Acknowledgments xvii I Methods 1 II Topics 3 1 Enumerative Combinatorics on Words 5 Dominique Perrin and Antonio Restivo 1.1 Introduction .............................. 7 1.2 Preliminaries ............................. 8 1.2.1 Generating series ....................... 9 1.2.2 Automata ........................... 12 1.3 Conjugacy .............................. 13 1.3.1 Periods ............................ 13 1.3.2 Necklaces ........................... 14 1.3.3 Circular codes ........................ 18 1.4 Lyndon words ............................ 22 1.4.1 The Factorization Theorem .................. 23 1.4.2 Generating Lyndon words .................. 24 1.5 Eulerian graphs and de Bruijn cycles ................ 26 1.5.1 The BEST Theorem ..................... 28 1.5.2 The Matrix-tree Theorem ................... 30 1.5.3 Lyndon words and de Bruijn cycles ............. 32 1.6 Unavoidable sets ........................... 34 1.6.1 Algorithms .......................... 35 1.6.2 Unavoidable sets of constant length ............. 37 1.6.3 Conclusion .......................... 40 1.7 The Burrows-Wheeler Transform .................. 42 1.7.1 The inverse transform .................... 44 1.7.2 Descents of a permutation .................. 45 1.8 The Gessel-Reutenauer bijection ................... 46 1.8.1 Gessel-Reutenauer bijection and de Bruijn cycles ...... 49 1.9 Suffix arrays ............................. 52 1.9.1 Suffix arrays and Burrows-Wheeler transform ........ 52 1.9.2 Counting suffix arrays .................... 54 References 61 xv

Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Contents

Acknowledgments xvii

I Methods 1

II Topics 3

1 Enumerative Combinatorics on Words 5Dominique Perrin and Antonio Restivo1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Generating series . . . . . . . . . . . . . . . . . . . . . . . 91.2.2 Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.1 Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Necklaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.3 Circular codes . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Lyndon words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.4.1 The Factorization Theorem . . . . . . . . . . . . . . . . . . 231.4.2 Generating Lyndon words . . . . . . . . . . . . . . . . . . 24

1.5 Eulerian graphs and de Bruijn cycles . . . . . . . . . . . . . . . . 261.5.1 The BEST Theorem . . . . . . . . . . . . . . . . . . . . . 281.5.2 The Matrix-tree Theorem . . . . . . . . . . . . . . . . . . . 301.5.3 Lyndon words and de Bruijn cycles . . . . . . . . . . . . . 32

1.6 Unavoidable sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 341.6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 351.6.2 Unavoidable sets of constant length . . . . . . . . . . . . . 371.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.7 The Burrows-Wheeler Transform . . . . . . . . . . . . . . . . . . 421.7.1 The inverse transform . . . . . . . . . . . . . . . . . . . . 441.7.2 Descents of a permutation . . . . . . . . . . . . . . . . . . 45

1.8 The Gessel-Reutenauer bijection . . . . . . . . . . . . . . . . . . .461.8.1 Gessel-Reutenauer bijection and de Bruijn cycles . . .. . . 49

1.9 Suffix arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521.9.1 Suffix arrays and Burrows-Wheeler transform . . . . . . . .521.9.2 Counting suffix arrays . . . . . . . . . . . . . . . . . . . . 54

References 61

xv

Page 2: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem
Page 3: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Acknowledgments

Thanks if you did anything.

xvii

Page 4: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem
Page 5: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Part I

Methods

Page 6: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem
Page 7: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Part II

Topics

Page 8: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem
Page 9: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Chapter 1Enumerative Combinatorics onWords

Dominique Perrin and Antonio Restivo

Universite Paris-Est, Marne-la-Vallee and University of Palermo

CONTENTS

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Generating series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.2 Automata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Conjugacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.1 Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Necklaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.3 Circular codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Lyndon words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.4.1 The Factorization Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.4.2 Generating Lyndon words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.5 Eulerian graphs and de Bruijn cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.5.1 The BEST Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.5.2 The Matrix-tree Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.5.3 Lyndon words and de Bruijn cycles. . . . . . . . . . . . . . . . . . . . . . . . 32

1.6 Unavoidable sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341.6.1 Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351.6.2 Unavoidable sets of constant length. . . . . . . . . . . . . . . . . . . . . . . . 371.6.3 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.7 The Burrows-Wheeler Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421.7.1 The inverse transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441.7.2 Descents of a permutation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

1.8 The Gessel-Reutenauer bijection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461.8.1 Gessel-Reutenauer bijection and de Bruijn cycles. . . . . . . . . . . 48

1.9 Suffix arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511.9.1 Suffix arrays and Burrows-Wheeler transform. . . . . . . . . . . . . . 521.9.2 Counting suffix arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5

Page 10: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem
Page 11: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Contents

1.1 IntroductionCombinatorics on words is a field which has both historical roots and a substantialgrowth. Its roots are to be found in the early results of Axel Thue on square freewords and the development of combinatorial group theory (see [4] for an introductionto the early developments of combinatorics on words). The present interest in thefield is pushed by its links with several connexions with other topics external to puremathematics, notably bioinformatics.

Enumerative combinatorics on words is itself a branch of enumerative combi-natorics, centered on the simplest structure constructor since words are the same asfinite sequences.

In this chapter, we have tried to cover a variety of aspects ofenumerative combi-natorics on words. We have focused on the problems of enumeration connected withconjugacy classes. This includes many interesting combinatorial aspects of wordslike Lyndon words and de Bruijn cycles. One of the highlightsof the chapter is theconnexion between both of these concepts via the theorem of Fredericksen and Maio-rana.

We have put aside some important aspects of enumerative combinatorics onwords which would deserve another complete chapter. This includes the enumerationof various families of words subject to a restriction. For example, the enumerationof square-free words is an important problem for which only asymptotic results areknown. It is known for example that the numbersn of ternary square-free words of

lengthn satisfies limn→∞ s1/nn = 1.302. . . (see [39] or [16]). Other examples of in-

terest include unbordered words or words avoiding more general patterns (on thisnotion, see [31]).

The chapter is organized as follows.In Section 1.2, we introduce some basic definitions concerning words used in the

sequel. We also introduce basic notions concerning generating series and automata.Both are powerful tools for the enumeration of words.

In Section 1.3, we introduce the notion of conjugacy and the correlated notionsof necklaces or circular codes. These notions play a role in almost all the remainingsections of the chapter. We review some classical formulas such as Witt’s Formula orManning’s Formula for the zeta function of a set of words.

7

Page 12: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

8 Handbook of Enumerative Combinatorics

In Section 1.4, we introduce Lyndon words and prove the important FactorizationTheorem (Theorem 1.4.1). We also discuss the problem of generating Lyndon wordsand present algorithms for generating them in alphabetic order.

In Section 1.5 we introduce the notion of de Bruijn cycle and their relation withEulerian graphs. We prove the so-called BEST Theorem enumerating the spanningtrees in an Eulerian graph and apply it to the enumeration of de Bruijn cycles. Wefinally present the Theorem of Fredericksen and Maiorana [17] which beautifullyconnects Lyndon words and de Bruijn cycles (Theorem 1.5.6).

In Section 1.6, we introduce unavoidable sets. We prove that, on any alphabet,there exist unavoidable sets of words of lengthn which are a set of representatives ofthe conjugacy classes of words of lengthn (Theorem 1.6.1).

In Section 1.7, we introduce a transformation on words, known as the Burrows-Wheeler transformation. This transformation is used in text compression. It is closelyrelated with conjugacy.

We show in Section 1.8 that the Burrows-Wheeler transformation is closely re-lated with a well-known bijection on words, known as the Gessel-Reutenauer bijec-tion. We also prove some results due to Higgins [23] which generalize the theoremof Fredericksen and Maiorana (Theorem 1.8.5).

In Section 1.9, we show that the Burrows-Wheeler is also related to a well-kownconcept in string processing, the so-called suffix arrays. We end the section withseveral results due to Schurman and Stoye [38] concerning the enumeration of suffixarrays

Acknowledgments The authors wish to thank Nicolas Auger, Maxime Crochemore,Francesco Dolce, Gregory Kucherov, Eduardo Moreno, Giovanna Rosone andChristophe Reutenauer who have read the manuscript and madecorrections. Theyalso thank the referee who has helped to substantially improve the presentation. Thesupport of ANR project Eqinocs is acknowledged by the first author.

1.2 PreliminariesWe briefly introduce the basic terminology on words. LetA be a finite set usuallycalled thealphabet. The elements ofA are calledletters.

A wordw on the alphabetA is denotedw= a1a2 · · ·an with ai ∈A. The integern isthe length ofw. We denote as usual byA∗ the set of words overA and byε the emptyword. For a wordw, we denote by|w| the length ofw. We use the notationA+ =A∗−{ε}. The setA∗ is a monoid. Indeed, the concatenation of words is associative,and the empty word is a neutral element for concatenation. The setA+ is sometimescalled thefree semigroupoverA, while A∗ is called thefree monoid.

A word w is called afactor (resp. aprefix, resp. asuffix) of a wordu if there existwordsx,y such thatu= xwy(resp.u= wy, resp.u= xw). The factor (resp. the prefix,

Page 13: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 9

resp. the suffix) isproper if xy 6= ε (resp.y 6= ε, resp.x 6= ε). The prefix of lengthkof a wordw is also denoted byw[0..k−1].

ε

a b

aa ab ba bb

aaa aab aba abb baa bab bba bbb

· · · · · ·

Figure 1.2.1The tree of the free monoid on two letters.

The set of words over a finite alphabetA can be conveniently seen as a tree.Figure 1.2.1 represents the set{a,b}∗ as a binary tree. The vertices are the elementsof A∗. The root is the empty wordε. The sons of a nodex are the wordsxa for a∈ A.Every wordx can also be viewed as the path leading from the root to the nodex. Aword x is a prefix of a wordy if it is an ancestor in the tree. Given two wordsx andy, the longest common prefix ofx andy is the nearest common ancestor ofx andy inthe tree.

The set of factors of a wordx is denotedF(x). We denote byF(X) the set offactors of words in a setX ⊂ A∗.

The lexicographic order, also calledalphabetic order, is defined as follows.Given two wordsx,y, we havex < y if x is a proper prefix ofy or if there existfactorizationsx = uax′ and y = uby′ with a,b letters anda < b. This is the usualorder in a dictionary. Note thatx< y in the radix order if|x| < |y| or if |x|= |y| andx< y in the lexicographic order.

A borderof a wordw is a nonempty word which is both a prefix and a suffix ofw. A wordw is unborderedif its only border isw itself. For example,a is a border ofabaandaababis unbordered.

1.2.1 Generating series

For a setX of words, we denote byfX(z) = ∑n≥0Card(X ∩An)zn the generatingseriesof X.

Operations on sets can be transferred to their generating series. First, ifX,Y aredisjoint, then

fX∪Y(z) = fX(z)+ fY(z). (1.2.1)

Page 14: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

10 Handbook of Enumerative Combinatorics

Next, the productXY of two setsX,Y is defined byXY= {xy| x∈ X,y∈Y}. We saythe the product isunambiguousif xy= x′y′ for x,x′ ∈ X andy,y′ ∈Y impliesx= x′

andy= y′. Then if the product ofX,Y is unambiguous

fXY(z) = fX(z) fY(z). (1.2.2)

A set X ⊂ A+ is a codeif the factorization of a word in words ofX is unique.Formally,X is a code ifx1x2 · · ·xn = y1y2 · · ·ym with xi ,y j ∈ X andn,m≥ 1 impliesn= m andxi = yi for 1≤ i ≤ n.

As a particular case, aprefix codeis a set which does not contain any properprefix of one of its elements. The submonoid generated by a prefix codeX is rightunitary, that is to say thatu,uv∈ X∗ implies v∈ X∗. Conversely, any right unitarysubmonoid is generated by a prefix code.

If X is a code, then

fX∗(z) =1

1− fX(z)(1.2.3)

In fact, since the setsXn,Xm are disjoint for n 6= m, we have fX∗(z) =

∑n≥0 fXn(z). By unique decomposition, we also havefXn(z) = ( fX(z))n. ThusfX∗(z) = ∑n≥0 fX(z)n whence the result.

Example 1 Let X= {a,ba}. The set X is a prefix code. We haveCard(Xk∩An) =( k

n−k

)

. Indeed, a word in Xk∩An is a product of n− k words ba and2k−n words a.It is determined by the choice of the positions of the n−k words ba among k possibleones.

On the other hand,Card(X∗ ∩An) = Fn+1 where Fn is theFibonacci sequencedefined by F0 = 0, F1 = 1 and Fn+1 = Fn+Fn−1 for n≥ 1 (the first values are givenin Table 1.2.1). This is a consequence of the fact that fX∗(z) = 1

1−z−z2 by Equa-

n 0 1 2 3 4 5 6 7 8 9 10 11 12 13Fn 0 1 1 2 3 5 8 13 21 34 55 89 144 233

Table 1.2.1The first values of the Fibonacci sequence.

tion (1.2.3). Since fX∗ (z) = ∑k≥0 fXk(z) we obtain the well-known identity relatingFibonacci numbers and binomial coefficients

Fn+1 = ∑k≤n

(

kn− k

)

(1.2.4)

which sums binomial coefficients along the parallels to the first diagonal in Pascal’striangle (see Table 1.2.2).

Page 15: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 11

1� 1 1

1 � 1 2 11 � 1 3 3 12 � 1 4 6 4 13 � 1 5 10 10 5 15 1 6 15 20 15 6 1

Table 1.2.2Pascal’s triangle.

Example 2 TheDyck setis the set of words on the alphabet{a,b} having an equalnumber of occurrences of a and b. It is a right unitary submonoid and thus it isgenerated by a prefix code D called theDyck code. Let Da (resp. Db) be the set ofwords of D beginning with a (resp. b). We have

Da = aD∗ab and Db = bD∗ba. (1.2.5)

Let us verify the first one. The second one is symmetrical. Clearly any d∈ Da endswith b. Set d= ayb. Then y has the same number of occurrences of a and b and thusy∈ D∗. Set y= y1 · · ·yn with yi ∈ D. If some yi begins with b, then ay1 · · ·yi−1b is aproper prefix of d which belongs to D∗, a contradiction with the fact that D is a prefixcode. Thus all yi are in Da and y∈ aD∗ab. Conversely, any word in aD∗ab is clearly inDa.

Since all products in(1.2.5)are unambiguous, we obtain fDa(z) = z2 fD∗a(z). SinceDa is a code, by(1.2.3), this implies fDa(z) = z2/(1− fDa(z)). We conclude thatfDa(z) is the solution of the equation

y(z)2− y(z)+ z2 = 0. (1.2.6)

such that y(0) = 0. Thus, we obtain the formula

fDa(z) =1−√

1−4z2

2(1.2.7)

Finally, since D= Da∪Db and fDa(z) = fDb(z) for reasons of symmetry, we obtain

fD(z) = 1−√

1−4z2 (1.2.8)

Using the binomial formula, we obtainCard(D∩A2n) =−(−4)n(1/2

n

)

. An elementary

computation shows that(1/2

n

)

= (2(−1)n−1/n4n)(2n−2

n−1

)

. Thus

Card(D∩A2n) =2n

(

2n−2n−1

)

(1.2.9)

As a consequence, and since Da = aD∗ab by (1.2.5), we obtain the important andwell-known fact that

Card(D∗a∩A2n) =1

n+1

(

2nn

)

(1.2.10)

Page 16: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

12 Handbook of Enumerative Combinatorics

These numbers are called theCatalan numbers(see Table 1.2.3).

n 1 2 3 4 5 6 7 8 9 101 1 2 5 14 42 132 429 1430 4862

Table 1.2.3The first Catalan numbers.

1.2.2 Automata

An automatonon the alphabetA is given by a setQ of states, a setE ⊂Q×A×Q ofedges, a setI of initial states and a setT of terminalstates. The automaton is denotedA = (Q,E, I ,T) or (Q, I ,T) if E is understood.

1 2a

b

aFigure 1.2.2An automaton

Example 3 Figure 1.2.2 represents an automaton with two states and three edges.The initial edges are indicated with an incoming edge and theterminal ones withwith an outgoing edge. Here state1 is both the unique initial and terminal state.

A pathin the automaton is a sequence of consecutive edges(pi ,ai, pi+1) for 1≤ i≤n.The integern is thelengthof the path. The wordw= a1a2 · · ·an is its label. We denotep1

w−→ pn such a path. A pathiw−→ t is successfulif i ∈ I andt ∈ T. The setrecognized

by the automaton is the set of labels of successful paths. Theautomaton is said tobeunambiguousif for each wordw there is at most one successful path labeledw.Thus, an unambiguous automaton defines a bijection between the set of successfulpaths and the set of their labels. As a particular case, an automaton isdeterministicif it has at most one initial state and for each statep, at most one edge labeled by agiven letter starting atp.

Example 4 The automaton represented in Figure 1.2.2 recognizes the set {a,ba}∗of Example 1. It is deterministic and thus unambiguous.

Page 17: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 13

The adjacency matrixof the automatonA = (Q,E, I ,T) is theQ×Q-matrix withinteger coefficients defined by

Mp,q = Card{e∈ E | e= (p,a,q) for somea∈ A}.

It is clear that for eachn≥ 1, Mnp,q is the number of paths of lengthn from p to q.

Thus we have the following useful statement.

Proposition 1 Let A = (Q, I ,T) be an unambiguous automaton, let M be its adja-cency matrix and let X be the set recognized byA . For each n≥ 1,

Card(X∩An) = ∑i∈I ,t∈T

Mni,t

Example 5 The adjacency matrix of the automaton represented in Figure1.2.2 is

M =

[

1 11 0

]

.

It is easy to verify that

M =

[

Fn+1 Fn

Fn Fn−1

]

.

Thus, by Proposition 1, we haveCard({a,ba}∗ ∩An) = Fn+1, as already seen inExample 1.

1.3 ConjugacyWe define necklaces and primitive necklaces. We enumerate first primitive necklaces(Witt’s Formula, Proposition 4) and then arbitrary ones (Proposition 6). See [30]for a more detailed presentation. These notions have been extended to more generalstructures (see in particular the case of partial words in [6]).

1.3.1 Periods

An integerp≥ 1 is aperiodof a wordw= a1a2 · · ·an whereai ∈ A if ai = ai+p fori = 1, . . . ,n− p. The smallest period ofw is called theminimalperiod ofw.

Proposition 2 (Fine, Wilf) If p,q are periods of a word w of length≥ p+ q−gcd(p,q), then w has periodgcd(p,q).

Proof.Setw= a1a2 · · ·an with ai ∈ A andd = gcd(p,q). We may assume thatp≥ q.Assume first thatd = 1. Let us show thatp−q is a period ofw. Let i be such that1≤ i ≤ n− p+ q. If i ≤ n− p, we haveai = ai+p = ai+p−q. Otherwise, we have

Page 18: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

14 Handbook of Enumerative Combinatorics

i > n− p and thusi > q−1. Thenai = ai−q = ai+p−q. Thusw has periodp−q. Sincegcd(p,q) = gcd(p−q,q) we obtain by induction onp+q thatw has period 1.

In the general case, we consider the alphabetB = Ad. On this alphabetw hasperiodsp/d, q/d and lengthn/d≥ p/d+q/d. By the first part, it has period 1 as aword on the alphabetB and thus periodd on the alphabetA.

Example 6 The word w= abaababaaba has periods5 and8 and length11= 5+8− 2. By Proposition 2, no word of length12 can have periods5 and 8 withouthaving period1.

More generally, let xn be the Fibonacci sequence of words defined by x1 = b,x2 = a and xn+1 = xnxn−1 for n≥ 2. For n≥ 3, let yn be the word xn minus its twolast letters. The word y7 is the word w above. Then, for n≥ 6, yn+1 has periods Fnand Fn−1. Indeed, yn+1 = xnyn−1 shows that yn+1 has period Fn. Moreover,

yn+1 = xnyn−1 = xn−1xn−2xn−2yn−3 = xn−1xn−2xn−3xn−4yn−3

= xn−1xn−1xn−4yn−3

which shows that Fn−1 is a period since xn−4yn−3 is a prefix of xn−3 and thus of xn−1.Since|yn+1| = Fn+Fn−1−2, this shows that the bound of Proposition 2 is the bestpossible.

A word w∈ A+ is primitive if w= un for u∈ A+ impliesn= 1.Two wordsx,y areconjugateif there exist wordsu,v such thatx= uvandy= vu.

Thus conjugate words are just cyclic shifts of one another. Conjugacy is thus anequivalence relation. The conjugacy class of a word of length n and periodp hasp elements ifp dividesn and hasn elements otherwise. In particular, we note thefollowing result.

Proposition 3 A primitive word of length n has n distinct conjugates.

1.3.2 Necklaces

A class of conjugacy is often called anecklace, represented on a circle (read clock-wise, see Figure 1.3.3).

Let p(n,k) be the number of primitive necklaces of lengthn on k letters. Everyword of lengthn is in a unique way a power of a primitive word of lengthd with ddividing n and such a word hasd distinct conjugates. Thus, for anyn≥ 1,

kn = ∑d|n

d p(d,k) (1.3.11)

This can be written, using generating series, as a formula called theCyclotomic Iden-tity.

11− kz

= ∏n≥1

1

(1− zn)p(n,k). (1.3.12)

Page 19: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 15

a

a

aa

b

a

a

ab

b

a

a

ba

b

a

a

bb

b

a

b

ab

b

a

b

bb

b

Figure 1.3.3The six primitive necklaces of length 5 on the alphabet{a,b}.

Indeed, taking the logarithm of both sides in Equation (1.3.12), we obtain

∑n≥1

knzn

n= ∑

n≥1−p(n,k) log(1− zn)

= ∑n≥1

p(n,k) ∑m≥1

znm

m= ∑

n≥1∑

n=de

p(d,k)zn

e

and thuskn/n= ∑n=dep(d,k)/ewhence Formula (1.3.11).We are going to find a converse giving an expression for the numbersp(n,k).

This solution of the system of linear equations (1.3.11) uses the following function.TheMobius functionis defined byµ(1) = 1 and forn> 1

µ(n) =

{

(−1)i if n is the product ofi distinct prime numbers

0 otherwise

Table 1.3.4 gives the first values of the Mobius function.

n 1 2 3 4 5 6 7 8 9 10µ(n) 1 −1 −1 0 −1 1 −1 0 0 1

Table 1.3.4The values ofµ(n) for n≤ 10.

Proposition 4 (Witt’s Formula) The number of primitive necklaces of length n onk letters is p(n,k) = 1

n ∑d|n µ(n/d)kd.

Page 20: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

16 Handbook of Enumerative Combinatorics

n 1 2 3 4 5 6 7 8 9p(n,1) 1 0 0 0 0 0 0 0 0p(n,2) 2 1 2 3 6 9 18 30p(n,3) 3 3 8 18 48 116 312p(n,4) 4 6 20 60 204 670p(n,5) 5 10 40 150 476p(n,6) 6 15 30 195p(n,7) 7 21 27p(n,8) 8 28p(n,9) 9

Table 1.3.5The numberp(n,k) of primitive necklaces of lengthn onk letters for 2≤ k+n≤ 10.

Table 1.3.5 gives the first values ofp(n,k). We prove some properties of the Mobiusfunction before giving the proof of Proposition 4.

Proposition 5 One has

∑d|n

µ(d) =

{

1 if n = 1

0 otherwise

Proof.Indeed, forn≥ 2, letn= pk11 · · · pkm

m andd= pℓ11 · · · pℓm

m be the prime decompo-sitions ofn,d. Thenµ(d) 6= 0 if and only if allℓi are 0,1 and thenµ(d) = (−1)t witht = ∑m

i=1ℓi . Moreover, there are(m

t

)

possible choices giving the same sumt. Thus

∑d|n

µ(d) =m

∑t=0

(−1)t(

mt

)

= 0

since, by the binomial identity, the last expression is(1−1)m.

For two functionsα,β from N\0 into a ringR, theirconvolution productis thefunctionα ∗β : N\0→Rdefined by

α ∗β (n) = ∑de=n

α(d)β (e).

This product is associative with neutral element the function 1 with value 1 on 1and 0 elsewhere. By Proposition 5 the functionn 7→ ∑d|n µ(d) is the function 1.This shows that the Mobius function is the inverse for the convolution product of theconstant function equal to 1.Proof of Proposition 4.Setα(n) = kn andβ (n) = np(n,k). Sincekn = ∑d|ndp(d,k)by Equation (1.3.11), we haveα = β ∗ γ whereγ is the constant function equal to 1.Sinceγ ∗ µ = 1, the convolution product of both sides by the Mobius function givesα ∗ µ = β , that isnp(n,k) = ∑n=deµ(d)ke.

Page 21: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 17

Recall that Euler’stotient functionϕ is defined as follows. The value ofϕ(n) forn≥ 1 is the number of integersk with 1≤ k≤ n such that gcd(n,k) = 1. In otherwords, forn≥ 2, ϕ(n) is the number of integersk for 1≤ k < n which are primeto n. One hasn = ∑d|n ϕ(d). Indeed, for each divisord of n the setMd of integers

n 1 2 3 4 5 6 7 8 9 10ϕ(n) 1 1 2 2 4 2 6 4 6 4

Table 1.3.6The values of the Euler functionϕ(n) for n≤ 10.

m≤ n such that gcd(n,m) = d hasϕ(n/d) elements. Thusn = ∑d|nCard(Md) =

∑d|n ϕ(n/d) = ∑d|n ϕ(d).Let c(n,k) be the number of necklaces of lengthn on k letters. Table 1.3.7 gives

the first values of the numbersc(n,k). The values in Table 1.3.7 can be easily com-

n 1 2 3 4 5 6 7 8 9c(n,1) 1 1 1 1 1 1 1 1 1c(n,2) 2 3 4 6 8 14 20 36c(n,3) 3 6 11 24 51 130 315c(n,4) 4 10 24 70 208 700c(n,5) 5 15 45 165 481c(n,6) 6 21 36 216c(n,7) 7 28 34c(n,8) 8 36c(n,9) 9

Table 1.3.7The values of the numberc(n,k) of necklaces of lengthn onk letters for 2≤ k+n≤10.

puted from those of Table 1.3.5 using the fact thatc(n,k) = ∑d|n p(d,k). The follow-ing statement gives a direct way to compute the numbersc(n,k) (see [21], where itis credited to McMahon).

Proposition 6 c(n,k) = 1n ∑d|n ϕ(n/d)kd.

Proof. Consider the multiset formed by then circular shifts of the words of lengthn (each word of lengthn may appear several times). The total number of theshifts in nc(n,k). On the other hand, each wordw = a0 · · ·an−1 of length n ap-pears with a multiplicity which is the number of integersp with 0 ≤ p < n such

Page 22: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

18 Handbook of Enumerative Combinatorics

thatw= ap · · ·an−1a0 · · ·ap−1, that is which are a period ofw2. But p is a period ofw2 if and only if w is a power of a word of length gcd(n, p). Thus

nc(n,k) = ∑0≤p<n

kgcd(n,p). (1.3.13)

Since there areϕ(n/d) integersp with 0≤ p< n such thatd = gcd(n, p), the resultfollows.

We illustrate the proof of Proposition 6 in the following example.

Example 7 Let A= {a,b}. The multiset of circular shifts of words of length4 is themultiset of6×4= 24elements represented below.

aaaa aaaa aaaa aaaaaaab aaba abaa baaaaabb abba bbaa baababab baba abab babaabbb babb bbab bbbabbbb bbbb bbbb bbbb

The words appearing more than once are abab,baba which appear twice andaaaa,bbbb which appear4 times.

The following array gives for each value of p= 1,2,3 the set of words w of length4 such that p is a period of w2 (for p= 0 it is the set of all words of length4).

p gcd(p,4)0 aaaa,aaab,aaba,aabb,abaa,abab,abba,abbb,

baaa,baab,baba,babb,bbaa,bbab,bbba,bbbb 41 aaaa,bbbb 12 aaaa,abab,baba,bbbb 23 aaaa,bbbb 1

The value of d= gcd(p,4) is indicated on the right. The corresponding prefix oflength d of each word is indicated in boldface. The row indexed p contains2d ele-ments coresponding to the binary words of length d in boldface. In this way we haveillustrated Equation 1.3.13 since summing the cardinalities of the sets in each row,we obtain24= 16+2+4+2.

1.3.3 Circular codes

A circular codeis a set of wordsX on the alphabetA such that any necklace has aunique factorization in words ofX. In particular, a circular code is a code.

Formally,X is a circular code if forx1, . . . ,xn andy1, . . . ,ym in X the equalitysx2 · · ·xnp= y1 · · ·ym with x1 = psands nonempty impliesn= m, p= 1 andxi = yi

for 1≤ i ≤ n.

Page 23: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 19

Example 8 The set X= {a,ba} is a circular code. Indeed, there is at most one wayto paste every occurrence of b with the a following it.

Example 9 The set X= {ab,ba} is not a circular code. Indeed, the necklace of abhas two possible factorizations.

It can be shown that a submonoidM of A∗ is generated by a circular code if andonly if it satisfies the following condition for anyu,v∈ A∗.

uv,vu∈M⇔ u,v∈M. (1.3.14)

For a proof, see [5, Chapter 7]. Note that (1.3.14) implies for anyu∈M andn≥ 1

un ∈M⇔ u∈M. (1.3.15)

Let Sbe a set of words on the alphabetA and letsn = Card(S∩An) in such a waythat fS(z) = ∑n≥0snzn.

Thezeta functionof S is the series

ζS(z) = exp∑n≥1

sn

nzn.

The following is due to Manning (see [5, Chapter 7]). The proof uses an argumentdue to [41].

Theorem 1.3.1 Let X be a circular code and let S be the set of words having aconjugate in X∗. Then

ζS(z) =1

1− fX(z). (1.3.16)

or equivalently

fS(z) =z f′X(z)

1− fX(z). (1.3.17)

Proof. For x∈ X, denotegn,x the number of words of the formw= sypof lengthnwith y ∈ X∗ andx = ps with p nonempty. SinceX is circular, the triple(s,y, p) isuniquely determined byw. Conversely, every word ofS∩An is of this form for somex∈ X. Thusgx,n = |x|Card(X∗∩An−|x|) and Card(S∩An) = ∑x∈X gn,x. We obtain

Card(S∩An) = ∑x∈X

gn,x = ∑x∈X|x|Card(X∗∩An−|x|)

=n

∑m=0

mCard(X∩Am)Card(X∗∩An−m).

This shows thatfS(z) = z f′X(z) fX∗ (z) whence Formula (1.3.17). Formula (1.3.16) isobtained from (1.3.17) by taking the derivative of the logarithm of each side.

Page 24: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

20 Handbook of Enumerative Combinatorics

Let un = Card(X ∩ An) in such a way thatfX(z) = ∑n≥0unzn. Using For-mula (1.3.17), we obtain for anyn ≥ 1 the formula known asNewton’s Formulain the context of symmetric functions

sn = nun+ ∑1≤i≤n−1

siun−i . (1.3.18)

Since from Equation (1.3.17) we havefS(z) =z f′X(z)

1− fX(z), we deduce thatfS(z) =

z f′X(z)+ fS(z) fX(z), whence Formula (1.3.18).Let nowP be the set of primitive necklaces inSand letpn = Card(P∩An). Then

since a word ofSof lengthn is a power of a primitive word of lengthd with d dividingn and that this word hasd conjugates, we have the following equality, generalizingEquation (1.3.11)

sn = ∑d|n

d pd. (1.3.19)

Like Equation (1.3.11), Equation (1.3.19) can be written asan equation relatingpower series and giving a generalization of the Cyclotomic Identity (1.3.12), namely,

fX∗ (z) = ∏n≥1

1(1− zn)pn

. (1.3.20)

Let cn be the total number of necklaces inS, primitive or not. A word of lengthnin S is in a unique way a power of a primitive word ofS. Thuscn = ∑d|n pd We givebelow two examples of computation ofsn, pn, cn.

Example 10 Let S be the set of representatives of necklaces on A= {a,b} withoutconsecutive occurrences of b. Then S is the set of words having a conjugate in X∗

where X is the circular code X= {a,ba}. Thus, by Theorem 1.3.1, we have

ζS(z) =1

1− z− z2 .

By Newton’s Formula, since u1 = u2 = 1 and un = 0 for n ≥ 3, we have sn+1 =sn+ sn−1 for n≥ 2.

We obtain the values indicated in Table 1.3.8. The3 necklaces of length5 without

n 1 2 3 4 5 6 7 8 9 10 11 12 13sn 1 3 4 7 11 18 29 47 76 123 199 322 521pn 1 1 1 1 2 2 4 5 8 11 18 25 40cn 1 2 2 3 3 5 5 8 10 15 19 31 41

Table 1.3.8The values ofsn, pn,cn for n≤ 13.

bb (in agreement with c5 = 3) are represented in Figure 1.3.4.

Page 25: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 21

a

a

aa

a

a

b

aa

a

a

b

aa

b

Figure 1.3.4The 3 necklaces of length 5 on the alphabet{a,b} withoutbb.

Example 11 Let next S be the set of representatives of necklaces on A= {a,b}with-out occurrence of bbb. Then S is the set of words having a conjugate in X∗ where Xis the circular code X= {a,ba,bba}. Thus

ζS(z) =1

1− z− z2− z3

and sn+1= sn+sn−1+sn−2 for n≥ 3. We obtain the following values. The5 necklaces

n 1 2 3 4 5 6 7 8 9 10 11 12 13sn 1 3 7 11 21 39 71 131 241 443 2757pn 1 1 2 2 4 5 10 15 26 42 74 121 212cn 1 2 3 4 5 9 11 19 29 48 75 132 213

Table 1.3.9The values ofsn, pn,cn for the set of necklaces withoutbbb.

of length5 without bbb (in agreement with c5 = 5) are represented in Figure 1.3.5.

a

a

aa

a

a

b

aa

a

a

b

aa

b

a

b

ba

b

Figure 1.3.5The 5 necklaces of length 5 on the alphabet{a,b} withoutbbb.

The formulae of this section generalize those of the previous one. MacMahon’s Iden-

Page 26: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

22 Handbook of Enumerative Combinatorics

tity (1.3.13) also generalizes to

cn =1n ∑

d|nϕ(n/d)sd

whereϕ denotes Euler totient function. This allows a direct computation of thecn.

1.4 Lyndon wordsA Lyndon wordis a primitive word which is less than all its conjugates in the alpha-betic order. We denote byL the set of Lyndon words.

The first Lyndon words on{a,b} are

a,b

ab

aab,abb

aaab,aabb,abbb

aaaab,aaabb,aabab,aabbb,ababb,abbbb

We first give the following equivalent definition.

Proposition 7 A word is a Lyndon word if and only if it is strictly smaller than anyof its proper suffixes.

Proof.The condition is sufficient. Indeed, letw= uvwith u,vnonempty. Sincew< v,we havew< vu.

It is also necessary. Forw∈ L let w= uvwith u,v nonempty. Assume first thatvis a prefix ofw and thus thatw= vt. Sincew is a Lyndon word,w< tv. But uv< tvimpliesu< t and thusvu< vt, a contradiction. Thusv is not a prefix ofw. But thenv< w implies thatvu< w, a contradiction. We conclude thatw< v.

Note that, as a consequence, a Lyndon word is unbordered. Indeed, ifu is both anonempty suffix and prefix ofw, thenu≤ w and thusu= w by Proposition 7.

The next statement gives a recursive way to build Lyndon words.

Proposition 8 If ℓ,m∈ L with ℓ < m, thenℓm is a Lyndon word.

Proof.Let us first show thatℓm< m. If ℓ is a prefix ofm, thenm= ℓm′. Thenm< m′

impliesℓm< ℓm′ = m. Otherwise,ℓ < m impliesℓm< m.Let v be a nonempty proper suffix ofℓm. If v is a suffix ofm, then by Proposi-

tion 7,m< v and thusℓm< m< v. Otherwise, we havev= v′m. Thenℓ < v′ and thusℓm< v′m= v. By Proposition 7, we conclude thatℓm∈ L.

For example, we haveaab,ab∈ L with aab< aband consequentlyaabab∈ L.

Page 27: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 23

1.4.1 The Factorization Theorem

The following result is due to Lyndon (see [30] for more references). It motivatedKnuth to call Lyndon wordsprimewords in [26].

Theorem 1.4.1 Any word factorizes uniquely as a nonincreasing product of Lyndonwords.

The proof uses the following result.

Lemma 1 Let ℓ1, . . . , ℓm be a nonincreasing sequence of Lyndon words and let w=ℓ1 · · ·ℓm. Thenℓ1 is the longest prefix of w which is a Lyndon word andℓm is theminimal nonempty suffix of w.

Proof.Assume thatℓ ∈ L is a prefix ofw longer thanℓ1. We haveℓ = ℓ1 · · ·ℓiu withi ≥ 1 andu a nonempty prefix ofℓi+1. Thenℓ < u≤ ℓi+1≤ ℓ1 < ℓ, a contradiction.

Next, letv be the minimal suffix ofw. Thenv is in L by Proposition 7. There isan index j, a nonempty suffixs of ℓ j and a wordt such thatv= st. Thenℓm≤ ℓ j ≤s≤ st= v≤ ℓm which impliesv= ℓm.

Proof of Theorem 1.4.1. We have to show that any wordw can be written in a uniquewayw= ℓ1 · · ·ℓm with ℓ1, . . . , ℓm∈ L andℓ1≥ . . .≥ ℓm.

Existence: Since the letters are inL, any word has a factorization in Lyndonwords. Consider a factorizationw= ℓ1 · · ·ℓm with mminimal. If ℓi < ℓi+1 for somei,thenw= ℓ1 · · ·ℓi−1(ℓiℓi+1) · · ·ℓm is a factorization in Lyndon words sinceℓiℓi+1 ∈ L.

Uniqueness: Assume thatℓ1 · · ·ℓm = ℓ′1 · · ·ℓ′m′ with ℓi , ℓ′i ∈ L, ℓ1 ≥ . . . ≥ ℓm and

ℓ′1≥ . . . ≥ ℓ′m′ . By Lemma 1, we haveℓ1 = ℓ′1, which gives the conclusion by induc-tion onm.

We illustrate Theorem 1.4.1 by giving below the factorization of the wordabracadabra.

(abracad)(abr)(a)

Let P be the set of prefixes of Lyndon words, also calledpreprimewords in [26].We call a wordminimal if it is minimal for the lexicographic order in its con-

jugacy class. Clearly, a word is minimal if and only if it is a power of a Lyndonword.

A sesquipowerof a wordx is a wordw= xnp with n≥ 1 andp a proper prefix ofx. Setm= |w|. The wordw is determined byx andm. It is called them-extensionofx.

The following result appears in Duval [13].

Proposition 9 The set P is the set of sesquipowers of Lyndon words distinct of themaximal letter.

The proof uses the following lemma.

Page 28: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

24 Handbook of Enumerative Combinatorics

Lemma 2 For any word p and letter a such that pa is a prefix of a minimal wordand for any letter b such that a< b, the word pb is in L.

Proof. Let x be a Lyndon word such thatpa is a prefix ofxn for somen≥ 1. Thenp= xn−1q andx= qar.

We first show that ifa< b, thenqb∈ L. Indeed, this is true ifq is empty. Other-wise, lett be a proper suffix ofq. Thentar is a proper suffix ofx. By Proposition 7,this impliesx< tar and thereforeq< t. Thuspb< tb. Since any proper suffix ofpbis of this form, this shows thatpb∈ L by Proposition 7 again.

Now, sincex< qb, we havexmqb∈ L for anym≥ 1 by Proposition 8.

Proof of Proposition 9.Let x be a Lyndon word distinct of the maximal letter. Anysesquipowerw of x is a prefix of a powerxn of x. By hypothesis, we can writex= paqwith a not the maximal letter. Then, by Lemma 2, for any letterb > a, we havexnpb∈ L and thusw is in P.

Conversely, we use an induction on the length ofw∈ P. If |w| = 1, thenw∈ L.Assume|w|> 1. Setw= vawith a∈A. By induction hypothesis,v= ynp with y∈ L,n≥ 1 andp proper prefix ofy. Sety= pbuwith b∈A. Sincew is a prefix of a Lyndonword, we havepb≤ paand thusb≤ a. If a= b, thenw is is a sesquipower ofy.

Finally if b< a, w is a Lyndon word by Lemma 2.

Observe that the Lyndon wordx such thatw is a sesquipower ofx is unique.Indeed, assume thatw is a sesquipower ofx,x′ ∈ L. Assuming that|x|< |x′|, we havex′ = xkp with p nonempty prefix ofx. Thenp≤ x< x′ < p, a contradiction.

1.4.2 Generating Lyndon words

Proposition 9 can be used to generate Lyndon words of a given length in alphabeticorder (this algorithm is due to Fredericksen and Maiorana [17], and independently toDuval [14], see [26]). The idea is to generate all preprime words of this length. Thisgeneration problem has been considered in several contexts(see [37], [34] or [26] inparticular).

The algorithm SESQUIPOWERS is represented below. We use the alphabet{0, . . . ,k− 1}. This algorithm visits all preprime wordsa1 · · ·an of length n withan indexj such thata1 · · ·an is an extension ofa1 · · ·a j (we say equivalently that thealgorithm visitsa1a2 · · ·an with index j or that the algorithm visitsa1a2 · · ·a j ).

Page 29: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 25

SESQUIPOWERS(n,k)1 for i← 1 to n do2 ai ← 03 j ← 14 while true do5 ⊲ Visit a1 · · ·an with index j6 j ← n7 while a j = k−1 do8 j ← j−19 if j = 0 then

10 return11 a j ← a j +112 ⊲ Now a1 · · ·a j ∈ L13 for i← j +1 to n do14 ai ← ai− j15 ⊲ Maken-extension

The assignment at line 11 makesa1 · · ·a j a Lyndon word (by Lemma 2). The loopat lines 12-15 realizes then extension of the worda1 · · ·a j .

In particular, the sequence of wordsa1a2 · · ·a j visited by the algorithm is thesequence of Lyndon words of length at mostn in increasing order and the sequenceof wordsa1a2 · · ·an visited with indexn is the sequence of Lyndon words of lengthn in increasing order.

We illustrate this on an example. Consider the list in alphabetic order of the wordsin P of length 5 (we read the list from top to bottom and then from left to right). Theletter in boldface is at indexj.

aaaaa aabab abbab

aaaab aabba abbba

aaaba aabbb abbbb

aaabb ababa bbbbb

aabaa ababb

The 6 Lyndon words of length 5 are those with the marked letterat the last posi-tion.

A possible variant of this algorithm enumerates preprime words in decreasingorder.

Page 30: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

26 Handbook of Enumerative Combinatorics

SESQUIPOWERSBIS(n,k)1 for i← 1 to n do2 ai ← k−13 an+1←−14 j ← 15 while true do6 ⊲ Visit a1, . . . ,an with index j7 if a j = 0 then8 return9 a j ← a j −1

10 for h← j +1 to n do11 ah← k−112 j ← 113 h← 214 while ah− j ≤ ah do15 ⊲ Now a1 · · ·ah−1 is the(h−1)-extension ofa1 · · ·a j16 if ah− j < ah then17 j ← h18 h← h+1

At line 8, the assignement realizes the inverse of the operation at line 11 ofSESQUIPOWERS. The loop at lines 13-17 implements the computation of the index jsuch thata1 · · ·an is a sesquipower ofa1 · · ·a j . It is guaranteed to always end by theassignment of line 3.

Recently, Kociumaka, Radoszewski and Rytter have presented a polynomial timealgorithm to compute thek-th Lyndon word [27].

1.5 Eulerian graphs and de Bruijn cyclesA de Bruijn cycleof ordern on k letters is a necklace of lengthkn such that everyword of lengthn onk letters appears exactly once as a factor. For example

aabb

aaababbb

aaaabaabbababbbb

aaaaabaaabbaababaabbbababbabbbbb

are de Bruijn cycles of order 2,3,4,5.Thede Bruijn graphof ordern on an alphabetA is the following labeled graph. It

hasAn−1 as set of vertices. Its edges are the pairs(u,v) such thatu= aw, v=wbwitha,b∈ A. Such an edge is labeledb. The de Bruijn graph of orders 3,4 on the alphabet{a,b} are represented in Figure 1.5.6 and Figure 1.5.7. A cycle in agraph is anEuler

Page 31: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 27

aa

ab

ba

bba b

b b

aa

b a

Figure 1.5.6The de Bruijn graph of ordern= 3.

aaa

aab

aba

abb

baa

bab

bba

bbba

b a

b

a

b

a

b

b

a

a

b

a

b a

b

Figure 1.5.7The de Bruin graph of ordern= 4

cycle if it uses each edge of the graph exactly once. A finite graph isEulerian if ithas an Euler cycle.

It is easy to verify that the de Bruijn cycles of ordern are the labels of Eulercycles in the de Bruijn graph of ordern. The following result shows the existence ofde Bruijn cycles of any order.

Theorem 1.5.1 A strongly connected finite graph is Eulerian if and only if eachvertex has an indegree equal to its outdegree.

Proof. The condition is necessary since an Euler cycle enters each vertex as manytimes as it comes out of it.

Conversely, we use an induction on the number of edges of the graphG. If thereare no edges, the property is true. LetC be a cycle with the maximal possible numberof edges not using twice the same edge. Assume thatC is not an Euler cycle. Then,sinceG is strongly connected, there is a vertexx which is onC and in a non-trivialstrongly connected componentH of G\C. Every vertex ofH has an indegree equalto its outdegree. So, by induction hypothesis,H contains an Eulerian cycleD. ThecyclesC andD have a vertex in common and thus can be combined to form a cyclelarger thanC, a contradiction.

Page 32: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

28 Handbook of Enumerative Combinatorics

We denote byd−(v) the indegree ofv (which is the number of edges enteringv)and byd−(v) its outdegree (which is the number of edges coming out ofv).

A variant of an Euler cycle is that ofEuler path. It is a path using all the edgesexactly once. It is easy to deduce from Theorem 1.5.1 that a graph has an Euler pathfrom x to y if and only if d+(x)−d−(x) = d−(y)−d+(y) = 1 andd+(z) = d−(z) forall other vertices.

The computation of an Euler cycle along the lines of the proofof Theorem 1.5.1is an interesting exercise in recursive programming. It is realized by the followingfunction EULER.

EULER(s, t)1 if there exists an edgee= (s,x) still unmarkedthen2 MARK(e)3 c← (e,EULER(x, t))4 return (EULER(s,s),c)5 else returnempty

The proof of correctness of this algorithm uses the following steps. The functioncomputes an Eulerian path froms (the source) tot (the target). It uses marks on theedges of the graph which are initially all unmarked.

It chooses an edgee= (s,x) leavings.If there is an Euler path froms to t beginning withe, the solution is

(e,Euler(x, p)).

Else the solution is(Euler(s,s),e,Euler(x, p)).

The following result is due to van Aarden-Ehrenfest and De Bruijn [1]. We aregoing to see a derivation of it using linear algebra.

Theorem 1.5.2 The number of de Bruijn cycles of order n on an alphabet with kletters is

N(n,k) = k−n(k!)kn−1. (1.5.21)

In particular, fork = 2, there are 22n−1−n de Bruijn cycles of ordern. Table 1.5.10

lists some values of the numbersN(n,k). The result fork = 2 was obtained as earlyas 1894 by Fly Sainte-Marie (see [4] for a historical survey).

Observe thatN(1,k) = (k−1)!. This is in agreement with the fact that de Bruijncycles of order 1 are the circular permutations of thek letters.

1.5.1 The BEST Theorem

The following result, known as the BEST Theorem, is due to vanAarden-Ehrenfestand de Bruin [1], and also to Smith and Tutte [40]. For a graphG on a setV ofvertices, denoteπ(G) = ∏v∈V(d

+(v)−1)!. A spanning tree ofG oriented towards a

Page 33: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 29

n 1 2 3 4 5N(n,2) 1 1 2 16 512N(n,3) 2 24 13824N(n,4) 6 331776N(n,5) 24

Table 1.5.10Some values of the numberN(n,k) of de Bruijn cycles of ordern onk letters

vertexv is a set of edgesT such that, for any vertexw, there is a unique path fromwto v using the edges inT.

Theorem 1.5.3 Let G be an Eulerian graph. Let v be a vertex of G and let t(G) bethe number of spanning trees oriented towards v. The number of Euler cycles of G ist(G)π(G).

Proof. Let E be the set of Euler cycles and letEv be the set of Euler paths fromvertex v to itself. Since each Euler cycle passesd+(v) times throughv, we haveCard(Ev) = d+(v)Card(E ).

Let Tv be the set of spanning trees ofG oriented towardsv. We define a mapϕv : Ev→ Tv as follows. LetP be an Euler path fromv to v. We defineT = ϕ(P) asthe set of edges ofG used inP to leave a vertexw 6= v for the last time. Let us verifythatT is a spanning tree oriented towardsv.

Indeed, for eachw 6= v, there is a unique edge inT going out ofw. Continuing inthis way, we reachv in a finite number of steps. Thus there is a unique path fromwto v.

Conversely, starting from a spanning treeT oriented towardsv, we build an EulerpathP fromv to vas follows. We first use any edge going out ofv. Next, from a vertexw, we use any edge previously unused and distinct from the edgein T, as long as suchedge exists. There results an Euler pathP from v to v which is such thatϕ(P) = T.This shows that Card(ϕ−1(T)) = d+(v)! ∏w6=v(d

+(w)−1)!. Consequently

Card(E ) = Card(Ev)/d+(v) = t(v)π(v).

We illustrate Theorem 1.5.3 on the example of the de Bruijn graph of order 3(Figure 1.5.6).

Example 12 Figure 1.5.8 represents the two possible spanning trees oriented to-wards bb in the de Bruijn graph of order 3. Following the Eulerian path in the deBruijn graph of order3 (see Figure 1.5.6), using in turn each of these spanningtrees, starting and ending at the root, we obtain the two possible de Bruijn words

aaababbb, abaaabbb.

Page 34: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

30 Handbook of Enumerative Combinatorics

aa

ab

ba

bb

b b

a

aa

ab

ba

bb

b b

b

Figure 1.5.8The two spanning trees of de Bruijn graph of ordern= 3 oriented towardsbb.

1.5.2 The Matrix-tree Theorem

Let G be a multigraph on a setV of vertices. LetM be its adjacency matrix definedby Mvw = Card(Evw) with Evw the set of edges fromv to w. Let D be the diagonalmatrix defined byDvv = ∑w∈V Mvw and letL = D−M be theLaplacian matrixofG. Note that the sum of the elements of each row ofL is 0. We denote byKv(G)the determinant of the matrixCv obtained by suppressing the row and the column ofindexv in the matrixL.

The following result is due to Borchardt [8].

Theorem 1.5.4 (Matrix-Tree Theorem) For any v∈ V the number of spanningtrees of G oriented towards v is Kv(G)

Proof.Denote byNv(G) the number of spanning trees oriented towardsv.We use an induction on the number of edges ofG. The result holds if there are

no edges. Indeed, if there is no edge leading tov, thenNv(G) = 0. On the other hand,since the sum of each row ofCv is 0, we haveKv(G) = 0. ThusNv(G) = Kv(G).

Consider now an edgee from w to v. Let G′ be the graph obtained by deletingthis edge andG′′ the graph obtained by mergingv andw.

We haveNv(G) = Nv(G

′)+Nv(G′′). (1.5.22)

Indeed, the first term of the right hand side counts the numberof spanning trees ori-ented towardsv not containing the edgeeand the second one the remaining spanningtrees. Similarly, we have

Kv(G) = Kv(G′)+Kv(G

′′). (1.5.23)

Indeed, assumev,w to be the first and second indices. The Laplacian matrices of thegraphsG andG′′ have the form

L =

a b xc d y

z t U

, L′′ =

a+b+ c+d x+ y

z+ t U

.

Page 35: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 31

The Laplacian matrixL′ of G′ being the same asL with c+1,d−1 instead ofc,d.Then

Kv(G) =

d y

t U

, Kv(G′) =

d−1 y

t U

, Kv(G′′) = det(U),

and thus Formula (1.5.23) by the linearity of determinants.By induction hypothesis,we haveKv(G′) = Nv(G′) andKv(G′′) = Nv(G′′) By (1.5.22) and (1.5.23) this showsthatKv(G) = Nv(G).

Example 13 For the graph G of Figure 1.5.6, we have (the matrix C is obtained fromL by suppressing the first row and the first column of L).

L =

1 −1 0 00 2 −1 −1−1 −1 2 00 0 −1 1

, C=

2 −1 −1−1 2 00 −1 1

One hasdet(C) = 2 in agreement with Theorem 1.5.4 since, by Example 1.5.8, thegraph G has2 spanning trees oriented towards bb.

It is possible to deduce the explicit formula for the number of de Bruijn cycles ofTheorem 1.5.2 from the matrix-tree Theorem.

We denote byG∗ theedge graphof a graphG. Its set of vertices is the setE ofedges ofG and its set of edges is the set of pairs(e, f ) ∈ E×E such that the end ofe is the origin of f . It is easy to verify that the edge graph of the de Bruijn graphGn

can be identified withGn+1.A graph isregularof degreek if any vertex hask incoming edges andk outgoing

edges. IfG is regular, the numbert(G) of spanning trees oriented towards a vertexvdoes not depend onv.

The following result is due to Knuth [24] (see also [25], Exercise 2.3.4.2).

Theorem 1.5.5 Let G be a regular graph of degree k with m vertices. Then

t(G∗) = km(k−1)−1t(G).

The proof uses the matrix-tree theorem.It is easy to prove Formula (1.5.21) by induction onn using this result (and the

preceding ones). Indeed, by Theorem 1.5.3, and sinceGn haskn−1 vertices, we have

N(n,k) = (k−1)!kn−1t(Gn).

Thus (1.5.21) is equivalent to

t(Gn) = k−nkkn−1. (1.5.24)

Page 36: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

32 Handbook of Enumerative Combinatorics

Assuming (1.5.24) and using Theorem 1.5.5, we have

t(Gn+1) = kkn−1(k−1)−1t(Gn)

= kkn−kn−1−1k−nkkn−1

= k−n−1kkn

which proves that (1.5.24) holds forn+1.

1.5.3 Lyndon words and de Bruijn cycles

The following beautiful result is due to Fredericksen and Maiorana [17].

Theorem 1.5.6 Let ℓ1 < ℓ2 < .. . < ℓm be the increasing sequence of Lyndon wordsof length dividing n. The wordℓ1ℓ2 · · ·ℓm is a de Bruijn cycle of order n.

The original statement contains the additional claim that the de Bruijn cycle obtainedin this way is lexicographically minimal. We shall obtain this as a consequence of avariant of Theorem 1.5.6 (see Theorem 1.5.7 below).

For example, ifn= 4 andA= {a,b}, then

aaaabaabbababbbb= a aaab aabb ab abbb b

is a de Bruijn cycle of order 4.We will use the following lemma.

Lemma 3 Let w be a prefix of length n of a Lyndon word and letℓ be its longestprefix in L. Then w is the n-extension ofℓ.

Proof.Setw= ℓsand letvbe such thatwv∈ L. Set alsor = |ℓ|, n= |w| andwv=a1 · · ·am

with ai ∈ A. By Proposition 7, we havewv< sv. Thus there is some indext with1≤ t ≤ |sv| such thata j = a j+r for 1≤ j ≤ t − 1 andat < at+r . If t ≤ n− r, byLemma 2, the worda1 · · ·at+r is a prefix ofw which is a Lyndon word longer thanℓ. Thusa j = a j+r for 1≤ j ≤ n− r. This implies thatr is a period ofw and thus theconclusion.

Proof of Theorem 1.5.6.Sinceℓ1ℓ2 · · ·ℓm has lengthkn, we only need to prove thatany wordw= a1 · · ·an of lengthn appears as a factor ofℓ1 · · ·ℓmℓ1ℓ2. We denote bya the first letter of the alphabet and byz the largest one. We consider the followingcases.

(a) Assume first thatw is primitive and thatw= uvwith vu= ℓk and thatu is not apower ofz. Setu= pbqwith p∈ z∗ andb a letterb< z. By Lemma 2,vpz∈ L.By repeated use of Lemma 8,vz|u| is a Lyndon word. Thusℓk+1 ≤ vz|u|. Thisimplies thatv is a prefix ofℓk+1 and thusw is a factor ofℓkℓk+1.

Page 37: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 33

(b) Assume next thatw= uv is primitive, thatu∈ z∗ and thatvu∈ L. We can firstrule out the case wherev∈ a∗. Indeed,zjan− j is a factor ofℓm−1ℓmℓ1ℓ2. Let kbe the least index such thatv≤ ℓk (the existence ofk follows from the fact thatvu= ℓ j for somej). Thenℓk≤ vuand thusv is a prefix ofℓk. Let v′ ≤ v be theLyndon word such thatv is a sesquipower ofv′.

(b1) Assume first thatv′ 6= ℓk−1. Let v′′ be the wordv′ with its lastletter changed intoa. The word visited beforev′ by AlgorithmSESQUIPOWER(n,k) is, in view of Algorithm SESQUIPOWERBIS(n,k),the wordv′′zn−|v′|. Thus ℓk−1 ends withu, ℓk begins withv and thusw= uv is a factor ofℓk−1ℓk.

(b2) Otherwise,v′ = ℓk−1. For the same reason as above,u is a suffix ofℓk−2.Sincev is a sesquipower ofv′, it is a prefix ofv′v and thus also a prefixof ℓk−1ℓk. Thusw is a factor ofℓk−2ℓk−1ℓk.

(c) Assume finally thatw= (uv)d with d dividing n andvu= ℓk.

(c1) If u /∈ z∗ thenℓk+1≤ (vu)d−1vz|u| since the latter is a Lyndon word. Thusw is a factor ofℓkℓk+1.

(c2) Otherwise,ℓk−1 ends with at least(d − 1)|w| letters z and ℓk+1 ≤(vu)d−1z|w|. Thusw is a factor ofℓk−1ℓkℓk+1.

We illustrate the cases in the proof forn= 6 andA= {a,b}. Table 1.5.11 gives thesequenceℓk.

k 1 2 3 4 5 6 7 8ℓk a aaaaab aaaabb aaabab aaabbb aab aababb aabbab

9 10 11 12 13 14aabbbb ab ababbb abb abbbbb b

Table 1.5.11The Lyndon words of length dividing 6.

(a) Letw= aabaaa. Thenu= aab, v= aaaandvu= ℓ2. We findw as a factor ofℓ2ℓ3.

(b1) Letw= baaaab. Thenu= b andv= aaaab. We findk = 3, v′ = v andw is afactor ofℓ2ℓ3.

(b2) Letw= bbabab. Thenu= bb, v= abab. We findk= 11. We havev′ = abandwe findw as a factor ofℓ9ℓ10ℓ11.

Page 38: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

34 Handbook of Enumerative Combinatorics

(c1) Letw= (aba)2. Thenu= a, v= baandk= 6. We findw as a factor ofℓ6ℓ7.

(c2) Let w = (bab)2. Thenu = b, v = ab and k = 12. We findw as a factor ofℓ11ℓ12ℓ13.

Let X be a set of words. A de Bruijn cycle of ordern relative toX is a necklace suchthat every word ofX of lengthn appears exactly once as a factor. The usual notionof de Bruijn cycle is relative toX = A∗.

Consider for example the setX of words on{a,b} which are representatives ofnecklaces without consecutive occurrences ofb (see Example 10). Thenaaabis a deBruijn cycle of order 3 relative toX andaaaababof order 4.

The following result, due to Moreno [34], gives a family of sets X for whichthere are de Bruijn cycles of any order relative toX. Let ℓ1 < ℓ2 < .. . < ℓm be theincreasing sequence of Lyndon words of length dividingn. Fors< m, we denote byXs the set of words such that no factor has a conjugate in{ℓ1, . . . , ℓs}.

Theorem 1.5.7 For any s<m, the sequenceℓsℓs+1 · · ·ℓm is a de Bruijn cycle of ordern relative to Xs.

One can deduce from this result the fact that the de Bruijn cycle given by Theo-rem 1.5.6 is the minimal one for the alphabetic order (see [35]).

As another variant of Theorem 1.5.6, let us quote the following result due toYu Hin Au [2]: concatenating the Lyndon words of lengthn in increasing order, oneobtains a word which contains cyclically all primitive words of lengthn exactly once.For example, forn= 4 andA= {a,b}, one obtains the wordaaab aabb abbbwhichcontains cyclically all 12 twelve primitive words of length4.

1.6 Unavoidable setsA word t is said toavoida wordp if p is not a factor oft, i.e. if the patternp doesnot appear in the textt. For example the wordabracadabra avoidsbaba. The setof all words avoiding a given setX of words has been of interest in several contextsincluding the notion of a system of finite type in symbolic dynamics (see [29] forexample). This notion has been extended to many other situations (see in particularthe case of partial words in [7]).

Let A be a finite alphabet. Anunavoidableset onA is a setI ⊂A∗ of words on thealphabetA such that any two-sided infinite word(an)n∈Z on the alphabetA admits atleast one factor inI . It is of course equivalent to ask that any one-sided infinitewordhas a factor inI or also, since the alphabet is finite, that the set of words that avoidsI is finite (see [31] for an exposition of the properties of unavoidable sets).

Example 14 Let A= {a,b}. The set U= {a,b10} is unavoidable since any word oflength10 either has a letter equal to a or is the word b10. On the contrary, the set

Page 39: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 35

V = {aa,b10} is avoidable. Indeed, the infinite word(ab)ω = ababababab. . . has nofactor in V .

Proposition 10 On a finite alphabet, any unavoidable set contains a finite unavoid-able set.

Proof. Indeed, letX be an unavoidable set and letSbe the set of words avoidingX.SinceX is unavoidable,S is finite. Letn be the maximal length of the words ofS. LetZ be the set of words ofX of length at mostn+1. Every word of lengthn+1 has afactor inX which is actually inZ. ThusZ is unavoidable.

The following gives an equivalent definition of unavoidablesets which holds forfinite sets. It will be used below.

Proposition 11 Let I ⊂ A∗ be a finite set of words. The following conditions areequivalent.

(i) The set I is unavoidable.

(ii) Each two-sided infinite periodic word has at least one factorin I.

Proof. It is enough to show that (ii)⇒ (i). Let (an)n∈Z be a two-sided infinite se-quence of letters. Letu∈ A∗ be a word longer than any word inI and having an infi-nite number of occurrences in the sequence(an)n∈Z . This sequence has at least onefactor of the formuvu. By the hypothesis, the infinite periodic word. . .uvuvuvuv. . .has a factorw∈ I . The wordw is a factor of at least one of the wordsuv or vu. It isthus also a factor of the sequence(an)n∈Z and thusI is unavoidable.

This statement is false ifI is infinite. For example, on a three-letter alphabet, theset of squares is avoidable but every periodic word containsobviously a square.

1.6.1 Algorithms

To check in practice that a given finite setX is unavoidable, there are two possiblealgorithms.

The first one consists in computing a graphG = (P,E), whereP is the set ofprefixes ofX andE is the set of pairs(p,s) for which there is a lettera∈ A such thats is the longest suffix ofpawhich is inP.

Proposition 12 A finite set X is unavoidable if and only if every cycle in G containsa vertex in X.

Proof.For each integern≥ 0, and verticesu,v∈ P, there is a path of lengthn from uto v if and only if there exists a wordy of lengthn such thatv is the longest suffix ofuy in P. This can be proved by induction onn. It follows that there is a path of lengthn from ε to a vertexx∈ X if and only if AX∩An 6= /0.

Page 40: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

36 Handbook of Enumerative Combinatorics

ε

b

a

bb

a

ba

b

a

b

b

a

Figure 1.6.9The graph forX = {a,bb}.

Example 15 For X = {a,bb}, the word graph is given in Figure 1.6.9. By inspection,the set X is unavoidable.

The second algorithm is sometimes easier to write down by hand. Say that a setY of words is obtained from a finite set of wordsX by anelementary derivationif

(i) either there exist wordsu,v∈ X such thatu is a proper factor ofv, and

Y = X \ v

(ii) or there exists a wordx = ya∈ X with a ∈ A such that, for each letterb∈ Athere is a suffixzof y such thatzb∈ X, and

Y = (X∪y)\ x

A derivation is a sequence of elementary derivations. We say thatY is derivedfrom X if Y is obtained fromX by a derivation.

Example 16 Let X= {aaa,b}. Then we have the derivations

X→ {aa,b}→ {a,b}→ {ε,b}→ {ε}

where the first three arrows follow case (ii) and the last one case (i).

The following result shows in particular that ifY is derived fromX, thenX isunavoidable if and only ifY is unavoidable. We denote bySX the set of two-sidedinfinite words avoidingX.

Proposition 13 If Y is derived from X, then SX = SY.

Proof. It is enough to consider the case of an elementary derivation. In the first casewhereY = X \v, wherev has a factor inX, then clearlySX = SY. In the second case,we clearly haveSY ⊂ SX sinceY is obtained by replacing an element ofX by one ofits factors. Conversely, assume by contradiction the existence of somes∈ SX \SY.The only possible factor ofs in Y is y. Let b be the letter followingy in s. Thenshas

Page 41: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 37

a factor inX, namelyzbwherez is the suffix ofy such thatzb∈ X whose existenceis granted by the definition of a derivation. This is a contradiction.

The notion of a derivation gives a practical method to check whether a set isunavoidable. We have indeed the following result.

Proposition 14 A finite set X is unavoidable if and only if there is a derivation fromX to the set{ε}.

Proof. Let X 6= {ε} be unavoidable. We prove the existence of a derivation to{ε}by induction on the suml(X) of the lengths of words inX. If ε ∈ X, we may derive{ε} from X. Thus assumeε /∈ X, and letw be a word of maximal length avoidingX. For eachb∈ A there is a wordxb = zb∈ X which is a suffix ofwb. Let xa = yabe the longest of the wordsxb. Then the hypotheses of case (ii) are satisfied and thusthere is a derivation fromX to a setY with l(Y) < l(X). The converse is clear byProposition 13.

In practice, there is a shortcut which is useful to perform derivations. It is de-scribed in the following transformation fromX to Y.

(iii) there is a wordy such thatya∈ X for eacha∈ A and

Y = (X∪y)\ {ya| a∈ A}

It is clear that such a setY can be derived fromX and thus, we do not change the def-inition of derivations by adding case (iii) to the definitionof elementary derivations.We use this new definition in the following example.

Example 17 Let X = {aaa,aba,abb,bbb}. We have the following sequence ofderivations (with the symbol a in the word x= ya underlined at each step)

{aaa,aba,abb,bbb} → {aaa,ab,bbb}→ {aa,ab,bbb}→ {a,bbb}→ {a,bb}→ {a,b}→ {ε}

Derivations could of course be performed on the left rather than on the right.

1.6.2 Unavoidable sets of constant length

In the sequel, we will be interested in unavoidable sets madeof words having all thesame lengthn. The following proposition is easy to prove.

Proposition 15 Let A be a finite alphabet and let I be an unavoidable set of wordsof length n on A. The cardinality of I is at least equal to the number of conjugacyclasses of words of length n on the alphabet A.

Page 42: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

38 Handbook of Enumerative Combinatorics

Proof.Let u∈ A∗ be a word of lengthn. The factors of lengthn of the worduω arethe elements of the conjugacy class ofu. ThusI must contain at least one element ofthis class.

We are going to prove the following result which shows that the lower boundc(n,k) on the size of unavoidable sets of words of lengthn on k symbols is reachedfor all n,k≥ 1.

Theorem 1.6.1 For all n,k ≥ 1, there exists an unavoidable set formed of c(n,k)words of length n on k symbols.

This result has been obtained by J. Mykkelveit [36], solvinga conjecture ofGolomb. His proof uses exponential sums (see below). Later,and independently, itwas solved by Champarnaud, Hansel and Perrin [11] using Lyndon words. We shallpresent this proof here.

It may be convenient for the reader to reformulate the statement in terms ofgraphs. Afeedback vertex setin a directed graphG is a setF of vertices contain-ing at least one vertex from every cycle inG. Consider, forn≥ 1, the de Bruijn graphGn+1 of ordern+1 on the alphabetA whose vertices are the words of lengthn onAand the edges are the pairs(au,ub) for all a,b∈ A andu∈ An−1. It is easy to see thata set of words of lengthn is unavoidable if the corresponding set of vertices is a feed-back vertex set of the graphGn+1. Thus, the problem of determining an unavoidableset of words of lengthk of minimal size is the same as determining the minimal sizeof a feedback vertex set inGn+1. The problem is, for general directed graphs, knownto be NP-complete (see [18] for example).

As a preparation to a proof of Theorem 1.6.1, we introduce thefollowing notions.A division of a wordw is a pair(ℓi ,u) such thatw= ℓiu whereℓ ∈ L, i ≥ 1 and

u∈ A∗ with |u|< |ℓ|.By Proposition 9 each word inP admits at least one division. We say that a

Lyndon wordℓ ∈ L meetsthe wordw if there is a division ofw of the form(ℓi ,u). Itis clear that for anyℓ ∈ L there is at most one such division ofw.

Themain divisionof w∈ P is the division(ℓi ,u) whereℓ is the shortest Lyndonword which meetsw. The wordℓi is theprincipal partof w, denoted byp(w), anduis theremainder, denoted byr(w).

For example, witha < b, the wordaabaabbbaadmits two divisions which are(aabaabbb,a) and(aabaabb,ba). The first one corresponds to its decomposition asa sesquipower of a Lyndon word. The second one is its main division.

Let n≥ 1 be an integer and letMn be the set of minimal words of lengthn. Foreachm∈Mn, let p(m) be its principal part andr(m) its remainder. LetIn be the set

In = {r(m)p(m)|m∈Mn}

We remark that any minimal word which is not primitive appears in Ik.

Example 18 Table 1.6.12 lists the elements of M7 and I7 with the remainder of eachword of M7 in boldface.

Page 43: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 39

M7 I7aaaaaaa aaaaaaaaaaaaab aaaaaabaaaaabb baaaaabaaaabab abaaaabaaaabbb bbaaaabaaabaab aabaaabaaababb abbaaabaaabbab babaaabaaabbbb bbbaaabaabaabb baabaab

aababab abaababaababbb bbaababaabbabb abbaabbaabbbab babaabbaabbbbb bbbaabbabababb babababababbbb bbababbabbabbb babbabbabbbbbb bbbabbbbbbbbbb bbbbbbb

Table 1.6.12The setsM7 andI7

The object of what follows is to show thatIn is an unavoidable set. By Proposi-tion 15, the number of elements ofIn is the minimal possible number of elements ofan unavoidable set of words of lengthn.

Theorem 1.6.1 will be obtained consequence of the followingone, giving a con-struction of the minimal unavoidable sets.

Theorem 1.6.2 Let A be a finite alphabet and let n≥ 1. Let Mn be the set of wordson the alphabet A of length n and which are minimal in their conjugacy class. Forevery word m∈Mn, let p(m) be the principal part of m and let r(m) be its remainder.Then the set

In = {r(m)p(m)|m∈Mn}is an unavoidable set.

To prove Theorem 1.6.2, we need some preliminary results.

Proposition 16 Let ℓ and m be two Lyndon words, withℓ a prefix of m. Let s∈ A∗ bea proper suffix of m, with|s| < |ℓ|. Then for all i> 0, the word w= ℓis is a Lyndonword.

Proof.Let t be a proper suffix ofw. Three cases may arise.1. One has|t| ≤ |s|. Thent is a proper suffix of the Lyndon wordmand thust >m≥ ℓand since|t|< |ℓ|, we havet > ℓns= w.2. One has|t| > |s| and the wordt factorizes ast = ℓ js, with 0≤ j < i. Sinces isa proper suffix ofm, we haves> m≥ ℓ. Consequentlyt = ℓ js> ℓ j+1 and since|s|< |ℓ|, we havet > ℓis= w.3. One has|t|> |s| and the wordt factorizes ast = s′t ′, wheres′ is a proper suffix ofℓ. Sinceℓ ∈ L, one hass′ > ℓ, and consequentlyt = s′t ′ > ℓis= w.

In all casest > w and thusw is a Lyndon word.

Page 44: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

40 Handbook of Enumerative Combinatorics

Proposition 17 Let w be a prefix of a minimal word and let(ℓi ,u) be its main divi-sion. Let u′ ∈ A∗ be a word of the same length as u such that the word w′ = ℓiu′ isalso a prefix of a minimal word. Then the main division of w′ is the pair(ℓi ,u′).

Proof.Let (mj ,v) be the main division ofw′. We havew′ = mjv with |v|< |m|. Since(ℓi ,u′) is a division ofw′, the wordm is a prefix ofℓ. We are going to show bycontradiction thatm cannot be a proper prefix ofℓ.

Suppose thatm is a proper prefix ofℓ. Since the factorization of a minimal wordas a power of a Lyndon word is unique, we cannot have the equality mj = ℓi . Supposefirst that|mj |< |ℓi|. Sincew′ = mjv= ℓiu′, the wordmj is a proper prefix of the wordℓi . Thus there exists a non-empty wordx ∈ A∗ such thatmjx = ℓi andxu′ = v. Wethus have

w= ℓiu= mjxu

Since|xu|= |xu′|= |v|< |m|, the pair(mj ,xu) is a division ofw, which is a contra-diction sincem is a proper prefix ofℓ and that(ℓi ,u) is the main division ofw.

Let us now suppose that|mj |> |ℓi |. Sincew′ = mjv= ℓiu′, the wordℓi is a properprefix of the wordmj . Sincem is a proper prefix ofℓ, there exists an integerk > 0and a prefixm′ of msuch thatℓ= mkm′. Sinceℓ is a primitive word,m′ is non-empty.As a consequence,ℓ admitsm′ both as a non-empty prefix and and suffix, which iscontradictory sinceℓ is a Lyndon word.

The final property needed to prove Theorem 1.6.2 is the following.

Proposition 18 Let m be a Lyndon word and n a positive integer. Let N≥ 1 be thesmallest integer such that|mN|> n. Then the word mN+1 has a factor in In.

Proof.Let w be the prefix of lengthn of mN. Let (ℓi ,u) be the main division ofw. If uis the empty word, then, by construction,w∈ In and the proposition is true. Supposethatu is not empty.

The wordℓ is a prefix ofm since either|w| < |m| or w admits a division of theform (mj ,m′). Letsbe the suffix ofmhaving the same length asu. By Proposition 16,the wordℓis is a Lyndon word. Thus, by Proposition 17, the main division of ℓis isthe pair(ℓi ,s). Consequently, the wordsℓi belongs toIn. But this word is a factor ofmN+1. ThusmN+1 has a factor inIn.

We are now able to prove Theorem 1.6.2. By Proposition 11, it is enough to showthat every periodic two-sided infinite word of the form. . .uuuuu. . . has at least onefactor in In. We may suppose without loss of generality thatu is a Lyndon word. LetN be the least integer such thatN|u|> n. Then, by Proposition 18, the worduN+1 hasa factor inIn. ThusIn is unavoidable.

1.6.3 Conclusion

The proof of J. Mykkeltveit in [36] is based on the following principle, presentedin the case of a binary alphabet. Let us associate to a wordw= a0a1 · · ·an−1 on thealphabet{0,1} the sums(w) = ∑a jω j whereω = e2iπ/n. We denote byIs(w) the

Page 45: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 41

imaginary part ofs(w). It can be shown that for each conjugacy class of words, onlytwo cases occur:

(i) either all wordsw are such thatIs(w) = 0 (and then, forn> 2 one has actuallys(w) = 0 for each of them)

(ii) or there is, in clockwise order, one block of wordsw such thatIs(w) > 0 fol-lowed by one block of wordsw such thatIs(w) < 0 separated by at most twowordsw such thatIs(w) = 0.

Consider the setSn of words of lengthn formed of

(i) a representative of each conjugacy class of wordsw of length n such thatIs(w) = 0 for all the conjugates.

(ii) the wordsw= a0a1 · · ·an−1 of lengthn such thatIs(w) > 0 for the first timeclockwise.

It is shown in [36] that this set is unavoidable for alln > 2. The compari-son with the previous family of minimal unavoidable set shows that the fami-lies have nothing in common. The sets obtained are indeed different. The setsdefined by J. Mykkeltveit have a slight advantage in the sensethat the maxi-mal length of words avoiding the set is less. For example, forn = 20, thereare 256 words of length 2579 that avoidIn, but none of length 563 that avoidall of Sn (and there is a unique way to avoidSn with length 562). This com-putation has been performed using D. Knuth’s program UNAVOIDABLE2 (seehttp://www-cs-faculty.stanford.edu/~knuth/programs.html). Our proofhas the advantage of using only elementary concepts and in particular no real orcomplex arithmetic.

Another proof of Theorem 1.6.1 obtained by the first two authors of [11] andpresented in [10] is a construction working by stages. To explain these stages, let usconsider the case of a binary alphabetA= {a,b}. Given a setX of two-sided infinitewords, we say that a setY of words is unavoidable inX if every word ofX has afactor inY.

For i ≥ 1, let Xi be the set of two-sided infinite words onA which avoidai . Letci(n,k) be the number of conjugacy classes of wordsx of length n on k symbolssuch that the words of the formxζ = · · ·xxx· · · are inXi . It is thus also equal to thenumber of orbits of periodn in Xi . Table 1.6.13 below gives the values ofci(n,2)for 1≤ i ≤ 10 and 1≤ n≤ 10. The rows are indexed byi and the columns byn.Thus the second row is the last row of Table 1.3.8 and the thirdrow is the last rowof Table 1.3.9. Moreover, substracting 1 to the first 8 entries of the 3 last rows, weobtain the second row of Table 1.3.7 (we have to substract 1 becausean is missing).

The idea of the step by step construction of a minimal unavoidable set of wordsof lengthn is to construct a sequenceY1 ⊂ Y2 ⊂ . . . ⊂Yn of sets of words of lengthn such that for 1≤ i ≤ n, the setYn is unavoidable inXi with ci(n,k) elements. Thiscan be stated as the following result.

Theorem 1.6.3 For each k≥ 1, and n≥ i ≥ 1, there exists a set of ci(n,k) words oflength n on k symbols which is unavoidable in Xi .

Page 46: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

42 Handbook of Enumerative Combinatorics

1 2 3 4 5 6 7 8 9 101 1 1 1 1 1 1 1 1 1 12 1 2 2 3 3 5 5 8 10 153 1 2 3 4 5 9 11 19 29 484 1 2 3 5 6 11 15 27 43 595 1 2 3 5 7 12 17 31 51 916 1 2 3 5 7 13 18 33 55 997 1 2 3 5 7 13 19 34 57 1038 1 2 3 5 7 13 19 35 58 1059 1 2 3 5 7 13 19 35 59 106

10 1 2 3 5 7 13 19 35 59 107

Table 1.6.13The values ofci(n,2).

The alternative proof consists in showing directly that theci(n,k) last elements ofInform a set unavoidable inXi .

It is interesting to remark that not all minimal unavoidablesets are build in thisway. Indeed, there are sets which are minimal unavoidable inXn+1 but do not containa minimal unavoidable set inXn.

For example, letY1 = {bbb}, Y2 = {bbb,bab}, Y3 = {bbb,bab,aab}. Then eachYi for 1≤ i ≤ 3 is unavoidable inXi of sizeci(3,2) andI3 =Y3∪{aaa}. In particular,the setI3 contains an unavoidable set inX2 with 2 elements, namelyY2. However,the setJ3 = {aaa,aba,bba,bbb} obtained fromI3 by exchanginga andb does notcontain a two element set unavoidable inX2.

A set of the formXi is a particular case of what is called a system of finite type.This is, by definition the set of all two-sided infinite words avoiding a given finite setof words (see [29]). We do not know in general in which systemsof finite type it istrue that for eachn there exists an unavoidable set having no more elements thanthenumber of orbits of periodn.

1.7 The Burrows-Wheeler TransformThe Burrows-Wheeler transform is a popular method used for text compression [9].It produces a permutation of the characters of an input wordw in order to obtain aword easier to compress. The presentation given here is close to that of [12].

Supposew is aprimitiveword over atotally orderedalphabetA. Let w1,w2,...,wn

be the sequence of conjugates ofw in increasing lexicographic order. LetM(w) bethe matrix havingw1,w2,...,wn as rows. For example, ifw = aabacacb, the matrixM(w) is

Page 47: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 43

M(aabacacb) =

a a b a c a c ba b a c a c b aa c a c b a a ba c b a a b a cb a a b a c a cb a c a c b a ac a c b a a b ac b a a b a c a

The Burrows-Wheeler Transform T(w) of w is the last column ofM(w), readfrom top to bottom. Ifbi denotes the last letter of the wordwi , for i = 1,2, ...,n, thenT(w) = b1b2...bn. For instance,T(aabacacb) = babccaaa.

It is clear thatT(w) depends only on the conjugacy class ofw, i.e.T(w) = T(w′)if w andw′ are conjugate. Therefore we may suppose thatw is aLyndonword, i.e.w= w1.

The matrixM(w) defines a permutationσw (or simply σ when no confusionarises) of 1,2, ...,n :

σ(i) = j ⇐⇒ wj = ai · · ·ana1 · · ·ai−1 (1.7.25)

In other terms,σ(i) is the rank in the lexicographic order of thei− th circular shiftof the wordw. For instance, forw= aabacacb, we have:

σ =

(

1 2 3 4 5 6 7 81 2 6 3 7 4 8 5

)

.

Let F(w) denote the first column of the matrixM(w). If ci denotes the first let-ter of the wordwi , for i = 1,2, . . . ,n, thenF(w) = c1c2 · · ·cn is the nondecreasingrearrangement ofw. By definition, we have, for each indexi, with 1≤ i ≤ n,

ai = cσ(i) (1.7.26)

The permutationσ transforms the first column ofM(w) into its first row, i.e. intothe wordw. We have also the following formula expressingT(w) usingσ :

bi = aσ−1(i)−1. (1.7.27)

Indeed,bσ(i) is the last letter ofwσ(i) = a j · · ·ana1 · · ·a j−1, hencebσ( j) = a j−1,which is equivalent to the above formula.

Given a primitive wordw∈ A∗, let πw, or simplyπ when no confusion arises, bethe permutation defined by

π(i) = σ(σ−1(i)+1), (1.7.28)

where the addition is to be taken modulon.

Page 48: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

44 Handbook of Enumerative Combinatorics

Remark 1.7.1 Observe thatπ is just the permutation defined by writingσ as a wordand interpreting it as a n-cycle. Thus we have alsoσ(i) = π i−1(1) and

ai = cπ i−1(1).

In the previous example we have, written as a cycle,

π = (1 2 6 3 7 4 8 5)

and as an array

π =

(

1 2 3 4 5 6 7 82 6 7 8 1 3 4 5

)

.

The following proposition is fundamental for defining the inverse transform.

Proposition 19 If c1c2 · · ·cn and b1b2...bn are the first and the last columns, respec-tively, of the matrix M(w), then ci = bπ(i), for i = 1,2, . . . ,n.

Proof.Substituting in formula 1.7.27 the valueai given by formula 1.7.26, we obtainbi = cσ(σ−1(i)−1), which is equivalent to the statement of the proposition.

The previous proposition states that the permutationπw transforms the last col-umn of the matrixM(w) into the first one. Actually, it can be noted thatπw transformsany column of the matrixM(w) into the following one.

1.7.1 The inverse transform

We now show how the wordw can be recovered fromT(w). For this we prove aproperty of the matrixM(w) stating that, for any lettera∈A, its occurrences inF(w)appear in the same order as inT(w), i.e. thek− th instance ofa in T(w) corresponds(throughπ) to its k− th instance inF(w). In order to formalize this property weintroduce the following notation.

The rank of the indexi in the wordz= z1z2 · · ·zn, denoted byrank(i,z), is thenumber of occurrences of the letterzi in z1z2 · · ·zi . For instance, ifz= babccaaa, thenrank(4,z) = 1 andrank(6,z) = 2.

Proposition 20 Given the words T(w) = b1b2 · · ·bn and F(w) = c1c2 · · ·cn, for eachindex i= 1,2, . . . ,n, we have

rank(i,F(w)) = rank(π(i),T(w)).

Proof. We first note that, for two wordsu,v of the same length, and for any lettera∈ A, one has

au< av⇐⇒ ua< va.

Thus, for all indicesi, j, i < j andci = c j impliesπ(i)< π( j). Hence, the number of

Page 49: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 45

occurrences ofci in c1c2 · · ·ci is equal to the number of occurrences ofbπ(i) = ci inb1b2 · · ·bπ(i).

To obtainw from T(w) = b1b2...bn, we first computeF(w) = c1c2...cn by rear-ranging the lettersbi in nondecreasing order. Proposition 20 shows thatπ(i) is theindex j such thatci = b j andrank( j,T(w)) = rank(i,F(w)). This defines the permu-tationπ , from whichσ can be obtained expressingπ as an-cycle, and then, by usingFormula (1.7.26), the wordw can be reconstructed.

Remark 1.7.2 Proposition 20 further shows that the permutationπ is related to thestandardpermutation of the word T(w). Recall that the standard permutation of aword v= b1b2 · · ·bn on a totally ordered alphabet A is the permutationτ such that,for i, j ∈ 1,2, . . . ,n, the conditionτ(i) < τ( j) is equivalent to bi < b j or bi = b j

and i< j. The permutationτ may be obtained by numbering from left to right theletters of v, starting from the smallest letter, then the second smallest, and so on. Forexample, for v= babccaaa, we have that, written as a word,τ = 51678234, and asan array:

τ =

(

1 2 3 4 5 6 7 85 1 6 7 8 2 3 4

)

.

It is easy to see that the permutationπ corresponds to the inverse of the standardpermutation of T(w).

Remark 1.7.3 The Burrows-Wheeler transform T(w) of a primitive word w dependsonly on the conjugacy class of w. Therefore, T defines aninjectivemapping from theprimitive necklaces over an alphabet A to the words of A∗. However such a mappingis notsurjective. Remark 1.7.2 indeed shows that, if we consider a word u∈ A∗ suchthat the standard permutationτ of u (and then also the permutationπ = τ−1) is nota cycle, then does not exists any word w such the T(w) = u. Let us, for instance,consider the word u= bccaaab. Its standard permutation

τ =

(

1 2 3 4 5 6 74 6 7 1 2 3 5

)

is the product of two cycles

τ = (1 4)(2 6 3 7 5).

It follows that there does not exist any word w such that T(w) = u.

1.7.2 Descents of a permutation

A descentof a permutationπ is an indexi such thatπ(i) > π(i + 1). We denoteby Des(π) the set of descents of the permutationπ . Consider the permutationπw

corresponding to a wordw. It is clear from Proposition 20 that ifi is a descent of

Page 50: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

46 Handbook of Enumerative Combinatorics

πw, thenci 6= ci+1. Thus the number of descents ofπw is at most equal tok− 1,wherek is the number of distinct symbols appearing in the wordw. For instance,for w = aabacacb, πw(4) > πw(5), moreover 4 is the only descent ofπw and soDes(πw) = {4}.

Let A = {a1,a2, . . . ,ak} be a totally ordered alphabet witha1 < a2 < .. . <ak. If w is a word of A∗, denote byP(w) the Parikh vector of w: P(w) =(|w|a1, |w|a2, . . . , |w|ak). It is clear that, if two words are conjugate, then they havethe same Parikh vector, and so one can define the Parikh vectorof a necklace. Wesay that a vectorV = (n1,n2, ...,nk) is positiveif ni > 0 for i = 1,2, ...,k. We denoteby ρ(V) the set of integersρ(V) = {n1,n1+n2, ...,n1+ ...+nk−1}. WhenV is pos-itive, ρ(V) hask−1 elements. Letπw be the permutation corresponding to wordwand letP(w) be the Parikh vector ofw. It is clear from Proposition 20 that we havethe inclusionDes(πw)⊂ ρ(P(w)).

Example 19 The Parikh vector of the word w= aabacacb is V= (4,2,2) andρ(V) = {4,6}. The permutationπw corresponding to w is

πw =

(

1 2 3 4 5 6 7 82 6 7 8 1 3 4 5

)

.

Thus Des(πw) = {4} ⊂ ρ(V).

The following statement, due to Crochemore, Desarmenienand Perrin [12], re-sults from the previous considerations and Remark 1.7.2.

Theorem 1.7.4 For any positive vector V= (n1,n2, ...,nk), with n= n1+n2+ ...+nk, the map w→ πw is one-to-one from the set of primitive necklaces of length nwithParikh vector V onto the set of cyclic permutationsπ on{1,2, ...,n} such thatρ(V)contains Des(π).

This result actually is a particular case of a result stated in [31] and closely re-lated to the Gessel-Reutenauer bijection introduced in thenext section. Since eachconjugacy class of primitive words can be represented by a Lyndon word, Theorem1.7.4 establishes a bijection between Lyndon words and cyclic permutations havingspecial descent sets. The extension of this result (Theorem11.6.1 of [31]) establishesa bijection between words and permutations, relating the Lyndon factorization ofwords and the cycle structure of the permutations.

1.8 The Gessel-Reutenauer bijectionWe have shown that the Burrows-Wheeler transformT(w) of a wordw depends onlyon the conjugacy class ofw. Therefore,T defines an injective mapping from the

Page 51: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 47

primitive necklaces over an alphabetA to the words ofA∗. However (cf. Remark1.7.3) such a mapping is not surjective.

We now extend the Burrows-Wheeler transform by defining abijectivemappingΦ from themultisetsof primitive necklaces over a totally ordered alphabetA to thewords ofA∗. This bijection has been introduced by Gessel and Reutenauer in [20].In order to define the mappingΦ we introduce a new order onA∗. For a wordz∈ A∗,zω denotes the infinite wordzzz· · · obtained by infinitely iteratingz. Givenu,v∈ A∗,u2ω v if and only if uω ≤ vω in the lexicographic order. Remark that the order2ωis different from the usual lexicographic order: for instance,aba2ω ab.

Following [33] (see also [23]), we give here a presentation of the mappingΦ thatemphasizes its relation with the Burrows-Wheeler transform.

Let S= {s1,s2, . . . ,sm} be a multiset of necklaces, represented by their Lyndonwords, i.e.sk is the Lyndon word corresponding to thek-th necklace ofS. In somecases, it is convenient to denote by(z1z2 · · ·zt) the necklace containing the wordz1z2 · · ·zt . Denote also byn= |s1|+ |s2|+ . . .+ |sm| thesizeof S and byL the leastcommon multiple of the lengths of the words inS.

In order to sort the elements in the necklaces ofS according to the order2ω ,consider the collection of all words of the formuL\|u|, whereu is an element of anecklace. All these words then have a common lengthL. We order this set of wordslexicographically to yield a matrixM(S) with n rows andL columns.

Example 20 Let S= {aab,ab,abb}. Then n= 8 and L= 6 and

M(S) =

a a b a a ba b a a b aa b a b a ba b b a b bb a a b a ab a b a b ab a b b a bb b a b b a

The transformΦ(S) = b1b2 · · ·bn corresponds to the last column of the matrixM(S), read from top to bottom. In the previous example,Φ(S) = babbaaba.

It is clear that if the multisetShas only one necklace represented by its Lyndonwordw, thenΦ(S) = T(w), i.e.Φ(S) corresponds to the Burrows-Wheeler transformof w. Note also that, if there are non trivial multiplicities in the multisetS, then thereare repeated rows in the matrixM(S).

Several properties of the Burrows-Wheeler matrixM(w) of a word w can beeasily extended to the matrixM(S). In particular, if we denote byF(S) = c1c2 · · ·cn

the first column of the matrixM(S), by using the same argument as in the proof ofProposition 20, it can be shown that, for any lettera ∈ A, its occurrences inF(S)appear in the same order as inΦ(S). This defines a permutationπ that transforms the

Page 52: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

48 Handbook of Enumerative Combinatorics

last column of the matrixM(S) into the first one. Actually (cf. Remark 1.7.2),π isthe inverse of the standard permutation of the wordΦ(S).

Example 21 The inverse of the standard permutation of the word babbaabais

π =

(

1 2 3 4 5 6 7 82 5 6 8 1 3 4 7

)

.

In order to reverse the Burrows-Wheeler transform, given the word T(w) =b1b2 · · ·bn, we considered in Section 1.7.1 the inverseπ of its standard permutation,then we expressed it as an-cycle( j1, j2, . . . , jn) and we associated to thisn-cycle thenecklace(c j1c j2 · · ·c jn).

Recall that the Burrows-Wheeler transform is injective, but not surjective. Thisis a consequence of the fact that, for some wordsu, the inverse of its standard per-mutation cannot be expressed by a singlen-cycle, but its decomposition containsseveral cycles (cf. Remark 1.7.3). This remark is at the baseof the surjectivity of theGessel-Reutenauer transform.

Now we show how to reverse the transformΦ, that is how the multiset of neck-lacesScan be recovered from the wordΦ(S) = b1b2 · · ·bn.

As for the Burrows-Wheeler transform, first compute the firstcolumnF(S) =c1c2 · · ·cn of M(S) by rearranging the lettersbi in nondecreasing order.

Then, consider the inverseπ of the standard permutation associated to theword Φ(S) = b1b2 · · ·bn. With each cycle( j1, j2, . . . , j i) of π , associate the necklace(c j1c j2 · · ·c j i ). The multisetS is given by

S= {(c j1c j2 · · ·c j i )|( j1, j2, . . . , j i) is a cycle ofπ}.Remark that different cycles ofπ could give rise to the same necklace, and this

explains the use of multisets.

Example 22 Let Φ(S) = babbaaba (see Example 20). Rearranging the letters innondecreasing order, one obtains F(S) = aaaabbbb. Then the permutationπ is

π =

(

1 2 3 4 5 6 7 82 5 6 8 1 3 4 7

)

.

By decomposingπ in cycles

π = (1 2 5)(3 6)(4 8 7),

one obtains the multiset of necklaces S= {(aab),(ab),(abb)}.The following theorem, due to Gessel and Reutenauer [20], results from the pre-

ceding considerations.

Theorem 1.8.1 The mapΦ defines a bijection between words over a totally orderedalphabet A and multisets of primitive necklaces over A.

A similar, but different, bijection has been proved in [19],where, instead of thelexicographic order, is used thealternate lexicographicorder.

Page 53: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 49

1.8.1 Gessel-Reutenauer bijection and de Bruijn cycles

In this section we present an interesting connection, pointed out in [23], between theGessel-Reutenauer bijection and the de Bruijn cycles.

A multisetS= {s1,s2, . . . ,sm} of necklaces is ade Bruijn set of span nover analphabetA if |s1|+ |s2|+ ...+ |sm|= Card(A)n and every wordw∈ An is a prefix ofsome power of some word in a necklace ofS.

Remark 1.8.2 The number of distinct prefixes of length n of powers of the words inthe necklaces of S is at mostCard(A)n. So, given that S is a de Bruijn set of span n,every word in An can be read exactly once within the necklaces of S. It also follows, inparticular, that no two necklaces in S are equal, so that S is indeed a set, as opposedto a multiset, of necklaces.

Remark 1.8.3 If S is a de Bruijn set of span n, then S contains a necklace of length atleast n. To show this, consider a Lyndon word u of length n (forinstance, u= abn−1,where a< b). By definition, u is prefix of some power of a word in a necklace ofS. Since u, as a Lyndon word, is unbordered, it cannot arise asa prefix of a properpower in a necklace of S. It follows that S contains a necklaceof length at least n.

If A is an alphabet of cardinalityk, denote byΓ the set of allk! products of distinctelements ofA:

Γ = {a1a2 · · ·ak|ai ∈ A for i = 1, . . . ,k andai 6= a j for i 6= j}.

For instance, forA= {a,b,c},

Γ = {abc,acb,bac,bca,cab,cba}.

The following result is due to Higgins [23].

Theorem 1.8.4 A set S is a de Bruijn set of span n if and only ifΦ(S) ∈ Γkn−1.

Proof. Let us first suppose thatS is a de Bruijn set of spann. Consider the matrixM(S). By Remark 1.8.3, the lengthL of the rows ofM(S) is at leastn. Consider thesub-matrix consisting of the firstn columns ofM(S). SinceS is a de Bruijn set, therows of this sub-matrix form the setAn. Each wordu∈ An−1 is prefix ofk successiverows ofM(S). We show that these successive rows ofM(S) end with distinct lettersof A. Suppose, by contradiction, that two of these rowsv1 andv2, end with the samelettera, i.e.v1 = uxaandv2 = uyafor somex,y∈A∗, with x 6= y. Since the conjugatesauxandauy, of v1 andv2, respectively, correspond to distinct rows inM(S), it followsthat au∈ An would be a prefix of a power of distinct words in the necklaces of S,contrary toSbeing a de Bruijn set of spann. Hence the final columnΦ(S) of M(S)is a product ofkn−1 elements (possibly with repetitions) taken from the setΓ.

In order to prove the converse implication, letSbe a multiset of necklaces suchthatΦ(S) = w∈ Γkn−1

. We first prove, by induction on the integerr, with 1≤ r ≤ n,

Page 54: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

50 Handbook of Enumerative Combinatorics

that any wordu∈ A∗ of lengthr is the prefix ofkn−r consecutive rows of the matrixM(S). In particular, we show that there exists an integerj such thatu appears as aprefix in the rows ofM(S) ranging from the indexjkn−r to the index( j +1)kn−r −1.Remark that the sequence of the last letters of these rows, read from top to bottom,returns a factor ofw which is again a concatenation of elements ofΓ.

The statement is true forr = 1. Indeed, sincew ∈ Γkn−1, |w| = kn and, for any

lettera∈ A, |w|a = kn−1. It follows that the first columnF(S) of M(S), read from topto bottom, consists ofkn−1 occurrences of the first (in the order) letter ofA, followedby kn−1 occurrences of the second letter, and so on. Actually, we have also that, ifz isthe word corresponding to an arbitrary column ofM(S), for eacha∈ A, |z|a = |z|/k.

Let us now suppose that the statement is true for somer < n, and consider aword v∈ A∗ of lengthr +1. If a is the first letter ofv, we havev= au, with |u|= r.By the inductive hypothesis, there exists an integerj such thatu is the prefix oflengthr of kn−r consecutive rows ofM(S) ranging from the indexjkn−r to the index( j + 1)kn−r − 1. The sequence of the last letters of these rows, read from top tobottom, forms a factorzu of w (the word corresponding to the last column ofM(S)),and moreoverzu is product of elements ofΓ. Thus, for anya ∈ A, |zu|a = kn−r−1.It follows that, within thekn−r consecutive rows ofM(S) havingu as prefix,kn−r−1

of them end with the lettera. By taking into account their conjugates, we have thatkn−r−1 consecutive rows ofM(S) have as prefix the same wordau= v. If b is thelast letter ofv, i.e. v = u′b, since|u′| = r, by the inductive hypothesis there existsan integeri such thatu′ appears as prefix of the rows ofM(S) ranging from theindex ikn−r to the index(i +1)kn−r −1. Thek different letters ofA split the interval[ikn−r ,(i +1)kn−r −1] into k sub-intervals of equal length in such a way that eachsub-interval contains the rows ofM(S) having as prefix of lengthr +1 the wordu′c,for somec∈A. We conclude that there is an integert such that thekn−r−1 consecutiverows ofM(S), having as prefix the wordv=u′b, have indexes that range fromtkn−r−1

to (t +1)kn−r−1−1. So, we have proved that, ifΦ(S) = w∈ Γkn−1, then, for anyr,

with 1≤ r ≤ n, every wordu∈ A∗ of lengthr is the prefix ofkn−r consecutive rowsof M(S). In particular, forr = n, every wordu∈A∗ of lengthn is the prefix of exactlyone row ofM(S). This implies thatS is a de Bruijn set of spann.

By Theorem 1.8.4, one can generate a de Bruijn setSof spann, on an alphabetA of cardinalityk, by taking a wordv∈ Γkn−1

and by computingΦ−1(v).

Example 23 Consider the alphabet A= {a,b} with a< b. ThenΓ = {α,β}, whereα = ab and β = ba. Let n= 4, and consider the word v= β αβ β αααβ =baabbabaabababba∈ Γ8. Rearranging the letters of v in nondecreasing order, oneobtains the first column F(S) of the matrix M(S): F (S) = aaaaaaaabbbbbbbb. Theinverseπ of the standard permutation of the word v is

π =

(

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 162 3 6 8 9 11 13 16 1 4 5 7 10 12 14 15

)

.

Page 55: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 51

By decomposingπ in cycles

π = (1 2 3 6 11 5 9)(4 8 16 15 14 12 7 13 10),

one obtains the set of necklaces

S= {(baaaaba),(baabbbbab)}.

One can verify that any word of A4 is prefix of some word in a necklace of S, i.e. S isa de Bruijn set of span4.

Given a totally ordered alphabetA= {a1,a2, . . . ,ak}, of cardinalityk, with a1 <a2 < .. . < ak, denote byα the elementa1a2 · · ·ak ∈ Γ. Now we look at the specialcase of Theorem 1.8.4 wherev is a power ofα. In such a case, by specializingthe arguments in the proof of Theorem 1.8.4, (cf.[23]), one can prove the followingresult.

Theorem 1.8.5 Let v= αkn−1, let S= Φ−1(v) and let M= M(S) be the matrix cor-

responding to S. Then the rows of M are simply the elements of An. Moreover S is theset of necklaces of the Lyndon words of length dividing n.

Example 24 Consider the alphabet A= {a,b} with a < b, and the wordα24=

(ab)16. The inverseπ of the standard permutation of the word(ab)16 is

π =

(

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 3 5 7 9 11 13 15 2 4 6 8 10 12 14 16

)

.

By decomposingπ in cycles

π = (1)(2 3 5 9)(4 7 13 10)(6 11)(8 15 14 12)(16),

one obtains the set of necklaces

S= {(a),(aaab),(aabb),(ab),(abbb),(b)},

which is the set of necklaces of the Lyndon words of length dividing 4. If we considerthe concatenation of such Lyndon words, we obtain the word

a.aaab.aabb.ab.abbb.b

which is indeed the first de Bruijn word of span4 in the lexicographic order. Thatthis is always the case is the well known theorem of Frederickson and Maiorana (seeTheorem 1.5.6).

Actually, as a consequence of Theorem 1.8.5 and of the theorem of Fredericksonand Maiorana, we obtain the following result.

Proposition 21 The concatenation in ascending order of the Lyndon words of thenecklaces of S= Φ−1(αkn−1

) is the first de Bruijn word of span n in the lexicographicorder.

Page 56: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

52 Handbook of Enumerative Combinatorics

1.9 Suffix arraysSuffix array is a widely used data structure in string algorithms (see [32] or [22]).Thesuffix arrayof a wordw of lengthn is essentially a permutation of{1,2, ...,n}corresponding to the starting positions of all the suffixes of w sorted lexicographi-cally.

Let A= {a1,a2, . . . ,ak} be a totally ordered alphabet of sizek, wherea1 < a2 <.. . < ak. Given a wordw = z1z2 · · ·zn of lengthn on the alphabetA, the suffix ar-ray of w is the permutationϑw (or simply ϑ when no confusion arises) of the set{1,2, . . . ,n} such thatϑ(i) = j if the suffixzjzj+1...zn has ranki in the lexicographicordering of all the suffixes ofw. For instance, ifw= baaababa, then

ϑ =

(

1 2 3 4 5 6 7 88 2 3 6 4 7 1 5

)

.

1.9.1 Suffix arrays and Burrows-Wheeler transform

We show here, following [28], the close connection between the suffix array of aword and its Burrows-Wheeler transform. For this purpose, it is convenient to intro-duce theBurrows-Wheeler array(or simply theBW-array) of a primitive word ofA∗.

Given a primitive wordw= z1z2 · · ·zn of lengthn on the ordered alphabetA, theBW-array of w is the permutationϕw (or simply ϕ when no confusion arises) of{1,2, . . . ,n} such thatϕ(i) = j if the conjugatezj · · ·znz1 · · ·zj−1 has ranki in thelexicographic sorting of all the conjugates ofw. By definition, theBW-array of aword w is just the inverse of the permutationσ defined by the relation 1.7.25. i.e.ϕ = σ−1.

In order to show the connection between the suffix array and the Burrows-Wheeler transform, we first introduce asentinelsymbol at the end of the word.Consider a symbol♯ /∈ A, and the ordered alphabetA′ = {♯,a1, ...,ak} where♯ <a1 < .. . < ak. We will examine the suffix array of the wordw′ = w♯. In the sequel,we denote bySn be the set of permutations of{1,2, . . . ,n}. Moreover, forϑ ∈ Sn,ϑ ∈ Sn+1 denotes the permutation

ϑ =

(

1 2 . . . n+1n+1 ϑ(1) . . . ϑ(n)

)

.

Remark 1.9.1 There is a one-to-one correspondence between the suffix arrays ofthe words w∈ An and the suffix arrays of the words w′ ∈ An♯. In particular, if thepermutationϑw ∈ Sn is the suffix array of w∈ An, then the permutationϑw′ ∈ Sn+1

is the suffix array of w′ = w♯ if and only ifϑw′ = ϑ .

Remark 1.9.2 It is easy to see that, for words in A∗♯, conjugate sorting is equivalent

Page 57: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 53

to suffix sorting. It follows that the suffix array of the word w′ = w♯ coincides with itsBW-array, i.e.ϑw′ = ϕw′ .

The following statement follows from the previous remarks.

Proposition 22 A permutationϑ ∈Sn is the suffix array of a word w∈An if and onlyif the permutationϑ ∈ Sn+1 is the BW-array of the word w′ = w♯.

Consider now the mappingΨ : Sn→ Sn+1 defined as follows. Ifϑ ∈ Sn, Ψ(ϑ)is the permutationµ ∈ Sn+1 defined byµ(i) = ϑ−1(ϑ(i)+1), where the addition istaken modulon+1. Actually,Ψ(ϑ) is just the permutation obtained by writingϑ−1

as a word and interpreting it as a(n+1)-cycle.

Example 25 Consider the permutation

ϑ =

(

1 2 3 4 5 6 7 88 2 3 6 4 7 1 5

)

.

Then we have that

ϑ =

(

1 2 3 4 5 6 7 8 99 8 2 3 6 4 7 1 5

)

and

ϑ−1 =

(

1 2 3 4 5 6 7 8 98 3 4 6 9 5 7 2 1

)

.

It follows that

Ψ(ϑ) = (8 3 4 6 9 5 7 2 1) =

(

1 2 3 4 5 6 7 8 98 1 4 6 7 9 2 3 5

)

.

The following theorem, that appears in [28], gives a characterization of suffixarrays.

Theorem 1.9.3 Let n1,n2, . . . ,nk be positive integers such that n1+n2+ . . .+nk =n. A permutationϑ ∈ Sn is the suffix array of a word w∈ An, with Parikh vectorP(w)= (n1,n2, . . . ,nk) if and only if Des(Ψ(ϑ))⊆{1,1+n1, . . . ,1+n1+ . . .+nk−1}.Moreover, in this case,ϑ is the suffix array of exactly one such word.

Proof. Let us suppose thatϑ ∈ Sn is the suffix array of a wordw ∈ An. Then, byProposition 22,ϑ ∈ Sn+1 is theBW-array of the wordw′ = w♯. By observing thattheBW-array ofw′ corresponds to the inverse of the permutationσw′ defined by therelation 1.7.25, we have thatϑ = (σw′)

−1. We show thatΨ(ϑ) = πw′ , whereπw′ isthe permutation defined by formula 1.7.28. Indeed, ifµ = Ψ(ϑ), we can write

µ(i) = ϑ−1(ϑ (i)+1) = σw′(σ−1w′ (i)+1) = πw′ .

Page 58: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

54 Handbook of Enumerative Combinatorics

We now observe that, ifP(w) = (n1, . . . ,nk) is the Parikh vector ofw, then the Parikhvector ofw′ = w♯ is P(w′) = (1,n1, . . . ,nk). Therefore, by Theorem 1.7.4,

Des(Ψ(ϑ)) = Des(πw′)⊆ ρ(P(w′)) = {1,1+n1, . . . ,1+n1+ . . .+nk−1}.

Conversely, given a Parikh vectorV =(n1, . . . ,nk) and a permutationϑ ∈Sn such thatDes(Ψ(ϑ))⊆ {1,1+n1, . . . ,1+n1+ . . .+nk−1}, we show that there exists a uniqueword w∈ An having Parikh vectorP(w) =V and suffix arrayϑw = ϑ . Actually, weprovide a construction of this word. Sincea1 < a2 < ... < ak, in the starting positionsof the firstn1 suffixes ofw, in the lexicographic order, there is the lettera1, in thestarting positions of the suffixes ofw having rank fromn1 + 1 to n1 + n2, in thelexicographic order, there is the lettera2, and so on. Therefore, forw= z1z2 · · ·zn, if1≤ i ≤ n1 thenzϑ (i) = a1 and, for 1< r ≤ k, if n1+ ...+nr−1 < i ≤ n1+ . . .+nr ,thenzϑ (i) = ar . This concludes the proof.

Example 26 Given the permutationϑ ∈S8 in Example 25 and the vector V= (5,3),we construct a word w on a binary alphabet A= {a,b}, with a< b, having V asParikh vector andϑ as suffix array. From Example 25, we have that

Ψ(ϑ) =

(

1 2 3 4 5 6 7 8 98 1 4 6 7 9 2 3 5

)

.

Des(Ψ(ϑ)) = {1,6} verifies the condition of the Theorem 1.9.3. The word w= z1...z8

having Parikh vector(5,3) and suffix arrayϑ is obtained as follows:

zϑ (1) = zϑ (2) = zϑ (3) = zϑ (4) = zϑ (5) = a

andzϑ (6) = zϑ (7) = zϑ (8) = b.

Therefore w= baaababa.

The following corollary of Theorem 1.9.3 will be useful in the next section.

Proposition 23 A permutationϑ ∈ Sn is the suffix array of some word w of length non an alphabet of cardinality k if and only if

Card(Des(Ψ(ϑ))\ {1})≤ k−1.

1.9.2 Counting suffix arrays

The results of previous sections are used here to solve threeenumeration problemsconcerning suffix arrays. The results are essentially due toSchurmann and Stoye [38](see also [15], [3] and [28]).

The first problem approached here is to count the numbers(n,k) of distinct per-mutations that are suffix arrays of some word of lengthn over an alphabet of sizek.

Page 59: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 55

The following table gives the values ofs(n,k) for 2≤ k≤ n≤ 9.

k 2 3 4 5 6 7 8 9n2 23 5 64 12 23 245 27 93 119 1206 58 360 662 719 7207 121 1312 3728 4919 5039 50408 248 4541 20160 35779 40072 40319 403209 503 15111 103345 259535 347769 362377 362879 362880

In next theorem we show that the functions(n,k) is related to theEulerian num-bers

⟨nd

, i.e. the number of permutations of{1,2, ...,n} with exactlyd descents. Re-call (cf.[21]) that the Eulerian numbers can be defined by thefollowing recurrencerelation

nd

= (d+1)

n−1d

+(n−d)

n−1d−1

with⟨1

0

= 1 and⟨1

d

= 0 whend≥ 1.

Theorem 1.9.4 The number s(n,k) of distinct permutations that are suffix arrays ofsome word of length n over an alphabet of size k is

s(n,k) =k−1

∑d=0

nd

In order to prove the theorem we need a preliminary lemma. In the following itis convenient to represent a permutationϕ ∈ Sn by the wordϕ(1)ϕ(2)...ϕ(n) on thealphabet{1,2, . . . ,n}. Now we define a mapping that, for anyϕ ∈ Sn and for anys∈ {2,3, . . . ,n+1}, gives a permutationψ ∈ Sn+1. Such a mapping is described asa transformation on words performed in three steps.

For a permutationϕ(1)ϕ(2) · · ·ϕ(n) and an integers∈ {2,3, . . . ,n+ 1}, in thefirst step we obtain the word

Es(ϕ) = ϕs(1)ϕs(2) · · ·ϕs(n),

whereϕs(i) = ϕ(i) for ϕ(i) < s, andϕs(i) = ϕ(i) + 1 for ϕ(i) ≥ s. Remark thatϕs(1)ϕs(2) · · ·ϕs(n) is a word on the alphabet{1,2, . . . ,n,n+1}, but it does not rep-resent a permutation, because the integersdoes not appear in the word. For instance,consider the permutationϕ ∈ S6 represented by the word 364215 ands= 3. ThenE3(ϕ) = 475216.

Page 60: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

56 Handbook of Enumerative Combinatorics

In the second stepIs, which is the most important, we moveϕs(1) from the firstposition in the word to the positions− 1. It is called theinsertion stepand it isformally defined as follows:

Is(ϕs(1)ϕs(2) · · ·ϕs(n)) = ϕs(2) · · ·ϕs(s−1)ϕs(1)ϕs(s) · · ·ϕs(n).

For instance,I3(475216) = 745216.

In the third stepCs we simply insert the symbols in the first position of the word.For instance,C3(745216) = 3745216.

The compositions of the above operations define the transformationT(ϕ ,s) =Cs(Is(Es(ϕ))). Remark that the wordT(ϕ ,s) represents a permutation of{1,2, . . . ,n,n+ 1}. For instance, forϕ = 364215 ands = 3, we haveT(ϕ ,s) =3745216. Moreover, it is straightforward to check that, ifϕ is cyclic, thenT(ϕ ,s)is cyclic too. Therefore, if we denote bySc

n the set ofcyclic permutations of{1,2, . . . ,n}, the transformationT defines a mapping

T : Scn×{2,3, . . . ,n+1}→ Sc

n+1.

Lemma 4 The mapping T is a bijection from Scn×{2,3, . . . ,n+1} onto Sc

n+1.

Proof. We first prove thatT is injective by showing that, given a permutationψ ∈Sc

n+1, one can uniquely reconstruct the pair(ϕ ,s), with ϕ ∈ Scn ands∈ {2, . . . ,n+1},

such thatT(ϕ ,s) = ψ . Let ψ = ψ(1)ψ(2) · · ·ψ(n+ 1). Sinceψ is a cyclic per-mutation,ψ(1) 6= 1. By the definition ofT, s= ψ(1). We deleteψ(1) = s fromthe wordψ(1)ψ(2) · · ·ψ(n+ 1), and we obtain the wordψ(2) · · ·ψ(n+ 1). Thenwe take the elementψ(s) and move this element in the first position of the word.We obtain the wordψ(s)ψ(2) · · ·ψ(s− 1)ψ(s+ 1) · · ·ψ(n+ 1). Now we substi-tute eachψ( j) > s with ψ( j)−1 and we obtain the permutationϕ ∈ Sc

n such thatT(ϕ ,s) = ψ . In order to show that the mappingT is surjective, it suffices to verifythat Card(Sc

n×{2,3, . . . ,n+1}) =Card(Scn+1). Indeed Card(Sc

n×{2,3, . . . ,n+1}) =(n−1)!n= n! = Card(Sc

n+1).

Proof of Theorem 1.9.4.According to Proposition 23, there is a bijection betweenthe suffix arrays of wordsw ∈ An and the cyclic permutationsψ ∈ Sc

n+1 such thatCard(Des(ψ)\{1})≤ k−1. We have then to count the number of such permutations.

Let P(n,d) denote the number of permutationsψ ∈Scn+1 such that Card(Des(ψ)\

{1})= d. To prove the theorem, we show thatP(n,d) is equal to the Eulerian number⟨n

d

.

The proof is by induction onn. Trivially, P(1,0)= 1=⟨1

0

, andP(1,d) = 0=⟨1

d

whend≥ 1.

We now show thatP(n,d) = (d+1)P(n−1,d)+ (n−d)P(n−1,d−1).

By Lemma4, a permutationψ ∈ Scn+1 can be obtained, through the transformT,

from a permutationϕ ∈ Scn with the ”insertion” of an elements∈ {2, . . . ,n+ 1}.

Page 61: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 57

We now examine how the transformT affects the number of descents ofϕ . Remarkthat the steps 1 and 3 in the definitions of the transformT do not affect the numberof descents. This number can be affected only in step 2 (theinsertion step Is). If ϕhasd descents in the interval{2, . . . ,n+1}, alsoEs(ϕ), the word obtained after thefirst step, hasd descents, independently from the choise ofs. We can thus factorizeEs(ϕ) in d+1 monotonic (increasing) runs. The second step in the transformT (theinsertion stepIs) may or may not create a new descent, depending on the position inwhich is inserted the first symbolϕs(1) of the wordEs(ϕ). In each monotonic runof Es(ϕ) there is exactly one position whereϕs(1) can be placed without creating anew descent. Otherwise one creates exactly one new descent.

How many permutationsψ = T(ϕ ,s) can we obtain with Card(Des(ψ)\{1}) =d ? For eachϕ ∈ Sc

n with Card(Des(ϕ) \ {1}) = d, we haved+ 1 possibilities tochooses (because inEs(ϕ) there ared+1 monotonic runs). For eachϕ ∈ Sc

n withCard(Des(ϕ) \ {1}) = d− 1, we haven−d possibilities to chooses. SinceT is abijection, there is no other way to get a permutationψ ∈ Sc

n+1 with Card(Des(ψ) \{1}) = d. It follows that

P(n,d) = (d+1)P(n−1,d)+ (n−d)P(n−1,d−1).

We now consider the problem of counting the number of words that share thesame suffix array.

Theorem 1.9.5 Given a permutationϑ ∈ Sn, the number of words of length n overan alphabet of size k havingϑ as their suffix array is

(

n+ k−1−dk−1−d

)

,

where d= Card(Des(Ψ(ϑ))\ {1}).

Proof. By Theorem 1.9.3, a wordw∈ An, with |A| = k, hasϑ as suffix array if andonly if w has a Parikh vectorP(w) = (n1,n2, . . . ,nk) such that

Des(Ψ(ϑ))⊆ {1,1+n1, . . . ,1+n1+ . . .+nk−1}.

Therefore, given the permutationϑ , and then given the set

Dϑ = Des(Ψ(ϑ))\ {1}= {m1,m2, . . . ,md},

we need to count the number of tuples(n1, . . . ,nk), with n1+ . . .+nk = n such that

Dϑ ⊆ {1+n1,1+n1+n2, . . . ,1+n1+ . . .+nk−1}.

We represent the tuple(n1, . . . ,nk) by a wordzon the alphabet{x,y}:

z= xn1yxn2y· · ·xnk−1yxnk,

Page 62: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

58 Handbook of Enumerative Combinatorics

with ni ≥ 0 andn1 + . . .+ nk = n. We have that|z| = n+ k− 1. The conditionDϑ = {m1, . . . ,md} ⊆ {1+ n1, . . . ,1+ n1 + . . .+ nk−1} defines the positions ofdoccurrences of the lettery in z. The remainingk− 1− d occurrences ofy can beplaced in arbitrary positions. This can be done in

(

n+ k−1−dk−1−d

)

ways.

Note that ifk−1< Card(Des(Ψ(ϑ))\ {1}), there is no word on an alphabet ofsize k which hasϑ as its suffix array. This is confirmed by Theorem 1.9.5, since(m

n

)

= 0 for m< n.

In the next theorem, we require that each letter of the alphabet occurs at leastonce in the words that we count.

Theorem 1.9.6 Given a permutationϑ ∈ Sn, the number of words of length n overan alphabet of size k that have at least one occurrence of eachof the k letters andhaveϑ as their suffix array is

(

n−1−dk−1−d

)

,

where d= Card(Des(Ψ(ϑ))\ {1}).

Proof. The proof of Theorem 1.9.5 is modified in order to ensure thateach letteroccurs at least once. In the representation of the tuple(n1, . . . ,nk) by the wordz=xn1yxn2y· · ·xnk−1yxnk, we require that theni are strictly positive, i.e.ni > 0 for i =1, . . . ,k−1. Then we have to distribute the occurrences of the lettery among then−1possible positions. As in the proof of Theorem 1.9.5, the positions of d occurrencesof y is determined by the permutationϑ , and the remainingk−1−d are distributedamong then−1−d remaining positions.

From Theorem 1.9.4 and Theorem 1.9.5 we can derive a long known summationidentity of Eulerian numbers. The identity

kn = ∑j

nj

⟩(

k+ jn

)

,

as given in [21, Eq.6.37], was proven by J. Worpitzki, already in 1883. In order toprove it, we observe that the number of words of lengthn over an alphabet of sizek can be obtained by summing the number of words for each suffix array. Thus, wehave:

kn =k−1

∑d=0

nd

⟩(

n+ k−d−1k−d−1

)

.

Page 63: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

Enumerative Combinatorics on Words 59

By using the symmetry rule for Eulerian and binomial numbers, from the previousequality we derive

kn =k−1

∑d=0

nn−1−d

⟩(

n+ k−d−1n

)

.

By setting j = n−d−1, we obtain

kn =n−1

∑j=n−k

nj

⟩(

k+ jn

)

= ∑j

nj

⟩(

k+ jn

)

,

where the last equality is motivated by the remark that⟨n

j

= 0 for all j ≥ n and(k+ j

n

)

= 0 for all j < n− k.

Page 64: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem
Page 65: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

References

[1] Tanja van Aardenne-Ehrenfest and Nicolaas Govert de Bruijn. Circuits andtrees in oriented linear graphs.Simon Stevin, 28:203–217, 1951.

[2] Yu Hin Au. Shortest sequences containing primitive words and powers. 2013.arXiv:0904.3997.

[3] Hideo Bannai, Shunsuke Inenaga, Ayumi Shinohara, and Masayuki Takeda.Inferring strings from graphs and arrays. volume 2747 ofLecture Notes inComputer Science, pages 208–217. Springer Berlin Heidelberg, 2003.

[4] Jean Berstel and Dominique Perrin. The origins of combinatorics on words.European J. Combin., 28(3):996–1022, 2007.

[5] Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Au-tomata. Cambridge University Press, 2009.

[6] Francine Blanchet-Sadri. Algorithmic combinatorics on partial words.Inter-nat. J. Found. Comput. Sci., 23(6):1189–1206, 2012.

[7] Francine Blanchet-Sadri, N. C. Brownstein, Andy Kalcic, Justin Palumbo,and T. Weyand. Unavoidable sets of partial words.Theory Comput. Syst.,45(2):381–406, 2009.

[8] Carl Wilhelm Borchardt. Ueber eine der Interpolation entsprechende Darstel-lung der Eliminations-Resultante.J. reine angew. Math., 57:111–121, 1860.

[9] Michael Burrows and David J. Wheeler. A block sorting data compressionalgorithm. Technical report, DIGITAL System Research Center, 1994.

[10] Jean-Marc Champarnaud and Georges Hansel. Ensembles inevitables etclasses de conjugaison.Bull. Belg. Math. Soc. Simon Stevin, 10(suppl.):679–691, 2003.

[11] Jean-Marc Champarnaud, Georges Hansel, and DominiquePerrin. Unavoid-able sets of constant length.Internat. J. Algebra Comput., 14(2):241–251,2004.

[12] Maxime Crochemore, Jacques Desarmenien, and Dominique Perrin. A noteon the Burrows-Wheeler transformation.Theoret. Comput. Sci., 332(1-3):567–572, 2005.

[13] Jean-Pierre Duval. Factorizing words over an ordered alphabet.J. Algorithms,4(4):363–381, 1983.

61

Page 66: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

62 References

[14] Jean-Pierre Duval. Generation d’une section des classes de conjugaison et ar-bre des mots de Lyndon de longueur bornee.Theoret. Comput. Sci., 60(3):255–283, 1988.

[15] Jean-Pierre Duval and Arnaud Lefebvre. Words over an ordered alphabet andsuffix permutations.RAIRO Theor. Inform. Appl., 36(3):249–259, 2002.

[16] Steven R. Finch.Mathematical constants, volume 94 ofEncyclopedia of Math-ematics and its Applications. Cambridge University Press, Cambridge, 2003.

[17] Harold Fredricksen and James Maiorana. Necklaces of beads ink colors andk-ary de Bruijn sequences.Discrete Math., 23(3):207–210, 1978.

[18] Michael R. Garey and David S. Johnson.Computers and intractability. W.H. Freeman and Co., San Francisco, Calif., 1979. A guide to the theory ofNP-completeness, A Series of Books in the Mathematical Sciences.

[19] Ira M. Gessel, Antonio Restivo, and Christophe Reutenauer. A bijection be-tween words and multisets of necklaces.European Journal of Combinatorics,33(7):1537 – 1546, 2012.

[20] Ira M. Gessel and Christophe Reutenauer. Counting permutations with givencycle structure and descent set.J. Combin. Theory Ser. A, 64(2):189–215,1993.

[21] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik.Concrete mathe-matics. Addison-Wesley Publishing Company, Reading, MA, second edition,1994. A foundation for computer science.

[22] Roberto Grossi. A quick tour on suffix arrays and compressed suffix arrays.Theoret. Comput. Sci., 412(27):2964 – 2973, 2011.

[23] Peter M. Higgins. Burrow-Wheeler transformations andde Bruijn words.The-oret. Comput. Sci., 457(0):128 – 136, 2012.

[24] Donald E. Knuth. Oriented subtrees of an arc digraph.J. Comb. Theory,3:309–314, 1967.

[25] Donald E. Knuth.The Art of Computer Programming, volume 1, FundamentalAlgorithms. Addison Wesley, 1968. Second edition, 1973.

[26] Donald E. Knuth.The Art of Computer Programming , Volume 4A, Combina-torial Algorithms: Part 1. Addison Wesley, 2012.

[27] Tomasz Kociumaka, Jakub Radoszewski, and Wojciech Rytter. Computingk-th lyndon word and decoding lexicographically minimal de Bruijn sequence. InCombinatorial Pattern Matching, volume 8486 ofLecture Notes in ComputerScience, pages 202–211, 2014.

[28] Gregory Kucherov, Lilla Tothmeresz, and StephaneVialette. On the combina-torics of suffix arrays.Inform. Process. Lett., 113(22-24):915–920, 2013.

[29] Douglas Lind and Brian H. Marcus.An Introduction to Symbolic Dynamicsand Coding. Cambridge, 1995.

Page 67: Contents · 8 Handbook of Enumerative Combinatorics In Section 1.4, we introduceLyndonwords and provethe important Factorization Theorem (Theorem 1.4.1). We also discuss the problem

References 63

[30] M. Lothaire. Combinatorics on Words. Cambridge University Press, secondedition, 1997. (First edition 1983).

[31] M. Lothaire.Algebraic Combinatorics on Words. Cambridge University Press,2002.

[32] Udi Manber and Gene Myers. Suffix arrays: A new method foron-line stringsearches.SIAM Journal on Computing, 22(5):935–948, 1993.

[33] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino.An extension of the Burrows-Wheeler Transform.Theoret. Comput. Sci.,387(3):298–312, 2007.

[34] Eduardo Moreno. On the theorem of Fredricksen and Maiorana about de Bruijnsequences.Adv. in Appl. Math., 33(2):413–415, 2004.

[35] Eduardo Moreno and Dominique Perrin. Corrigendum to: ‘on the theorem ofFredricksen and Maiorana about de Bruijn sequences’.Adv. in Appl. Math.,2014. to appear.

[36] Johannes Mykkeltveit. A proof of Golomb’s conjecture for the de Bruijn graph.J. Combinatorial Theory Ser. B, 13:40–45, 1972.

[37] Christophe Reutenauer.Free Lie algebras. The Clarendon Press Oxford Uni-versity Press, New York, 1993. Oxford Science Publications.

[38] Klaus-Bernd Schurmann and Jens Stoye. Counting suffix arrays and strings.Theor. Comput. Sci., pages 220–234, 2008.

[39] Arseny M. Shur. Growth of power-free languages over large alphabets.TheoryComput. Syst., 54(2):224–243, 2014.

[40] Cedric A. Smith and William T. Tutte. On unicursal pathsin a network ofdegree 4.Amer. Math. Monthly, 48, 1941.

[41] Richard P. Stanley.Enumerative combinatorics. Vol. 1. Cambridge UniversityPress, Cambridge, 1997.