1
Some simple varieties of trees arising in permutation analysis Mathilde Bouvel Marni Mishna Cyril Nicaud LaBRI & CNRS Dept. Mathematics LIGM & CNRS Université Bordeaux Simon Fraser University Université Paris-Est [email protected] [email protected] [email protected] Challenge Understand the convergence of a nested family of trees towards permutations Permutations as a limit of tree classes In this work we define a nested family of parameterized trees P (k) whose limit P is in a straightforward bijection with permutations. The trees are decorated with subclasses of simple permutations and are known as the strong interval trees of permutations. We give a specification for P (k) , and hence have easy access to asymptotic counting and parameter formulas, as well as random generation. The limits in k and n do not commute, and this leads to some interesting behaviour in the asymptotics. lim k→∞ P (k) n = P n = n! Simple Permutations Simple permutation property: no set of consecutive numbers are together in a block. σ = 79 42513 86 =⇒ not simple π = 842713695 =⇒ simple S = set of all simple permutations = n 2413, 3142, 24153, 25314, 35142, 31524,... o S (z) = 2z 4 + 6z 5 + 46z 6 + 338z 7 + 2926z 8 + 28146z 9 + ... The number of simple permutations of size n: s n n!/e 2 . [2] oeis.org/A111111 Open problem: Describe a “nice” generation scheme for simple permutations (i.e. not rejection). P: The class of Strong Interval Trees 1. Size is the number of leaves 2. Three colours for internal vertices: Linear nodes: children are totally ordered. Prime nodes: children are totally ordered, and are decorated with a simple permutation 3. There is only choice between linear nodes at the root, or as a child of a prime node. P = Z + N · SEQ 2 U + N · SEQ 2 U + N · S (P), U = Z + N · SEQ 2 U + N · S (P), U = Z + N · SEQ 2 U + N · S (P). Lemma. (Simplification of system) P SEQ 1 U U Z + Λ(U) with Λ(z) = z 2 1 - z + S z 1 - z · B Λ(z) is not analytic What do the trees look like? σ = 1234567 (identity) σ = 7654321 (reverse identity) σ = 3571426 (simple) 3571426 A random permutation: (recall, 1 9 are simple) σ = 6 7 9 10 11 13 8 12 3 1 5 4 2 6 7 2413 9 10 11 13 8 12 3142 3 1 5 4 2 This tree models instances of perfect sorting by reversals [4] Prime node degree restricted trees: A simple variety of trees P (k) -trees: The sub-class of P where prime nodes have at most k children. This class is a simple variety of trees for each case and hence we have asymptotic enumeration and random generation for “free”. P (k) SEQ 1 U (k) U (k) Z + SEQ 2 (U (k) ) + 4 (P (k) ) 4 +···+ s k (P (k) ) k A random tree from P (7) of size approx 1000 constructed from a Boltzmann generator The real comparisons yield permutations of length 1150, and a Boltzmann sampler allows us to generate restricted trees of the Asymptotic enumeration Use adapted inversion formula [8] on U (k) = Z + Λ k (U (k) ) with Λ k (z) = z 1-z 2 + k j =4 s j ( z 1-z ) j Theorem. U (k) n P (k) n s ρ k 2πΛ 00 k (τ k ) · ρ -n k n 3/2 Here: Λ 0 k (τ k ) = 1 ρ k = Λ k (τ k ) ρ k < τ k < e k Remark lim k→∞ ρ k = 0 Bounds on ρ k and a familiar limit Theorem. ρ k = e k 1 - 5 2 log k k + Θ( 1 k ) · We see how trees become permutations when k = n: s ρ k 2πΛ 00 k (τ k ) · ρ -n k n -3/2 e 41 2 k e n 1 + 5 2 log k k + Θ 1 k ¶¶ n n -3/2 . Trees γρ -n n -3/2 Permutations (n/e) n p 2πn Subtleties in the asymptotics For any fixed k, as n →∞ this asymptotic formula is very accurate. However, P (2n) n = n!, but the tree formula produces an extra exponential factor of 2. Taking additional terms in the expansion does not help. Rather, new contours are required in the intermediate integrals, demonstrating the limits of the inversion formula. Tree Anatomy Theorem. Average parameter values for trees in P (k) n # vertices of arity j λ j τ j k ρ k · n # of internal vertices τ k -ρ k ρ k · n Sum of subtree size q π 2ρ k Λ 00 k (τ k ) · n 3/2 Application to genome reconstruction This permutation data structure models the analysis of perfect sorting by reversals, which is used in genome comparison. Tree parameters such as the number of internal nodes have direct interpretations on the evolutionary scenarios they represent [6]. We aspire to describe better model for the permutations arising in actual genome comparisons, and this is a first step towards that goal. References [1] M.H. Albert and M.D. Atkinson. Simple permutations and pattern restricted permutations. Discrete Math., 300:1–15, 2005. [2] M.H. Albert, M.D. Atkinson, and M. Klazar. The enumeration of simple permutations. J. Integer Seq., 6:03.4.4, 2003. [3] S. Bérard, A. Bergeron, C. Chauve, and C. Paul. Perfect sorting by re- versals is not always difficult. IEEE/ACM Trans. Comput. Biol. Bioinform., 4:4–16, 2007. [4] A. Bergeron, C. Chauve, F. de Montgolfier, and M. Raffinot. Comput- ing common intervals of K permutations, with applications to modular decomposition of graphs. Proc. 13th Annual European Symposium on Algorithms, in Lecture Notes in Comput. Sci., 3669:779–790, 2005. [5] K. S. Booth and G. S. Lueker. Testing for the consecutive ones property, interval graphs, and graph planarity using PQ-tree algorithms. J. Comput. System Sci., 13(3):335–379, 1976. [6] M. Bouvel, C. Chauve, M. Mishna, and D. Rossin. Average-case analysis of perfect sorting by reversals. Discrete Math. Algorithms Appl., 3(3):369– 392, 2011. [7] G. Chapuy, A. Pierrot, D. Rossin. On growth rate of wreath-closed permutation classes. Talk at the conference Permutation Patterns, 2011. [8] P. Flajolet and R. Sedgewick. Analytic Combinatorics. Cambridge Univer- sity Press, 2009. Acknowledgements. MM gratefully acknowledges the support of NSERC Dis- covery grant, and both LIGM and LaBRI for hosting during the work. MB + CN were partially supported by ANR project M (2010-BLAN-0204). We are also thankful to Cedric Chauve, Carine Pivoteau, and Rosemary Mc- Closkey for work related to this project.

Common interval trees and applications to comparative genomicsuser.math.uzh.ch/bouvel/presentations/Poster2013_BMN.pdf · tion of combinatorial structures. Combin. Probab. Comput

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Common interval trees and applications to comparative genomicsuser.math.uzh.ch/bouvel/presentations/Poster2013_BMN.pdf · tion of combinatorial structures. Combin. Probab. Comput

Some simple varieties of trees arising in permutation analysisMathilde Bouvel Marni Mishna Cyril NicaudLaBRI & CNRS Dept. Mathematics LIGM & CNRSUniversité Bordeaux Simon Fraser University Université Paris-Est

[email protected] [email protected] [email protected]

Challenge

Understand the convergence of anested family of trees towards

permutations

Permutations as a limit of tree classesIn this work we define a nested family of parameterized trees P(k) whose limit P is in astraightforward bijection with permutations. The trees are decorated with subclasses of simplepermutations and are known as the strong interval trees of permutations. We give a specificationfor P(k), and hence have easy access to asymptotic counting and parameter formulas, as well asrandom generation. The limits in k and n do not commute, and this leads to some interestingbehaviour in the asymptotics.

limk→∞P(k)n =Pn= n!

Simple PermutationsSimple permutation property: no set of consecutive numbersare together in a block.

σ=794251386 =⇒ not simple

π=842713695 =⇒ simple

S= set of all simple permutations=

{2413,3142,24153,25314,35142,31524, . . .

}S(z) =2z4+6z5+46z6+338z7+2926z8+28146z9+ . . .

The number of simple permutations of size n: sn ∼ n!/e2. [2]oeis.org/A111111

Open problem: Describe a “nice” generation scheme for simplepermutations (i.e. not rejection).

P: The class of Strong Interval Trees1. Size is the number of leaves

2. Three colours for internal vertices:

Linear nodes: •• children are totally ordered.Prime nodes: • children are totally ordered, and aredecorated with a simple permutation

3. There is only choice between linear nodes at the root, oras a child of a prime node.

P=Z�+N• ·SEQ≥2U•+N• ·SEQ≥2U•+N• · S(P),U• =Z�+N• ·SEQ≥2U•+N• · S(P),U• =Z�+N• ·SEQ≥2U•+N• · S(P).

Lemma. (Simplification of system)

P≡ SEQ≥1U

U≡Z+Λ(U)with Λ(z) = z2

1−z +S( z1−z

)

B Λ(z) is not analyticWhat do the trees look like?

σ=1234567 (identity)•

� � � � � � �

σ=7654321 (reverse identity)•

� � � � � � �

σ=3571426 (simple)•3571426

� � � � � � �

A random permutation: (recall, ∼ 19 are simple)

Common interval trees and applications to comparative genomicsRosemary McCloskey, Cedric Chauve, and Marni Mishna

[email protected], [email protected], [email protected]

Comparative GenomicsOur problem arises from comparative genomics. Given somegroup of species, we would like to know how each evolved fromtheir common ancestor.

?

?

Genomes as PermutationsA genome is a sequence of genes, each represented by a numberand an orientation.

The genomes of any two mammals contain a similar set of genes;only the order and orientation di!er. Therefore, we can think ofa comparison between two genomes as a signed permutation.

= !"1

!"2

!"3

!"4

!"5

= !"3

#!1

#!2

#!5

!"4

Common Interval TreesA permutation can be uniquely represented by a common inter-val tree. Each internal node represents an interval - a segmentof the permutation which can be rearranged to form the identity.

" =!"1!"2!"3!"4!"5 " !"

3#!1#!2#!5!"4

12345

321 54

3 12

-1 -2

-5 4

These trees have three types of nodes.

Type Combinatorially... Biologically...

leaf atom (counts size) gene

Q-node ± identity conserved segment

P-node simple permutation conserved, but scrambled

Boltzmann SamplersA Boltzmann sampler [1] is an approximate size, uniform random generator. We sacrifice some control over the size, in exchangefor O(n) generation of most objects (including common interval trees).

This type of random generator allows us to

• quickly generate and examine a large set of trees,

• compare real trees to generated ones comparable in size.

First ModelA random common interval tree (that is, one constructed from a random permutation) usually takes the following shape.

When comparing real genomes, a P -node is a rare occurence. When they do show up, P -nodes have very low degree. A naturalfirst step in modelling these scenarios is to restrict the allowed degree of P -nodes, with a Boltzmann sampler.

The real comparisons yield permutations of length $ 1150, and a Boltzmann sampler allows us to generate restricted trees of thesame size.

Tree ParametersThe Boltzmann sampler allows us to generate many trees of large size and examine some of their properties. These can have bothcombinatorial and biological significance.

For example, the pathlength of a tree is defined as the sum of distances from each leaf to the root.

Total pathlength = 2 + 2 + 1 = 5.

The methods of analytic combinatorics [2] lead to an asymptotic estimate for pathlength, which we were able to recover byexamining randomly generated trees.

Biologically, pathlength is a rough indicator of evolutionary distance.

Improving the ModelRestricting P -node degree is not enough to model real com-parisons - a real tree looks nothing like the results of thoserestrictions.

In the Boltzmann model, the degree of a node is chosen with aprobability based on the combinatorial specification. Instead,we can try choosing the degree of a node based on its height inthe tree, and the degrees at that height in the real trees. Thisapproach sacrifices uniformity, but the correlation with realtrees improves dramatically.

Comparison of ModelsWhen compared to real trees, the trees generated with theBoltzmann sampler match fairly well, although there are someparameters which are di"cult to simulate.

In a comparison of human and mouse genomes, Boltzmanngeneration matched real data more closely than an existingmodel called DCJ simulations.

The real human-mouse comparison.

Generated with the Boltzmann sampler.

Constructed via DCJ simulations.

Future WorkMatching parameters of the real trees seems like a promisingapproach. An attempt to match a di!erent parameter mightprove useful, although one must be careful to avoid simplycopying the real trees.

With advances in genome sequencing, additional referencegenomes, when made available, could allow a more accurate fitto the distribution of degrees.

ReferencesReferences

[1] Philippe Duchon, Philippe Flajolet, Guy Louchard, andGilles Schae!er. Boltzmann samplers for the random genera-tion of combinatorial structures. Combin. Probab. Comput.,13(4-5):577–625, 2004.

[2] Philippe Flajolet and Robert Sedgewick. Analytic combina-torics. Cambridge University Press, Cambridge, 2009.

σ=6 7 9 10 11 13 8 12 3 1 5 4 2

�6

�7

• 2413

�9

�10

�11

�13

�8

�12

• 3142

�3

�1

�5

�4

�2

This tree models instances of perfect sorting by reversals [4]

Prime node degree restricted trees: A simple variety of treesP(k)-trees: The sub-class of P where prime nodes have at most k children. This class is a simple variety of trees for each caseand hence we have asymptotic enumeration and random generation for “free”.

P(k) ≡ SEQ≥1U(k)

U(k) ≡Z+SEQ≥2(U(k))+4 (P(k))4+·· ·+ sk (P(k))k

A random tree from P(7) of size approx 1000 constructed from a Boltzmann generator

Common interval trees and applications to comparative genomicsRosemary McCloskey, Cedric Chauve, and Marni Mishna

[email protected], [email protected], [email protected]

Comparative GenomicsOur problem arises from comparative genomics. Given somegroup of species, we would like to know how each evolved fromtheir common ancestor.

?

?

Genomes as PermutationsA genome is a sequence of genes, each represented by a numberand an orientation.

The genomes of any two mammals contain a similar set of genes;only the order and orientation di!er. Therefore, we can think ofa comparison between two genomes as a signed permutation.

= !"1

!"2

!"3

!"4

!"5

= !"3

#!1

#!2

#!5

!"4

Common Interval TreesA permutation can be uniquely represented by a common inter-val tree. Each internal node represents an interval - a segmentof the permutation which can be rearranged to form the identity.

" =!"1!"2!"3!"4!"5 " !"

3#!1#!2#!5!"4

12345

321 54

3 12

-1 -2

-5 4

These trees have three types of nodes.

Type Combinatorially... Biologically...

leaf atom (counts size) gene

Q-node ± identity conserved segment

P-node simple permutation conserved, but scrambled

Boltzmann SamplersA Boltzmann sampler [1] is an approximate size, uniform random generator. We sacrifice some control over the size, in exchangefor O(n) generation of most objects (including common interval trees).

This type of random generator allows us to

• quickly generate and examine a large set of trees,

• compare real trees to generated ones comparable in size.

First ModelA random common interval tree (that is, one constructed from a random permutation) usually takes the following shape.

When comparing real genomes, a P -node is a rare occurence. When they do show up, P -nodes have very low degree. A naturalfirst step in modelling these scenarios is to restrict the allowed degree of P -nodes, with a Boltzmann sampler.

The real comparisons yield permutations of length $ 1150, and a Boltzmann sampler allows us to generate restricted trees of thesame size.

Tree ParametersThe Boltzmann sampler allows us to generate many trees of large size and examine some of their properties. These can have bothcombinatorial and biological significance.

For example, the pathlength of a tree is defined as the sum of distances from each leaf to the root.

Total pathlength = 2 + 2 + 1 = 5.

The methods of analytic combinatorics [2] lead to an asymptotic estimate for pathlength, which we were able to recover byexamining randomly generated trees.

Biologically, pathlength is a rough indicator of evolutionary distance.

Improving the ModelRestricting P -node degree is not enough to model real com-parisons - a real tree looks nothing like the results of thoserestrictions.

In the Boltzmann model, the degree of a node is chosen with aprobability based on the combinatorial specification. Instead,we can try choosing the degree of a node based on its height inthe tree, and the degrees at that height in the real trees. Thisapproach sacrifices uniformity, but the correlation with realtrees improves dramatically.

Comparison of ModelsWhen compared to real trees, the trees generated with theBoltzmann sampler match fairly well, although there are someparameters which are di"cult to simulate.

In a comparison of human and mouse genomes, Boltzmanngeneration matched real data more closely than an existingmodel called DCJ simulations.

The real human-mouse comparison.

Generated with the Boltzmann sampler.

Constructed via DCJ simulations.

Future WorkMatching parameters of the real trees seems like a promisingapproach. An attempt to match a di!erent parameter mightprove useful, although one must be careful to avoid simplycopying the real trees.

With advances in genome sequencing, additional referencegenomes, when made available, could allow a more accurate fitto the distribution of degrees.

ReferencesReferences

[1] Philippe Duchon, Philippe Flajolet, Guy Louchard, andGilles Schae!er. Boltzmann samplers for the random genera-tion of combinatorial structures. Combin. Probab. Comput.,13(4-5):577–625, 2004.

[2] Philippe Flajolet and Robert Sedgewick. Analytic combina-torics. Cambridge University Press, Cambridge, 2009.

Asymptotic enumerationUse adapted inversion formula [8] on U(k) =Z+Λk(U(k)) withΛk(z) = z

1−z2 +∑kj=4 sj(

z1−z )j

Theorem.

U(k)n ∼P(k)

n ∼√

ρk2πΛ′′

k(τk)·ρ−nkn3/2

Here:

Λ′k(τk) =1 ρk =Λk(τk) ρk < τk <

ek

Remark limk→∞ρk =0

Bounds on ρk and a familiar limitTheorem.

ρk= ek

(1− 52

logkk +Θ(1k )

)We see how trees become permutations when k= n:√

ρk2πΛ′′

k(τk)·ρ−nk n−3/2 ≤

( e4kπ

) 12

(ke

)n (1+ 52

logkk +Θ

(1k

))nn−3/2.

Treesγρ−nn−3/2 → Permutations

(n/e)np2πn

Subtleties in the asymptoticsFor any fixed k, as n → ∞ this asymptotic formula is veryaccurate. However, P(2n)

n = n!, but the tree formula produces anextra exponential factor of 2. Taking additional terms in theexpansion does not help. Rather, new contours are requiredin the intermediate integrals, demonstrating the limits of theinversion formula.

Tree AnatomyTheorem. Average parameter values for trees in P(k)

n

# vertices of arity j λjτjkρk ·n

# of internal vertices τk−ρkρk ·n

Sum of subtree size√ π2ρkΛ′′

k(τk)·n3/2

Application to genome reconstructionThis permutation data structure models the analysis of perfectsorting by reversals, which is used in genome comparison. Treeparameters such as the number of internal nodes have directinterpretations on the evolutionary scenarios they represent [6].We aspire to describe better model for the permutations arisingin actual genome comparisons, and this is a first step towardsthat goal.

References[1] M.H. Albert and M.D. Atkinson. Simple permutations and pattern restrictedpermutations. Discrete Math., 300:1–15, 2005.

[2] M.H. Albert, M.D. Atkinson, and M. Klazar. The enumeration of simplepermutations. J. Integer Seq., 6:03.4.4, 2003.

[3] S. Bérard, A. Bergeron, C. Chauve, and C. Paul. Perfect sorting by re-versals is not always difficult. IEEE/ACM Trans. Comput. Biol. Bioinform.,4:4–16, 2007.

[4] A. Bergeron, C. Chauve, F. de Montgolfier, and M. Raffinot. Comput-ing common intervals of K permutations, with applications to modulardecomposition of graphs. Proc. 13th Annual European Symposium onAlgorithms, in Lecture Notes in Comput. Sci., 3669:779–790, 2005.

[5] K. S. Booth and G. S. Lueker. Testing for the consecutive ones property,interval graphs, and graph planarity using PQ-tree algorithms. J. Comput.System Sci., 13(3):335–379, 1976.

[6] M. Bouvel, C. Chauve, M. Mishna, and D. Rossin. Average-case analysis ofperfect sorting by reversals. Discrete Math. Algorithms Appl., 3(3):369–392, 2011.

[7] G. Chapuy, A. Pierrot, D. Rossin. On growth rate of wreath-closedpermutation classes. Talk at the conference Permutation Patterns, 2011.

[8] P. Flajolet and R. Sedgewick. Analytic Combinatorics. Cambridge Univer-sity Press, 2009.

Acknowledgements. MM gratefully acknowledges the support of NSERC Dis-covery grant, and both LIGM and LaBRI for hosting during the work. MB+ CN were partially supported by ANR project M (2010-BLAN-0204).We are also thankful to Cedric Chauve, Carine Pivoteau, and Rosemary Mc-Closkey for work related to this project.