Transcript
Page 1: Common interval trees and applications to comparative genomicsuser.math.uzh.ch/bouvel/presentations/Poster2013_BMN.pdf · tion of combinatorial structures. Combin. Probab. Comput

Some simple varieties of trees arising in permutation analysisMathilde Bouvel Marni Mishna Cyril NicaudLaBRI & CNRS Dept. Mathematics LIGM & CNRSUniversité Bordeaux Simon Fraser University Université Paris-Est

[email protected] [email protected] [email protected]

Challenge

Understand the convergence of anested family of trees towards

permutations

Permutations as a limit of tree classesIn this work we define a nested family of parameterized trees P(k) whose limit P is in astraightforward bijection with permutations. The trees are decorated with subclasses of simplepermutations and are known as the strong interval trees of permutations. We give a specificationfor P(k), and hence have easy access to asymptotic counting and parameter formulas, as well asrandom generation. The limits in k and n do not commute, and this leads to some interestingbehaviour in the asymptotics.

limk→∞P(k)n =Pn= n!

Simple PermutationsSimple permutation property: no set of consecutive numbersare together in a block.

σ=794251386 =⇒ not simple

π=842713695 =⇒ simple

S= set of all simple permutations=

{2413,3142,24153,25314,35142,31524, . . .

}S(z) =2z4+6z5+46z6+338z7+2926z8+28146z9+ . . .

The number of simple permutations of size n: sn ∼ n!/e2. [2]oeis.org/A111111

Open problem: Describe a “nice” generation scheme for simplepermutations (i.e. not rejection).

P: The class of Strong Interval Trees1. Size is the number of leaves

2. Three colours for internal vertices:

Linear nodes: •• children are totally ordered.Prime nodes: • children are totally ordered, and aredecorated with a simple permutation

3. There is only choice between linear nodes at the root, oras a child of a prime node.

P=Z�+N• ·SEQ≥2U•+N• ·SEQ≥2U•+N• · S(P),U• =Z�+N• ·SEQ≥2U•+N• · S(P),U• =Z�+N• ·SEQ≥2U•+N• · S(P).

Lemma. (Simplification of system)

P≡ SEQ≥1U

U≡Z+Λ(U)with Λ(z) = z2

1−z +S( z1−z

)

B Λ(z) is not analyticWhat do the trees look like?

σ=1234567 (identity)•

� � � � � � �

σ=7654321 (reverse identity)•

� � � � � � �

σ=3571426 (simple)•3571426

� � � � � � �

A random permutation: (recall, ∼ 19 are simple)

Common interval trees and applications to comparative genomicsRosemary McCloskey, Cedric Chauve, and Marni Mishna

[email protected], [email protected], [email protected]

Comparative GenomicsOur problem arises from comparative genomics. Given somegroup of species, we would like to know how each evolved fromtheir common ancestor.

?

?

Genomes as PermutationsA genome is a sequence of genes, each represented by a numberand an orientation.

The genomes of any two mammals contain a similar set of genes;only the order and orientation di!er. Therefore, we can think ofa comparison between two genomes as a signed permutation.

= !"1

!"2

!"3

!"4

!"5

= !"3

#!1

#!2

#!5

!"4

Common Interval TreesA permutation can be uniquely represented by a common inter-val tree. Each internal node represents an interval - a segmentof the permutation which can be rearranged to form the identity.

" =!"1!"2!"3!"4!"5 " !"

3#!1#!2#!5!"4

12345

321 54

3 12

-1 -2

-5 4

These trees have three types of nodes.

Type Combinatorially... Biologically...

leaf atom (counts size) gene

Q-node ± identity conserved segment

P-node simple permutation conserved, but scrambled

Boltzmann SamplersA Boltzmann sampler [1] is an approximate size, uniform random generator. We sacrifice some control over the size, in exchangefor O(n) generation of most objects (including common interval trees).

This type of random generator allows us to

• quickly generate and examine a large set of trees,

• compare real trees to generated ones comparable in size.

First ModelA random common interval tree (that is, one constructed from a random permutation) usually takes the following shape.

When comparing real genomes, a P -node is a rare occurence. When they do show up, P -nodes have very low degree. A naturalfirst step in modelling these scenarios is to restrict the allowed degree of P -nodes, with a Boltzmann sampler.

The real comparisons yield permutations of length $ 1150, and a Boltzmann sampler allows us to generate restricted trees of thesame size.

Tree ParametersThe Boltzmann sampler allows us to generate many trees of large size and examine some of their properties. These can have bothcombinatorial and biological significance.

For example, the pathlength of a tree is defined as the sum of distances from each leaf to the root.

Total pathlength = 2 + 2 + 1 = 5.

The methods of analytic combinatorics [2] lead to an asymptotic estimate for pathlength, which we were able to recover byexamining randomly generated trees.

Biologically, pathlength is a rough indicator of evolutionary distance.

Improving the ModelRestricting P -node degree is not enough to model real com-parisons - a real tree looks nothing like the results of thoserestrictions.

In the Boltzmann model, the degree of a node is chosen with aprobability based on the combinatorial specification. Instead,we can try choosing the degree of a node based on its height inthe tree, and the degrees at that height in the real trees. Thisapproach sacrifices uniformity, but the correlation with realtrees improves dramatically.

Comparison of ModelsWhen compared to real trees, the trees generated with theBoltzmann sampler match fairly well, although there are someparameters which are di"cult to simulate.

In a comparison of human and mouse genomes, Boltzmanngeneration matched real data more closely than an existingmodel called DCJ simulations.

The real human-mouse comparison.

Generated with the Boltzmann sampler.

Constructed via DCJ simulations.

Future WorkMatching parameters of the real trees seems like a promisingapproach. An attempt to match a di!erent parameter mightprove useful, although one must be careful to avoid simplycopying the real trees.

With advances in genome sequencing, additional referencegenomes, when made available, could allow a more accurate fitto the distribution of degrees.

ReferencesReferences

[1] Philippe Duchon, Philippe Flajolet, Guy Louchard, andGilles Schae!er. Boltzmann samplers for the random genera-tion of combinatorial structures. Combin. Probab. Comput.,13(4-5):577–625, 2004.

[2] Philippe Flajolet and Robert Sedgewick. Analytic combina-torics. Cambridge University Press, Cambridge, 2009.

σ=6 7 9 10 11 13 8 12 3 1 5 4 2

�6

�7

• 2413

�9

�10

�11

�13

�8

�12

• 3142

�3

�1

�5

�4

�2

This tree models instances of perfect sorting by reversals [4]

Prime node degree restricted trees: A simple variety of treesP(k)-trees: The sub-class of P where prime nodes have at most k children. This class is a simple variety of trees for each caseand hence we have asymptotic enumeration and random generation for “free”.

P(k) ≡ SEQ≥1U(k)

U(k) ≡Z+SEQ≥2(U(k))+4 (P(k))4+·· ·+ sk (P(k))k

A random tree from P(7) of size approx 1000 constructed from a Boltzmann generator

Common interval trees and applications to comparative genomicsRosemary McCloskey, Cedric Chauve, and Marni Mishna

[email protected], [email protected], [email protected]

Comparative GenomicsOur problem arises from comparative genomics. Given somegroup of species, we would like to know how each evolved fromtheir common ancestor.

?

?

Genomes as PermutationsA genome is a sequence of genes, each represented by a numberand an orientation.

The genomes of any two mammals contain a similar set of genes;only the order and orientation di!er. Therefore, we can think ofa comparison between two genomes as a signed permutation.

= !"1

!"2

!"3

!"4

!"5

= !"3

#!1

#!2

#!5

!"4

Common Interval TreesA permutation can be uniquely represented by a common inter-val tree. Each internal node represents an interval - a segmentof the permutation which can be rearranged to form the identity.

" =!"1!"2!"3!"4!"5 " !"

3#!1#!2#!5!"4

12345

321 54

3 12

-1 -2

-5 4

These trees have three types of nodes.

Type Combinatorially... Biologically...

leaf atom (counts size) gene

Q-node ± identity conserved segment

P-node simple permutation conserved, but scrambled

Boltzmann SamplersA Boltzmann sampler [1] is an approximate size, uniform random generator. We sacrifice some control over the size, in exchangefor O(n) generation of most objects (including common interval trees).

This type of random generator allows us to

• quickly generate and examine a large set of trees,

• compare real trees to generated ones comparable in size.

First ModelA random common interval tree (that is, one constructed from a random permutation) usually takes the following shape.

When comparing real genomes, a P -node is a rare occurence. When they do show up, P -nodes have very low degree. A naturalfirst step in modelling these scenarios is to restrict the allowed degree of P -nodes, with a Boltzmann sampler.

The real comparisons yield permutations of length $ 1150, and a Boltzmann sampler allows us to generate restricted trees of thesame size.

Tree ParametersThe Boltzmann sampler allows us to generate many trees of large size and examine some of their properties. These can have bothcombinatorial and biological significance.

For example, the pathlength of a tree is defined as the sum of distances from each leaf to the root.

Total pathlength = 2 + 2 + 1 = 5.

The methods of analytic combinatorics [2] lead to an asymptotic estimate for pathlength, which we were able to recover byexamining randomly generated trees.

Biologically, pathlength is a rough indicator of evolutionary distance.

Improving the ModelRestricting P -node degree is not enough to model real com-parisons - a real tree looks nothing like the results of thoserestrictions.

In the Boltzmann model, the degree of a node is chosen with aprobability based on the combinatorial specification. Instead,we can try choosing the degree of a node based on its height inthe tree, and the degrees at that height in the real trees. Thisapproach sacrifices uniformity, but the correlation with realtrees improves dramatically.

Comparison of ModelsWhen compared to real trees, the trees generated with theBoltzmann sampler match fairly well, although there are someparameters which are di"cult to simulate.

In a comparison of human and mouse genomes, Boltzmanngeneration matched real data more closely than an existingmodel called DCJ simulations.

The real human-mouse comparison.

Generated with the Boltzmann sampler.

Constructed via DCJ simulations.

Future WorkMatching parameters of the real trees seems like a promisingapproach. An attempt to match a di!erent parameter mightprove useful, although one must be careful to avoid simplycopying the real trees.

With advances in genome sequencing, additional referencegenomes, when made available, could allow a more accurate fitto the distribution of degrees.

ReferencesReferences

[1] Philippe Duchon, Philippe Flajolet, Guy Louchard, andGilles Schae!er. Boltzmann samplers for the random genera-tion of combinatorial structures. Combin. Probab. Comput.,13(4-5):577–625, 2004.

[2] Philippe Flajolet and Robert Sedgewick. Analytic combina-torics. Cambridge University Press, Cambridge, 2009.

Asymptotic enumerationUse adapted inversion formula [8] on U(k) =Z+Λk(U(k)) withΛk(z) = z

1−z2 +∑kj=4 sj(

z1−z )j

Theorem.

U(k)n ∼P(k)

n ∼√

ρk2πΛ′′

k(τk)·ρ−nkn3/2

Here:

Λ′k(τk) =1 ρk =Λk(τk) ρk < τk <

ek

Remark limk→∞ρk =0

Bounds on ρk and a familiar limitTheorem.

ρk= ek

(1− 52

logkk +Θ(1k )

)We see how trees become permutations when k= n:√

ρk2πΛ′′

k(τk)·ρ−nk n−3/2 ≤

( e4kπ

) 12

(ke

)n (1+ 52

logkk +Θ

(1k

))nn−3/2.

Treesγρ−nn−3/2 → Permutations

(n/e)np2πn

Subtleties in the asymptoticsFor any fixed k, as n → ∞ this asymptotic formula is veryaccurate. However, P(2n)

n = n!, but the tree formula produces anextra exponential factor of 2. Taking additional terms in theexpansion does not help. Rather, new contours are requiredin the intermediate integrals, demonstrating the limits of theinversion formula.

Tree AnatomyTheorem. Average parameter values for trees in P(k)

n

# vertices of arity j λjτjkρk ·n

# of internal vertices τk−ρkρk ·n

Sum of subtree size√ π2ρkΛ′′

k(τk)·n3/2

Application to genome reconstructionThis permutation data structure models the analysis of perfectsorting by reversals, which is used in genome comparison. Treeparameters such as the number of internal nodes have directinterpretations on the evolutionary scenarios they represent [6].We aspire to describe better model for the permutations arisingin actual genome comparisons, and this is a first step towardsthat goal.

References[1] M.H. Albert and M.D. Atkinson. Simple permutations and pattern restrictedpermutations. Discrete Math., 300:1–15, 2005.

[2] M.H. Albert, M.D. Atkinson, and M. Klazar. The enumeration of simplepermutations. J. Integer Seq., 6:03.4.4, 2003.

[3] S. Bérard, A. Bergeron, C. Chauve, and C. Paul. Perfect sorting by re-versals is not always difficult. IEEE/ACM Trans. Comput. Biol. Bioinform.,4:4–16, 2007.

[4] A. Bergeron, C. Chauve, F. de Montgolfier, and M. Raffinot. Comput-ing common intervals of K permutations, with applications to modulardecomposition of graphs. Proc. 13th Annual European Symposium onAlgorithms, in Lecture Notes in Comput. Sci., 3669:779–790, 2005.

[5] K. S. Booth and G. S. Lueker. Testing for the consecutive ones property,interval graphs, and graph planarity using PQ-tree algorithms. J. Comput.System Sci., 13(3):335–379, 1976.

[6] M. Bouvel, C. Chauve, M. Mishna, and D. Rossin. Average-case analysis ofperfect sorting by reversals. Discrete Math. Algorithms Appl., 3(3):369–392, 2011.

[7] G. Chapuy, A. Pierrot, D. Rossin. On growth rate of wreath-closedpermutation classes. Talk at the conference Permutation Patterns, 2011.

[8] P. Flajolet and R. Sedgewick. Analytic Combinatorics. Cambridge Univer-sity Press, 2009.

Acknowledgements. MM gratefully acknowledges the support of NSERC Dis-covery grant, and both LIGM and LaBRI for hosting during the work. MB+ CN were partially supported by ANR project M (2010-BLAN-0204).We are also thankful to Cedric Chauve, Carine Pivoteau, and Rosemary Mc-Closkey for work related to this project.

Recommended