54
5. Lecture WS 2003/04 Bioinformatics III 1 Genome Rearrangements are to other areas in bioinformatics we still know very little about rearrangement events that produced the existing varieties of mic architectures ... Some material of this lecture borrowed from: Nipun Mehra, www.stanford.edu/class/cs374/Notes/lec17.ppt www.sna.csie.ndhu.edu.tw/~lung/seminar/20020502.ppt Bafna V., and P.A. Pevzner. "Sorting by reversals: genome rearrangements in plant organelles and evolutionary history of X Chromosome." Hannenhalli S., and P.A. Pevzner. "Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals.“ “Computational Molecular Biology” book by P.A. Pevzner, MIT press, chapter 10

Genome Rearrangements

  • Upload
    aspen

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Genome Rearrangements. Compare to other areas in bioinformatics we still know very little about the rearrangement events that produced the existing varieties of genomic architectures. Some material of this lecture borrowed from: Nipun Mehra, www.stanford.edu/class/cs374/Notes/lec17.ppt - PowerPoint PPT Presentation

Citation preview

Page 1: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 1

Genome Rearrangements

Compare to other areas in bioinformatics we still know very little about

the rearrangement events that produced the existing varieties of

genomic architectures ...

Some material of this lecture borrowed from:Nipun Mehra, www.stanford.edu/class/cs374/Notes/lec17.pptwww.sna.csie.ndhu.edu.tw/~lung/seminar/20020502.ppt

Bafna V., and P.A. Pevzner. "Sorting by reversals: genome rearrangements in plant organelles and evolutionary history of X Chromosome."

Hannenhalli S., and P.A. Pevzner. "Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals.“

“Computational Molecular Biology” book by P.A. Pevzner, MIT press, chapter 10

Page 2: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 2

Processes of Evolution

- Substitution

- Insertion

- Deletion

- Translocation

- Inversion/ Reversal

- Duplication

Page 3: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 3

What is a reversal = inversion ?

Break and Invert

A T G C C T G T A C T A

T A C G G A C A T G A T

A T G T A C A G G C T A

T A C A T G T C C G A T

• Purines (A, G) and Pyrimidines (C, T) switch strands

•Many organisms have highly similar genes but very different

gene orders.•Very prominent in Prokaryotes, Mitochondrial DNA and

Mamallian X-chromosome.

Page 4: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 4

Types of Genome Rearrangements

Two genomes may have many genes in common, but the genes may be

arranged in a different sequence or be moved between chromosomes. Such

differences in gene orders are the results of rearrangement events that are

common in molecular evolution.

For example, in unichromosomal genomes, the most common rearrangement

events are reversals, in which a contiguous interval of genes is put into the

reverse order.

For multichromosomal genomes, the most common rearrangement events are

reversals, translocations, fissions, and fusions.

The pairwise genome rearrangement problem is to find an optimal scenario

transforming one genome to another via these rearrangement events.

Page 5: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 5

Representation of a genome

We consider a unichromosomal genome to bef a sequence of n genes. The

genes are represented by numbers 1, 2, ..., n.

The two orientations of gene i are represented by i and -i.

A genome is represented as a signed permutation of the numbers 1, 2, ..., n.

For example, a unichromosomal genome with n = 5 genes is 5 -3 4 2 -1

Page 6: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 6

Multichromosal Genome

A multichromosomal genome consists of n genes spread over m

chromosomes. We represent it as a signed permutation of 1, 2, ..., n, with

delimiters "$" or ";" inserted between the chromosomes. For example, a genome

with 12 genes spread over 3 chromosomes is

7 -2 8 3 $ 5 9 -6 -1 12 $11 4 10 $ The order of the chromosomes and the direction of the chromosomes do not

matter in the multichromosomal algorithms. Thus, we could represent this same

genome by flipping the first chromosome (reverse the order of its entries and

negate them) and then moving the last chromosome to the beginning:

11 4 10 $ -3 -8 2 -7 $ 5 9 -6 -1 12 $

Page 7: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 7

Unichromosomal genomes: sorting by reversal

A reversal in a signed permutation is an operation that takes an interval in a

permutation, reverses the order of the numbers, and changes all their signs. For

example,

5 1 3 2 -9 7 -4 6 8

5 1 -7 9 -2 -3 -4 6 8

The reversal distance between two genomes is the minimum number of

reversals it takes to get from one genome to the other.

For a given pair of genomes, the reversal distance is unique, but there are

usually many possible reversal scenarios with this distance.

However, it is possible that this mathematical notion of reversal distance can

underestimate the actual number of steps that occurred biologically.

Page 8: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 8

Multichromosomal genomes: rearrangement operations

We treat four elementary rearrangement events in multichromosomal genomes:

reversals, translocations, fusions, and fissions.

Reversal: An interval within a single chromosome may be reversed in the

same fashion as a reversal acts in the unichromosomal case:

7 -2 8 3 $ 7 -2 8 3 $

5 9 -6 -1 12 $ 5 9 -12 1 6 $

11 4 10 $ 11 4 10 $

Note: When the programs are run in unichromosomal mode, the genomes

3 1 2     and     -2 -1 -3are considered different (one reversal apart, distance=1), while in

multichromosomal mode, those same genomes are considered equivalent

(distance=0) because we have simply flipped an entire chromosome, which

gives an equivalent genome in the multichromosomal mode.

Page 9: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 9

Two chromosomes "A B" and "C D" may be rearranged into "A D" and "C B".

(The letters A, B, C, D stand for sequences of genes.)

Because flipping chromosomes does not alter a genome (only its

representation is altered), "A -C" and "-B D" is another possible translocation.

(-B means to reverse the order of the genes in sequence B and negate each

one.)

For example, a translocation on chromosomes 1 and 3 is

7 -2 8 3 $ 7 -2 8 -4 -11 $

5 9 -6 -1 12 $ 5 9 -6 -1 12 $

11 4 10 $ -3 10 $

Translocation

Page 10: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 10

Fussion & Fission

Fusion: Two chromosomes may be fused together into a single chromosome.

Due to chromosome flippings, there are four distinct fusions between each pair of

chromosomes. Here is one of the fusions between chromosomes 1 and 3:

7 -2 8 3 $ 7 -2 8 3 -10 -4 -11 $

5 9 -6 -1 12 $ 5 9 -6 -1 12 $

11 4 10 $

Fission: A chromosome may be broken into two chromosomes between any pair

of genes:

7 -2 8 3 $ 7 -2 8 3 $

5 9 -6 -1 12 $ 5 9 $

11 4 10 $ -6 -1 12 $

11 4 10 $

Page 11: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 11

Signed and unsigned genomes

Most comparative mapping techniques determine the physical locations and

relative order of genes in each chromosome, but do not determine which of two

orientations each gene has.

Current sequencing methods do provide the orientations. It turns out that the

genome rearrangement problem (uni- and multichromosomal) for unsigned

permutations is NP-hard, but the same problems for signed data can be done in

polynomial time.

Fortunately, with many genomes currently being sequenced, it is likely that

many comparative maps (corresponding to unsigned permutations) will soon be

replaced by sequencing data (corresponding to signed permutations).

Page 12: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 12

Multichromosomal genomes: rearrangement operations

For example, to turn the unsigned genome

1 2 3 4 5

into the unsigned genome

1 4 3 2 5

requires one unsigned reversal. An assignment of signs may be designed in

the source and destination genomes that give a signed reversal scenario

requiring this same number of steps. Here, we get

1 2 3 4 5

1 -4 -3 -2 5

which also takes one step. Note that there may be other sign assignments

taking this minimum number of steps.

Page 13: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 13

Multichromosomal genomes: rearrangement operations

It is possible that correctly signed data would have increased the number of

steps:

1 2 3 4 5

1 -4 -3 -2 5

1 -4 3 -2 5

If the data collection method did not determine signs, it is impossible to know

mathematically whether the one step or two step scenario is more biologically

accurate; the mathematical problem the genome rearrangement programs solve

is to find the signs giving the minimum possible distance.

Page 14: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 14

X-Alignments

• Implication: The reversals took place equidistant from the center of chromosome.• Those along the diagonal are orthologs between species.• Those along anti-diagonal are duplicates separated by inversion, within species.

• The “X” Factor discovered by Eisen et al • Alignment of whole genomes of prokaryotes like bacteria revealed X-like

patterns in dot plots – called X-alignments.

Page 15: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 15

A biological model case

8 7 6 5 4 3 2 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

cabbage

turnip

Palmer and Herbon found that the mitochondrial genomes in cabbage and

turnip had very similar gene sequences, but with fairly different gene orders.

How to design a „transformation“ of cabbage into turnip?

Mitochondrial DNA of cabbage and turnip are composed of five conserved

blocks of genes that are shuffled in cabbage as compared to turnip. Every

conserved block has a direction that is shown by a + or – sign.

Page 16: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 16

Inversion, Transposition and inverted Transposition

inversion

transposition

inverted transposition

Page 17: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 17

Oriented/Unoriented Blocks

Remember that the unoriented case results in an NP-Hard problem, whereas

the oriented case can be solved in polynomial time.

2 1 3 7 5 4 8 6

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

UNORIENTED BLOCKS

ORIENTED BLOCKS

Polynomial Time

NP-Hard

Page 18: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 18

Sorting by Reversals

8 7 6 5 4 3 2 1 11 10 9

8 7 6 5 4 3 2 1 11 10 9

8 2 3 4 5 6 7 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

8 2 3 4 5 1 7 6 11 10 9

4 3 2 8 5 1 7 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

Cabbage

Turnip

Page 19: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 19

Permutation () : an ordered arrangement of

the set { 1,2,…,n}

Reversal () :a rearrangement that inverts a

block in {3 4 7 6 1 5 2 } (3,6) ={3 4 5 1 6 7 2}

Signed Permutation (): a permutation

where the elements are oriented

a reversal switches element orientation

{+3 -4 +7 -6 +1 -5 +2 } (3,6) ={+3 -4 +5 -1 +6 -7 +2}

Page 20: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 20

easy to do by eye ...

8 7 6 5 4 3 2 1 11 10 9

8 7 6 5 4 3 2 1 11 10 9

8 2 3 4 5 6 7 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

8 2 3 4 5 1 7 6 11 10 9

4 3 2 8 5 1 7 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

1

12

123

12….t=

= t …. 21

Page 21: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 21

Formal Approach: Sorting by Reversals

The order of genes in 2 organisms is represented by permutations = 12 ... n and = 12 ... n. A reversal of an interval [i,j] is the permutation

1 2 ... i-1 i i+1 ... j-1 j j+1 ... n

1 2 ... i-1 j j-1 ... i+1 i j+1 ... n

(i,j) has the effect of reversing the order of ii+1 ... j and transforming

1 ... i-1i ... j j+1 ... n into •(i,j) = 1 ... i-1j ... ij+1 ... n .

Given permutations and , the reversal distance problem is to find a series of

reversals 12 ... t such that •1•2 ... t = and t is minimal.

t is called the reversal distance between and .

Page 22: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 22

Breakpoint Graph

Sort a permutation is a hard problem.

Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor

(1984) and correlations were noticed between the reversal distance and the

number of breakpoints.

Let i j if |i – j| = 1. Extend a permutation = 12 ... n by adding 0 = 0 and

n+1 = n + 1. We call a pair of elements (i,i+1), 0 i n, of an adjacency

if i i+1, and a breakpoint if i i+1.

2 3 1 4 6 5 7

0 2 3 1 4 6 5 7 8

adjacencies

breakpoints

As the identity permutation has no

breakpoints, sorting by reversals

corresponds to eliminating breakpoints.

An observation that every reversal can

eliminate at most 2 breakpoints implies that

the reversal distance d() b() / 2 where

b() is the number of breakpoints in .

However, this is a clear overestimate.

Page 23: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 23

Breakpoint Graph

The breakpoint graph of a permutation is an edge-colored graph G() with n +

2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by a

black edge for 0 i n. We join vertices i and j by a gray edge if i j.

Black path

0 2 3 1 4 6 5 7

Grey path

0 2 3 1 4 6 5 7

Superposition of black and grey paths formsthe breakpoint graph:

A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.

Page 24: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 24

Cycle decomposition

A cycle in an edge-colored graph G is called alternating if the colors of every two

consecutive edges of this cycle are distinct. In the following, cycles will mean

alternating cycles.

Cycle decomposition ofthe breakpoint graph:

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

A vertex v in a graph G is called balanced if the

number of black edges incident to v equals the

number of grey edges incident to v.

A balanced graph is a graph in which every

vertex is balanced. G() is a balanced graph.

Therefore, there exists a cycle decomposition

of G() into edge-disjoint alternating cycles

(every edge in the graph belongs to exactly one

cycle in the decomposition). Cycles in an edge

decomposition may be self-intersecting. The

previous breakpoint graph can be decomposed

into 4 cycles, one of which is self-intersecting.

Page 25: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 25

Cycle decomposition

What is the decomposition of the breakpoint graph into a maximum number c()

of edge-disjoint alternating cycles? Here, c() = 4.

Cycle decompositions play an important role in estimating reversal distances.

When a reversal is applied to a permutation, the number of cycles in a maximum

decomposition can change by at most one (while the number of breakpoints

can change by two).

Bafna&Pevzner (1996) proved the bound:

d() n + 1 - c()

Which is much tighter than the bound in terms of breakpoints d() b() / 2.

For many biological problems, d() = n + 1 - c().

Therefore, the reversal distance problem reduces to the problem of finding

the maximal cycle decomposition.

Page 26: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 26

Effects of reversals on cycles

(A) For reversals acting on two

cycles, (b – c) = 1.

(B) For reversals acting on an

unoriented cycle, (b – c) = 0.

(C) For reversals acting on an

oriented cycle, (b – c) = -1

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Page 27: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 27

Effect of reversals on gray edges

(a) A proper reversal on an oriented gray edge.

(b) A nonproper reversal on an unoriented gray edge.

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Page 28: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 28

Transform signed into unsigned permutation

(a) Optimal sorting of a

permutation (3 5 8 6 4 7 9 2 1 10

11) by 5 reversals.

(b) Breakpoint graph of this

permutation: black edges

connect adjacent vertices that

are not consecutive, gray edges

connect consecutive vertices that

are not adjacent.

(c) Transformation of a signed

permutation into an unsigned

permutation and the breakpoint

graph G(); (d) Interleaving graph

H with two oriented and

one unoriented unoriented

component.

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Page 29: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 29

The Problems

Minimum Sorting by Reversals (MinSortRv):

Given a permutation , what is the shortest sequence (12….t ) of reversals

that sorts ? (Distance: d())

Complexity remains open. (NP-Hard)

Minimum Signed Sorting by Reversals (SignedSortRv):

Given a signed permutation , what is the shortest sequence (12….t ) of

reversals that sorts ?

Solvable in polynomial time.

Page 30: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 30

KS93 -Kececioglu and Sanko“Exact and approximation algorithms for the inversion distance between two chromosomes", 4th CPM- studied MinSortRv- introduced notion of breakpoints- 2 approximation algorithm

BP93 -Bafna and Pevzner“Genome Rearrangements and Sorting by Reversals", 34th FOCS- breakpoint graph and cycle decomposition- introduced signed sorting SignedSortRv- 3/2 approx algorithm for SignedSortRv- 7/4 approx algorithm for MinSortRv

HP95 - Hannenhali and Pevzner “Transforming Cabbage into Turnip”, 27th STOC- SignedSortRv resolved- O(n4) algorithm- introduced hurdles and fortresses- d() = b() - c() + h() + f()

Important developments

Page 31: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 31

KS93- Breakpoints

Extend to include element 0 (L) on the left and element n+1 (R) on the right.

A breakpoint occurs between two adjacent elements that do not differ by 1

Example: = { 3 5 6 7 2 1 4 8 } has 5 breakpoints, (b() = 5).

R 3 5 6 7 2 1 4 8 L

Breakpoints partition sequence into strips that are increasing or decreasing.

Reversals add or remove breakpoints. Sorted permutation has 0 breakpoints.

i-reversal (i = 0,1, 2): a reversal that decreases number of breakpoints by i.

Theorem (KS): Let contain a decreasing strip. Then has a 1- or 2-reversal.

If every reversal that removes a breakpoint of results in a permutation with no

decreasing strips, then has a 2-reversal.

Page 32: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 32

Algorithm KS()

i 0

while contains a breakpoint

i i+1

the reversal that removes the most

breakpoints, resolving ties in favor of

reversals that leave a decreasing strip

return

Optimal reversal distance is at least b()/2

KS returns a solution that is at most 2*optimal = b()

Page 33: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 33

BP93 – Breakpoint Graph

Vertices: elements of (plus 0 (L) and n+1 (R) )

2 3 1 6 5 4L R

+-6 6

THE DIAGRAM OF REALITY AND DESIRE

Page 34: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 34

Construction of a diagram of reality and desire

3 2 1 4 5L R

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

Reality edges

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

Desire edges

1 2 3 4 5L RDesire

Reality

Page 35: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 35

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

L R-3

+3

+2

-2

+1 -1 -4

+4

+5

-5

Page 36: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 36

c() = number of cycles in a maximum cycle decomposition

Observation: reversals affect c().

Example:

{L [+1 -1] –2 +2 +3 -3 R}- removes 2 breakpoints and 1 cycle.

effect of reversals on c()

L +1 -1 -2 +2 +3 -3 R L -1 +1 -2 +2 +3 -3 R

Page 37: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 37

d() >= b() - c()

Cycles of length 4 are eliminated by 2-reversals.

Let c4() = number of 4-cycles.

(c() - c4()) : Cycles of length > 4 include at least three breakpoints

d() >= b() – c4() - (c() - c4()) / 3

Page 38: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 38

Algorithm BP()

while contains a breakpoint

if has no decreasing strips

if a 4-cycle C remains

Find cycle C’ that crosses C

0-reversal on C’, 2-reversal on C

else

Regular 0-reversal

else

Regular greedy choice

Algorithm BP produces a solution that is at most (3*optimal)/2

Page 39: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 39

A FB C

D

E Interleaving Graph

HP95 – Hurdles and Fortress

Page 40: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 40

HurdlesA hurdle is a bad component that does not separate any other two bad

components. Separation is an important concept, in that a reversal through

reality edges in different components A and C will result in every component B,

that separates A and C being twisted. A bad component becomes good when

twisted.

Bad Components

Non-Hurdles Hurdles

SimpleHurdles

SuperHurdles

B

A FCD

E

Page 41: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 41

Fortress

A permutation a is called fortress f() when its reality and desire diagram

contains an odd number of hurdles and all of them are super hurdles.

Fortresses are permutations that require

one extra reversal to sort, due to their

special structure

A smallest possible fortress.

Page 42: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 42

Algorithm HP()If there is a good component in RD() then

pick two divergent edges e,f in this component,

making sure the corresponding reversal does

not create any bad components

Return the reversal characterized by e and f

Else

if h() is even then

Return merging of two opposite hurdles

else

if h() is odd and there is a simple hurdle

return a reversal cutting this hurdle

else // fortress

return merging of any two hurdles

d() b() - c() + h() + f() h(): number of hurdles

f(): 0/1, according to being a fortress or not

Page 43: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 43

Hurdles

(a) Unoriented component U separates U‘ and U‘‘ by virtue of the edge (0, 1)

(b) Hurdle U does not separate U‘ and U‘‘.

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Page 44: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 44

Effects of reversals on cycles

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Reversal on a cycle C (i) deletes vertex C from the interleaving graph; (ii) changes

the orientation of vertices in V(C); (iii) complements the subgraph induced by V(C).

Page 45: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 45

Merging hurdles

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Page 46: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 46

Hannenvalli-Pevzner algorithm

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)

Page 47: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 47

Improvements of Hannenhalli-Pevzner algorithm

Several websites offer programs to sort permutations by reversals. At their roots

is the Hannenhalli-Pevzner algorithm for sorting signed permutations by

reversals. Successive authors improved the algorithm.

• By the Hannenhalli&Pevzner algorithm, the distance computation is

performed in time O(n4).

• improvements in the algorithm developed by Haim Kaplan, Ron Shamir and

Robert E. Tarian bring the time to compute distance down to O(n2).

• GRAPPA is written by a multitude of authors. It reduces the distance

computation time to O(n) using improvements by David A. Bader, Bernard

M.E. Moret and Mi Yan.

The main purpose of GRAPPA is to construct phylogenetic trees for multiple

signed unichromosomal genomes; the distance computation on which we are

focused here is but a mere subroutine in that context.

Page 48: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 48

Algorithm by Kaplan, Shamir, Tarjan

The algorithm has three main stages:

1. Pre-process the permutation. This pre-processing contains 3 sub stages:

    1a. Unsign the permutation, e.g., p will be unsigned to the permutation 0,

(7,8), (4,3), (1,2), (5,6), (12,11), (9,10), 13.

    1b. Define the Overlap graph of the permutation

    1c. Find the connected components of the overlap graph

2. Clear the hurdles. A hurdle is a problematic connected component of the

overlap graph. In this stage each reversal merges two hurdles in distinct

connected components into one non-hurdle component.

3. Generate a sequence of safe reversals. A safe reversal is defined as a

reversal that reduces  b-c (the number of breakpoints minus the number of

cycles) without creating new hurdles.

Page 49: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 49

Multichromosomal genomes: more tricky

Word problems and insertions/deletions

So far we did not consider "word problems" in which some genes are repeated,

1 2 -1 3 4

nor did we allow gaps in the numbering (as may arise from insertion/deletion),

1 3 -9 -7 5

Distinguish between microrearrangements (e.g. intrachromosomal

rearrangements with a span < 1 Mb) and macrorearrangements (e.g.

intrachromosomal rearrangements of larger span as well as interchromosomal

rearrangements). The existing rearrangement algorithms do not distinguish

between these two types of rearrangements.

First identify conserved synteny blocks (segments that can be converted into

conserved segments by microrearrangements).

Page 50: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 50

Genome Rearrangements: Synteny

(a) Human and mouse synteny blocks of

conserved gene order. Every block

corresponds to a rectangle, with a diagonal

showing whether the arrangements of

anchors in human and mouse (within the

synteny block) are the same or reversed.

(b) Combining anchors into clusters by the

GRIMM-Synteny algorithm at G = 100 kb. The

edges in the anchor graph connect the closest

ends of the anchors. The anchors are color-

coded by the resulting clusters. At G = 1 Mb,

this forms a single cluster, which in turn forms

a synteny block (the lower right block in the

human 18/mouse 17 rectangle in a).

Pevzner, Tesler, Genome Res 13, 37 (2003)

Page 51: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 51

From Anchors to Breakpoint Graphs

X-chromosome: from local similarities, to

synteny blocks, to breakpoint graph, to

rearrangement scenario. (a) Dot-plot of

anchors. Anchors are enlarged for

visibility. (b) Clusters of anchors. (c)

Rectified clusters. (d) Synteny blocks. (e)

Synteny blocks (symbolic representation

as genome rearrangement units). (f) 2D

breakpoint graph superimposed on

synteny blocks. The projections of the 2D

graph onto the human and mouse axes

form the conventional breakpoint graphs.

(g) 2D breakpoint graph. The four cycles

in the breakpoint graph are shown by

different colors. (h) A most parsimonious

rearrangement scenario for human and

mouse X-chromosomes. Pevzner, Tesler, Genome Res 13, 37 (2003)

Page 52: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 52

Genome Rearrangements

Construction of the breakpoint

graph from synteny blocks.

(a) Solid path through human.

(b) Dotted path through mouse.

(c) Superposition of paths.

(d) Remove blocks to obtain

cycles.

Pevzner, Tesler, Genome Res 13, 37 (2003)

Page 53: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 53

Multichromosomal breakpoint graph

Multichromosomal breakpoint

graph of the whole human and

mouse genomes. The

conventional chromosome

order and orientation are not

suitable for such graphs; an

optimal chromosome order and

orientation were determined by

the algorithm in Tesler (2002).

Three "null chromosomes," N1,

N2, N3, were added to mouse

to equalize the number of

chromosomes in the two

genomes.

Pevzner, Tesler, Genome Res 13, 37 (2003)

Page 54: Genome Rearrangements

5. Lecture WS 2003/04

Bioinformatics III 54

Summary

Computational studies on genome rearrangements assume that evolution took

the path of shortest reversal distance.

Due to algorithmic improvements, the computational costs could be

significantly reduced.

It may be very advantageous to simultaneously analyze more than 2 genomes

at the same time!