17
1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena SUNY Stony Brook

1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

1

Computer Science DepartmentTechnion – Israel Institute of Technology

Genomic Sorting with Length-Weighted Reversals

Ron Y. PinterTechnion

Steve SkienaSUNY Stony Brook

Page 2: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

2

Genome Rearrangement

• events– duplication– translocation– reversal (inversion)

• occur primarily during reproduction

• allow large-scale genomic comparisons

Page 3: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

3

Sorting by Reversals

• genome represented as a permutation on 1, 2, …, n– n = # homologous genes among species

• assumptions– can identify genes– genes are distinct

• operation: reversal of a subsequence (of genes)– models inversion (occurs during crossover)

• one of the permutations can be 1, 2, …, n– appropriately relabel others

Page 4: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

4

• 6 reversal• in our model (for f(l) = l): cost = 18

Example

4 3 2 8 7 1 5 6 11 10 9

4 3 2 1 7 8 5 6 9 10 11

1 2 3 4 8 7 6 5 9 10 11

1 2 3 4 5 6 7 8 9 10 11

Page 5: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

5

Our Model

• unsigned

• cost of reversal of subsequence of length l is f(l)

• total sorting cost (or distance) is

f (length(sj))

Sj are reversed

subsequences

Page 6: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

6

Cost Functions

• additivef(x+y) = f(x) + f(y)

• subadditivef(x+y) < f(x) + f(y)

• superadditivef(x+y) > f(x) + f(y)

• other– e.g. bitonic

f(l)

f(l)

Page 7: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

7

Problems

• algorithm to sort any permutation– worst-case min cost

• approximate min cost for a given permutation

Page 8: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

8

Extremal Costs

• highly subadditive: e.g. unit cost, f(l) = 1– NP complete [Caprara, ’97]– series of approximation ratios: 2, 1.75,

1.375

• highly superadditive: f(l) > l2

– essentially bubblesort

Page 9: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

9

Our Results

• additive cost function– specifically f(l) = l

• QuickSort-like algorithm for worst-case– complexity: O(n lg2n)

• min cost approximation ratio of O(lg2n)

Page 10: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

10

MedianEject(a,b)

• find r maximal blocks of wrong-sided elements with respect to median

• for lg r do: flip every other pair of blocks of wrong-sided and adjacent blocks

• move wrong-sided blocks to median boundary

• reverse left and right blocks

Page 11: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

11

complexity: O((b-a) lg r)

Sample Run

Page 12: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

12

ReversalSort(a,b)

MedianEject (a,b);

ReversalSort (a, );

ReversalSort ( ,b);

Complexity

T(n) = 2 T ( ) + O(f(n) lg n) O(f(n)lg2n)= O(n lg2n) for f(n)~n

2

ab

2

n

Page 13: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

13

Algorithmic Improvements

I simplify “short” phases

II merge 2 last steps of MedianEject

when possible (2p+q vs. 3p+q)

III apply II recursively

p q p

Page 14: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

14

Approximation Ratio• M(p) is the maximal total distance between pairs of out-of order

elements

Lemma 4: min cost is (M(p))butLemma 6: # of out-of order elts < 3 M(p)+Lemma 7: MedianEject touches only elements within linear range

from out-of-order elements

yields:

• each round of MedianEject takes O(M(p) lg2 n)

• ReversalSort costs O(M(p) lg2 n)

• ReversalSort is at most O((lg2 n) times optimal

Page 15: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

15

• use our cost (= distance) to build phylogenetic trees

• 4 plants (chloroplastic genes)• consistent with [Martin et al., PNAS Sept ‘02]• work in progress [M. Shoham]

Bioinformatic “Validation”

Cyanophora

Cyanidium

Guilardia

Porphyra

Page 16: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

16

• weighted genes

• tighter approximation ratio– close to O(lg n)– can get to constant?

• other cost functions (incl. bitonic)

• the signed case

Open Problems: Algorithmic

Page 17: 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena

17

• chromosomal ordering

• what is the right cost function?– consider cost(l) = ld

• combine with constant-based models– restricted regions– “undesired” reversal sequences

• deal with duplication and translocation events

Open Problems: Modeling