43
Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Embed Size (px)

Citation preview

Page 1: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Algorithms for Generalized Comparison of Minisatellites

Behshad Behzadi & Jean-Marc Steyaert

LIX, Ecole PolytechniqueFrance

Page 2: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Outline

• Biology• Evolutionary Model• Problem description • Previous works• Algorithms• Results

Page 3: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Biology…

• Minisatellites consist of tandem arrays of short repeat units found in genome of most higher eukaryotes.

• High degree of polymorphism at minisatellites has applications from forensic studies to investigation of origin of modern humans.

Page 4: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

…Biology…

• These repeats are called variants.

• MVR-PCR is designed to find the variants.

• As an example, MSY1 is the minisatellite on the human Y-chromosomes. There are five

different repeats (variants) in MSY1.

Page 5: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Different Repeat Types (Variants) of MSY1

Page 6: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Graphical representations of Minisatellite Maps of 13 males

Page 7: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Summary

• Biologists are able to compute the minisatellite maps : A sequence in which each of the repeats is replaced by its symbol.

• Study of evolution of minisatellites is an important problem, in human genetics studies.

Page 8: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Computer Science Model

• Each variant is a symbol of an alphabet.

• A minisatellite is a string on this alphabet.

• We need to compare these strings.

Page 9: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Evolutionary Operations

• Insertion • Deletion • Mutation

• Amplification (p-plication)• Contraction (p-contraction)

Page 10: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Examples of operations(1)

• Insertion of d abbc — abbdc

• Deletion of c abbcb — abbb

• Mutation of c into d caab — daab

Page 11: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Examples of operations(2)

• 4-plication of c abcb —> abccccb

• 2-contraction of b abbc —> abc

No subword replication or contraction

Page 12: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Cost Functions

• I(x) : insertion of letter x• D(x) : deletion of letter x• M(x,y) : mutation of x to y• A

p(x) : p-plication of letter x

• Cp(x) : p-contraction of letter x

Page 13: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Hypotheses

• All the costs are positive. • The cost of duplications (and contractions) is

less than all other operations. • Distance : M(x,x)=0• Triangle Inequality holds: M(x,y)+M(y,z) ≥ M(x,z)

Page 14: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Transformation of s into t

• Applying a sequence of operations on s transforming it into t.

• An example : xyy —>> xbcbxzc xyy —> xy —> xxy —> xxz —> xbxz —>

xbbxz —> xbbbxz —> xbcbxz —> xbcbxzc

• The cost of a transformation is the sum of costs of its operations.

Page 15: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Transformation distance between s and t

• TD = Minimum cost for a possible transformation of s into t.

• The transformation which gives this minimum is called optimal transformation.

Page 16: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Previous Works

• Jobling & al.• Bérard & Rivals (RECOMB’02)• B.B. & J.M.S. (CPM2003, WABI2004): this

work

Page 17: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Optimal Transformation between s and t

• For any transformation of s into t there are 2 different types for the symbols of s.

• Generative vs Vanishing letters of s : — create a substring in t (generation) or

— disappear (reduction)

Page 18: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Basic lemmas

• The optimal generation of a non-empty string s from a symbol x can be achieved by a non-decreasing generation.

• In an optimal transformation of a string s to a string t any contraction operation can be done before any generation.

Page 19: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

The schema of the proof

• Sequence u is eliminated sometime during the process• The right-hand side transformation is equivalent and less

expensive w.r.t. evolution.

Page 20: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Optimal Transformation

• generative and vanishing symbols can be transformed in two distinct optimized phases.

Page 21: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

The Algorithm

• Preprocessing ( Substring generation costs) by Dynamic programming

• Main part (Transformation distance) by Dynamic programming

Page 22: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Substring generation costs

• G[i, j, x] : minimum generation cost of t[i..j] from symbol x among all generations which do not start by a mutation.

• T[i,j,x] is the minimum generation cost of substring t[i..j] from symbol x.

• mc[i,j,p,x] is the minimum generation cost for generating t[i..j] from symbol x among all possible generations starting with a p-plication.

Page 23: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

mc[i, j, p, x]

Page 24: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Substring generation costs

Page 25: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Substring Reduction cost

• S[i,j] is the minimum cost of reduction of the substring s[i..j] into s[i].

• S[i,j] is determined in the same way.

Page 26: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Complexity

• The time complexity is

• The space complexity is

• The maximum possible p for a p-plication is noted by

Page 27: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Transformation Distance

– TD[i,j] is the transformation distance between s[1..i] and t[1..j].

Page 28: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Complexity

• The main algorithm complexity is O(n³) in time and O(n²) in space.

• The total time complexity is

• The total space complexity is

Page 29: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Further improvements

• Improving the complexity using the Run Length Encoded string representation.• The RLE of aaaabbbbcccabbbbcc

is a4b4c3a1b4c2 also written a4b4c3ab4c2

• The lengths of the encoded strings with original lengths m and n are denoted by m' and n'.

Page 30: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Generation of Runs

• There exists an optimal generation of a non-empty string t from a single symbol x in which for every run of size k > 1 in t the k-1 right symbols of the run are generated by duplications of the leftmost symbol of the run.

Page 31: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

New configurations in the transformation Generations could split runs into several parts... Similarly for reductions... See on examples different configurations

Page 32: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

x

x

Page 33: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

x

ji

y

x

j’

i’

y

Page 34: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France
Page 35: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

PreProcessing: Generation Costs

• Compute the generation cost of all substrings of the target string t from any symbol x of the alphabet.

• Fill a table Gt [x,i,j] by recurrence.

Page 36: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France
Page 37: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France
Page 38: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Core Algorithm

• The Transformation Distances TD between s[1..i] and t[1..j] are computed by recurrence according to lemmas derived to the situations

• Generalized dynamic programming is used again

• Complexity : O(n'3+m'3+mn'2+nm'2+mn)

Page 39: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

BS algorithms vs. BR algorithms

• Complexity improvement O(n4) to O(n3) and more with RLE (O(n2) experimentally)

• Generalization 1: amplifications and contractions of order > 2

• Generalization 2: symbol-dependent cost functions

• The triangle hypotheses on cost functions are not restrictive and can be released by some preprocessing.

Page 40: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Dataset

• Provided by Prof. M. Jobling

• Minisatellite maps of 690 Y chromosomes from worldwide population.

• The length of the sequences is between 48 and 118.

• Distances were computed for 690x690 pairs

Page 41: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Running times

Minisatellites DataRandom Sequences Algorithm

32.54 sec 0.03 sec 810.93 sec 2.14 secThis Work

1012.38 sec 2.49 sec 1014.32 sec 2.51 secBehzadi & Steyaert (2003)

1062.23 sec 16.37 sec 1058.44 sec

15.90 sec

Berard and Rivals (2002)

CorePre-Processing CorePre-Processing

Page 42: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Conclusion

• More efficient algorithm to compute faster the distances and thus the phylogenetic trees.

• A more general framework which can be used for modelling more complicated biological evolutions.

Page 43: Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Thank you