Estimating seed sensitivity on homogeneous alignments

Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignmentshomogeneous alignments

BIBE 2004 Taichung - May 20th, 2004

Gregory Kucherov1, Laurent Noé1, Yann Ponty2

1LORIA, Nancy 2LRI, Paris, France

2Detected alignment

Seed paradigm Seed paradigm ((FASTA, BLAST,FASTA, BLAST,

PatternHunter, YASS, … )PatternHunter, YASS, … )

Start with small conserved and easily detected fragments (seeds).

Then extend the seeds and build possible alignments

Dot plot

Detected seeds

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgcataggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

3

ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

Spaced Seed Model Spaced Seed Model [Ma & al. 02][Ma & al. 02]

Seed Pattern : ###--#-##

‘#’ : obligatory match position‘-’ : joker position (“don’t care” position)

Weight : 6 [number of #] Span : 9 [number of all symbols]

Example :

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA






4

How to describe How to describe Selectivity Selectivity and and SensitivitySensitivity

Selectivity

seed weight:number of random occurrences ~ 4-weight .

Sensitivityprobability for the seed to detect an interesting similarity.

To be specified:

• What set of similarities do we want to detect?

• What is the probability of each similarity?

5

What is a good seed?What is a good seed?Sensitivity/Selectivity balanceSensitivity/Selectivity balance

Seed of relatively large weight : – Few random seed matches (high selectivity)– Possible loss of similarities (low sensitivity)

Seed of relatively small weight :– Detect almost all possible similarities (high sensitivity)– Many random seed matches (low selectivity)

6

Similarity: notationSimilarity: notation

Ungapped similarities only (no indels)

CTACGATGAGCTGCT|||:||:||||:|||CTATGACGAGCGGCT

All matches are equiprobable, all mismatches are equiprobable (simplification)

binary word

7

Similarities to be detected Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma

et al 2001, 2003, …] et al 2001, 2003, …]

The set:– all strings in (given n)

The probabilities:– Bernoulli model;– Markov models;

{ , }n

8

Similarities to be detected Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma

et al 2001, 2003, …]et al 2001, 2003, …]

Advantage: – natural probability model;– DP algorithms to compute sensitivity;

Disadvantage: – Uninteresting similarities are included in the set.

9

Similarities to be detected Similarities to be detected and their probabilitiesand their probabilities: our approach: our approach

Only “true” similarities to be considered

Scoring Scheme

CTACGATGAGCTGCT|||:||:||||:|||CTATGACGAGCGGCT

Score = 12r – 3p

+r+r+r-p+r+r-p+r+r+r+r-p+r+r+r

10

Similarities to be detected Similarities to be detected and their probabilitiesand their probabilities: our approach : our approach

Homogeneous similarities :

do not contain sub-alignment of higher score : (cf. Maximum Scoring Pairs)

all prefixes and suffixes of the similarity have non-negative score.

11

Homogeneous similaritiesHomogeneous similarities

(Prefix) Score

Alignment

Homogeneous alignment

homogeneous similarities occur entirely inside shaded area:

12

Alignment

Non homogeneous similaritiesNon homogeneous similarities

Negative suffix

Score

Prefix of higher score

Alignment

Suffix of higher scoreNegative prefix

Score

13

Our ModelOur Model

The set:Homogeneous similarities of

given length n and given score S

The probabilities :all similarities of the set have same probability.

14

Problem StatementProblem Statement

Given:

1) a seed of weight w and span l,2) integer scoring scheme {r, p},3) similarity length n,

4) score S. Compute:

the probability for the seed to match a random

homogeneous similarity of length n and score S

15

Computation of Seed Sensitivity Computation of Seed Sensitivity (homogeneous case) (homogeneous case)

To be computed:

probability for a seed to detect a homogeneous similarity, i.e.

Two steps of computation– Preprocessing: counting number Nhom of all homogeneous similarities of

given length n and score S.• DP algorithm: Space ant time complexity:

– Seed sensitivity measure: counting number of homogeneous similarities detected by seed .

• DP algorithm: (similar to Keich03).

16

Number of Homogeneous Similarities: Number of Homogeneous Similarities: Reduction to Graph Path problemReduction to Graph Path problem

Vertices: {(k, y)}, where k is a length of similarity, y is its score.

Vertex (k, y) corresponds to the set of all similarities of score y and length k.

2 edges from each (k, y): (k+1, y+r) - for match at position (k+1) ; (k+1, y-p) - for mismatch at position (k+1);

Homogeneous Similarities Paths from (0,0) to (n,S) inside n x S grid

(0,0) Alignment

(k,y) (n,S)

Score

17

Score S fixed:

number of possible paths from (0,0) to (n,S). D(y, k) = D(y+r, k+1) + D(y-p, k+1)

Taking into account border effects

Number of Homogeneous Similarities:Number of Homogeneous Similarities:Recursive equationRecursive equation

(0,0)

(k,y)(n,S)

18

Time and complexityTime and complexity

Space Complexity

Time Complexity

19

Computer experiments:Computer experiments:Homogeneous vs. All similaritiesHomogeneous vs. All similarities

Compare the sensitivity of seeds on both models– Fixed score S according to the scoring scheme (r=+1; p=-3)– Similarity length varies from 20 to 120– Two sets of similarities:

(1) all similarities of given length and score;(2) only homogeneous similarities of given length and

score;

Comparison plots– x axis : similarity length

– y axis : sensitivity (probability that the seed matches a similarity)

20

contiguous seed (weight 11)###########

Experiments (score 16)Experiments (score 16)

21

spaced seed (weight 11)###-#--#-#--##-###


22

Optimal seedsOptimal seeds

Optimal seeds are different

Optimal seeds are the same

23

SummarySummary

We have proposed

– a new definition of seed sensitivity based on the notion of homogeneous similarity;

– a DP algorithm to compute the sensitivity of a given seed.

Sensitivity of a seed on homogeneous similarities is usually substantially larger than on all similarities

Optimal seed on homogeneous similarities may be not optimal on all similarities and vice versa.

24

ExtensionsExtensions

Combining homogeneity constraint with properties of DNA sequences

Distinguishing different mismatches (transitions/transversions): YASS http://www.loria.fr/projects/YASS/

Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignmentshomogeneous alignments

Gregory Kucherov1, Laurent Noé1, Yann Ponty2

1LORIA (Laboratoire lorrain de recherche en informatique et ses applications), Nancy, France

2LRI (Laboratoire de recherche en informatique), Paris, France

26

CollaboratorsCollaborators

Thanks !!

Mikhail Roytberg (Institute of Mathematical Problem of Biology, Russia) for his comments and helpful discussion during the preparation of this work.

Alain Denise (Laboratoire de Recherche en Informatique, France) for his help on culminating paths.

27

Thank you for your attention !Thank you for your attention !

????

28

29

Number of Number of Detected Detected Homogeneous Homogeneous Similarities: Similarities:

Reduction to Graph Path problemReduction to Graph Path problem(after Keich03)(after Keich03)

Vertexes: {(k, y, t, u)} , where

k is a length of similarity, y is its score;

1 ≤ t ≤ l ; u is a binary word of length l-w.

{(k, y, t, u)} corresponds to set H (k, y, t, u) consists of homogeneous similarities of length k and score y;

– t is a length of maximal prefix of the seed , matching an end of a word v from H(k, y, t, u);

– u is a word, consisting of symbols corresponding to the joker position within the matching.

– t and u are same for all similarities from H

30

Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.

u

###--#-##

t

k = 17,y = 8,t = 8,u =

k’ = k + 1,y’ = y + r,t’ = t + 1,u’ = u

(k,y)

Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2

31


k’ = 17 + 1,y’ = 8 + r,t’ = t + 1,u’ = u

u

(k’,y’)

###--#-##

t+1

k’ = 18,y’ = 9,t’ = 9,u’ =


32



k = 17,y = 8,t = 8,u =

k’ = 17 + 1,y’ = 8 - p,t’ = ?,u’ = ?

u

(k,y)

###--#-##

t

33


k’ = 17 + 1,y’ = 8 - p,t’ = ?,u’ = ?

k’ = 18,y’ = 6,t’ = 5,u’ =

###--#-##

(k’,y’)

u’ ?

t’(t’, u’) = F (t, u);


34

Number of Number of Detected Detected Homogeneous Homogeneous Similarities: Similarities:

Reduction to Graph Path problemReduction to Graph Path problem(after Keich03)(after Keich03)

Edges: 2 edges from “each” (k, y, t, u):

(k+1, y+r, t+1, u) - for match at (k+1)-th position

(k+1, y-p, t’, u’) - for mismatch at (k+1)-th position;

t’, u’ can be pre-computed.

35

contiguous seed (weight 11)###########


36

spaced seed (weight 11)###-#--#-#--##-###


37

Example 1Example 1

Score

Alignment

###-###

detected alignment

38

Score

Alignment

Example 2Example 2

###-###

detected alignment

39

Example 3Example 3

Score

Alignment

###-###

detected alignment

Documents

Estimating seed sensitivity on homogeneous alignments