39
Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignments homogeneous alignments BIBE 2004 Taichung - May 20th, 2004 Gregory Kucherov 1 , Laurent Noé 1 , Yann Ponty 2 1 LORIA, Nancy 2 LRI, Paris, France

Estimating seed sensitivity on homogeneous alignments

  • Upload
    ipo

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Estimating seed sensitivity on homogeneous alignments. BIBE 2004 Taichung - May 20th, 2004 Gregory Kucherov 1 , Laurent Noé 1 , Yann Ponty 2 1 LORIA, Nancy 2 LRI, Paris, France. Detected seeds. Detected alignment. Seed paradigm ( FASTA, BLAST, PatternHunter, YASS, … ). - PowerPoint PPT Presentation

Citation preview

Page 1: Estimating seed sensitivity on homogeneous alignments

Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignmentshomogeneous alignments

BIBE 2004 Taichung - May 20th, 2004

Gregory Kucherov1, Laurent Noé1, Yann Ponty2

1LORIA, Nancy 2LRI, Paris, France

Page 2: Estimating seed sensitivity on homogeneous alignments

2Detected alignment

Seed paradigm Seed paradigm ((FASTA, BLAST,FASTA, BLAST,

PatternHunter, YASS, … )PatternHunter, YASS, … )

Start with small conserved and easily detected fragments (seeds).

Then extend the seeds and build possible alignments

Dot plot

Detected seeds

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgcataggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

Page 3: Estimating seed sensitivity on homogeneous alignments

3

ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

Spaced Seed Model Spaced Seed Model [Ma & al. 02][Ma & al. 02]

Seed Pattern : ###--#-##

‘#’ : obligatory match position‘-’ : joker position (“don’t care” position)

Weight : 6 [number of #] Span : 9 [number of all symbols]

Example :

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA

Page 4: Estimating seed sensitivity on homogeneous alignments

4

How to describe How to describe Selectivity Selectivity and and SensitivitySensitivity

Selectivity

seed weight:number of random occurrences ~ 4-weight .

Sensitivityprobability for the seed to detect an interesting similarity.

To be specified:

• What set of similarities do we want to detect?

• What is the probability of each similarity?

Page 5: Estimating seed sensitivity on homogeneous alignments

5

What is a good seed?What is a good seed?Sensitivity/Selectivity balanceSensitivity/Selectivity balance

Seed of relatively large weight : – Few random seed matches (high selectivity)– Possible loss of similarities (low sensitivity)

Seed of relatively small weight :– Detect almost all possible similarities (high sensitivity)– Many random seed matches (low selectivity)

Page 6: Estimating seed sensitivity on homogeneous alignments

6

Similarity: notationSimilarity: notation

Ungapped similarities only (no indels)

CTACGATGAGCTGCT|||:||:||||:|||CTATGACGAGCGGCT

All matches are equiprobable, all mismatches are equiprobable (simplification)

binary word

Page 7: Estimating seed sensitivity on homogeneous alignments

7

Similarities to be detected Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma

et al 2001, 2003, …] et al 2001, 2003, …]

The set:– all strings in (given n)

The probabilities:– Bernoulli model;– Markov models;

{ , }n

Page 8: Estimating seed sensitivity on homogeneous alignments

8

Similarities to be detected Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma

et al 2001, 2003, …]et al 2001, 2003, …]

Advantage: – natural probability model;– DP algorithms to compute sensitivity;

Disadvantage: – Uninteresting similarities are included in the set.

Page 9: Estimating seed sensitivity on homogeneous alignments

9

Similarities to be detected Similarities to be detected and their probabilitiesand their probabilities: our approach: our approach

Only “true” similarities to be considered

Scoring Scheme

CTACGATGAGCTGCT|||:||:||||:|||CTATGACGAGCGGCT

Score = 12r – 3p

+r+r+r-p+r+r-p+r+r+r+r-p+r+r+r

Page 10: Estimating seed sensitivity on homogeneous alignments

10

Similarities to be detected Similarities to be detected and their probabilitiesand their probabilities: our approach : our approach

Homogeneous similarities :

do not contain sub-alignment of higher score : (cf. Maximum Scoring Pairs)

all prefixes and suffixes of the similarity have non-negative score.

Page 11: Estimating seed sensitivity on homogeneous alignments

11

Homogeneous similaritiesHomogeneous similarities

(Prefix) Score

Alignment

Homogeneous alignment

homogeneous similarities occur entirely inside shaded area:

Page 12: Estimating seed sensitivity on homogeneous alignments

12

Alignment

Non homogeneous similaritiesNon homogeneous similarities

Negative suffix

Score

Prefix of higher score

Alignment

Suffix of higher scoreNegative prefix

Score

Page 13: Estimating seed sensitivity on homogeneous alignments

13

Our ModelOur Model

The set:Homogeneous similarities of

given length n and given score S

The probabilities :all similarities of the set have same probability.

Page 14: Estimating seed sensitivity on homogeneous alignments

14

Problem StatementProblem Statement

Given:

1) a seed of weight w and span l,2) integer scoring scheme {r, p},3) similarity length n,

4) score S. Compute:

the probability for the seed to match a random

homogeneous similarity of length n and score S

Page 15: Estimating seed sensitivity on homogeneous alignments

15

Computation of Seed Sensitivity Computation of Seed Sensitivity (homogeneous case) (homogeneous case)

To be computed:

probability for a seed to detect a homogeneous similarity, i.e.

Two steps of computation– Preprocessing: counting number Nhom of all homogeneous similarities of

given length n and score S.• DP algorithm: Space ant time complexity:

– Seed sensitivity measure: counting number of homogeneous similarities detected by seed .

• DP algorithm: (similar to Keich03).

Page 16: Estimating seed sensitivity on homogeneous alignments

16

Number of Homogeneous Similarities: Number of Homogeneous Similarities: Reduction to Graph Path problemReduction to Graph Path problem

Vertices: {(k, y)}, where k is a length of similarity, y is its score.

Vertex (k, y) corresponds to the set of all similarities of score y and length k.

2 edges from each (k, y): (k+1, y+r) - for match at position (k+1) ; (k+1, y-p) - for mismatch at position (k+1);

Homogeneous Similarities Paths from (0,0) to (n,S) inside n x S grid

(0,0) Alignment

(k,y) (n,S)

Score

Page 17: Estimating seed sensitivity on homogeneous alignments

17

Score S fixed:

number of possible paths from (0,0) to (n,S). D(y, k) = D(y+r, k+1) + D(y-p, k+1)

Taking into account border effects

Number of Homogeneous Similarities:Number of Homogeneous Similarities:Recursive equationRecursive equation

(0,0)

(k,y)(n,S)

Page 18: Estimating seed sensitivity on homogeneous alignments

18

Time and complexityTime and complexity

Space Complexity

Time Complexity

Page 19: Estimating seed sensitivity on homogeneous alignments

19

Computer experiments:Computer experiments:Homogeneous vs. All similaritiesHomogeneous vs. All similarities

Compare the sensitivity of seeds on both models– Fixed score S according to the scoring scheme (r=+1; p=-3)– Similarity length varies from 20 to 120– Two sets of similarities:

(1) all similarities of given length and score;(2) only homogeneous similarities of given length and

score;

Comparison plots– x axis : similarity length

– y axis : sensitivity (probability that the seed matches a similarity)

Page 20: Estimating seed sensitivity on homogeneous alignments

20

contiguous seed (weight 11)###########

Experiments (score 16)Experiments (score 16)

Page 21: Estimating seed sensitivity on homogeneous alignments

21

spaced seed (weight 11)###-#--#-#--##-###

Experiments (score 16)Experiments (score 16)

Page 22: Estimating seed sensitivity on homogeneous alignments

22

Optimal seedsOptimal seeds

Optimal seeds are different

Optimal seeds are the same

Page 23: Estimating seed sensitivity on homogeneous alignments

23

SummarySummary

We have proposed

– a new definition of seed sensitivity based on the notion of homogeneous similarity;

– a DP algorithm to compute the sensitivity of a given seed.

Sensitivity of a seed on homogeneous similarities is usually substantially larger than on all similarities

Optimal seed on homogeneous similarities may be not optimal on all similarities and vice versa.

Page 24: Estimating seed sensitivity on homogeneous alignments

24

ExtensionsExtensions

Combining homogeneity constraint with properties of DNA sequences

Distinguishing different mismatches (transitions/transversions): YASS http://www.loria.fr/projects/YASS/

Page 25: Estimating seed sensitivity on homogeneous alignments

Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignmentshomogeneous alignments

Gregory Kucherov1, Laurent Noé1, Yann Ponty2

1LORIA (Laboratoire lorrain de recherche en informatique et ses applications), Nancy, France

2LRI (Laboratoire de recherche en informatique), Paris, France

Page 26: Estimating seed sensitivity on homogeneous alignments

26

CollaboratorsCollaborators

Thanks !!

Mikhail Roytberg (Institute of Mathematical Problem of Biology, Russia) for his comments and helpful discussion during the preparation of this work.

Alain Denise (Laboratoire de Recherche en Informatique, France) for his help on culminating paths.

Page 27: Estimating seed sensitivity on homogeneous alignments

27

Thank you for your attention !Thank you for your attention !

????

Page 28: Estimating seed sensitivity on homogeneous alignments

28

Page 29: Estimating seed sensitivity on homogeneous alignments

29

Number of Number of Detected Detected Homogeneous Homogeneous Similarities: Similarities:

Reduction to Graph Path problemReduction to Graph Path problem(after Keich03)(after Keich03)

Vertexes: {(k, y, t, u)} , where

k is a length of similarity, y is its score;

1 ≤ t ≤ l ; u is a binary word of length l-w.

{(k, y, t, u)} corresponds to set H (k, y, t, u) consists of homogeneous similarities of length k and score y;

– t is a length of maximal prefix of the seed , matching an end of a word v from H(k, y, t, u);

– u is a word, consisting of symbols corresponding to the joker position within the matching.

– t and u are same for all similarities from H

Page 30: Estimating seed sensitivity on homogeneous alignments

30

Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.

u

###--#-##

t

k = 17,y = 8,t = 8,u =

k’ = k + 1,y’ = y + r,t’ = t + 1,u’ = u

(k,y)

Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2

Page 31: Estimating seed sensitivity on homogeneous alignments

31

Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.

k’ = 17 + 1,y’ = 8 + r,t’ = t + 1,u’ = u

u

(k’,y’)

###--#-##

t+1

k’ = 18,y’ = 9,t’ = 9,u’ =

Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2

Page 32: Estimating seed sensitivity on homogeneous alignments

32

Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.

Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2

k = 17,y = 8,t = 8,u =

k’ = 17 + 1,y’ = 8 - p,t’ = ?,u’ = ?

u

(k,y)

###--#-##

t

Page 33: Estimating seed sensitivity on homogeneous alignments

33

Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.

k’ = 17 + 1,y’ = 8 - p,t’ = ?,u’ = ?

k’ = 18,y’ = 6,t’ = 5,u’ =

###--#-##

(k’,y’)

u’ ?

t’(t’, u’) = F (t, u);

Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2

Page 34: Estimating seed sensitivity on homogeneous alignments

34

Number of Number of Detected Detected Homogeneous Homogeneous Similarities: Similarities:

Reduction to Graph Path problemReduction to Graph Path problem(after Keich03)(after Keich03)

Edges: 2 edges from “each” (k, y, t, u):

(k+1, y+r, t+1, u) - for match at (k+1)-th position

(k+1, y-p, t’, u’) - for mismatch at (k+1)-th position;

t’, u’ can be pre-computed.

Page 35: Estimating seed sensitivity on homogeneous alignments

35

contiguous seed (weight 11)###########

Experiments (score 32)Experiments (score 32)

Page 36: Estimating seed sensitivity on homogeneous alignments

36

spaced seed (weight 11)###-#--#-#--##-###

Experiments (score 32)Experiments (score 32)

Page 37: Estimating seed sensitivity on homogeneous alignments

37

Example 1Example 1

Score

Alignment

###-###

detected alignment

Page 38: Estimating seed sensitivity on homogeneous alignments

38

Score

Alignment

Example 2Example 2

###-###

detected alignment

Page 39: Estimating seed sensitivity on homogeneous alignments

39

Example 3Example 3

Score

Alignment

###-###

detected alignment