43
Multi-seed lossless filtration Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM (Istanbul) July 5-7, 2004

Multi-seed lossless filtration

  • Upload
    anne

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Multi-seed lossless filtration. Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM ( Istanbul ) July 5-7, 2004. potential matches. Text filtration: general principle. - PowerPoint PPT Presentation

Citation preview

Page 1: Multi-seed lossless filtration

Multi-seed lossless filtrationMulti-seed lossless filtration

Gregory KucherovLaurent Noé

LORIA/INRIA, Nancy, France

Mikhail RoytbergInstitute of Mathematical Problems in Biology,

Puschino, Russia

CPM (Istanbul)July 5-7, 2004

Page 2: Multi-seed lossless filtration

Text filtration: general principleText filtration: general principle

potential matches

Page 3: Multi-seed lossless filtration

Text filtration: general principleText filtration: general principle

potential matches

Page 4: Multi-seed lossless filtration

Text filtration: general principleText filtration: general principle

lossless and lossy filters

true match

Page 5: Multi-seed lossless filtration

Filtration applied to sequence comparisonFiltration applied to sequence comparison

potential similarities

Page 6: Multi-seed lossless filtration

Filtration applied to sequence alignmentFiltration applied to sequence alignment

potential similarities

Page 7: Multi-seed lossless filtration

Filtration applied to sequence alignmentFiltration applied to sequence alignment

true similarities

Page 8: Multi-seed lossless filtration

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

GCTACGACTTCGAGCTGC

...CTCAGCTATGACCTCGAGCGGCCTATCTA...

Page 9: Multi-seed lossless filtration

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

Page 10: Multi-seed lossless filtration

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

(m,k)-problem, (m,k)-instances

m

k

Page 11: Multi-seed lossless filtration

GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.

Similarities are defined through Hamming distance

(m,k)-problem, (m,k)-instances This work: lossless filtering

m

k

Page 12: Multi-seed lossless filtration

Filtering by contiguous fragmentFiltering by contiguous fragment

PEX (Navarro&Raffinot 2002)– Searching for a contiguous pattern

PEX with errors– Searching for a contiguous pattern with l possible errors

• requires retrieval of all l-variants in the index. Efficient for– small alphabets (ADN,ARN)– relatively small l (<= 2)

m=18

k=3

11

km

####

conserved1

#########(1)

(m,k)

Page 13: Multi-seed lossless filtration

Superposition of two filtersSuperposition of two filters

Pevzner&Waterman 1995

Idea: combine PEX with another filter based on a regularly-spaced seed

PEX :

spaced PEX (matches occurring at every k positions).

####

#---#---#---#

#---#---#---# #---#---#---# #---#---#---# #---#---#---#

k+1

Page 14: Multi-seed lossless filtration

Spaced seedsSpaced seeds

Spaced seeds (spaced Q-grams)– proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-

problems

Principle– Searching for spaced rather than contiguous patterns

– Selectivity• defined by the weight of the seed (number of #’s)

###-##

Page 15: Multi-seed lossless filtration

ExExaamplemple: (18,3)-problem: (18,3)-problem

###-##

###-##

###-##

###-## ###-## ###-##

Page 16: Multi-seed lossless filtration

Spaced seeds for sequence comparisonSpaced seeds for sequence comparison

Ma, Tromp, Li 2002 (PatternHunter)

Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004, ...

Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004, ...

Page 17: Multi-seed lossless filtration

This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001)

single filter based on several distinct seeds each seed detects a part of (m,k)-instances but

together they must detect all (m,k)-instances

Families of spaced seedsFamilies of spaced seeds

Independent work (lossy seed families for sequence alignment):

Li, Ma, Kisman, Tromp 2004 (PatternHunter II) Xu, Brown, Li, Ma, this conference Sun, Buhler, RECOMB 2004 (Mandala)

Page 18: Multi-seed lossless filtration

– every (18,3)-instance contains an occurrence of a seed of F

– all seeds of the family have the same weight 7

Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)

Family F solvesthe (18,3)-problem

##-#-#######---#--##-#

F

Page 19: Multi-seed lossless filtration

##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###

Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)

##-#-#######---#--##-#

###-##---#-###

###---#--##-# ###---#--##-#

w=7

w=9

Page 20: Multi-seed lossless filtration

####

###-##

##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###

Comparative selectivityComparative selectivity

##-#-#######---#--##-#

w=4 ~39. 10-4

w=5 ~9.8 10-4

w=7 ~1.2 10-4

w=9 ~0.23 10-4

Selectivity of families on Bernoulli similarities (p(match) = 1/4) estimated as the probability for one of the seeds to occur at a given position

Page 21: Multi-seed lossless filtration

How far should we goHow far should we go

A trivial extreme solution ... – would be to pick all seeds of weight m - k. – prohibitive cost except for very small problems

We are interested in intermediate solutions:– relatively small number of seeds (< 10) to keep the hash table of a

reasonable size,– the seed weight sufficiently large to obtain a good selectivity

kmC ~

Page 22: Multi-seed lossless filtration

ResultsResults

Computing properties of seed families Seed design

– Seed expansion/contraction– Periodic seeds– Seed optimality– Heuristic seed design

Experiments– Examples of designed seed families– Application to computing specific oligonucleotides

Conclusions

Page 23: Multi-seed lossless filtration

MeMeaasursuringing the the efficefficiency of a familyiency of a family

Optimal threshold (Burkhard&Karkkainen): minimal number of seed occurrences over all (m,k)-instances

A seed family F is lossless iff the optimal threshold TF(m,k)1

TF(m,k) can be computed by a dynamic programming algorithm in time O(m·k·2(S+1)) and space O(k·2(S+1)), where S is the maximal length of a seed from F

optimizations are possible (see the paper) the resulting space complexity is the same as in the

Burkhard&Karkkainen algorithm

Page 24: Multi-seed lossless filtration

MeMeaasursuringing the the efficefficiency of a family (cont)iency of a family (cont)

Using a similar DP technique we can compute, within the same time complexity bound:

the number UF(m,k) of undetected (m,k)-similarities for a (lossy) family F

the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed

[see the paper for details]

Page 25: Multi-seed lossless filtration

Design Design of seedof seed famil familiesies

Pruning exhaustive search tree (Burkhard&Karkkainen)

– Construct all solutions of weight w from solutions of weight w – 1

– Example:if ##--#--# and ##-#---# are solutions of weight w-1,

consider their «union» ##-##--# of weight w.

– Prohibitive cost: • more than a week for computing all single-seed solutions of

the (50,5)-problem• the search space blows up for multi-seed families

Page 26: Multi-seed lossless filtration

Seed expansion/contractionSeed expansion/contraction

Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:

###-#--###-#--###-#

#-#-#---#-----#-#-#---#-----#-#-#---#

Page 27: Multi-seed lossless filtration

Seed expansion/contractionSeed expansion/contraction

Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:

###-#--###-#--###-#

#-#-#---#-----#-#-#---#-----#-#-#---#

the only solution of weight 12 of the (25,2)-problem

Page 28: Multi-seed lossless filtration

Seed expansion/contractionSeed expansion/contraction

Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:

###-#--###-#--###-#

#-#-#---#-----#-#-#---#-----#-#-#---#

– Let be the i-regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F

– Example:If F = { ###-# , ##-## } then

= { #-#-#---# , #-#---#-# } = { #--#--#-----# , #--#-----#--# }

Fi

F2F3

the only solution of weight 12 of the (25,2)-problem

Page 29: Multi-seed lossless filtration

Seed expansion/contractionSeed expansion/contraction (cont)(cont)

Lemma:

– If a family F solves an (m,k)–problem, then both F and solves the (i·m, (i+1)·k- 1)–problem

– If a family solves the (i·m,k)–problem, then its i-contraction F solves the (m, )-problem

Fi

Fi

ik

##-#-#######---#--##-#

##-#-#######---#--##-#

#-#---#---#-#-#-##-#-#-------#-----#-#-#

(18,3)

(36,7)

Page 30: Multi-seed lossless filtration

Periodic seedsPeriodic seeds

Iterating short seeds with good properties

into longer seeds

###-#--###-#--###-#

###-#--

Page 31: Multi-seed lossless filtration

Cyclic problemCyclic problem

Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Qi=[Q,- (m-s(Q))]i solves the linear (m·(i+1)+s(Q)-1,k)-problem.

Cyclic (11,3)-problem

Linear (30,3)-problem

###-#--#---

###-#--#---###-#--#

Page 32: Multi-seed lossless filtration

Extension to multi-seed caseExtension to multi-seed case

Cyclic (11,3)-problem

Linear (25,3)-problem

###-#--#---

###-#--#---###-#--##--#---###-#--#---###

Page 33: Multi-seed lossless filtration

Extension to multi-seed caseExtension to multi-seed case

Cyclic (11,3)-problem

Linear (25,3)-problem

###-#--#---

###-#--#---###-#--# #--#---###-#--#---###

Page 34: Multi-seed lossless filtration

AAsymptotsymptotic optimalityic optimality

Theorem:Fix a number of errors k. Let w(m) be the maximal weight

of a seed solving the linear (m,k)-problem. Then

the fraction of the number of jokers tends to 0 but the convergence speed depends on k

seed expansion cannot provide an asymptotically optimal solution

( )

Page 35: Multi-seed lossless filtration

Non-asymptotic optimality Non-asymptotic optimality

Fix a number of errors k. For each seed (seed family) Q there exists mQ s.t.

mmQ, Q solves the (m,k)-problem

For a class of seeds , Q is an optimal seed in iff Q realizes the minimal mQ over all seeds of

Lemma: Let n be an integer and r=n/3. For every k2, seed #n-

r-#r is optimal among seeds of weight n with one joker.

Page 36: Multi-seed lossless filtration

Heuristic seed design: genetic algorithmHeuristic seed design: genetic algorithm

a population of seed families is evolving by mutating and crossing over

seed families are screened against sets of difficult (m,k)-instances

for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families

compute the contribution of each seed of the family. Mutate the less “valuable” seeds.

difficult(m,k)-instances

seed families

select and reorderselect

Page 37: Multi-seed lossless filtration

Example: (25,2)-problemExample: (25,2)-problem

Page 38: Multi-seed lossless filtration

Example: (25,3)-problemExample: (25,3)-problem

Page 39: Multi-seed lossless filtration

Application Application of lossless filtering: of lossless filtering: oligooligo design design

Specific oligonucleotides: small DNA molecules (10-50bp) that hybridize with a given target sequence and do not hybridize with the other background sequences (e.g. the rest of the genome)

Formalization: given a sequence, find all windows of length m which do not occur elsewhere within k substitution errors

Page 40: Multi-seed lossless filtration

Seed design: (32,5)-problemSeed design: (32,5)-problem

Page 41: Multi-seed lossless filtration

ExperimentExperiment

This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp)

All 32-windows occurring elsewhere within 5 errors have been computed

The computation took slightly more than 1 hour on a P4 3GHz computer

87% of the database have been “filtered out”

Page 42: Multi-seed lossless filtration

Further questionsFurther questions

Combinatorial structure of optimal seed families

Efficient design algorithm

Page 43: Multi-seed lossless filtration

QuestionsQuestions

agctga

g?cc??

tatgag

caa?ga

cca??a

ctc?gc

ggcgca

tctagg

ag??ac

c???tc

ttcttc

g

???? ??