View
34
Download
0
Category
Preview:
DESCRIPTION
Finding approximate occurrences of a pattern that contains gaps. Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park. Contents. The exact/approximate gapped pattern matching problem Previous approaches Our contributions. Exact gapped pattern matching problem. - PowerPoint PPT Presentation
Citation preview
Finding approximate occurrences of a pattern
that contains gaps
Inbok LeeCostas S. IliopoulosAlberto Apostolico
Kunsoo Park
Contents The exact/approximate gapped
pattern matching problem Previous approaches Our contributions
Exact gapped pattern matching problem
Definition find the occurrences of the pattern that contains gaps from the text.
Pattern P = AA *(2,3) GC *(1,3) TT
A A G C T T*(2,3) *(1,3)
P1 P2 P3
any string whose length is between 2
and 3
any string whose length is between 1
and 3
subpatterns
Example – Exact matching
G CA TC A A T T G C T C
A A G C T TPattern
Text
*(2,3) *(1,3)
Pattern P = AA *(2,3) GC *(1,3) TTText T = GCAATTGCACTTC
Approximate gapped pattern matching problem
Definition find all the substrings of the text which match each subpattern Pi with ki number of insertion, deletion, and substitution.
Pattern P = AA *(2,3) GC *(1,3) TT
A A G C T T*(2,3) *(1,3)
P1
k1 = 0
any string whose length is between 2
and 3
any string whose length is between 1
and 3
P2
k2 = 1P3
k3 = 0
Example – Approximate matching
G CA TC A A T T G T T C
A A G C T TPattern
Text
*(2,3) *(1,3)
Pattern P = AA *(2,3) GC *(1,3) TT ,k1 = k3 = 0, k2 = 1
Text T = GCAATTGTACTTC
1 substitution
Class of charactersAllow more than two different characters at a position of the pattern
Pattern P = AA *(2,3) G[CT] *(1,3) TT
A A G T T T*(2,3) *(1,3)
P1
k1 = 0
any string whose length is between 2
and 3
any string whose length is between 1
and 3
P2
k2 = 1P3
k3 = 0
CC or T
Example – Class of characters
G CA TC A A T T G T T C
A A G T T TPattern
Text
*(2,3) *(1,3)
Pattern P = AA *(2,3) G[CT] *(1,3) TT
Text T = GCAATTGTACTTC
C
Application of the gapped pattern
matching Information retrieval Data mining Computational biology
Especially, finding motifs in a sequence
Motifs
Motifs (biologically important common region)
Sequence 1
Sequence 2
Sequence 3
Sequence 4
Sometimes overall sequence alignment doesn’t show the relation between biologically related sequences.
PROSITE database Database of protein families, domains and motifs
http://www.expasy.ch/prosite Motifs are represented as gapped patterns from t
he alphabet of 20 amino acids. Prion protein (Creutzfeld-Jacob Disease) :
E*(1,1)[ED]*(1,1)K[LIVM][LIVM]*(1,1)
[KR][LIVM][LIVM]*(1,1) [QE]MC*(2,2)QY Ribosomal protein L1 :
[IM]*(2,2)[LIVA]*(2,3) [LIVM][GA]*(2,2)[LMS][GSNH][PTKR][KRAV]G*(1,1) [LIMF]P[DENSTKQ]
Finding hidden motifs
a set of sequences
how to findunknown motifs?
Finding motifs in a sequence
known motif
new sequence
As biological sequences may contain errors, we should consider approximate
matching occurrences.
x
Our topic
Previous approaches Regular expression approaches
Exact matching Navarro and Raffinot’s approach [RECO
MB 2002] Exact and approximate matching
Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] Approximate matching
Regular expression approach
Pattern P = AA *(2,3) GC *(1,3) TT
Regular expression AA**(*|)GC*(*|) (*|)TT
A *G C TA * * T
* * *
Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA)
Too general!
Navarro and Raffinot’s approach
A *G C TA * * T
* * *
NFA is not easy to run and DFA can be large.
0 10 10 1 0 0 1 0 0 0 0Bit-Vector
Simulate NFA by the bit-parallelism technique.(A word can be read and written simultaneously)
Navarro and Raffinot’s approach
A *G C TA * * T
* * *
Allow k errors for all the pattern.
A *G C TA * * T
*
0 errors
1 errors
Works for small size pattern and small number of errrors.O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size)
* *
Akutsu’s approach
Combination of the dynamic programming and the balanced search tree. O (mn log n) time
Text
P1
*(a1, b1)
P2
P3
*(a2, b2)
Dynamic Table
use the tree to compute the smallestvalues here
Drawbacks of the previous approaches
XXX
X X XO O O O O
O O O O O
O O O O
O O O O
OXO
O X OO X O O O
O X O O O
O O X O
O O X O
O
O
O
O
?k = 3 for all the pattern
more sensitive and desirable
k1 = 1 k2 = 1 k3 = 1
Our contributions O (ln + m) time algorithm for the exact g
apped pattern matching problem. l : number of subpatterns n : length of the text m : length of the pattern
O (mn) time algorithm for the approximate gapped pattern matching problem.
Graph Modeling1. Create a node where a subpattern appears (e
xactly or approximately) in the text2. Link two nodes with an edge if they represent
the two consecutive subpatterns and satisfy the gap condition.
3. If there is a path P1– P2 - … - Pm in the graph, there is an occurrence of the pattern in the text.
Exact matching
G CA TC A A T T G C T CText
P1 = AA, P2 = GC, P3 = TT
Step 1. Create nodes
P = AA *(2,3) GC *(1,3) TT
P1
P2
P2
P3
P3
Exact matching
G CA TC A A T T G C T CText
P1 = AA, P2 = GC, P3 = TT
Step 2. Connect the nodes with the edges
P = AA *(2,3) GC *(1,3) TT
P1
P2
P2
P3
P3
Exact matching
G CA TC A A T T G C T CText
P1 = AA, P2 = GC, P3 = TT
Step 3. Find the path by Depth-First Search
P = AA *(2,3) GC *(1,3) TT
P1
P2
P2
P3
P3
A better idea
G CA TC A A T T G C T CText
No need to build the graph explicitly.Step 1. Find P1 = AA and compute thecandidate range for P2.
P = AA *(2,3) GC *(1,3) TT
P1
candidate range
A better idea
G CA TC A A T T G C T CText
Step 2. Find P2 = GC within the candidate rangeand compute candidate range for P3.
P = AA *(2,3) GC *(1,3) TT
P1
P2
candidate range
A better idea
Text
Step 3. After findng P3 = TT within the candidate range, we found the occurrence of P.
P = AA *(2,3) GC *(1,3) TT
P1
P2
P3
G CA TC A A T T G C T C
Approximate matching
Almost the same idea as the exact matching case.Find the approximate occurrence of subpatterns, instead of the exact one.
G C A A T T G C A C T T C
0 0 0 0 0 0 0 0 0 0 0 0 0 0
A 1 1 1 0 0 1 1 1 1 0 1 1 1 1
A 2 2 2 1 0 1 2 2 2 1 1 2 2 2
Text
P1
*(2,3)
k1 = 0
, k2 = 1 candidate range
Approximate matching
G C A C T T C
0 0 ? ? ?
G 1 0 1 2 3
C 2 1 0 1 2
Text
P2
*(1,3)
k2 = 1
, k3 = 0
candidate range
Infinity – no alignment can start
from here
Approximate matching
T T C
0 0 0 ?
T 1 0 0 1
T 2 1 0 1
Text
P3
k3 = 0
approximate occurrence of the pattern
Handling class of characters
Represent characters as bit masks.
A G T C
0 1 0 1[GC]
Text Pattern
G[GC]
&
01000101
0100
T[GC]
&
00100101
0000
nonzero zero
Time Complexity
O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice
Text
P1
P2
P3
Conclusion O (ln + m) time algorithm for the exact g
apped pattern matching problem O (mn) time algorithm for the approximat
e gapped pattern matching problem. Open problem
time complexity in the average case?
Recommended