Upload
yuni-listiana
View
229
Download
0
Embed Size (px)
Citation preview
8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph
1/4
A New Approach for Motif Discovery Based on the de Bruijn Graph
Hong Zhou 1,2 Zheng Zhao 1 Hongpo Wang 2
1. School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China;2. Department of Computer Science and Information Engineering, Tianjin Agricultural
University,Tianjin,300384,China [email protected]
Abstract
This paper attempts to provide a new approach to
discover conserved regions such as motifs in either
DNA or Protein sequences. We have followed agraph-based approach to solve this problem, in
particular, using the idea of de Bruijn graphs. The
de Bruijn graph has been successfully adopted to
solve problems such as local alignment and DNA
fragment assembly. Our method harnesses the power
of the de Bruijn graph to discover the conserved
regions in a DNA or protein sequence. We have
found that the algorithm was successful in mining
signals for larger number of sequences and at a
faster rate when compared to some popular motif
searching tools.
1. Introduction
Motif is a repeating pattern in a biological
sequence that is conserved during the process of
evolution. Motif discovery is a very important
problem in Biology. It finds applications in DNA or
protein sequence analysis, comprehending disease
susceptibility and disease cure. Numerous motif
discovery algorithms have been proposed till date.Graph theory is playing an important role in
computational biology [1]. Graph-based algorithms
provide a simpler and quicker solution to
computationally intensive problems such as DNA
fragment assembly [2] and motif discovery. However,
the amount of literature available on motif discovery
using graph algorithms is not proportional to the
potential of the graph-based algorithms.
This paper focuses on developing a time-efficient
algorithm for motif discovery using a de Bruijn
graph. We have found that the algorithm was
successful in mining signals for larger number ofsequences and at a faster rate when compared to
some popular motif searching tools such as MEME
[3]. The next few sections will describe the
importance of the motif discovery problem and the
advancements in the algorithmic approaches to motif
discovery.
. Method
Our method starts from constructing the de Bruijn
graph. As the graph that we constructed may have
cycles, so the second step is to remove cycles in the
graph and transform it into a directed acyclic graph.
The third step is to compute a path on the directed
graph, then extract the conserved regions. This is the
main thought of our method, which we will describe
in detail in the following.
The algorithm can be formally described as
follows.
Input S = s1 , s 2 , ..., s n, each s i has length l i
Output S = s 1 , s 2 , ..., s n, each sri has length m1. use S to construct de Bruijn graph G = (V,E )
2. eliminate cycles of G
3. get a high weighted path from G, then
construct consensus sequence sc from the path
4. FOR i 1 to n
5. DO s i conserveextract ( sc , s i)
6. construct output result S s 1 , s 2 , ..., s n
2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery
978-0-7695-3735-1/09 $25.00 2009 IEEEDOI 10.1109/FSKD.2009.542
39
2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery
978-0-7695-3735-1/09 $25.00 2009 IEEEDOI 10.1109/FSKD.2009.542
39
8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph
2/4
2.1 . Construct the de Bruijn Graph
A de Bruijn graph is a graph whose vertices are
subsequences of a given sequence and whose edges
indicate the overlapping subsequences [4]. Consider
the following sequence ACCGTCT. The sequencecan be resolved into the following fragments of
length 4: ACCG, CCGT, CGTC, and GTCT. Each
fragment is called an l - tuple. An l - 1 tuple is
obtained by further fragmenting each l - tuple. For
example, ACC and CCG are l - 1 tuples of ACCG.
The l - 1 tuples form the nodes of the de Bruijn graph.
An edge exits between any two l - 1 tuples and the
edges represents the l - tuples. The multiplicity of
each edge is represented by an edge with a weight.
Initially every edge gets a weight 1. In case two
consecutive vertices (vertices with the specified
overlap) repeat, then the edge multiplicity is
increased (in this case an increment in weight of the
edge by 1). The conserved regions are most likely to
reside on the most repeated edges, i.e. edges with
greater multiplicity (which have greater weight
attached to them).
Procedure ConstructG
Input S ={s 1,s 2,, s n},each si has length l i
Output G =(V ,E )
1. FOR i 1 TO n
2. DO FOR j 1 TO (l i-k +1)
3. DO T =s i( j ,, j +k -1)
4. T L=s i( j ,, j +k -2)
5. T R=s i( j +1,, j +k -1)
6. v L -1
7. v R -1
8. IF (hashtable( T L))
9. THEN v L=hashtable( T L).node
10. IF (hashtable( T R))
11. THEN v R =hashtable( T R).node12. IF v L 0 AND v R 0 AND e E
in(v L, v R)
13. THEN add sequence info{ i , j }to e
14. ELSE IF v L
8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph
3/4
In Figure . 1 is an eliminating cycle process. In
Figure .1(a), cycles exist which include vertexes
{v 1,v 2,v 3} and edges { e 1,e 2,e 3}. The vertex v 2 has the
least information. So the original cycle is eliminated
by making a copy v N2 of vertex v 2. The result is
Figure .1(b). In Figure .1(b), cycle also exists. Nowvertex v 3 has the least information. So the original
cycle is eliminated by making a copy v N3 of vertex v 3.
The result is Figure.1(c). And so on. The final result
is Figure.1(d)
Note that a loop in the original graph is removed,
and it is a directed acyclic graph now.
2.3 . Extract the conserved regions
After performing the above transformations, we
apply a Depth First Search (DFS) algorithm to find a
heaviest path within linear time. The weight for each
edge is proportional to its multiplicity and length. After obtaining the high weight edges, the algorithm
tries to locate the entire repeated region called as the
active region. An active region is high weight
region where a lot of sequences coincide upon a
particular consecutive edge set. In other words, an
active region is a region that contains consecutive
edges that have repeated themselves. All the high
weight edges need not represent motifs because there
is a fair chance for them being repeats. Hence the
search is for more prominent edges. Therefore wehave tried to identify edges that have more weight in
an active region in comparison to the other edges.
3. Results and Analysis
The algorithm has been implemented on Pentium
4 2.66GHz under the Linux operating system with C
language. Initially the algorithm was tested on the
DNA sequences that belong to the prokaryotic family.
The advantage with these sequences is that they havevery low or negligible number of repeats. In addition
to that, the common repeating patterns in the
prokaryotic sequences (the TATA patterns) are
already known. Upon successful initial testing we
tested the algorithm on the protein sequences. We
chose to start off with the protein responsible for
Figure.1 Elimiting cycle
4141
8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph
4/4
redox (oxidation and reduction) reactions called
Cytochrome [5, 6]. The initial testing began on just
twenty sequences, with the longest sequence being
577 nucleotides. In this section, we make a
comparison of our algorithm with popular motif
searching tools namely MEME and the Gibbssampler [7].
Figure 2 shows the comparison among the three
algorithms. Both MEME and Gibbs sampler have a
character limit. Therefore, we could not test them for
more than 30 protein sequences each averaging
200bp. Clearly our algorithm has a speed advantage
over the other motif discovery tools. This
comparison just gives an approximate picture of the
speed levels (see Figure 2). Our algorithm
successfully ran for 1500 sequences. However it
broke beyond 1500 (see Figure 3).
Porcess Time Versus
1
10
100
1000
10000
100000
5 10 20 30
Number of Sequences in Sample Data
T i
m e
i
n
M i l l i
s e c o n
d s
MEMEGibbsOur Algorithm
Performance Curve
0
20000
40000
60000
80000
0 500 1000 1500Number of Sequences
T
i m e
i
n
M i l l i
s e c o n
d s
Our Algorithm
4. References
[1] J. C. Setubal, J. Meidanis, Introduction to
Computational Molecular, Biology , PWS Publishing
Company, Boston, USA, 1997.
[2] P. A. Pevzner, H. Tang, and M. S. Waterman, AnEulerian path approach to DNA fragment assembly.
Proceedings of the National Academy of Sciences of the
United States , 98(17):9748-- -9753, August 2001.
[3] T. L. Bailey, C. Elkan, Fitting a mixture model by
expectation maximization to discover motifs in
biopolymers, In The Second International Conference on
Intelligent Systems for Molecular Biology , pages 28---36,
Stanford, CA, USA, 1994. AAAI Press.
[4] D. Z. Du, F. K. Hwang, Generalized de Bruijn
digraphs, Networks , Vol. 18, pp. 27-38, 1988.
[5] Cytochrome P450 cysteine heme-iron ligand signature.
http://www.expasy.org/cgi-bin/nicedoc.
pl?PDOC00081 .
[6] Cytochrome. h ttp://en.wikipedia.org/wiki
/Cytochrome , June 2006. Cytochrome is a protein
family.
[7] S. Geman, D. Geman, Stochastic relaxation, Gibbs
distributions, and the Bayesian restoration of images,
Readings in uncertain reasoning , pages 452472, 1990
Figure.2 Process Time Versus
Figure.3 Performance Curve
4242