A New Approach for Motif Discovery Based on the de Bruijn Graph

8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph

1/4

A New Approach for Motif Discovery Based on the de Bruijn Graph

Hong Zhou 1,2 Zheng Zhao 1 Hongpo Wang 2

1. School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China;2. Department of Computer Science and Information Engineering, Tianjin Agricultural

University,Tianjin,300384,China [email protected]

Abstract

This paper attempts to provide a new approach to

discover conserved regions such as motifs in either

DNA or Protein sequences. We have followed agraph-based approach to solve this problem, in

particular, using the idea of de Bruijn graphs. The

de Bruijn graph has been successfully adopted to

solve problems such as local alignment and DNA

fragment assembly. Our method harnesses the power

of the de Bruijn graph to discover the conserved

regions in a DNA or protein sequence. We have

found that the algorithm was successful in mining

signals for larger number of sequences and at a

faster rate when compared to some popular motif

searching tools.

1. Introduction

Motif is a repeating pattern in a biological

sequence that is conserved during the process of

evolution. Motif discovery is a very important

problem in Biology. It finds applications in DNA or

protein sequence analysis, comprehending disease

susceptibility and disease cure. Numerous motif

discovery algorithms have been proposed till date.Graph theory is playing an important role in

computational biology [1]. Graph-based algorithms

provide a simpler and quicker solution to

computationally intensive problems such as DNA

fragment assembly [2] and motif discovery. However,

the amount of literature available on motif discovery

using graph algorithms is not proportional to the

potential of the graph-based algorithms.

This paper focuses on developing a time-efficient

algorithm for motif discovery using a de Bruijn

graph. We have found that the algorithm was

successful in mining signals for larger number ofsequences and at a faster rate when compared to

some popular motif searching tools such as MEME

[3]. The next few sections will describe the

importance of the motif discovery problem and the

advancements in the algorithmic approaches to motif

discovery.

. Method

Our method starts from constructing the de Bruijn

graph. As the graph that we constructed may have

cycles, so the second step is to remove cycles in the

graph and transform it into a directed acyclic graph.

The third step is to compute a path on the directed

graph, then extract the conserved regions. This is the

main thought of our method, which we will describe

in detail in the following.

The algorithm can be formally described as

follows.

Input S = s1 , s 2 , ..., s n, each s i has length l i

Output S = s 1 , s 2 , ..., s n, each sri has length m1. use S to construct de Bruijn graph G = (V,E )

2. eliminate cycles of G

3. get a high weighted path from G, then

construct consensus sequence sc from the path

4. FOR i 1 to n

5. DO s i conserveextract ( sc , s i)

6. construct output result S s 1 , s 2 , ..., s n

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

978-0-7695-3735-1/09 $25.00 2009 IEEEDOI 10.1109/FSKD.2009.542

39

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

978-0-7695-3735-1/09 $25.00 2009 IEEEDOI 10.1109/FSKD.2009.542

39


2/4

2.1 . Construct the de Bruijn Graph

A de Bruijn graph is a graph whose vertices are

subsequences of a given sequence and whose edges

indicate the overlapping subsequences [4]. Consider

the following sequence ACCGTCT. The sequencecan be resolved into the following fragments of

length 4: ACCG, CCGT, CGTC, and GTCT. Each

fragment is called an l - tuple. An l - 1 tuple is

obtained by further fragmenting each l - tuple. For

example, ACC and CCG are l - 1 tuples of ACCG.

The l - 1 tuples form the nodes of the de Bruijn graph.

An edge exits between any two l - 1 tuples and the

edges represents the l - tuples. The multiplicity of

each edge is represented by an edge with a weight.

Initially every edge gets a weight 1. In case two

consecutive vertices (vertices with the specified

overlap) repeat, then the edge multiplicity is

increased (in this case an increment in weight of the

edge by 1). The conserved regions are most likely to

reside on the most repeated edges, i.e. edges with

greater multiplicity (which have greater weight

attached to them).

Procedure ConstructG

Input S ={s 1,s 2,, s n},each si has length l i

Output G =(V ,E )

1. FOR i 1 TO n

2. DO FOR j 1 TO (l i-k +1)

3. DO T =s i( j ,, j +k -1)

4. T L=s i( j ,, j +k -2)

5. T R=s i( j +1,, j +k -1)

6. v L -1

7. v R -1

8. IF (hashtable( T L))

9. THEN v L=hashtable( T L).node

10. IF (hashtable( T R))

11. THEN v R =hashtable( T R).node12. IF v L 0 AND v R 0 AND e E

in(v L, v R)

13. THEN add sequence info{ i , j }to e

14. ELSE IF v L


3/4

In Figure . 1 is an eliminating cycle process. In

Figure .1(a), cycles exist which include vertexes

{v 1,v 2,v 3} and edges { e 1,e 2,e 3}. The vertex v 2 has the

least information. So the original cycle is eliminated

by making a copy v N2 of vertex v 2. The result is

Figure .1(b). In Figure .1(b), cycle also exists. Nowvertex v 3 has the least information. So the original

cycle is eliminated by making a copy v N3 of vertex v 3.

The result is Figure.1(c). And so on. The final result

is Figure.1(d)

Note that a loop in the original graph is removed,

and it is a directed acyclic graph now.

2.3 . Extract the conserved regions

After performing the above transformations, we

apply a Depth First Search (DFS) algorithm to find a

heaviest path within linear time. The weight for each

edge is proportional to its multiplicity and length. After obtaining the high weight edges, the algorithm

tries to locate the entire repeated region called as the

active region. An active region is high weight

region where a lot of sequences coincide upon a

particular consecutive edge set. In other words, an

active region is a region that contains consecutive

edges that have repeated themselves. All the high

weight edges need not represent motifs because there

is a fair chance for them being repeats. Hence the

search is for more prominent edges. Therefore wehave tried to identify edges that have more weight in

an active region in comparison to the other edges.

3. Results and Analysis

The algorithm has been implemented on Pentium

4 2.66GHz under the Linux operating system with C

language. Initially the algorithm was tested on the

DNA sequences that belong to the prokaryotic family.

The advantage with these sequences is that they havevery low or negligible number of repeats. In addition

to that, the common repeating patterns in the

prokaryotic sequences (the TATA patterns) are

already known. Upon successful initial testing we

tested the algorithm on the protein sequences. We

chose to start off with the protein responsible for

Figure.1 Elimiting cycle

4141


4/4

redox (oxidation and reduction) reactions called

Cytochrome [5, 6]. The initial testing began on just

twenty sequences, with the longest sequence being

577 nucleotides. In this section, we make a

comparison of our algorithm with popular motif

searching tools namely MEME and the Gibbssampler [7].

Figure 2 shows the comparison among the three

algorithms. Both MEME and Gibbs sampler have a

character limit. Therefore, we could not test them for

more than 30 protein sequences each averaging

200bp. Clearly our algorithm has a speed advantage

over the other motif discovery tools. This

comparison just gives an approximate picture of the

speed levels (see Figure 2). Our algorithm

successfully ran for 1500 sequences. However it

broke beyond 1500 (see Figure 3).

Porcess Time Versus

1

10

100

1000

10000

100000

5 10 20 30

Number of Sequences in Sample Data

T i

m e

i

n

M i l l i

s e c o n

d s

MEMEGibbsOur Algorithm

Performance Curve

0

20000

40000

60000

80000

0 500 1000 1500Number of Sequences

T

i m e

i

n

M i l l i

s e c o n

d s

Our Algorithm

4. References

[1] J. C. Setubal, J. Meidanis, Introduction to

Computational Molecular, Biology , PWS Publishing

Company, Boston, USA, 1997.

[2] P. A. Pevzner, H. Tang, and M. S. Waterman, AnEulerian path approach to DNA fragment assembly.

Proceedings of the National Academy of Sciences of the

United States , 98(17):9748-- -9753, August 2001.

[3] T. L. Bailey, C. Elkan, Fitting a mixture model by

expectation maximization to discover motifs in

biopolymers, In The Second International Conference on

Intelligent Systems for Molecular Biology , pages 28---36,

Stanford, CA, USA, 1994. AAAI Press.

[4] D. Z. Du, F. K. Hwang, Generalized de Bruijn

digraphs, Networks , Vol. 18, pp. 27-38, 1988.

[5] Cytochrome P450 cysteine heme-iron ligand signature.

http://www.expasy.org/cgi-bin/nicedoc.

pl?PDOC00081 .

[6] Cytochrome. h ttp://en.wikipedia.org/wiki

/Cytochrome , June 2006. Cytochrome is a protein

family.

[7] S. Geman, D. Geman, Stochastic relaxation, Gibbs

distributions, and the Bayesian restoration of images,

Readings in uncertain reasoning , pages 452472, 1990

Figure.2 Process Time Versus

Figure.3 Performance Curve

4242

Documents

A New Approach for Motif Discovery Based on the de Bruijn Graph