A New Approach for Motif Discovery Based on the de Bruijn Graph

Embed Size (px)

Citation preview

  • 8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph

    1/4

    A New Approach for Motif Discovery Based on the de Bruijn Graph

    Hong Zhou 1,2 Zheng Zhao 1 Hongpo Wang 2

    1. School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China;2. Department of Computer Science and Information Engineering, Tianjin Agricultural

    University,Tianjin,300384,China [email protected]

    Abstract

    This paper attempts to provide a new approach to

    discover conserved regions such as motifs in either

    DNA or Protein sequences. We have followed agraph-based approach to solve this problem, in

    particular, using the idea of de Bruijn graphs. The

    de Bruijn graph has been successfully adopted to

    solve problems such as local alignment and DNA

    fragment assembly. Our method harnesses the power

    of the de Bruijn graph to discover the conserved

    regions in a DNA or protein sequence. We have

    found that the algorithm was successful in mining

    signals for larger number of sequences and at a

    faster rate when compared to some popular motif

    searching tools.

    1. Introduction

    Motif is a repeating pattern in a biological

    sequence that is conserved during the process of

    evolution. Motif discovery is a very important

    problem in Biology. It finds applications in DNA or

    protein sequence analysis, comprehending disease

    susceptibility and disease cure. Numerous motif

    discovery algorithms have been proposed till date.Graph theory is playing an important role in

    computational biology [1]. Graph-based algorithms

    provide a simpler and quicker solution to

    computationally intensive problems such as DNA

    fragment assembly [2] and motif discovery. However,

    the amount of literature available on motif discovery

    using graph algorithms is not proportional to the

    potential of the graph-based algorithms.

    This paper focuses on developing a time-efficient

    algorithm for motif discovery using a de Bruijn

    graph. We have found that the algorithm was

    successful in mining signals for larger number ofsequences and at a faster rate when compared to

    some popular motif searching tools such as MEME

    [3]. The next few sections will describe the

    importance of the motif discovery problem and the

    advancements in the algorithmic approaches to motif

    discovery.

    . Method

    Our method starts from constructing the de Bruijn

    graph. As the graph that we constructed may have

    cycles, so the second step is to remove cycles in the

    graph and transform it into a directed acyclic graph.

    The third step is to compute a path on the directed

    graph, then extract the conserved regions. This is the

    main thought of our method, which we will describe

    in detail in the following.

    The algorithm can be formally described as

    follows.

    Input S = s1 , s 2 , ..., s n, each s i has length l i

    Output S = s 1 , s 2 , ..., s n, each sri has length m1. use S to construct de Bruijn graph G = (V,E )

    2. eliminate cycles of G

    3. get a high weighted path from G, then

    construct consensus sequence sc from the path

    4. FOR i 1 to n

    5. DO s i conserveextract ( sc , s i)

    6. construct output result S s 1 , s 2 , ..., s n

    2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

    978-0-7695-3735-1/09 $25.00 2009 IEEEDOI 10.1109/FSKD.2009.542

    39

    2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

    978-0-7695-3735-1/09 $25.00 2009 IEEEDOI 10.1109/FSKD.2009.542

    39

  • 8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph

    2/4

    2.1 . Construct the de Bruijn Graph

    A de Bruijn graph is a graph whose vertices are

    subsequences of a given sequence and whose edges

    indicate the overlapping subsequences [4]. Consider

    the following sequence ACCGTCT. The sequencecan be resolved into the following fragments of

    length 4: ACCG, CCGT, CGTC, and GTCT. Each

    fragment is called an l - tuple. An l - 1 tuple is

    obtained by further fragmenting each l - tuple. For

    example, ACC and CCG are l - 1 tuples of ACCG.

    The l - 1 tuples form the nodes of the de Bruijn graph.

    An edge exits between any two l - 1 tuples and the

    edges represents the l - tuples. The multiplicity of

    each edge is represented by an edge with a weight.

    Initially every edge gets a weight 1. In case two

    consecutive vertices (vertices with the specified

    overlap) repeat, then the edge multiplicity is

    increased (in this case an increment in weight of the

    edge by 1). The conserved regions are most likely to

    reside on the most repeated edges, i.e. edges with

    greater multiplicity (which have greater weight

    attached to them).

    Procedure ConstructG

    Input S ={s 1,s 2,, s n},each si has length l i

    Output G =(V ,E )

    1. FOR i 1 TO n

    2. DO FOR j 1 TO (l i-k +1)

    3. DO T =s i( j ,, j +k -1)

    4. T L=s i( j ,, j +k -2)

    5. T R=s i( j +1,, j +k -1)

    6. v L -1

    7. v R -1

    8. IF (hashtable( T L))

    9. THEN v L=hashtable( T L).node

    10. IF (hashtable( T R))

    11. THEN v R =hashtable( T R).node12. IF v L 0 AND v R 0 AND e E

    in(v L, v R)

    13. THEN add sequence info{ i , j }to e

    14. ELSE IF v L

  • 8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph

    3/4

    In Figure . 1 is an eliminating cycle process. In

    Figure .1(a), cycles exist which include vertexes

    {v 1,v 2,v 3} and edges { e 1,e 2,e 3}. The vertex v 2 has the

    least information. So the original cycle is eliminated

    by making a copy v N2 of vertex v 2. The result is

    Figure .1(b). In Figure .1(b), cycle also exists. Nowvertex v 3 has the least information. So the original

    cycle is eliminated by making a copy v N3 of vertex v 3.

    The result is Figure.1(c). And so on. The final result

    is Figure.1(d)

    Note that a loop in the original graph is removed,

    and it is a directed acyclic graph now.

    2.3 . Extract the conserved regions

    After performing the above transformations, we

    apply a Depth First Search (DFS) algorithm to find a

    heaviest path within linear time. The weight for each

    edge is proportional to its multiplicity and length. After obtaining the high weight edges, the algorithm

    tries to locate the entire repeated region called as the

    active region. An active region is high weight

    region where a lot of sequences coincide upon a

    particular consecutive edge set. In other words, an

    active region is a region that contains consecutive

    edges that have repeated themselves. All the high

    weight edges need not represent motifs because there

    is a fair chance for them being repeats. Hence the

    search is for more prominent edges. Therefore wehave tried to identify edges that have more weight in

    an active region in comparison to the other edges.

    3. Results and Analysis

    The algorithm has been implemented on Pentium

    4 2.66GHz under the Linux operating system with C

    language. Initially the algorithm was tested on the

    DNA sequences that belong to the prokaryotic family.

    The advantage with these sequences is that they havevery low or negligible number of repeats. In addition

    to that, the common repeating patterns in the

    prokaryotic sequences (the TATA patterns) are

    already known. Upon successful initial testing we

    tested the algorithm on the protein sequences. We

    chose to start off with the protein responsible for

    Figure.1 Elimiting cycle

    4141

  • 8/13/2019 A New Approach for Motif Discovery Based on the de Bruijn Graph

    4/4

    redox (oxidation and reduction) reactions called

    Cytochrome [5, 6]. The initial testing began on just

    twenty sequences, with the longest sequence being

    577 nucleotides. In this section, we make a

    comparison of our algorithm with popular motif

    searching tools namely MEME and the Gibbssampler [7].

    Figure 2 shows the comparison among the three

    algorithms. Both MEME and Gibbs sampler have a

    character limit. Therefore, we could not test them for

    more than 30 protein sequences each averaging

    200bp. Clearly our algorithm has a speed advantage

    over the other motif discovery tools. This

    comparison just gives an approximate picture of the

    speed levels (see Figure 2). Our algorithm

    successfully ran for 1500 sequences. However it

    broke beyond 1500 (see Figure 3).

    Porcess Time Versus

    1

    10

    100

    1000

    10000

    100000

    5 10 20 30

    Number of Sequences in Sample Data

    T i

    m e

    i

    n

    M i l l i

    s e c o n

    d s

    MEMEGibbsOur Algorithm

    Performance Curve

    0

    20000

    40000

    60000

    80000

    0 500 1000 1500Number of Sequences

    T

    i m e

    i

    n

    M i l l i

    s e c o n

    d s

    Our Algorithm

    4. References

    [1] J. C. Setubal, J. Meidanis, Introduction to

    Computational Molecular, Biology , PWS Publishing

    Company, Boston, USA, 1997.

    [2] P. A. Pevzner, H. Tang, and M. S. Waterman, AnEulerian path approach to DNA fragment assembly.

    Proceedings of the National Academy of Sciences of the

    United States , 98(17):9748-- -9753, August 2001.

    [3] T. L. Bailey, C. Elkan, Fitting a mixture model by

    expectation maximization to discover motifs in

    biopolymers, In The Second International Conference on

    Intelligent Systems for Molecular Biology , pages 28---36,

    Stanford, CA, USA, 1994. AAAI Press.

    [4] D. Z. Du, F. K. Hwang, Generalized de Bruijn

    digraphs, Networks , Vol. 18, pp. 27-38, 1988.

    [5] Cytochrome P450 cysteine heme-iron ligand signature.

    http://www.expasy.org/cgi-bin/nicedoc.

    pl?PDOC00081 .

    [6] Cytochrome. h ttp://en.wikipedia.org/wiki

    /Cytochrome , June 2006. Cytochrome is a protein

    family.

    [7] S. Geman, D. Geman, Stochastic relaxation, Gibbs

    distributions, and the Bayesian restoration of images,

    Readings in uncertain reasoning , pages 452472, 1990

    Figure.2 Process Time Versus

    Figure.3 Performance Curve

    4242