4
Mining the Database of Transcription Binding Sites Wei Peng, Tao Li and Giri Narasimhan School of Computer Science, Florida International University Miami, FL 33199 {wpeng002,taoli,[email protected]} Abstract In this paper, we study the problems of motif discov- ery and gene regulation.First, although the sliding window technique based on profiles or consensus sequences is a standard method for discovering motifs in the genomes with prior knowledge of transcription binding sites in ortholo- gous genes from related organisms, it usually has high com- putational costs. In this paper, we propose an efficient ap- proximation method employing randomized algorithms to identify motifs. The approximation method can be easily combined with the sliding-window technique for efficient and accurate motif discovery. Second, we mine frequent motif combinations and sequential motif patterns to inves- tigate the regulatory relationships between motifs and pro- vide a better understanding of gene expression, regulation, and transcription. 1. Introduction Determining the location of transcription factor binding sites is an important initial step for solving the complexity of gene networks. Numerous computational methods have been proposed to assist biologists in finding putative tran- scription factor binding site locations. These binding sites are often described as patterns or motifs. Given a motif or pattern, it is useful to know all the sites where this pattern can be found. Given motifs describing the binding sites in one genome, it is also useful to know all the locations where it occurs in a related genome. We consider efficient ways to solve this problem. In particular, using the transcription fac- tor binding sites in the parasite Plasmodium falciparum, as predicted in the database PlasmoTFBM [1], we found all occurrences of that pattern in the genomes of two other re- lated species Plasmodium berghei and Plasmodium yoelii. The sliding window technique is a common method to search for motifs. It takes O(NM ) to discover a certain motif in one nucleotide sequence, where N is the length of the sequence, and M is the length of the motif. Since ge- nomic sequences are very long and since searching for them is a time-consuming process, we propose a randomized ap- proximation algorithm to find all occurrences of the given motif [4]. The time complexity of the approximation ap- proach is only O(N log M ). Thus the algorithm for finding all the matches involves a two-stage approach: (1) use the approximation algorithm to efficiently look for approximate motif positions, and (2) use the sliding-window technique to determine the exact matches in the neighborhood of the approximate matches 1 . It is also known that a collection of transcription factors are togther responsible for regulating the transcription of a gene. This can be investigated by identifying combina- tions of transcription factor binding sites. Once again, us- ing the putative transcription factor binding sites listed in the database PlasmoTFBM [1], we sought to identify fre- quent combinations of binding sites. In this paper, we consider the problem of efficiently find- ing motif hits and frequent combinations of motif hits. Sec- tion 2 introduces the fast approximation algorithm to iden- tify motif hits; Section 3 applies the Apriori algorithm [5] to mine for frequent combinations of motifs; Section 4 de- scribes the application of the GSP algorithm [6] to mine fre- quent sequential patterns of motifs; Section 5 presents our experimental results. 2. Discovering Motifs Based on A Randomized Algorithm 2.1. Introduction The sliding window technique is one of the most fre- quently used methods for discovering conserved patterns (motifs) with the given profiles (frequency matrix) or con- sensus sequences. When searching the nucleotide sequence for motifs, the sliding window, which has the same size as the target motif, moves along the sequence, base by base. 1 The size of the neighborhood is usually decided based on experiments or prior experience. Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06) 0-7695-2727-2/06 $20.00 © 2006

[IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

  • Upload
    giri

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

Mining the Database of Transcription Binding Sites

Wei Peng, Tao Li and Giri NarasimhanSchool of Computer Science, Florida International University

Miami, FL 33199{wpeng002,taoli,[email protected]}

Abstract

In this paper, we study the problems of motif discov-ery and gene regulation.First, although the sliding windowtechnique based on profiles or consensus sequences is astandard method for discovering motifs in the genomes withprior knowledge of transcription binding sites in ortholo-gous genes from related organisms, it usually has high com-putational costs. In this paper, we propose an efficient ap-proximation method employing randomized algorithms toidentify motifs. The approximation method can be easilycombined with the sliding-window technique for efficientand accurate motif discovery. Second, we mine frequentmotif combinations and sequential motif patterns to inves-tigate the regulatory relationships between motifs and pro-vide a better understanding of gene expression, regulation,and transcription.

1. Introduction

Determining the location of transcription factor bindingsites is an important initial step for solving the complexityof gene networks. Numerous computational methods havebeen proposed to assist biologists in finding putative tran-scription factor binding site locations. These binding sitesare often described as patterns or motifs. Given a motif orpattern, it is useful to know all the sites where this patterncan be found. Given motifs describing the binding sites inone genome, it is also useful to know all the locations whereit occurs in a related genome. We consider efficient ways tosolve this problem. In particular, using the transcription fac-tor binding sites in the parasite Plasmodium falciparum, aspredicted in the database PlasmoTFBM [1], we found alloccurrences of that pattern in the genomes of two other re-lated species Plasmodium berghei and Plasmodium yoelii.

The sliding window technique is a common method tosearch for motifs. It takes O(NM) to discover a certainmotif in one nucleotide sequence, where N is the length ofthe sequence, and M is the length of the motif. Since ge-

nomic sequences are very long and since searching for themis a time-consuming process, we propose a randomized ap-proximation algorithm to find all occurrences of the givenmotif [4]. The time complexity of the approximation ap-proach is only O(N log M). Thus the algorithm for findingall the matches involves a two-stage approach: (1) use theapproximation algorithm to efficiently look for approximatemotif positions, and (2) use the sliding-window techniqueto determine the exact matches in the neighborhood of theapproximate matches 1.

It is also known that a collection of transcription factorsare togther responsible for regulating the transcription ofa gene. This can be investigated by identifying combina-tions of transcription factor binding sites. Once again, us-ing the putative transcription factor binding sites listed inthe database PlasmoTFBM [1], we sought to identify fre-quent combinations of binding sites.

In this paper, we consider the problem of efficiently find-ing motif hits and frequent combinations of motif hits. Sec-tion 2 introduces the fast approximation algorithm to iden-tify motif hits; Section 3 applies the Apriori algorithm [5]to mine for frequent combinations of motifs; Section 4 de-scribes the application of the GSP algorithm [6] to mine fre-quent sequential patterns of motifs; Section 5 presents ourexperimental results.

2. Discovering Motifs Based on A RandomizedAlgorithm

2.1. Introduction

The sliding window technique is one of the most fre-quently used methods for discovering conserved patterns(motifs) with the given profiles (frequency matrix) or con-sensus sequences. When searching the nucleotide sequencefor motifs, the sliding window, which has the same size asthe target motif, moves along the sequence, base by base.

1The size of the neighborhood is usually decided based on experimentsor prior experience.

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 2: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

The HMM is constructed according to the profile of themotif. Suppose the sequence is G = g0g1 · · · gN−1, andthe given motif is F = f0f1 · · · fM−1. The score vectorS contains N components, of which the ith component isthe number of matches between G and F when the first let-ter of F is corresponding to the ith letter of G [4]. Thescore vector will be calculated each time the sequence insliding window is checked by HMM based on whether thissequence is the target motif or not. If its score is greaterthan or equal to the pre-defined threshold, it has a high pos-sibility to correspond to a transcription factor binding site.

We propose an approximation algorithm to discover mo-tif hits efficiently. It estimates the score vector of approxi-mate matches between the nucleotide sequences composedof N bases and a motif pattern string of length M in timeO(N log M). Rather than computing exact score vector, thealgorithm calculates the score vector approximately. Thescore vector is calculated by a convolution, which can beefficiently computed by Fast Fourier Transform (FFT) [2].The approximation algorithm improves the ability to handlelarge datasets by performing convolution of two vectors ofthe same length n in time O(n log n) [4].

2.2 Algorithm Description

The approximation algorithm aims to compute the meanover all the possible mappings of the convolution of tworandomized finite sequences to approximate the score vec-tor. Assume that we attempt to estimate the score vec-tor S = s0s1 · · · sN−1 of the nucleotide sequence G =g0g1 · · · gN−1 and the given motif consensus sequence F =f0f1 · · · fM−1. Since sequence G and motif pattern Fare over a finite alphabet

∑= {A, G, C, T}, each letter

can be mapped to an integer in γ = {0, 1, 2, 3} by us-ing a function Ψ. After mapping, sequence G can be rep-resented by T = t0t1 · · · tN−1, and motif pattern F byP = p0p1 · · · pM−1. If there is a match between T and P atposition i, that is ti = pi, the Hermitian inner product willcontribute 1 as follows:

∑M−1

i=0ωtiωpi =

∑M−1

i=0ωti−pi ,

where ω indicates a primitive 4th root of unity 2. ω canbe any e

k

2πi, where k is an integer taken from γ. Then

T can be transformed to T = ωt0ωt1 · · ·ωtN−1 , and P toP = ω−p0ω−p1 · · ·ω−pM−1 .

A convolution is constructed to represent the amount ofoverlap between f and the reversed sequence of g. Sup-pose X = [x0x1 · · ·xn−1] and Y = [y0y1 · · · yn−1] are twovectors of the same length n. The convolution of X and Ycan be denoted as in [3, 7] by X ⊗ Y = [c0, c1, · · · cn−1],

where ci =∑j

i=0xiyj−i. If we reverse the sequence X

to X ′ = x′

0x′

1· · ·x′

n−1, then x′

i = xn−1−i. We can eas-

ily obtain that (X ′ ⊗ Y )′i =∑n−1−i

i=0x′

n−1−iyn−1−i−j .

2Complex number µ is called the nth root of unity if µn= 1.

It can then be seen that the convolution can be applied tointerpret the score vector, if xiyj satisfies two conditions:(1) it contributes a positive score when xi and yj overlapor match; (2) it contributes a perturbing score when theydo not overlap or match. The root of unity can be usedto transform nucleotide sequences and motif sequences tosatisfy these two conditions. If the length of the two vec-tors are of lengths n and m, with n ≥ m, then the shortervector will be filled with n − m zeroes. The convolutionof X and Y can then be calculated by FFT as follows [7]:X ⊗ Y = FFT−1(FFT (x) · FFT (y)). And the convo-lution calculation of two vectors of length n runs in timeO(n log n) since the FFT is able to simultaneously calcu-late all the components of the score vector. Algorithm 1describes the steps of the approximation algorithm, whichfollows the presentation in [4].

Algorithm 1 The Approximation Motif Detecting Algorithm

Input:A nucleotide sequence G, a motif pattern F , and a threshold d.Output: Positions and names of motif hits in G.Algorithm steps:1. For J = 1 to k

(a) from the sequence G′, obtain a complexsequence T J of size (1 + β)M ;

(b) from the given motif pattern F , obtain acomplex sequence P J of size | T J | by paddingwith βM zeroes

(c) compute the vector SJ as the convolution ofT J with the reverse of P J

2. Compute the vector S ′ =∑k

J=1SJ/k and consider

its reverse as an estimate of S.3. For q = 0 to N , if sq ≥ d, write down the position qand the name of motif F .

The efficiency of this algorithm lies in step (c) as theconvolution can be efficiently calculated using FFT. In al-gorithm 1, the sequence G is first partitioned into overlap-ping chunks of size (1 + β)M . Each of the chunks willbe processed sequentially. The adjacent two chunks overlapeach other by M letters, and only the first βM componentsof the score vector calculated from each trunk are useful.There are at most N/(βM) chunks. The time complexityis k( N

βMO((1 + β)M log((1 + β)M)) = O(kN log M),

where k is the number of running repetitions [4]. In ourexperiments, the score vector does not vary sharply withdifferent k values. We set k = 1 and β is assigned a valuearound M .

3 Discovering Frequent Motif Combinations

The availability of transcription factor binding sites inits promoter region and the combinatorial regulation by aset of transcription factors (TFs) is critical to the expression

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 3: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

of a gene. Different combinations of motifs involve dif-ferent binding mechanisms. To focus on statistically signif-icant combinations of motifs, we only take frequent motifcombinations into account. We use Apriori, an algorithmfor discovering frequent itemsets [5], to identify the fre-quent combinatorial patterns of motifs in gene promoter re-gionss. The pseudo-code of Apriori algorithm is skippeddue to space limitation. Note that the Apriori algorithm de-termines frequent motif combinations from candidates withthe minimum support. The candidate motif combinationswith support greater than or equal to some minimum valueare considered as frequent motif combinations.

4 Discovering Frequent Sequential Patternsof Motifs

Besides combinations of motifs, additional sequence fea-tures are increasingly employed and used in various studies.For example, DNA sequence has structural motif associatedwith production of enzyme and regulator motif regulatingthe former motif. The regulator motif is always related toinducement and suppression. Therefore, the order in whichmotifs occur is very essential for understanding gene ex-pression. We apply the sequential pattern mining techniqueto identify and analyze frequent sequences of motifs. Givena set of sequences, we try to discover the frequent combi-nation of motifs with the same relative order. We use theGeneralized Sequential Pattern(GSP) Algorithm [6] in ourexperiments. Since we do not consider overlaps of motifs,every element in the sequence only contains one motif.

5 Experiment

Dataset Description: We have sequences from the up-stream regions of all the genes from three genomes: Plas-modium falciparum, Plasmodium berghei, and Plasmodiumyoelii, which have been recently sequenced. We obtainedputative transcription binding sites from PlasmoTFBM [1].A motif in PlasmoTFBM is provided as a consensus se-quence. The motifs could also have been provided in theform of profiles. A whole-genome synteny map (showingorthologous genes) of the three species of Plasmodium isalso available.

Discovering Motifs in Ortholog Genes using the Ap-proximation Algorithm: The approximation algorithm es-timates the score vector of approximate matches betweenthe given nucleotide sequences composed of N bases and amotif of length M in time O(N log M). Figure 1 comparesthe motif discovery results obtained by the sliding win-dow with those by the approximation approach on a sam-ple nucleotide sequence from the upstream region of genePY02536 of length 1752 taken from Plasmodium yoelii.

The sample motif used for the example is named Mo-tif.N38.8.21, and has a consensus sequence of “ATAAT-TAT”. The upper figure in Figure 1 describes the score vec-

0 200 400 600 800 1000 1200 1400 1600 18000

1

2

3

4

5

6

7Discover Motifs in a Gene Sequence by Sliding−window

Position X

Sco

re Y

0 200 400 600 800 1000 1200 1400 1600 1800−6

−4

−2

0

2

4

6

8Discover Motifs in a Gene Sequence by the Approximation Algorithm

Position X

Sco

re Y

Figure 1. Comparisons of motif discovery results. X axisindicates positions in the upstream region of gene PY02536.Y is the corresponding score of the motif Motif.N38.8.21 atvarious positions in the sequence.

tor calculated using the sliding window technique. Thescore is the number of matching bases divided by the lengthof motif. The positions with scores equal to or higher than7 are marked. The lower figure shows the score vector ob-tained using the approximation algorithm. The threshold isset to 0.875. The places marked with “∗” in these two sub-figures designate qualified positions where the score vectoris no less than the threshold. We can notice that in Figure 1the approximation algorithm identified the same number ofpositions as the sliding window technique. Our experimentshows that the sliding window technique identifies a certainmotif from 4000 genes in about 48.7 minutes, whereas Ran-domized Algorithm just takes 7 minutes when parameter kis set to 1, and β has the same value as the length of themotif. The gaps between the exact positions found by thesliding window technique and the approximate ones foundby the approximation algorithm are small compared withthe length of gene sequence. The average deviation of thepositions in the approximation algorithm from those of thesliding window is about 0.015 (gap/ | gene |) when param-eter k is set to 1. We tested with different K values (e.g.,1, 2, · · · , 10), the average variation almost stays the same. Adetailed theoretical analysis on the variance of approximatescore vectors can be found in [4]. In our experiment, wefound that the approximation algorithm has a nice propertywhich makes it applicable and plausible. More specifically,it has over 90% probability to discover the motif when thehit occurs, and does not report a hit when one does not oc-cur. Due to this property, the motif combinatorial pattern

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 4: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

will not be affected by using the approximation algorithm.However, approximate motif positions will lead to incorrectanswers. This problem can be easily solved by combiningthe approximation algorithm and the sliding window tech-nique using a two-stage approach: the approximation algo-rithm is first used to find all approximate positions of themotif, and the sliding window technique is then applied toidentify the exact positions in the vicinity of the approxi-mate positions.

Discovering Frequent Motif Combinations and Se-quential Patterns: Due to the large number of discov-ered motifs, we only consider conserved motifs that arepresent in all three Plasmodium species with high scores.By first identifying the motif hits as specified above, we ob-tained 214 conserved motifs. Thereafter, there were 1581genes whose upstream regions contained these conservedmotifs in all three organisms. We used the Apriori soft-ware 3 to mine frequent motif combinations. Figure 2 il-lustrates the number of frequent combination sets of mo-tifs under various minimum support thresholds. The higherthe number of upstream regions in which the motifs ap-pear, the more likely they are to correspond to genuinetranscription binding sites. Moreover, the motifs combi-nations with higher support values are more likely to reg-ulate transcription and affect gene expression Let us takea look at an example that Motif.N38.8.21 and Motif.P7.8.8(consensus sequence:”ATTATTAT”) appear together mostfrequently with support 3.1% compared to other 2-motif-sets. Motif.N38.8.21, Motif.P7.8.8, and Motif.P2.8.14 (con-sensus sequence:”AATAAATT”) tend to be a frequent mo-tif combination among the most frequent 3-motif-sets withsupport 0.6%. Motifs which get together more often shouldregulate each other to affect gene expressions and tran-scriptions more possibly. The relative order of the mo-

0 1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

120

140

160

180

200Frequent Motif Sets to Various Min−support

Min−support (%)

Num

ber o

f Fre

quen

t Mot

if Se

ts

Figure 2. The Number of Frequent Combinations Variesto Different Minimum Supports

tif is worth investigating in order to maximize the de-3downloaded from http://fuzzy.cs.uni-

magdeburg.de/∼borgelt/doc/apriori/apriori.html]input

gree to which the sequence features and their complexinter-dependencies explain the expression patterns. In ourexperiment, we found many 2-motif-sequences, but veryfew frequent 3-motif-sequences in P.f. For instance, themost frequent sequence in P.f is Motif.N38.8.21(consensussequence:ATTATTAT)→Motif.P7.8.8(consensus sequence:ATAATTAT) with 2.34% support. Its reverse sequenceMotif.P7.8.8→Motif.N38.8.21 has 0.76% support. Obvi-ously, It means that Motif.N38.8.21 usually occurs beforeMotif.P7.8.8. The different orders in which motifs occurare very essential for us to obtain more knowledge aboutdifferent functions they have, and how they regulate eachother specifically.

6 Conclusion

In this paper, we study the problems of discovering mo-tif hits. We propose an efficient randomized approximationalgorithm to identify motifs. The approximation methodcan be easily combined with the sliding-window techniquefor efficient and accurate discovery of motif hits. We alsoapply data mining techniques to mine frequent motif com-binations and ordered combinations of motifs. Experimentswere conducted to evaluate and validate our methods.

References

[1] C. Yang, E. Zeng, K. Mathee, and G. Narasimhan. Miningregulatory elements in the Plasmodium falciparum genomeusing gene expression data. In CAMDA’04, pages 16–20,2004.

[2] D. Knuth. The Art of Computer Programming, volume 2of Series in Computer Science and Information Processing.Addison-Wesley, Reading, MA, second edition, 1981.

[3] Mohamed G. Elfeky, Walid G. Aref, and Ahmed K. El-magarmid. Periodicity detection in time series databases.IEEE Transactions on Knowledge and Data Engineering,17(7):875–887, July 2005.

[4] Mikhail J. Atallah, Frederic Chyzak, and Philippe Dumas. Arandomized algorithm for approximate string matching. Algo-rithmica 29, pages 468–486, 2001.

[5] R. Agrawal and R. Srikant. Fast algorithms for mining asso-ciation rules. In VLDB’94, 1994.

[6] R. Agrawal and R. Srikant. Mining sequential patterns. InICDE 95), pages 3–14, 1995.

[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algo-rithms. MIT Press, Cambridge, MA, 1990.

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006