6
FAST SEARCH IN DNA SEQUENCE DATABASES USING PUNCTUATION AND INDEXING Yi Lu 1 , Shiyong Lu 1 , Jeffrey L. Ram 2 Department of Computer Science, Wayne State University, Detroit, MI 48202, {luyi, shiyong}@wayne.edu Department of Physiology, Wayne State University, Detroit, MI 48201, U.S.A, [email protected] ABSTRACT Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, Compressed- Punctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM’s performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences. KEY WORDS Algorithm, DNA Sequence, Search, Punctuator, Indexing 1. Introduction String pattern searching is a fundamental problem in many applications. The basic problem is to find all occurrences of a pattern string P in a text string T. An interesting application of string searching is to find patterns in DNA sequence databases, which consist of four characters: A, C, G and T. Although approximate pattern matching is frequently used to find related sequences in diverse databases, exact pattern matching also has many bioinformatics applications such as intraspecies and close taxonomic comparisons (e.g., among primates, at the taxonomic order level) and the search for large mobile genetic elements. In non-coding regions, long stretches of conserved bases may be indicative of important regulatory elements or other unknown functions. Another application is in the design of hybridization probes of large genomes. Rapid searches for exact patterns can assist in avoiding multiply occurring probe targets. Another potential application is to enhance performance of approximate homology searching tools such as BLAST[1] and BLAT[2] as these tools use exact matches to find hits prior to a complete sequence alignment. In the exact pattern matching literature, various good solutions have been presented. One of the most efficient methods is the Boyer-Moore algorithm (BM) [3] that was developed by R. S. Boyer and J. S. Moore. Amir and Benson [4] further tried to speed the searching process in the compression domain. Compression of sequences having simple alphabets into shorter sequences with larger alphabets are expected to have advantages not only in BM search time but also to enhance storage and input- output efficiency owing to the shorter lengths of P and T in the compression domain. Various compression methods have been extensively studied [4-8]. Among statistical compression methods, Huffman encoded files [5] run faster than Aho-Corasick [6], by the same factor as the compression ratio. For dictionary-based compression, Shibata [7] used byte-pair encoding (BPE) [9]. Utilizing a combined Boyer-Moore (BM) and BPE algorithm, DNA string matching runs about three times faster than AGREP[10]. DeMoura, et al. [11] uses Huffman coding on words that runs twice as fast as AGREP; however, the compression method is not applicable to DNA sequences. d-BM is a recent BM application in which DNA sequences encoded by 4 character, 8 bit (4C8B) compression are searched by BM while still compressed [12]. In d-BM’s compression, A, C, G, and T are each encoded with 2 bits, “00” for A, “01” for C, “10” for G, and “11” for T, and codes for 4 contiguous characters are combined in a single byte [12]. 4C8B compression increases the number of characters in encoded DNA sequences from 4 to 256, and this increase of alphabet size underlies the speed enhancement of d- BM [12]. However, as illustrated in Figure 1, the frame boundaries of each set of 4 characters in the target sequence, T, might not be aligned with the input sequence, P. Since P can be compressed in any of 4 alignments, the d-BM algorithm searches for P in T in each of the 4 alignments. The enhanced performance of d- BM relies on the positive effect of increased alphabet size 505-141 351

FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

FAST SEARCH IN DNA SEQUENCE DATABASES USING PUNCTUATION AND INDEXING

Yi Lu1, Shiyong Lu1, Jeffrey L. Ram2

Department of Computer Science, Wayne State University, Detroit, MI 48202, {luyi, shiyong}@wayne.edu Department of Physiology, Wayne State University, Detroit, MI 48201, U.S.A, [email protected]

ABSTRACT Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, Compressed-Punctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths ≥ 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM’s performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences. KEY WORDS Algorithm, DNA Sequence, Search, Punctuator, Indexing 1. Introduction String pattern searching is a fundamental problem in many applications. The basic problem is to find all occurrences of a pattern string P in a text string T. An interesting application of string searching is to find patterns in DNA sequence databases, which consist of four characters: A, C, G and T. Although approximate pattern matching is frequently used to find related sequences in diverse databases, exact pattern matching also has many bioinformatics applications such as intraspecies and close taxonomic comparisons (e.g., among primates, at the taxonomic order level) and the search for large mobile genetic elements. In non-coding regions, long stretches of conserved bases may be indicative of important regulatory elements or other unknown functions. Another application is in the design of hybridization probes of large genomes. Rapid searches for exact patterns can assist in avoiding multiply

occurring probe targets. Another potential application is to enhance performance of approximate homology searching tools such as BLAST[1] and BLAT[2] as these tools use exact matches to find hits prior to a complete sequence alignment.

In the exact pattern matching literature, various good solutions have been presented. One of the most efficient methods is the Boyer-Moore algorithm (BM) [3] that was developed by R. S. Boyer and J. S. Moore. Amir and Benson [4] further tried to speed the searching process in the compression domain. Compression of sequences having simple alphabets into shorter sequences with larger alphabets are expected to have advantages not only in BM search time but also to enhance storage and input-output efficiency owing to the shorter lengths of P and T in the compression domain. Various compression methods have been extensively studied [4-8]. Among statistical compression methods, Huffman encoded files [5] run faster than Aho-Corasick [6], by the same factor as the compression ratio. For dictionary-based compression, Shibata [7] used byte-pair encoding (BPE) [9]. Utilizing a combined Boyer-Moore (BM) and BPE algorithm, DNA string matching runs about three times faster than AGREP[10]. DeMoura, et al. [11] uses Huffman coding on words that runs twice as fast as AGREP; however, the compression method is not applicable to DNA sequences. d-BM is a recent BM application in which DNA sequences encoded by 4 character, 8 bit (4C8B) compression are searched by BM while still compressed [12]. In d-BM’s compression, A, C, G, and T are each encoded with 2 bits, “00” for A, “01” for C, “10” for G, and “11” for T, and codes for 4 contiguous characters are combined in a single byte [12]. 4C8B compression increases the number of characters in encoded DNA sequences from 4 to 256, and this increase of alphabet size underlies the speed enhancement of d-BM [12]. However, as illustrated in Figure 1, the frame boundaries of each set of 4 characters in the target sequence, T, might not be aligned with the input sequence, P. Since P can be compressed in any of 4 alignments, the d-BM algorithm searches for P in T in each of the 4 alignments. The enhanced performance of d-BM relies on the positive effect of increased alphabet size

505-141 351

bryson
Page 2: FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

70 199 185 (00 11 11)(01) 27 30 228 (11 11)

(01 00) 108 123 147 (11)

(01 00 01) 177 238 79

01 00 01 10 11 00 01 11 10 11 10 01 00 11 11C A C G T A C T G T G C A T T

Compressed T Values:( Without punctuation)

Byte code, Frame 1

Byte code, Frame 2

Byte code, Frame 3

Byte code, Frame 4

Bit codeSequence T

Figure 2 An example of different encoding of T with different alignments of compression frame

overcoming the negative influence of searching 4 times. By introducing the new concept of “punctuator” in this paper, we are able to reduce the number of searches up to 4-fold compared to d-BM.

2. Method

The problem of multiple searches of misaligned compressed patterns can be resolved if an unambiguous “start” character (punctuator) can be recognized in T so that all characters following it will be in a known alignment. If compression always begins wherever a particular sequence of 4 characters (a punctuator) is encountered, then the compression frame of all sequences after the punctuator will be completely determined. For example, in Figure 2, sequence T can be coded in 4 frames. If ACGT is selected as the punctuator, then only frame 2 can uniquely indicate the sequence T. Any pattern P that contains ACGT can be unambiguously aligned in T at the ACGT first and then be matched for the remaining characters in P.

Since punctuator sequences are likely to occur in

longer sequences more than one time and will not always be aligned in the same frame, the use of a punctuator necessitates special encoding of the “out of frame” nucleotides immediately preceding each punctuator. In the above example, “ACGT” is always converted to the

byte 27. Since the preceding partial frame can have either 0, 1, 2, or 3 characters in it, a special “pre-punctuator” byte is formed that is always inserted just prior to the punctuator and identifies the number of characters (first two bits), and the identity of the characters (xx), using 2, 4, or 6 of the remaining bits, depending on whether 1, 2, or 3 characters are present. Thus, the “pre-punctuator” byte can have only the values illustrated in Figure 3. Besides defining the byte preceding a punctuator as a special byte, the last byte of T can also be encoded as a special byte following the same rule. ▪ 00 00 00 00: 0 character before the punctuator. ▪ 01 xx 00 00: 1 character before the punctuator. ▪ 10 xx xx 00: 2 characters before the punctuator. ▪ 11 xx xx xx: 3 characters before the punctuator.

Figure 3 Values for pre-punctuator bytes

In unpunctuated compression, the translation of 4 adjacent characters in DNA into 2 bits each guarantees the uniqueness of each translated byte. However, with pre-punctuators, some translated bytes are no longer unique: e.g., without punctuation “10111100” always represents the characters “GTTA”. But if this byte occurs before a punctuator in punctuated compression, it indicates that there are 2 characters (“TT”) prior to the punctuator, as shown in Fig 4. With punctuated com-pression every byte beginning with “11” can have two meanings, representing either 4 characters or 3 characters preceding a punctuator, but as shown in Fig 3, many other bytes still retain their unique meanings. For example, all bytes beginning with “00” have a unique meaning except for “00000000” when it occurs before a punctuator. 2.1 Punctuator Selection

Sequences that could also be used for pre-punctuator bytes are NOT unique and therefore cannot be used as punctuators. This criterion can be translated to Rule 1:

Rule 1: Exclude from consideration as punctuators any string of four characters that are encoded as 00 00 00 00, 01 xx 00 00, 10 xx xx 00 and 11 xx xx xx.

Rule 1 leads to a subset (81 out of 256) of 4-length sequences that can be selected as punctuators. However, not all 81 4-length sequences can be used as punctuators.

Figure 4 Punctuated compression of a sequence in which the punctuator ACGT appears more than once

G C T A C T T T G G A T G C T X

T A C T T T G G A X X X

Input pattern P

DNA sequence T

Figure 1 Exact pattern search: Input pattern P can be searched for in the target sequence, T. This illustrates how groups of 4 bases in P may not

align with groups of the identical sequence in T.

80 27 30 228 188 27 182 9201 00 01 10 11 00 01 11 10 11 10 01 00 11 11 00 01 10 11 10 11 01 10 01 01 11 00C A C G T A C T G T G C A T T A C G T G T C G C C T A

10111100Pre-punctuator byte

Compressed T value:

Punctuated byte Code

Bit CodeSequence T:

352

Page 3: FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

A second type of sequences that should not be used as punctuators are those whose alignment might be uncertain because of imprecise framing when considered together with neighboring characters. For example, “GGGG” can be selected as a punctuator according to Rule 1. However, its position would be uncertain if it was selected as the punctuator and a “G” was adjacent to these characters. In that case, it would be ambiguous whether to begin the alignment at the first “G” or the second “G”. Similarly, a sequence with the same character in the first and fourth position can be duplicated with a partial sequence of neighboring characters, leading to ambiguous positioning (e.g., where should “GxyG” begin in the sequence “GxyGxyG”?). Thus, we have Rule 2 as follows:

Rule 2: Exclude from consideration as punctuators any string of four characters in which Si..j = Si+k.. j+k (1≤i≤j≤3, k=4-j).

Simply put, punctuator strings in which any prefix of the string coincides with any suffix of the string are excluded.

2.2 Encoding DNA sequences To obtain the punctuated compressed target sequence T’ the original sequence T is first divided into three sub-sequences: the prefix Tp, the sequence before the first punctuator; the core Tc, the sequence beginning with the first punctuator; and the suffix Ts, the final 1 to 3 characters that may not make a complete 4-mer at the end of T. Tc can be compressed to sub-pattern Tc’, in which the first byte is the punctuator and is followed by 4C8B compression until the next punctuator is encountered, looking ahead up to 7 characters to ascertain the position of the next punctuator string. Upon encountering each punctuator string, the preceding 0 to 3 characters, as determined by the frame position of the previous punctuator, are encoded in a pre-punctuator byte. The prefix Tp is compressed from its first character with 4C8B compression until the first punctuator string is encountered. Any “out of frame” characters immediately preceding the first punctuator are then encoded with the previously described “pre-punctuator” code, or with “00000000” if there are no out of frame characters. Finally, the compressed sequence ends with a single suffix byte, Ts’, encoding the last 0 to 3 characters according to the same algorithm as the pre-punctuator sequences.

The search pattern, P, is similarly compressed into P’ prior to searching in T’, except that additional encoding is needed for Pp, the part of the sequence that precedes the first punctuator. Thus, P is scanned to find the punctuator string, if present, dividing the resultant compressed sequence into three segments, Pp, the prefix sequence

before the first punctuator; Pc, the core sequence whose compression frame is fixed by the presence of the punctuator; and Ps, a final byte encoding the terminal 0 to 3 characters of P. If no punctuator is present in P, the entire sequence is encoded as a prefix sequence, Pp.

Since the alignment of Pp in T cannot be known before the search, Pp is processed into 4 compressed sub-patterns, Ppi (0 ≤ i ≤ 3), reflecting the 4 possible alignments. Each compressed Ppi’ has a leading byte with 0 to 3 characters and information about the number of characters represented in the same format as a pre-punctuator byte; a set of 4C8B core bytes; and a final byte encoding the terminal i characters and the number i in pre-punctuator format, indicating the number of characters prior to the punctuator in Pc. For example, with “ACGT” as the punctuator in the pattern “TGCATTACGTGTCGC”, Pp would be “TGCATT”, and the four compressed prefix patterns Ppi’ would be: • Pp0’ = 10111000 01001111 00000000, • Pp1’ = 01110000 10010011 01110000, • Pp2’ = 00000000 11100100 10111100, and • Pp3’ = 11111001 11001111. 2.2 Amount of compression of the punctuated compressed sequence Without punctuation, 4C8B compression of DNA sequence T of length m has m/4 bytes. With punctuation, the core sequence, Tc, needs only a small number of additional bytes. For each punctuator, a pre-punctuator byte that codes 0 to 3 characters (on average, 1.5 characters) is needed. Since any particular 4-mer begins in T approximately 1 in every 256 characters, we expect to need on average m/4 + m/(256 x 1.5) bytes for punctuated compression. Tp is fully compressed with 4C8B compression throughout its length, except for one byte encoding 0 to 3 characters just before the first punctuator. For long T sequence, the one extra byte in Tp’ represents a vanishingly small proportion of the total sequence. Similarly, the suffix Ts adds only one byte to the length of T’. 3. Compressed-Punctuated Boyer-Moore (cp-BM) Search Algorithm 3.1 Simple cp-BM (one punctuator)

cp-BM is designed to search for P in punctuated compressed sequences. If no punctuator is present in P, the search defaults to d-BM [12]. Since checking whether a punctuator is present in P can be done in O(1) time, preprocessing in cp-BM takes no more time than in d-BM. If a punctuator is present, cp-BM searches for Pc’ in T’ with BM, followed by further checking of prefixes and suffixes. The search process is carried out in three steps: (1) Search for the occurrence of Pc’ in T’ using the BM algorithm. (2) If Pc’ is found, compare the decoded Ps’

353

Page 4: FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

characters to the same number of characters to the right of Pc’ in T’. (3) If Ps’ is matched, compare the compressed prefix sub-pattern Ppi’ (0≤i≤3) with the target sequence characters to the left of Pc’ in T’. The BM algorithm used in the first step employs 3 clever ideas [13]: right-to-left scan, the bad character shift rule, and the good suffix shift rule, enabling BM to produce large jumps through T and dramatically reduce the number of local sequence comparisons of P to T. As a result, BM applications typically run in “sublinear” time for sufficiently large alphabets and sufficiently long patterns. We give details here on our implementation of BM for cp-BM with single punctuators, followed by additional enhancements, such as indexing and multiple punctuators, that punctuated compression makes possible.

The rate limiting process in cp-BM is step 1. Step 2 compares only a single byte after a simple decoding step. Step 3 is straightforward in that each byte in Pp’ should match the T’ byte by byte. The trick here is selecting which Ppi’ should be compared to T’. It is not necessary to compare all four Ppi’ with T’ because the pre-punctuator byte before the punctuator in T’ indicates i of the corresponding Ppi’ to match up.

For example, consider searching for the pattern P = “TGCATTACGTGTCGC” in T = “CACGTACTGTGCAT TACGTGTCGCCTA” in Figure 5. Pc is encoded by 27, 182, which is readily found in T’. Ps’ decodes to the single character C, which matches the character encoded by the next two bits in T’. Finally, the 188 byte just to the left of Pc in T indicates that the Ppi’ that should be compared is Pp2’, which has bytes encoded as 228, 188 and a leading byte of 0, identifying an exact match for the entire sequence P.

3.2 Indexing and multiple punctuators to enhance performance of cp-BM

3.2.1 Increasing the number of punctuators

The probability of a punctuator occurring at each point in the sequence is around 1/256. Therefore, the punctuator may not occur in a short P (e.g., length <256), and cp-BM will default to the slower d-BM processing. For longer patterns, even if a punctuator is present in P, it may not occur until near the end of the pattern. By introducing more punctuators, the probability that a punctuator occurs

in P and/or at the beginning of P is increased. With 4 punctuators, one punctuator is likely to occur in P within 64 nucleotides of the beginning. Multiple punctuators can be defined with some limitations: (1) Each punctuator causes a further decrease in the overall compression, and (2) the suffix of one punctuator cannot be identical to the prefix of another punctuator, as it would then be more difficult to identify each punctuator unambiguously. Thus, “ACGT” and “GTAC” could not be used together as punctuators; whereas, “ACGT” and “ACTG” could be. One possible set of 16 punctuators is: {ACAT, CAAT, ACCT, CACT, ACGT, CAGT, ACTT, CATT, ACAG, CAAG, ACCG, CACG, ACGG, CAGG, ACTG, CATG}.

3.2.2 Indexing punctuators With the BM algorithm, the largest shifts that can be generated by the bad character and good suffix rules are the length of the compressed pattern. As a result, BM performance decreases drastically for short sequences. For example, the d-BM algorithm lost its advantage over AGREP for sequences less than 50 characters in length [12]. However, the average distance between two locations of a punctuator is about 256 characters in T or 64 bytes in T’. If the locations of punctuators in T’ are indexed, the distance to the next punctuator can be compared to the shifts produced by the bad character and good suffix rules, the search can be shifted to the next punctuator if the punctuator distance is larger. Especially for short sequences, where BM performance degrades, shifting according to punctuator index positions should speed the search. 4 Results Experiments were run on an Intel Pentium IV 2.24 GHz machine with 512 Megabytes memory running Linux RedHat 9.0. DNA files that were tested included 400 megabases randomly generated by an in-house program, as well as the 285 megabases DNA sequence of human genome chromosome 1 downloaded from Genbank. 4.1 Comparison of cp-BM to d-BM and AGREP A program that randomly selects a substring of a given length from a source string generated test patterns. cp-BM was tested with 30 patterns of length M, ranging from 16 to 2048, and user times were analyzed. As shown in

T G C A T T A C G T G T C G C11 10 01 00 11 11 01 01 10 11 10 11 01 10 01

See text for Pp choice 27 182 01010000 = 80

80 27 30 228 188 27 182 9201 00 01 10 11 00 01 11 10 11 10 01 00 11 11 00 01 10 11 10 11 01 10 01 01 11 00C A C G T A C T G T G C A T T A C G T G T C G C C T A

Search pattern P:

Bit code for P

Punctuated byte code for P

Punctuated byte Code for T

Bit code for T

Sequence T:

Figure 5 Example of how a search for P in T would be aligned in the punctuated compressed domain

354

Page 5: FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

Figure 6 with 4 punctuators (ACGT, ACTG, CAGT, CATG), cp-BM runs at least 2 times faster than d-BM. At some intermediate pattern lengths (e.g., 64 bases), the advantage of cp-BM over d-BM was greater than 4-fold. Furthermore, if punctuators are guaranteed to be included in the pattern, the running time of cp-BM exhibits almost no change as the pattern length is varied from 16 to 2048. Thus, the degradation of performance for shorter sequences in cp-BM is almost exclusively due to sequences lacking punctuators that default to d-BM processing. With guaranteed punctuators present, the running time of cp-BM is at least an order of magnitude better than d-BM for pattern lengths less than 64. cp-BM is better than AGREP for random sequences for sequences of length >32 and exceeds performance of AGREP at all lengths when punctuators are guaranteed to be present.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

16 32 64 128 256 512 1024 2048

Pattern Length

Use

r Tim

e (m

s)

d-BM (random pattern)

cp-BM (random pattern)

cp-BM (guaranteedpunctuator)

AGREP

Figure 6 Comparison experiment of cp-BM to d-BM and AGREP on

random data

Figure 7 Comparison of cp-BM to BLAT on DNA sequence of Chromosome 1 downloaded from GenBank

4.2 Comparison of cp-BM to BLAT cp-BM was compared to BLAT (http://www.soe.ucsc.edu /~kent/src/) using biological data. We searched human chromosome 1 sequences for 32 to 256 base fragments of 30 ESTs (ftp://ftp.ncbi.nih.gov/repository/dbEST/) known to have chromosome 1 homologs. As illustrated in Figure 7, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than 3-fold faster for 256 base sequences. Since ESTs may include mutations or exon boundaries present in the target genome, cp-BM did

not locate all of the patterns in chromosome 1, nor did BLAT with parameters set for exact match.

Table 1 Comparison of compression of 400 megabases with different numbers of punctuators.

4.3 Effect of increasing the number of punctuators Although increasing the number of punctuators is expected to increase the preprocessing time and decrease the amount of compression, the increase in compression user time is sub-linear with respect to the number of punctuators and the compressed size is increased by only 3.2% to 28.3% by use of 16 punctuators (see Table 1). When the number of punctuators increases, the probability that a punctuator exists in the search pattern also increases, which leads to cp-BM being called more often than d-BM, thereby enhancing the performance (Figure 8). This has the greatest effect on searches for short patterns (e.g., 32 and 64 characters) for which the probability of finding at least one punctuator is especially increased by larger numbers of punctuators.

Figure 8 Comparison of search times with different number of punctuators

4.4 Effect of position of punctuator on cp-BM algorithm speed To investigate the effect of indexing, we generated a special set of patterns. Randomly selected patterns of length 128 were separated into four categories: the location of first punctuator in pattern is less than 32, between 32 and 64, between 64 and 96, and larger than 96. As illustrated in Figure 9, both the position of the first punctuator in the pattern and the use of punctuator indexing affect search performance, demonstrating that

time for compression

Compressed size

1 punctuator 37.4 25.2% 2 punctuators 37.6 25.5% 4 punctuators 46.6 26.0% 8 punctuators 66.0 26.8% 16 punctuators 100.2 28.3%

0

5 0

10 0

15 0

2 0 0

2 5 0

3 0 0

3 5 0

4 0 0

4 5 0

5 0 0

3 2 6 4 12 8 2 5 6

Pattern Length

Use

r tim

e (m

s)

B L A Tc p - B M

0

200

400

600

800

1000

1200

1400

1600

1800

32 64 128 256 512 1024 2048

Length of Search Pattern

User

Tim

e (m

s)

1 punc tua tor

2 punc tua tors

4 punc tua tors

8 punc tua tors

16 punc tua tors

355

Page 6: FAST SEARCH IN DNA SEQUENCE DATABASES USING ... - cs.wayne…

the punctuator indexing technique does help to speed up the search process.

Figure 9 Comparison experiment of indexed and non-indexed cp-BM 5 Conclusion In this paper, we propose cp-BM to speed up exact match searches in DNA sequences. cp-BM enhances performance over d-BM by incorporating punctuated compression and indexing into the BM search algorithm. In contrast to d-BM, whose performance decreases for short sequences, cp-BM speeds up search processing on both short and long patterns. Building on the idea of using punctuators in compressing sequences to enable further speed enhancements of cp-BM, the following variants are expected to be useful:

Compressing T with different sets of punctuators: Performance can be improved by increasing the probability that a punctuator will be found in P. If storage of different T’ sequences is not a limitation, multiple T’ sequences utilizing different punctuators can be created. P can be rapidly examined for available punctuators to identify the punctuator that comes earliest in P and then, after compression to P’, search only in the relevant T’.

Varying the pre-punctuator code to allow more 4-mers to serve as punctuators: The number of different possible punctuators is limited by the fact that the pre-punctuator sequences are unable to serve as unambiguous punctuators. However, for particular punctuator, different pre-punctuator codes can be used. For example, “TAGC” is excluded as a punctuator by the pre-punctuator code in section 2.1; however, sequences compressed with this potential punctuator could instead use “xx xx xx yy” as the pre-punctuator code, where yy are the bits specifying the number of characters. In this case, if “TAGC” is found early in P, then P would be compressed using this variant punctuator and pre-punctuator code, and searched in a corresponding variant T’.

Translating A ,G, C, T with different bit patterns: A further variation on varying the compression dependent upon which punctuator is present early in P would be to compress with a different bit translation of A, G, C, T. For example, “TAGC” could also be a punctuator if both

P and T were compressed using “T” = “00” and “A” = “11”.

With the above variations on the basic cp-BM search scheme described in this paper, it is likely that search performance of this algorithm can approach the superior performance demonstrated here for patterns with guaranteed punctuators present.

The ideas in this paper may be used to speed up the performance of approximate homology searching tools in which exact pattern matching is typically used to find “hits” before a complete sequence alignment is performed. Conversely, in preliminary studies we have found that indexing methods used by BLAT can be incorporated into cp-BM to provide a further enhancement of exact match searches. References

1. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10.

2. Kent, W.J., BLAT--the BLAST-like alignment tool. Genome Res, 2002. 12(4): p. 656-64.

3. Boyer, R.S. and J.S. Moore, A fast string searching algorithm. Communications of the ACM, 1977. 20: p. 762-772.

4. Amir, A. and G. Benson. Efficient two-dimensional compressed matching. in Data Compression Conference. 1992.

5. Miyazaki, M., et al., Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing, 1998. 39: p. 2638-2648.

6. Chen, X., S. Kwong, and M. Li, A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. Genome Inform Ser Workshop Genome Inform, 1999. 10: p. 51-61.

7. Shibata, Y., et al. A Boyer-Moore type algorithm for compressed pattern matching. in 11th Ann. Symp. on Combinatorial Pattern Matching. 2000: Springer-Verlag.

8. Shibata, Y., et al. Speeding up pattern matching by text compression. in 4th Italian Conference on Algorithms and Complexity. 2000: Springer-Verlag.

9. Gage, P., A new algorithm for data compression. The C Users Journal, 1994. 12(193).

10. Wu, S. and U. Manber. Agrep : a fast approximate pattern-matching tool. in Usenix Winter 1992 Technical Conference. 1992.

11. Moura, E.S.d., et al. Direct pattern matching on compressed text. in 5th International Symp. on String Processing and Information Retrieval. 1998.

12. Chen, L., S. Lu, and J. Ram. Compressed Pattern Matching in DNA Sequences. in IEEE Computational Systems Bioinformatics Conference. 2004. Stanford, CA, USA.

13. Gusfield, D., Algorithms on Strings, Trees, and Sequences. 1997, New York: Cambridge University Press.

0

100

200300

400

500

600

700800

900

1000

<32 32-64 64-96 96-128

Position in pattern

Use

r tim

e (m

s)cp-BMwithout indexcp-BM withindex

356