[IEEE 2012 Ninth International Conference on Information Technology: New Generations (ITNG) - Las Vegas, NV, USA (2012.04.16-2012.04.18)] 2012 Ninth International Conference on Information

New Hashing-Based Multiple String Pattern Matching Algorithms

Chouvalit Khancome1 1Department of Computer Science

Faculty of Science King Monkut’s Institute of Technology at Ladkrabang(KMITL). Ladkrabang, Bangkok, THAILAND

10520 e-mail: [email protected],

Veera Boonjing1,2 1Department of Computer Science

Faculty of Science King Monkut’s Institute of Technology at Ladkrabang(KMITL). Ladkrabang, Bangkok,THAILAND

10520 e-mail: [email protected]

2National Centre of Excellence in Mathematics PERDO, Bangkok, THAILAND 10400

Abstract—The paper presents three new algorithms for multiple string pattern matching using hashing tables: a suffix search (SS), a suffix-prefix search (SPS), and a suffix-middle-prefix search (SMPS). It takes O(|P|) time and space preprocessing where |P| is the sum of all pattern lengths. They search for a fixed-length pattern m in a text with length |t| takes O(|t||P|) in the worst case, O(|t|) in an average case, and O(|t|/m) in the best case. Furthermore, their attempting times are less than of traditional algorithms.

Keywords-Multiple String Pattern Matching, Static Dictionary Matching, Matching Algorithm, String Matching.

I. INTRODUCTION Multiple string pattern matching is a classic principle

found in computer science. This principle has been adapted to solve several problems in computer science such as the operating commands, DNA sequencings, network intrusion detection systems (NIDS), bibliography search systems, etc. Fundamentally, this matching simultaneously searches for all occurrences of patterns P={p1, p2, p3,..,pr} which appeared in a given text T={t1,t2,t3…tn} over a finite alphabet � . A straightforward search is by reading p1 to pr and comparing each pattern with a given text character by character. A powerful search is a search with a minimal attempting time. Such a search can be obtained via a fast accessing data structure [9] and a minimal attempting search such as a suffix search (SS), a suffix-prefix search (SPS), and a suffix-middle-prefix search (SMPS) [9]. Since a hashing table is the fastest accessing data structure, this paper proposes to use it as a data structure for such a purpose.

The remaining sections are organized as follows. Section 2 shows related works. Section 3 gives basic definitions. Section 4 explains how to create several hashing tables and their algorithms. Section 5 illustrates the proposed search algorithms. Section 6 analyzes complexities of all proposed algorithms. Section 7 shows experiment results and gives a discussion. Finally, section 8 gives a conclusion.

II. RELATED WORKS Multiple string pattern matching is a very popular

research subject, with many publications in the literature shown in [9]. The Aho-Corasick [1] algorithm is the first linear time solution which is inherited from KM [7]. The Commentz-Walter [3] and SetHorspool [9] are the sub-linear time solutions which extended from the single string matching [4]. The Aho- Corasick algorithm [1] uses the Trie structure for accommodating P in O(|P|) time. This takes the searching time to O(|t|+nocc) where nocc is the occurrence of patterns (additional time to access Trie). The Comment-Walter [3] and SetHorspool [9] take the sub-linear time as O(|t||lmax|) in the searching phase where lmax is the longest pattern in P. However, implementing the Trie to applications takes a large amount of memory. Other Trie-based solutions [18], [19], MultiBDM[20], SBOM[5], SDBM[9], [15], and [11] improved Trie for decreasing the searching time, but they were more complex both in the searching phase as well as when using the simplest way to search (shown in [9]).

Bit-parallel-based algorithms accommodate the patterns by the form of bits. Navarov [9] showed how to apply the single Shift-Or and the single Shift-And to the Multiple Shift-And [8], the Multiple-BNDM[9], and [10]. The Bit-parallel principle is restricted by the computer word; moreover, the inherited algorithms need to deal with the bit conversion thus taking more time.

The hashing algorithm was presented by Karp and Rabin [14] in single string matching. This algorithm takes the worst case O(mn) time (exhaustive search) , and to directly extend to the multiple pattern string matching takes O(|t||P|) time. The efficient solution, presented by Wu and Manber [23], created the shift table and implements the hashing table to store the block of patterns. Solution [25] improved the Wu and Manber algorithm [23] for saving the searching time, but the searching did not improve (e.g., O(|t||P|) in worst case).

Current solutions, which use other excellent ideas [16], [17] and [24], combine several structures to improve the time complexity such as the q-gram and the partitioning

2012 Ninth International Conference on Information Technology- New Generations

978-0-7695-4654-4/12 $26.00 © 2012 IEEE

DOI 10.1109/ITNG.2012.34

195

2012 Ninth International Conference on Information Technology - New Generations

978-0-7695-4654-4/12 $26.00 © 2012 IEEE

DOI 10.1109/ITNG.2012.34

195

technique. The good literature reviews are shown in [9], [16], and [24].

III. BASIC DEFINTIONS As previously mentioned, the new algorithms employ the

hashing tables for accommodating the patterns; as well as, the searching phase hashes into these hashing tables for matching inspection. Definition 1 shows a main shifting table, definition 2 describes the hashing table of prefix patterns, and definition 3 explains two characters to be stored in the hashing table which only stored the suffix-prefix of patterns. Then, definition 4 is the middle of pattern string after taking the suffix-prefix of patterns. Definition 5 is the suffix-middle-prefix of patterns, and definition 6 is the remaining of patterns after taking the suffix-middle-prefix of patterns. The following explanations indicate the meaning of referring notations to be used.

Definition 1. The hashing table, which consists of two

columns: � and the shifting values, is called the shifting table and denoted by ST.

Example 1. The ST of of P={aaba, abcb, aadc, zmnd,

qope, jmqf } into ST where * is the other characters that are not appeared in � (the shifting values calculated by algorithm 1 in the next section).

TABLE I. SHIFTING VALUE TABLE(ST)

� of pi[m] Shifting Values

a 2 b 1 c 1 d 1 e 1 f 1 m 2 n 4 o 3 p 1 q 3 j 3 z 3 * 4

Definition 2. The hashing table, which stores the prefix

of patterns after putting all last characters of all patterns, is called the prefix of pattern table and denoted by PPT.

Example 2. The PPT of example 1 shown as P={aaba,

abcb, aadc, zmnd, qope, jmqf }.

TABLE II. PREFIX STRING HASHING(PPT)

pi[1...m-1] aab abc aad zmn qop jmq

Definition 3. The hashing table, which stores the prefix-

suffix of all patterns, is called the suffix-prefix of pattern table and denoted by S-PPT.

Example 3. S-PPT of P={aaba, abcb, aadc, zmnd,

qope, jmqf }

TABLE III. SUFFIX-PREFIX HASHING(S-PPT)

pi[m]pi[1] aa ba ca dz eq fj

Definition 4. The hashing table, which stores the part of patterns after creating the S-PPT, is called the middle of pattern table and denoted by MPT.

Example 4. The MPT of example 1 shown as P={aaba, abcb, aadc, zmnd, qope, jmqf }.

TABLE IV. MIDDLE PREFIX HASHING(MPT)

pi[2...m-1] ab bc ad mn op mq

Definition 5. The hashing table, which stores the suffix-

middle-prefix of all patterns, is called the suffix-middle-prefix table and denoted by S-M-PPT.

Example 5. If P={aaaxbbc, xabzcba, maakdce,

zmndkki, aqojpek, bjmaqfb }, then S-M-PPT is shown as table 4.

196196

TABLE V. SUFFIX-MIDDLE-PREFIX HASHING(S-M-PPT)

pi[m] pi[m/2] pi[1] cxa azx ekm idz kja bab

Definition 6. The hashing table which stores the part of

patterns after creating the S-M-PPT is called the before and after middle table, denoted by BM-AMPT.

Example 6. The BM-AMPT of example 5 shown as

P={aaaxbbc, xabzcba, maakdce, zmndkki, aqojpek, bjmaqfb }.

TABLE VI. BEFORE-MIDDLE AND AFTER-MIDDLE HASHING

pi[(2...m/2)-1] pi[m/2+1...m-1] aa bb ab cb aa dc mn kk qo pe jm qf

IV. PREPROCESSING PHASE The following algorithms show how to create

ST, PPT, S-PPT, MPT, S-M-PPT, and BM-AMPT, respectively. Let m is the fixed length of patterns when all pattern lengths are equivalent. ShiftingValue is the shifting value of the individual character in � which is initiated as m. Algorithm 1 : Create ST Input: P={p1, p2, p3,...,pr} Output: Shifting Table (ST) 1. Initiate the empty ST 2. For i=1 to r Do 3. ShiftingValue=m 4. For j=m-1 down to 1 Do 5. If pi[j] exists in ST Then 6. ShiftingValue = the shifting value at pi[j] in ST 7. If pi[j] = pi[m] Then 8. ShiftingValue = ShiftingValue-1 9. If ShiftingValue at pi[j] >ShiftingValue Then

ShiftingValue at ST[pi[j]]= ShiftingValue 10. End If 11. End If 12. Else 13. ST�pi[j] at column of � 14. add m to column of shifting Value

15. End For 16. End If 17. End For 18. Return ST

Algorithm 2 : Create PPT Input: P={p1, p2, p3,...,pr} Output: PPT 1. Initiate the empty PPT 2. For i=1 to r Do 3. PPT�pi[1...m-1] 4. End For 5. Return PPT

Algorithm 3 : Create S-PPT Input: P={p1, p2, p3,...,pr} 1. Initiate the empty S-PPT 2. For i=1 to r Do 3. S-PPT � pi[m] pi[1] 4. End For 5. Return S-PPT

Algorithm 4 : Create MPT Input: P={p1, p2, p3,...,pr} 1. Initiate the empty MPT 2. For i=1 to r Do 3. MPT�pi[2...m-1] 4. End For 5. Return MPT

Algorithm 5 : Create S-M-PPT Input: P={p1, p2, p3,...,pr} 1. Initiate the empty S-M-PPT 2. For i=1 to r Do 3. S-M-PPT � pi[m] pi[m/2] pi[1] 4. End For 5. Return S-M-PPT

Algorithm 6 : Create BM-AMPT Input: P={p1, p2, p3,...,pr} 6. Initiate the empty BM-AMPT 7. For i=1 to r Do 8. BM-AMPT�pi[(2...m/2)-1] 9. BM-AMPT� pi[(m/2)+1...(m-1)] 10. End For 11. Return BM-AMPT

V. SEARCHING ALGORITHMS The following sub-sections show the searching steps,

figures, and the algorithm details.

197197

A. Suffix Search The preprocessing phase creates the ST of algorithm1

and the PPT table of algorithm 2. Then, this search begins with initiating the first window at the first character of text T. Next step gets the last character in the given text and hashes into ST. Afterwards, if the comparison is true, then the next method gets the string at the first character to the character (m-1)th in the search window and hashes into PPT for the matching. The last step gets the shifting value from ST, and the search window is shifted. This idea is shown as figure 1.

Figure 1. llustation idea of the first solution.

The following algorithm shows the idea above.

Algorithm 7: Suffix Search (SS) Input: T, PPT,ST, P Output : all matched positions are reported 1. Create ST(P) 2. Create PPT(P) 3. Cur=m 4. While Cur <=n Do 5. If ST exists T[Cur] Then 6. If Hash[T[Cur-m]...T[m-1]] into PPT = true Then 7. Report the matched position at Cur 8. End If 9. End If 10. NewCur= ShiftingVvalue at Hash[ST[T[Cur]] 11. Cur = Cur+NewCur 12. End While

B. Suffix Prefix Search First of all, the table ST and the S-PPT are created by

running the algorithm 1 and the algorithm 3 and 4. The steps for searching are explained as follows. First step initiates the first search window at the first character of text T. Next step gets the last character and the first characters in the search window and hashes into S-PPT. Then, if the matching is true, then it gets the string at the second character of text T in the search window to the character (m-1)th and hashes into MPT for matching position. In the last step, the shifting

value is retrieved from ST and the search window is shifted. The following figure shows this idea.

Figure 2. llustation idea of the second solution.

The algorithm 6 shows these steps.

Algorithm 8: Suffix Prefix Search (SPS) Input: T, S-PPT,ST, MPT, P Output : all matched positions are reported 1. Create ST(P) 2. Create S-PPT(P) 3. Create MPT(P) 4. Cur=m 5. While Cur <=n Do 6. If S-PPT exists T[Cur-m+1]T[Cur] Then 7. If Hash[T[Cur-m+1]...T[Cur-1]] into MPT = true

Then Report the matched position at Cur

8. End If 9. End If 10. NewCur= ShiftingValue at Hash[ST[T[Cur]] 11. Cur = Cur+NewCur 12. End While

C. Suffix Middle Prefix Search In the preprocessing phase, the hashing table S-M-PPT

and the table BM-AMPT are created, generated by algorithm 1, 5 and 6. The searching steps begin with initiating the first search window at the first character of text T. Next step gets the last character, the middle character and the first characters of the search window and hashes into S-M-PPT. Then, if the comparison is matched, then the strings at the second character of search window to ((m/2)-1)th are taken and hashed into BM-AMPT. Then if the comparison is successful, then the strings at (m/2+1)th to (m-1)th are captured and hashed for matching inspection. Then, the matching position is reported. The last step gets the shifting value from ST, and the search window is shifted. The idea is shown as figure 3.

198198

Figure 3. Illustation idea of the third solution.

The following algorithm shows the idea above.

Algorithm 9: Suffix Middle Prefix Search (SMPS) Input: T, ST, S-M-PPT, BM-AMPT, P Output : all matched positions are reported 1. Create ST(P) 2. Create S-M-PPT(P) 3. Create BM-AMPT(P) 4. Cur=m 5. While Cur <=n Do 6. If S-M-PPT exists T[Cur-m+1] T[Cur-m/2]T[Cur]

Then 7. If Hash[T[Cur+1..(Cur+m/2-1)] into BM-

AMPT = true Then 8. If Hash[T[(Cur+m/2+1)...(Cur+m-1)] into

BM-AMPT =true Then 9. Report the matched position at Cur 10. End If 11. End If 12. End If 13. NewCur = ShiftingValue at Hash[ST[T[Cur]] 14. Cur = Cur+NewCur 15. End While

VI. COMPLEXITY ANALYSIS

A. Preprocesing Time and Space Complexity Algorithm 1 is run to equal the number of pattern (r)

which takes r time or |P|; meanwhile, the basic space complexity equals 2 � (two column: � and ShiftingValue). The algorithm 2 (creating PPT) is run in |P| time as well as the algorithm 1, and the space complexity is |P|- � . Considering the data structure for the first searching solution, the time complexity is |P|+|P| which equals O(|P|). The space complexity equals the space of ST+the space of PPT which equals |P| +(|P|- � ) = |P| shown as O(|P|). In the same way, MPT, S-M-PPT, BM-AMPT and BM-AMPT are explained as PPT because over all of space is |P|.

The algorithm 3 and 4 are run to equal |P| which take O(|P|) and the maximum space requires O(|P|) as well as the previous mention.

B. Searching Time Compleixity The algorithms 7, 8 and 9 are similar; because, they need

to capture suffix or prefix-suffix or suffix-middle-prefix of pattern at once. Then, they capture the remaining part of pattern at once. The maximum of time takes |t| time if the pattern in ST is matched, which takes O(|t|); meanwhile, if there is no matched, then it takes O(|t|/m) (shown the best case scenario). In the average case scenario, the comparisons need to take at least one character and capture the remaining part of the search widow in each comparison, which take O(|t|) time in maximum when the shifting value is activated more than one position. In the worst case scenario, it can be said that the basic comparison is equivalent to the average case, but the shifting value is activated only one position, which finally lead to O(|t||P|) time.

VII. EMPERICAL RESULTS AND DISCUSSION In the empirical results, we assumed the given text

1,000,000 characters and the fixed-length of all patterns to be matched was 100 characters. The matching times were set to 10,000 times with no overlapping. Moreover, in the other ways, we assumed the mismatch times when the mismatchs were occurred at the first character and the last character of each search window.

The chosen algorithms are classic Aho-Corasick [1], simple algorithm of Commentz-Walter[3] called SetHorspool[shown in 9], the current excellent q-gram technique of [16] (creating from classic Aho-Corasick [1] and SetHorspool). We assumed the 5-q grams which were larger than the original experiments shown in [16]. The table 7 shows the empirical results of the attempting time for matching where SS is our first solution, SPS is the second solution, and SMPS is the third solution.

TABLE VII. EMPERICAL RESULTS OF ATTEMPING TIME FOR MATCHING (TIMES)

Algorithms Complete matching

Mismatch at pi [1]

Mismatch at pi [m]

Aho-Corasick[1] 1,000,000 10,000 990,000 SetHorspool[9] 1,000,000 990,000 10,000 5-q grams of [1] 200,000 10,000 200,000 5-q grams of [9] 200,000 200,000 10,000

SS 20,000 20,000 10,000 SPS 20,000 10,000 10,000

SMPS 30,000 10,000 10,000

The empirical results showed that the attempting times of new algorithms were less than the traditional algorithms especially in the cases of long length.

199199

As mentioned in section 5 and 6, this section discusses and suggests for an effective implementation. The data structure employs the hashing tables for accommodating the patterns, and then every comparison takes only once. If the implementation employs the data structure that can be captured a part of the search window or the part of pattern in one time, then the expected searching is very fast.

Moreover, when using very long patterns is suitable for these algorithms; because, the advantage of these algorithms compares the data structure in the hashing table only once or three times in maximum. Meanwhile, the traditional solutions need more times to compare for the matching (e.g., Aho-Corasick [1], Commentz-Walter[3], or the current solutions [16], [17] and [24]).

VIII. CONCLUSION This paper presents three new algorithms of multi-string

pattern matching. These solutions employ a hashing table in several forms for fast pattern accessing. All approaches take O(|P|) time and space preprocessing where |P| is the sum of lengths in P. Their searches take O(|t||P|) in the worst case, O(|t|) in an average case, and O(|t|/m) in the best case scenario; where |t| is the length of text T and m is the fixed-length of patterns. Shown in empirical results, these proposed algorithms are suitable for long-length patterns.

REFERENCES [1] A. V. Aho, and M. J. Corasick, “Efficient string matching: An aid to

bibliographic search”, Comm. ACM, 1975, pp.333-340. [2] A. Moffat, and J. Zobel, “Self-Indexing Inverted Files for Fast Text

Retrieval”, ACM Transactions on Information Systems, Vol. 14, No. 4, 1996, pp.349-379.

[3] B. Commentz-Walter, “A string matching algorithm fast on the average”, In Proceedings of the Sixth International Collogium on Automata Languagees and Programming, 1979, pp.118-132.

[4] R.S. Boyer, and J.S. Moore, “A fast string searching algorithm”, Communications of the ACM, 20(10), 1977, pp.762-772.

[5] C. Allauzen, and M. Raffinot, “Factor oracle of a set of words”, Technical report 99-11, Institute Gaspard Monge, Universitĕ de Marne-la-Vallĕe, 1999.

[6] C. Monz, and M. de Rijke, (11, 02, 2002). Inverted Index Construction. [Online]. Available: http://staff.science.uva.nl/ ~christof/courses/ir/transparencies/clean-w-05.pdf.

[7] D.E. Knuth, J.H. Morris, V.R Pratt, “Fast pattern matching in strings”, SIAM Journal on Computing 6(1), 1997, pp.323-350.

[8] G. Navarro, “Improved approximate pattern matching on hypertext”, Theoretical Computer Science, 2000, 237:pp.455-463.

[9] G. Navarro, and M. Raffinot, “Flexible Pattern Matching in Strings”, The press Syndicate of The University of Cambridge. 2002.

[10] H. HYYRO, K. F. SSON, and G. Navarro, “Increased Bit-Parallelism for Approximate and Multiple String Matching”, ACM Journal of Experimental Algorithms, Vol.10, Article No. 2.6, 2005, pp.1-27.

[11] J. J. Fan, and K. Y. Su, “An efficient algorithm for match multiple patterns”, IEEE Trans. On Knowledge and Data Engineering, 1993, Vol.5, No.2, pp.339-351.

[12] J. Zobel, and A. Moffat, “Inverted Files Versus Signature Files for Text Indexing”, ACM Transaction on Database Systems, Vol. 23, No. 4, 1998, pp.453-490.

[13] J. Zobel, and A. Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys, Vol. 38, No. 2, 2006, pp.1-56.

[14] K. M. Karp, and M.O. Rabin, “Efficient randomized pattern-matching algorithms” IBM Journal of Research and Development, 31(2), 1987, pp.249-260.

[15] L. Gongshen, L. Jianhua, and L. Shenghong, “New multi-pattern matching algorithm”, Journal of Systems Engineering and Electronics, Vol. 17, No. 2, 2006, pp.437-442.

[16] L. Salmela, J. Tarhio, and J. Kytöjoki, “Multipattern string matching with q-grams”, ACM Journal of Experimental Algorithmics (JEA), Vol. 11, Article No. 1.1, 2006, pp.1-19.

[17] L. Ping, T. Jian-Long, and L. Yan-Bing, “A partition-based efficient algorithm for large scale multiple-string matching”, In Proceeding of 12th Symposium on String Processing and Information Retrieval (SPIRE’05). Lecture Notes in Computer Science, vol. 3772, Springer-Verlag, Berlin, 2005.

[18] M. Crochemore, A. Czumaj, L. Gąsieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter, “Fast practical multi-pattern matching”, Report 93-3, Institute Gaspard Monge, Universitĕ de Marne-la-Vallĕe, 1993.

[19] M. Crochemore, A. Czumaj, L. Gąsieniec, T. Lecroq, W. Plandowski, and W. Rytter, “Fast practical multi-pattern matching”, Information Processing Letters, 71(3/4), 1999, pp.107-113.

[20] M. Raffinot, “On the multi backward dawg matching algorithm (MultiBDM)”, In R. Baeza-Yates, editor, Proceedings of the 4th South American Workshop on String Processing, Valparaìso, Chile. Carleton University Press, 1997, pp.149-165.

[21] O. R. Zaïane, “CMPUT 391: Inverted Index for Information Retrieval”, University of Alberta. 2001.

[22] R. B. Yates, and B. R. Neto, “Mordern Information Retrieval”, The ACM press.A Division of the Association for Computing Machinery,Inc. 1999, pp.191-227.

[23] S. Wu, and U. Manber, “A fast algorithm for multi-pattern searching”, Report tr-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994.

[24] S. Klein, T. R. Shalom, and Y. Kaufman, “Searching for a set of correlated patterns”, Journal of Discrete Algorithm, Elsevier, 2006, pp.1-13.

[25] Y. Hong, D. X. Ke, and C. Yong, “An improved Wu-Manber multiple patterns matching algorithm”, Performance, Computing, and Communications Conference, 2006. IPCCC 2006. 25th IEEE International 10-12, 2006, pp.675-680.

[26] Z. A.A. Alqadi, M. Aqel, and I. M.M. El Emary, “Multiple skip Multiple pattern matching algorithm (MSMPMA)”, IAENG International Journal of Computer Science, 34:2, IJCS_34_2_03, 2007.

[27] F. C. Botelho, “Near-Optimal Space Perfect Hashing Algorithms”, The thesis of PhD. in Computer Science of the Federal University of Minas Gerais, 2008.

[28] R. Pagh, (11, 08, 2009) “Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions”. [Online]. Available: www.it.-c.dk/people/pagh/papers/hash.pdf.

[29] Wikipedia (10,07,2009). “Hash function”. [Online]. Available: en.wikipedia.org/wiki/Hash_fuction.

[30] Simon, I. “String matching and automata”, in Results and Trends in Theoetical Computer Science, Graz, Austria, J. Karhumaki, H. Maurer and G. Rozenerg ed., Lecture Notes in Computer Science 814, Springer- Verlag, Berlin. 1994, pp. 386-395.

200200

Documents

[IEEE 2012 Ninth International Conference on Information Technology: New Generations (ITNG) - Las Vegas, NV, USA (2012.04.16-2012.04.18)] 2012 Ninth International Conference on Information