4
Mining Maximal Sequential Patterns En-Zheng Guan1, Xiao-Yu Chang2, Zhe Wang3, Chun-Guang Zhou4 College of Computer Science, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education Changchun, China, 130012 E-mail: [email protected]', chang_xiao_yul63.com2, [email protected], cgzhou(jlu.edu.cn4 Abstract-To solve the problem that when patterns are long, frequent sequential patterns mining may generate an exponential number of results, which often makes decision-makers perplexed for there is too much useless repeated information, a novel algorithm MFSPAN (Maximal Frequent Sequential Pattern mining algorithm) to mine the complete set of maximal frequent sequential patterns in sequence databases is proposed. MFSPAN takes full advantage of the property that two different sequences may share a common prefix to reduce itemset comparing times. Experiments on standard test data show that MFSPAN is very effective. I. INTRODUCTION Sequential pattern mining [1] is a very important field in data mining. However, when frequent patterns are long, such mining will generate an exponential number of frequent subsequences, which makes the set of all frequent sequential patterns too large [2]. This limits the effectiveness of these algorithms, since useful frequent patterns are flooded with a mountain of useless results. Some researchers proposed to mine the set of all closed patterns; a frequent set is closed if it has no superset with the same frequency [3, 4]. But for some dense databases, even such a set would grow too large. However, the set of maximal frequent sequences (MFS) is much smaller than the two kinds of sets above. Given an MFS set, it is much easier to obtain some useful information, (e.g. the longest frequent sequence, the overlap of MFS, etc.), and a decision-maker may not be perplexed by too much repetition of redundant information. Moreover, all the frequent sequences can be built up from an MFS set, and if we are interested in a certain sequence, we can get its support by searching the database in a single scan. All of these make mining MFS very meaningful. But as far as we know, all recent algorithms to mine maximal patterns such as Mafia [5], MaxMiner [6], and DepthProject [7], are about mining maximal frequent itemsets in transaction databases, and no works have been done on mining maximal frequent sequences in sequence databases. II. PRELIMINARIES AND RELATED WORK Given a sequence s = {ti,t2,...,t,} and an item a, sO a means that s concatenates with a. A new sequence s' generated by s O a can be classified to two classes: (1) a is an SES (a sequence-extended step [3]): s O , a = {t1, ,t2,Qi)}; (2) a is an IES (an itemset-extended step): s O i a = {tl,t2,...,t mU(a)}, V kE tm, k < a. Also, given two sequences s = {tlt2, . .,t m } and p = {tl',t2',. .,t m'}, SO p means s concatenates with p, and it can be classified to two classes: p is an IES, e.g. s O iP = {tl,t2,. . .tm U t1',t2',..* tm' andp is an SES, e.g. s p, P = {tl1t2, ..,tmtl,t2r, ..tmr} If s =p'Os',/p p', thenp is a prefix of s, s' is a suffix of s with respect to (w.r.t.)p. We can get a suffix s' of s by removing a prefix p of s, denoted as s' = s -p. TABLE I SAMPLE DATABASE Customer ID 1 2 3 Sequence { (a,d)(b,c,40(,q,d)) _ {(a,c,d)(d)(a,b,c}} _ (a,d)(b,c,d)) I Given a lexicographic order for all items, then we can define the lexicographic order for all sequences: if a < b (a and b are two different items), 1) {(a)} < {(b)}, and 2) s Os a yj< SOs b O y2< i aOY3< Si bOy4; s, 'y, Y2, 73N 4 are sequences. O,kdi / 0 Corresponding sequence Fig. 1. Part of the FS tree for Table I. A lexicographic frequent sequence tree (FS tree) [3, 8] can be constructed in the following way: 1) each node of the tree corresponds to a sequence, the root of the tree corresponds to O. 2) if the father node corresponds to 0-7803-9422-4/05/$20.00 ©2005 IEEE 525 11 I -1 I I 11

[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Mining Maximal Sequential

Embed Size (px)

Citation preview

Page 1: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Mining Maximal Sequential

Mining Maximal Sequential PatternsEn-Zheng Guan1, Xiao-Yu Chang2, Zhe Wang3, Chun-Guang Zhou4

College of Computer Science, Jilin University, Key Laboratory of Symbol Computation and KnowledgeEngineering of the Ministry of Education

Changchun, China, 130012E-mail: [email protected]', chang_xiao_yul63.com2, [email protected], cgzhou(jlu.edu.cn4

Abstract-To solve the problem that when patterns are long,frequent sequential patterns mining may generate anexponential number of results, which often makesdecision-makers perplexed for there is too much uselessrepeated information, a novel algorithm MFSPAN (MaximalFrequent Sequential Pattern mining algorithm) to mine thecomplete set of maximal frequent sequential patterns insequence databases is proposed. MFSPAN takes full advantageof the property that two different sequences may share acommon prefix to reduce itemset comparing times.Experiments on standard test data show that MFSPAN is veryeffective.

I. INTRODUCTION

Sequential pattern mining [1] is a very important field indata mining. However, when frequent patterns are long,such mining will generate an exponential number offrequent subsequences, which makes the set of all frequentsequential patterns too large [2]. This limits theeffectiveness of these algorithms, since useful frequentpatterns are flooded with a mountain of useless results.Some researchers proposed to mine the set of all closedpatterns; a frequent set is closed if it has no superset with thesame frequency [3, 4]. But for some dense databases, evensuch a set would grow too large. However, the set ofmaximal frequent sequences (MFS) is much smaller than thetwo kinds of sets above. Given an MFS set, it is much easierto obtain some useful information, (e.g. the longest frequentsequence, the overlap of MFS, etc.), and a decision-makermay not be perplexed by too much repetition of redundantinformation. Moreover, all the frequent sequences can bebuilt up from an MFS set, and if we are interested in acertain sequence, we can get its support by searching thedatabase in a single scan. All of these make mining MFSvery meaningful. But as far as we know, all recentalgorithms to mine maximal patterns such as Mafia [5],MaxMiner [6], and DepthProject [7], are about miningmaximal frequent itemsets in transaction databases, and noworks have been done on mining maximal frequentsequences in sequence databases.

II. PRELIMINARIES AND RELATED WORK

Given a sequence s = {ti,t2,...,t,} and an item a, sO ameans that s concatenates with a. A new sequence s'generated by s O a can be classified to two classes: (1) a isan SES (a sequence-extended step [3]): s O , a = {t1,,t2,Qi)}; (2) a is an IES (an itemset-extended step): s O ia = {tl,t2,...,t mU(a)}, V kE tm, k < a. Also, given twosequences s = {tlt2, . .,t m } and p = {tl',t2',. .,t m'}, SOpmeans s concatenates with p, and it can be classified to twoclasses: p is an IES, e.g. s O iP = {tl,t2,. . .tm U t1',t2',..* tm'andp is an SES, e.g. s p,P = {tl1t2, ..,tmtl,t2r, ..tmr} Ifs=p'Os',/p p', thenp is a prefix of s, s' is a suffix of s withrespect to (w.r.t.)p. We can get a suffix s' ofs by removing aprefixp of s, denoted as s' = s -p.

TABLE I

SAMPLE DATABASE

Customer ID123

Sequence{(a,d)(b,c,40(,q,d))

_ {(a,c,d)(d)(a,b,c}}_ (a,d)(b,c,d)) I

Given a lexicographic order for all items, then we candefine the lexicographic order for all sequences: if a < b (aand b are two different items), 1) {(a)} < {(b)}, and 2) s Osa yj< SOs b O y2< s° i aOY3< Si bOy4; s, 'y, Y2, 73N 4 aresequences.

O,kdi/0Corresponding sequence

Fig. 1. Part of the FS tree for Table I.

A lexicographic frequent sequence tree (FS tree) [3, 8]can be constructed in the following way: 1) each node of thetree corresponds to a sequence, the root of the treecorresponds to O. 2) if the father node corresponds to

0-7803-9422-4/05/$20.00 ©2005 IEEE525

11 I -1 I I 11

Page 2: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Mining Maximal Sequential

sequence s, its son node is either an IES or an SES. 3) SL <SR, if one's left son node corresponds to SL and it's right sonnode to SR, i.e. according to the lexicographic order for allsequences, we always first do all SES of one node, then allIES of it.

Figure 1 demonstrates a part of the FS tree for the sampledatabase in Table I with min-sup = 2. For a finite database,all the frequent sequences can be listed in such a tree (InFigures 1, subscript "i" denotes IES, and "s" denotes SES).

HI. THE MFSPAN ALGORITHM

A. Maximal Frequent Sequence andMFS Tree

n$

I %I "

01I 'l

1 2 301

Fig. 2. The MFS tree for Table I Fig. 3. A Sample.

Definition 1. In the complete set of frequent sequentialpatterns for a database D, if there doesn't exist any sequences' (s' # s) which is a super-sequence of s, s is a maximalfrequent sequence in D.

For example, in the sample database, {(a,d)(d)(b,c)} is amaximal frequent sequence, but {(a)(d)(b,c)} is not a maximalfrequent sequence.

Definition 2. After pruning nodes of the FS tree of adatabase D, if each leaf node of the new tree corresponds toa maximal frequent sequence in D, and every maximalfrequent sequence in D corresponds to a leaf node in thatnew tree, the new tree is called the maximal lexicographicfrequent sequence tree (MFS tree) ofD.

For example, Figure 2 is the MFS tree of the sampledatabase in Table I.

Definition 3. In an FS tree, a branch in a subtree of noden (suppose the root of this subtree is node m, m is a son of n)consists of nodes (not including the node n) that are on thepath from m down to one of m's accessible leaf nodes,denoted as Branch n_i (i denotes the leaf node). If Branchn_1 and Branch n_2 do not share nodes, they are calledbrother branches.

For example, in Figure 3, both Branch n_1 and Branchn_2 are brother branches of Branch n_3, but Branch n_1and Branch n_2 are not brother branches for they share nodem.

B. Using Divide and Conquer Strategy to Generate anMFS Tree

If the FS tree of a dataset has been generated, and then wecan prune the tree to an MFS tree. First, this idea isdiscussed, and then we will consider mining the MFS tree atone time instead of generating the FS tree in advance.Let us consider two leaf nodes of an FS tree, denoted as

node I and 2, their nearest common ancestor denoted asnode n, the root node of the FS tree denoted as r. Branchr_1 corresponds to sequence sl, Branch r_2 to S2, Branchn_i to sn-l and Branch n_2 to Sn-2. Let sn-I = {P 1,P2,.--.A ISn-2 = {Y1,Y2,. ..,yk}, and then s = {aI,.. .'a} ° Sn-, S2 ={al,. .,Ai} O S.-2, i, j, k > 1. Function comp (s1, s2) judgeswhether sI c s2, if yes return TRUE else return FALSE. It'seasy to write the pseudocode of it. The most obviousmethod to prune an FS tree to an MFS tree is that comparingthe corresponding sequence of branch r_i w.r.t. every leafnode iwith the corresponding sequence of branch rj w.r.t.every other leaf nodej, and deleting those sequences whichare sub-sequences of others. We name this method"Whole-comparision". But it's not effective, because itignores the property that two different sequences may sharethe same prefix. According to Definition 3, it is obvious thatBranch n_1 and Branch n_2 are brother branches.

Theorem 1. (Brother Branches Comparison)

At any node n in an FS tree,(1) ifsn-l is an SES, and S,-2 is an SES, then comp(s1, s2)

<* COMP(sn 1, Sn-2);(2) ifs,1 is an SES, and sn-2 is an IES, then comp(sj, s2)

r comp(sn1, Sn-2 -Yd;(3) ifs,n is an IES, and Sn-2 is an SES, then comp(s,, s)

<* comp((ad¢ O i Sn-lJ Sn-2);(4) ifs, is an IES, and sn-2 is an IES, then comp(s1, s2)

¢ comp(ad O i Sn-1, (ad)O i Sn-2);

Proof. Let ai= (x1,...,xm), P1i= (y1,-..,yp), Y1= (z1,. .,Zq), andm, p, q > 1.(1) {a1,a2, ,aiP1.,P2,-J3j} 5 {a,L,- 4,Ui,Yl,Y2. . .YkI} <

{ 13,PX2s, .JCj {(Y1,Y2, ,k}; therefore comp(s1, S2) <*

COmp(Sn-1, Sn-2).(2) If comp(sn1, Sn-2 - y1) = TRUE, i.e.

{J31,132,-.Jj3} C {7Y2,...,yk}, then it's obvious that{(X1,. * .xm)A,j3132, *,J3J} 5 {(Xl, *,*xm,zl*.. .,Zq),Y2, ***,Yk};else if comp(sni1, Sn-2 - YO) = FALSE, i.e.{(P132,--J3j3} t {Y2,. ..,Yk}, it's obvious that{(X1, ** *,Xm),31 ,P2, * * *,Sj} (X1, * * XmsZl s * *,zq), Y2, * *,^Yk);therefore comp(sn-1, Sn-2 - Yi) <* comp({a.} O S s.-,{ai} ° i sn-2). Because at the ancestor node of node n,both {ad ° , sn-I and {a} O sin2 are SES. According to(1), we can conclude that comp(s1, S2) <* comp({ai} O

526

Page 3: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Mining Maximal Sequential

Sn-1, {OCi} ° iSn-2). SO comp(sh, S2) <* COmP(sn-1, Sn2 - 71).Therefore, (2) is correct.

(3) and (4) can also be proved to be correct as the proofprocess of (2).

According to Theorem 1, only parts instead of the twowhole sequences need to be compared in order to know thecontainment relationship of them. According to Definition 3,it is obvious that every two branches in different subtrees ofa node are branch brothers. Therefore, based on Theorem 1,the following branches pruning process can be achieved.

Lemma 1. (Using Divide and Conquer Strategy toGenerate an MFS Tree)

To prune an FS tree to an MES tree, theprocess can go asfollows: from subtrees which consist ofonly a leafnode inthe FS tree by the bottom-up method, for any node, n, weonly need to compare sequences' corresponding to everytwo branches in different subtrees of n, and delete thosebranches whose corresponding sequences are sub-sequencesfound in the above comparingprocess.

Based on Lemma 1, we can take full advantage of thesame prefix between two sequences to avoid redundantcomparisons. The idea ofLemma 1 is realized in Algorithm1. As for the FS tree of the sample database in Table I, theitemset comparing times of the "whole-comparison" methodare 190, but ofMFSPAN are only 88.

Algorithm 1: MFSPAN (root)Input: root node ofthe current tree TOutput: the MFS tree T

01. if root is a leaf node02. return.03. for (i = 1; i < n; i ++)I/suppose that the root node has n children04. MFSPAN(child[i]);05. for(j = 1;j <n;j++)lleach child ofthe current root corresponds to a subtree06. for (k=j + 1; k<n; k++)07. CompareSubtree(subtree[], subtree[k]);

sequence patterns. And it is easy to find that the recursionprocess ofMFSPAN is like that of SPAM. So MFSPAN canbe done by depth-first search method and integrated intoSPAM, and the new algorithm is easy to be obtained.

IV. EXPERIMENTAL RESULTS

The experiments are implemented on a 2.7GHZ Intel PCwith 512MB main memory, running Microsoft Windows XP.All the codes are written by C++ and compiled by MicrosoftVisual C++ 6.0. The test datasets are generated by the IBMQuest Synthetic Data Generator [1]. The code of the DataGenerator can be downloaded in the following website:http://www.alrnaden.ibm.com/software/quest/Resources/datasets/svndata.html#assocSynData. Table II shows somemajor parameters in this generator and their meanings.

TABLE II

PARAMETERS FOR IBM DATA GENERATOR

Symbol Options MeaningD -ncust Number of customers in 000sC -slen Average itemsets per sequenceT -tlen Average items per itemsetN -nitems Number of different items in 000sS -seq.patlen Average length ofmaximal sequencesI -lit.patlen Average length of itemsets within the

tmaximal sequences

The data structure of all algorithms goes as follows: anitemset is represented by a bitset, and a sequence is aLinklist whose nodes are itemsets. We also integrate themethod "whole-comparison" into SPAM, so thecomparison differences between MFSPAN and the method"whole-comparison" are caused only by the itemsetcomparing times. Figures 4-6 show the comparisons.

Dataset (DIC15TIONIS8I5) atvarymig minimum supports

- SMFSPAN * wholecompaison

*

-.40

00a9toaS

Subtree[j] and subtree[k] are two different subtrees of thecurrent temporary tree root. Every branch in a subtreecorresponds to a postfix sequence, so a subtree is a set ofpostfix sequences. CompareSubtree( compares every pairof two sequences in two different sequence sets respectively,and removes the sequence which is a sub-sequence ofanother from their own set.

In practice, the comparisons need not be done after thewhole FS tree has been constructed, and it can be done onlyafter a subtree has been constructed. Till now, SPAM [8] isone of the most famous algorithms in mining frequent

160

120

80 j

40

0.036 0.038 0.04 0.042 0.044 0.046min_sup

Fig. 4. Varying support for small dataset

1 The sequence corresponding to a branch should be modified according toTheorem 1.

527

Page 4: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Mining Maximal Sequential

Dataset (D5C5T 15NO.1 SS 15) atvarying minmnum supports-MFSPAN -&whole-comparison

100

O 80*

8 60) 440

.9 420

a 20a3

0 '-

0.1 0.11 0.12 0.13 0.14 0.15 0.16min_sup

Fig. 5. Varying support for dense dataset 1#

301

.s 15

04

*E 101

5

Dataset (D1OC3T3ONO.1S3130) atvarying minimum supports

- MbFSPAN - -wbole-conparison)O I

io I

0io .I

0 0Q0.16 0.18 0.2 0.22 0.24 0.26 0.28

mm_sup

Fig. 6. Varying support for dense dataset 2#

In Figures 4-6, when the minimum support is lower, thedifferences are significant, because at a lower support theaverage depth of all the nodes in the FS tree is deeper, andthere are more common prefixes in the tree, MFSPANavoids more redundant comparisons. Because MFSPANtakes full advantage of the property that two differentsequences may share a common prefix to get the MFS set,it's obvious that if there are more common prefixes betweendifferent sequences, MFSPAN is more effective.

Through the definition of maximal frequent sequences(MFS) and analysis above, it can be concluded that the MFSset contains the necessary information and is much smallerthan the set of all the frequent sequence. Since in manyapplications, it is not unusual that one may encounterdatabases with large number of sequential patterns and longsequences, such as in DNA analysis or stock sequenceanalysis, the idea to mine the MFS set is very meaningful.

the problem that when frequent patterns are long, suchmining may generate an exponential number of frequentsubsequences, which usually makes decision-makersdifficult to obtain the useful information. MFSPAN takesfull advantage of the property that two different sequencesmay share a common prefix to get the MFS set. Both theoryand experiments examine the effectiveness of MFSPAN.Mining MFS set is very meaningful and full of challenge; itis expected that more studies on it could appear.

ACKNOWLEDGMENT

This work was supported by the National Natural ScienceFoundation of China under Grant No. 60175024 and60433020, and the Key Science-Technology Project of theNational Education Ministry of China under Grant No.02090.

REFERENCES

[1] R. Agrawal and R. Srikant, "Mining Sequential Patterns." In Yu, P. S.and Chen, A. S. P.,editors, Eleventh International Conference on DataEngineering (ICDE 1995), Taipei, Taiwan, pages 3-14, IEEEComputer Society Press, 1995.

[2] J Han, J Pei, X.F Yan, "From sequential pattern mining to structuredpattern mining: a pattern-growth approach." In Journal of ComputerScience and Technology, vol. 19, pp. 257-259, May 2004.

[3] X. Yan, J. Han, and R. Afshar, "CloSpan: Mining Closed SequentialPatterns in Large Databases." In Barbara, D. and Kamath, C., editors,Proceedings of the Third SLAM International Conference on DataMining (SDM2003), San Francisco, CA, U.S.A. 2003.

[4] J Wang, J Han, "BIDE: Efficient Mining of Frequent ClosedSequences." In Proceedings of the 20th International Conference onData Engineering (ICDE2004), Boston, MA, March 2004. pp. 79-90,IEEE Computer Society, 1730 Massachusetts Ave., NW Washington,DC USA, 2004.

[5] D. Burdick, M. Calimlim, and J. Gehrke, "Mafia: A maximal frequentitemset algorithm for transactional databases." In Georgakopoulos D, etal, eds. Proc. ofthe 17th Int'l. Conf on Data Engineering (ICDE 2001),Heidelberg, Germany, pp. 443-452, IEEE Press, 2001.

[6] R. J. Bayardo, Jr., "Efficiently mining long patterns from databases." InProceedings oftheACMSIGMOD, 1998.

[7] R. Agarwal, C. Aggarwal, and V. V. V. Prasad, "A Tree ProjectionAlgorithm for Generation of Frequent Itemsets." In Journal ofParallelandDistributed Computing, pp. 350-371, 2000.

[8] J. Ayres, J. Flannick, J. Gehrke, et al, "Sequential pattern mining usinga bitmap representation." In SIGKDD2002, Edmonton, Alberta, Canada,pp. 429-435,2002.

V. CONCLUSIONS AND DISCUSSIONS

A novel algorithm to mine the set of all the maximalfrequent patterns in sequence databases is proposed to solve

528