Finding Periodic-Frequent Patterns in Temporal Databases using … · 2019-08-12 · Data Science and Pattern Recognition c 2019 ISSN 2520-4165 Ubiquitous International Volume 3,

Data Science and Pattern Recognition c©2019 ISSN 2520-4165

Ubiquitous International Volume 3, Number 2, July 2019

Finding Periodic-Frequent Patterns in TemporalDatabases using Periodic Summaries

Rage Uday Kiran

National Institute of Information and Communications TechnologyThe University of Tokyo, Tokyo, Japan

uday [email protected]

Alampally Anirudh

International Institute of Information and Communications Technology-HyderabadHyderabad, Telangana, India

[email protected]

Chennupati Saideep


[email protected]

Masashi Toyoda

The University of Tokyo, Tokyo, [email protected]

P. Krishna Reddy


[email protected]

Masaru Kitsuregawa

National Institute of InformaticsThe University of Tokyo, Tokyo, Japan

[email protected]

Abstract. Periodic-frequent pattern mining is an important model in data mining. Thepopular adoption and successful industrial application of this model has been hindered bythe following two limitations: (i) The periodic-frequent pattern model implicitly assumesthat all transactions within the data occur at a uniform time interval. This assumptionlimits the model’s applicability as the transactions in many real-world databases occur atirregular time intervals. (ii) Finding periodic-frequent patterns in very large databases isa memory intensive process because its mining algorithm has to maintain a list structureto record all timestamps at which an itemset has appeared in the whole data. This papermakes an effort to address these two limitations. A flexible model of periodic-frequentpattern in temporal databases has been described to address the former issue. In order toaddress the latter issue, a novel concept known as period summary has been introducedto effectively capture the temporal occurrence information of an itemset in a database. Anew tree structure, called Periodic Summary-tree (PS-tree), has be introduced to recordthe temporal occurrence information of an itemset in a temporal database. A pattern-growth algorithm has also been described to find all periodic-frequent patterns from PS-tree. Experimental results demonstrate that the proposed algorithm is efficient.Keywords: Data mining, knowledge discovery in databases, pattern mining, periodicpatterns, temporal databases

24

Finding Periodic-Frequent Patterns in Temporal Databases using Periodic Summaries 25

1. Introduction. Frequent patterns are an important class of regularities that existwithin the data. The problem of finding these patterns has been actively studied inknowledge discovery techniques such as association rule mining [1, 2], sequential patternmining [3], classification [4], and clustering [5]. Since the rationale behind mining thesupport metric-based frequent patterns is to find all patterns that appear frequently ina database, a huge number of patterns are normally generated and most of which mightbe found uninteresting depending on application or user requirement. Moreover, thecomputation cost in finding such huge number of patterns may not be trivial. As a result,finding user interest-based patterns using measures, such as closed [6], maximal[7], K-most [8], utility [9], occupancy [10] and periodicity [11], have been proposed to reduce thedesired result set effectively. The interesting frequent patterns generated using periodicitymeasure are known as periodic-frequent patterns. In this paper, we focus on efficientdiscovery of periodic-frequent patterns in very large databases.

Tanbeer et al. [11] introduced Periodic-Frequent Pattern Mining (PFPM) to discoverall frequent patterns that are occurring at regular intervals in a transactional database.Since then, the problem of finding periodic-frequent patterns has recieved a great deal ofattention [12–14]. A classic application of PFPM is market-basket analytics. It analyzeshow regularly the itemsets are being purchased by the customers. An example of aperiodic-frequent pattern is as follows:

{Bat,Ball} [support = 5%, periodicity = 1 hour] (1)

The above pattern says that 5% of the customers have purchased the items ‘Bat’ and‘Ball,’ and the maximum duration between any two consecutive purchases containingboth of these items is no more than an hour. This predictive behavior of the customers’purchases may facilitate the user in product recommendation and inventory management.Other real-world applications of periodic-frequent pattern mining includes accident dataanalytics [15] and body sensor data analytics [16]. Mining periodic-frequent patternshas inspired other data mining tasks such as high-utility periodic pattern mining [17],recurring pattern mining [13] and regular pattern mining [16].

The popular adoption and successful industrial application of PFPM suffers from thefollowing two key problems:

1. The model of periodic-frequent pattern implicitly assumes that all transactionswithin the data occur at a fixed time interval. This assumption limits the appli-cability of the model because transactions in many real-world applications occurirregularly within the data. A naıve solution to address this issue involves trans-forming irregular data into regular data using (using some form of interpolation).Unfortunately, such a solution is inefficient because it can introduce a number ofsignificant and hard to quantify biases [18], especially if there exists high irregular-ity in the data. Recently, there is a growing need for models which can find usefulinformation in irregular data [15,19–21].

2. The state-of-the-art periodic-frequent pattern mining algorithms maintain a liststructure to record the occurrence information of every item in each transaction.The size of this list typically equals to that of the database size (i.e., the total num-ber of transactions within a database). Thus, finding periodic-frequent patterns invery large databases is a memory intensive process.

This paper makes an effort to address the above mentioned two problems. We describea flexible model of finding periodic-frequent itemsets in temporal databases to addressthe first issue in PFPM. To address the second issue in PFPM, we introduce a novel

26 R. Uday et al.

concept, called period summary, to effectively record the temporal occurrence informationof an itemset in a portion of temporal database. Arithmetic operations on the periodsummaries will determine whether a pattern is periodic or aperiodic within the data. Anovel tree structure, called Period Summary tree (PS-tree), has been introduced to recordthe temporal occurrence information of an itemset. A pattern-growth algorithm, calledPeriodic Summary growth (PS-growth), has been described to find all periodic-frequentpatterns from PS-tree. Using PS-tree, it is possible to achieve the improved performanceas the memory required to store period summaries is significantly less than the memoryrequired to store the list of transaction identifiers. As a result, the proposed algorithmcan be extended to mine the periodic-frequent patterns from the datasets of very largesize. Experimental results on both synthetic and real-world databases demonstrate thatthe proposed approach is memory efficient significantly and run time efficient considerablyas compared to the existing approaches.

In [22], we have introduced the concept of period summary and PS-growth algorithm toefficiently discover periodic-frequent patterns in a transactional database. In this paper,we first extend this study to temporal databases and provide theoretical correctness forthe PS-growth algorithm. We also evaluate the performance of PS-growth by conductingextensive experiments on both synthetic and real-world databases.

The rest of the paper is organized as follows. Section 2 describes the related work onfrequent pattern mining, periodic pattern mining and periodic-frequent pattern mining.Section 3 describes the model of periodic-frequent pattern in a temporal database. Section4 briefly explains the periodic-frequent pattern-growth algorithm (PFP-growth) and itslimitations. Section 5 introduces the concept of period summaries and the PS-growthalgorithm. Section 6 reports on experimental results. Finally, Section 7 concludes thepaper with future research directions.

2. Related Work.

2.1. Frequent pattern mining. Agrawal et al. [1] described a model to find frequentpatterns in a transactional database. Since then, the problem of finding these patternshas received a great deal of attention [23–26]. The basic model used in most of thesestudies remained the same. It involves discovering all frequent patterns in a transactionaldatabase that satisfy the user-specified minimum support (minSup) constraint. The usageof single minSup for the entire database leads to the following two problems:

• If minSup is set too high, we will miss the frequent patterns involving rare items.• To find the frequent patterns involving both frequent and rare items, we have to set

a low minSup. However, this may result in combinatorial explosion, producing toomany patterns, because frequent items can combine with one another in all possibleways and many of them may be meaningless.

This dilemma is known as the “rare item problem” [27]. When confronted with thisproblem in real-world applications, researchers have tried to find frequent patterns usingthe concept of multiple minSups [27–29], where the minSup of a pattern is representedwith the minimum item support (minIS) of its items. A major limitation of this extendedmodel of frequent patterns is that it suffers from an open problem of determining the items’minIS values.

Brin et al. [30] introduced correlated pattern mining to address the rare item prob-lem. The statistical measure, χ2, was used to discover correlated patterns. Since then,several interestingness measures have been discussed based on the theories in probabil-ity, statistics, or information theory. Examples of these measures include all-confidence,any-confidence, bond [31] and kulc [32]. Each measure has its own selection bias that


justifies the rationale for preferring one pattern over another. As a result, there exists nouniversally acceptable best measure to discover correlated patterns in any given database.Researchers are making efforts to suggest an appropriate measure based on user and/orapplication requirements [31,33–35].

Recently, finding user interest-based patterns using measures, such as closed [6], maxi-mal[7], K-most [8], utility [9], occupancy [10] and periodicity [11], is gaining popularity toreduce the desired result set of patterns. In the next subsection, we describe the studieswhich focused on finding periodic patterns using periodicity measure.

2.2. Periodic pattern mining. Han et al. [36] introduced periodic pattern1 model tofind temporal regularities in time series data. The model involves the following two steps:(i) segment the given time series into multiple period-segments such that the length of eachperiod-segment is equal to the user-specified period (per), and (ii) discover all patternsthat satisfy the user-specified minSup.

Example 2.1. Let I = {abcde} be the set of items and S = a{bc}baebacea{ed}d be atime series data generated from I. If the user-defined period is 3, S is segmented intofour period-segments such that each period-segment contains only 3 events (or itemsets).That is, PS1 = a{bc}b, PS2 = aeb, PS3 = ace and PS4 = a{ed}d. Let a ∗ b be a pattern,where ‘?’ denotes a wild (or do not care) character that can represent any itemset. Thispattern appears in the period-segments of PS1 and PS2. Therefore, its support is 2. Ifthe user-specified minSup is 2, then a ? b represents a periodic pattern as its support isno less than minSup. In this example, braces for singleton itemsets have been eliminatedfor brevity.

Han et al. [25] have discussed Max-sub-pattern hitset algorithm to find periodic pat-terns. Chen et al. [37] developed a pattern-growth algorithm, and showed that it outper-forms the Max-sub-pattern hitset algorithm. Aref et al. [38] extended Han’s model forthe incremental mining of partial periodic patterns. Yang et al. [39] studied the changein periodic behavior of a pattern due to noise, and enhanced the basic model to discovera class of periodic patterns known as asynchronous periodic patterns. Zhang et al. [40]enhanced the basic model of partial periodic patterns to discover periodic patterns incharacter sequences like protein data. Cao et al. [41] discussed a methodology to deter-mine the period using auto-correlation. The popular adoption and successful industrialapplication of partial periodic pattern model suffers from the following two issues:

• The usage of single minSup for the entire time series leads to the rare item problem.• The basic model of periodic patterns implicitly considers the data as an evenly

spaced time series (i.e., all events within a series occur at a fixed time interval). Thisassumption limits the applicability of the model as events in many real-world timeseries datasets occur at irregular time intervals.

Yang et al. [39] used “information gain” as an alternative interestingness measure toaddress the problem. Chen et al. [37] extended Liu’s model [27] to find periodic patternsin time series using multiple minSups. It has to be noted that these studies have focusedon finding periodically occurring sets of itemsets in time series data, while the proposedstudy focuses on finding periodically occurring frequent itemsets in temporal databases.

2.3. Periodic-frequent pattern mining. Ozden et al. [42] enhanced the transactionaldatabase by a time attribute that describes the time when a transaction has appearedand investigated the periodic behavior of the patterns to discover cyclic association rules.In this study, a database is fragmented into non-overlapping subsets with respect to

1The term ‘pattern’ in a time series represents a set of itemsets (or sets of items)

28 R. Uday et al.

time. The association rules that are appearing in at least a certain number of subsets arediscovered as cyclic association rules. By fragmenting the data and counting the numberof subsets in which a pattern occurs greatly simplifies the design of the mining algorithm.However, the drawback is that patterns (or association rules) that span multiple windowscannot be discovered.

Tanbeer et al. [43] discussed a model to find periodic-frequent patterns in a transac-tional database. This model eliminates the need of data fragmentation, and discovers allpatterns in a transactional database that satisfy the user-specified minSup and maxPerconstraints. A pattern-growth algorithm, called Periodic-Frequent Pattern-growth (PFP-growth), was also discussed to find these patterns. To improve the runtime of miningperiodic-frequent patterns, Kiran and Kitsuregawa [44] have suggested a greedy-searchtechnique to determine the periodic interestingness of a pattern. Amphawan et al. [45]introduced approximate periodicity to reduce the the memory requirements of miningperiodic-frequent patterns. In Amphawan’s solution splits the transactional timeline intointervals of size maxPer and interval information is stored only when a pattern is oc-curring in that interval. The proposed solution to efficiently discover periodic-frequentpatterns using period summary is different from the Amphawan’s solution as the size ofinterval is not restricted to maxPer and can expand as long as the transaction-ids merge.

Uday et al. [46] and Venkat et al. [47] have employed the concept of item-specificsupport and periodicity thresholds to address the rare item problem in periodic-frequentpattern mining. Rashid et al. [48] introduced standard deviation as an alternative measureof maxPrd. Nofong [49] employed mean as an alternative measure to determine theperiodic interestingness of a pattern. Philippe et al. [14] extended the basic model ofperiodic-frequent patterns to discover periodic patterns in multiple sequences. Since allthe above mentioned studies are the extensions of the basic model of periodic-frequentpatterns [43], they also employ a variant of PFP-growth algorithm to find the interestingpatterns. As a result, all of these studies also suffer from the basic issues as those ofthe periodic-frequent pattern mining. Since this paper tries to address the fundamentalproblems in the basic model of periodic-frequent patterns, the proposed solutions can beextended to address the same in the above mentioned related works.

Lin et al. [50] proposed a model to discover a class of user interest-based frequent pat-terns, called “up-to-date high utility patterns.” An update-to-date pattern is a frequentpattern associated with a time interval in which it has appeared frequently within thedata. It has to be noted that up-to-date patterns are conceptually different from theperiodic-frequent patterns. It is because the former patterns need not have to appear atregular intervals in their corresponding time interval.

Recently, Uday et al. [20, 21] have studied the problem of finding periodic patternsin temporal databases. Though this model address the first problem of periodic-frequentpattern mining, it still suffers from the second problem of periodic-frequent pattern modelbecause its mining algorithm also maintains a list structure to record the temporal oc-currence information of an itemset within the data. The proposed concept of periodsummary can be extended to discover periodic patterns in temporal databases. However,we confine this paper to the basic model of periodic-frequent patterns.

In the next section, we describe the proposed model of periodic-frequent pattern intemporal databases.

3. Model of Periodic-Frequent Patterns. Let I be the set of items. Let X ⊆ Ibe a pattern (or an itemset). A pattern containing β number of items is called a β-pattern. A transaction, tk = (ts, Y ) is a tuple, wherets ∈ R represents the timestampat which the pattern Y has occurred. A temporal database TDB over I is a set


Table 1. A running example of a temporal database

TS Items1 a, c, d, g2 c, e, f3 a, c, d4 a, b, c, d, e5 b, f

TS Items6 a, b, c, d7 a, c, d, f8 a, b, c, d9 a, c, d, e10 b, e, f, g

TS Items11 a, c, d, e12 b, e, g13 a, c, d, g14 b, e, f15 a, c, d

TS Items16 b, e, f, g17 a, b, c, d, e18 b, e, g19 a, c, d, e, g20 b, g

of transactions, TDB = {t1, · · · , tm}, m = |TDB|, where |TDB| can be defined asthe number of transactions in TDB. Let tsmin and tsmax denote the minimum andmaximum timestamps in TDB, respectively. For a transaction tk = (ts, Y ), k ≥1, suchthat X ⊆ Y , it is said that X occurs in tk (or tk contains X) and such a timestamp isdenoted as tsX . Let TSX = {tsXj , · · · , tsXk }, j, k ∈ [1,m] and j ≤ k, be an orderedset of timestamps where X has occurred in TDB. In this paper, we call this list oftimestamps of X as ts-list of X. The number of transactions containing X in TDB isdefined as the support of X and denoted as sup(X). That is, sup(X) = |TSX |. LettsXq and tsXr , j ≤ q < r ≤ k, be the two consecutive timestamps in TSX . The time

difference (or an inter-arrival time) between tsXr and tsXq is defined as a period of X,

say pXa . That is, pXa = tsXr − tsXq . Let PX = (pX1 , pX2 , · · · , pXr ) be the set of all periods

for pattern X. The periodicity of X, denoted as per(X) = max(pX1 , pX2 , · · · , pXr ). The

pattern X is a frequent pattern if sup(X) ≥ minSup, where minSup refers to the user-specified minimum support constraint. The frequent pattern X is said to be periodic-frequent if per(X) ≤ maxPer, where maxPer refers to the user-specified maximumperiodicity constraint. The redefined problem definition of periodic-frequent patternmining involves discovering all patterns in TDB that satisfy the user-specified minSupand maxPer constraints. The support of a pattern can be expressed in percentage of|TDB|. Similarly, the period and periodicity of a pattern can be expressed in percentageof (tsmax − tsmin).

Example 3.1. Consider the temporal database shown in Table 1. The set of all items inthis database, i.e., I = {a, b, c, d, e, f, g}. The set of items containing ‘a’, ‘c’ and ‘d’, i.e.,‘{a, c, d}’ (or acd, in short) is a pattern. This pattern contains 3 items. Therefore, it is a3-pattern. In Table 1, the pattern acd appears in the transactions whose timestamps are 1,3, 4, 6, 7, 8, 9, 11, 13, 15, 17 and 19. Therefore, TSacd = {1, 3, 4, 6, 7, 8, 9, 11, 13, 15, 17, 19}.The support of this pattern, i.e., Sup(acd) = |TSacd| = 12. The periods for this pat-tern are 1(= 1 − tSi), 2(= 3 − 1), 1(= 4 − 3), 2(= 6 − 4), 1(= 7 − 6), 1(= 8 − 7),1(= 9 − 8), 2(= 11 − 9), 2(= 13 − 11), 2(= 15 − 13), 2(= 17 − 15) and 1(= tsl − 19),where tsi = 0 represents the timestamp of initial transaction and tsl = 20 representsthe timestamp of last transaction in the temporal database. The periodicity of acd, i.e.,Per(acd) = maximum(1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1) = 2. If the user-specified minSup = 10and maxPer = 4, the pattern acd is a periodic-frequent pattern because Sup(acd) ≥minSup and Per(acd) ≤ maxPer. (Note: The key difference between the transactionidentifiers (or tids) of a transaction and the timestamps of a transaction is as follows:Each transaction must have a unique tid value. However, multiple transactions can sharea common timestamp. For ease of describing the PFP-growth in the next section, we havespecified unique timestamps for every transaction in Table 1.)

4. PFP-growth and Its Issues. Tanbeer et al. [11] described PFP-growth algorithmto find all periodic-frequent patterns in a transactional database. Briefly, the algorithm

30 R. Uday et al.

ch (a) (b) (c) (d)

I s p idl

c

a

d

b

13

12

12

11

2

2

2

4

19

19

19

20

I s p idl I s p idl

a

c

d

g

e

f

a

c

d

g

e

fb

1

21

1

1

1

1

11

1

2

2

1

21

1

2

2

12

13

12

8

11

6

11

2

2

2

9

7

4

4

19

19

19

20

19

16

20

I s p idl

a

c

d

g

1

1

1

1

1

1

1

1

1

1

1

1

Figure 1. Construction of PF-list. (a) After scanning the first transaction(b) After scanning the second transaction (c) After scanning all transactions(d) Final sorted list of the periodic-frequent items of size 1.

{}

c

null

a

d:1

{}

c:2

null

a

d:1

{}null

c:2

a

d:1,3,5,7,9, 11,13,15,19

b:5,10,12,14, 16,18,20

b:4,6,8,17(a) (b) (c) (d)

I s p idl

c

a

d

b

13

12

12

11

2

2

2

4

19

19

19

20

Figure 2. Construction of PF-tree. (a) PF-list (b) After scanning the firsttransaction (c) After scanning the second transaction (d) After scanning alltransactions.

involves the following two steps: (i) Construction of Periodic Frequent-tree (or PF-tree)and (ii) Mining of PF-tree. Before we discuss these two steps, we describe the structure ofPF-tree. As an example, we consider entries in Table 1 as a transactional database withtimestamps representing as transactional identifiers. Let us assume that user-specifiedminSup = 10 (= 50%) and maxPer = 4 (= 20%).

4.1. Structure of PF-tree. The structure of PF-tree resembles that of Frequent Pattern-tree (FP-tree) [51]. It contains a PF-list and a prefix-tree. A PF-list consists of threefields - item (i), support (f) and periodicity (p). The items in PF-tree are sorted in thedescending order of support to facilitate high compactness. Two types of nodes are main-tained in PF-tree: ordinary node and tail-node. The former is the type of node similarto that used in FP-tree, whereas the latter node explicitly maintains the transaction-ids(tids) for each occurrence of that pattern only at the tail-node of every transaction.

4.2. Construction of PF-tree. The periodic-frequent patterns satisfy the downwardclosure property [11]. That is, all non-empty subsets of a periodic-frequent pattern are alsoperiodic-frequent patterns. Henceforth, the periodic-frequent items (or 1-patterns) play akey role in efficient discovery of periodic-frequent patterns. These items are generated byscanning the database and populating the PF-list as per the steps given in Algorithm 1.Figure 1(a), 1(b) and 1(c) show the PF-list generated after scanning the first transaction,second transaction and all transactions in the database, respectively. Figure 1(d) showsthe final PF-list generated after pruning all items that have failed to satisfy either minSupor maxPer constraints.


Algorithm 1 PF-list (TDB, maxPer, minSup)

1: for each transaction tcur ∈ TDB do2: for each item i in tcur do3: if tcur is i’s first occurrence then4: Set supi = 1, pi = ticur and idil = ticur5: else6: Set supi += 1, picur = ticur - idil and idil = ticur7: if picur > p then8: p = picur9: After scanning the entire database, the periodicity of all items in the list is recalculated

using the tid of the last transaction to reflect the correctness.

In the second database scan, the items in the PF-list will take part in the constructionof PF-tree. The tree construction starts by inserting the first transaction, (1, acdg),according to PF-list order, as shown in Figure 2(b). The elements which are not presentin the PF-list are not considered while inserting into the prefix tree. The tail-node ‘d’carries the transaction-id of the first transaction, ‘d : [1]’. The second transaction, (2,cef), is inserted into the tree with node ‘c : [2]’ as the tail-node (Figure 2). The processis repeated for the remaining transactions in the database. The final PF-tree generatedafter scanning the entire database is shown in Figure 2(d).

4.3. Mining of PF-tree. PFP-growth employs the following steps to discover periodic-frequent patterns from PF-tree:

– Choosing the last item i in the PF-list (Figure 2(a)) as an initial suffix item, itsprefix-tree (denoted as PTi) is constructed. This constitutes the prefix sub-paths ofthe nodes labeled i.

(b)

c

a

{}null

d:4,6,8,17(a)

I s p

a 4 9

c 4 9

PF-list

d 4 9

{}nullPF-list

I s pI s p

a 12 2

c 13 2

PF-list

d 12 2

c:2

a

{}null

d:1,3,4,6,7, 8,9,11,13, 15,17,19

(c)

Figure 3. Mining using PFP-growth algorithm. (a) PF-tree after remov-ing ‘b’ (b) Prefix tree of ‘b’ (c) Conditional Tree of ‘b’.

(a)

I s p

a 12 2

13 2

PF-list {}

c:2

null

a:1,3,4,6,7, 8,9,11,13, 15,17,19 (c)

c

(b)

I s p

a 4 9

4 9

PF-list {}

c

null

a:1,3,4,6,7, 8,9,11,13, 15,17,19

c

PF-list

I s p

ca

12

12

2

2 a:1,3,4,6,7, 8,9,11,13, 15,17,19

{}

c

null

Figure 4. Mining using PFP-growth algorithm. (a) PF-tree after remov-ing ‘d’ (b) Prefix tree of ‘d’ (c) Conditional Tree of ‘d’.

32 R. Uday et al.

0 4 8 12 16 20

1 3 7 9 11 13 15 19

Figure 5. Occurrence timeline for pattern ‘cad’

0 4 8 12 16 20

<1,19,4>

Figure 6. Occurrence timeline for pattern ‘cad’ using the notion of periodsummary

– For each item j in PTi, we aggregate all of its nodes’ transaction-id list to derive thetransaction-id list of the pattern ij, i.e., TIDij. Then, we determine whether ij isa periodic-frequent pattern or not by comparing its support and periodicity againstminSup and maxPer, respectively. If ij is a periodic-frequent pattern, then weconsider j to be periodic-frequent in PTi.

– Choosing every periodic-frequent item j in PTi, we construct its conditional tree,CTij, and mine it recursively to discover the patterns.

– After finding all periodic-frequent patterns for a suffix item i, we prune it fromthe original PF-tree and push the corresponding nodes’ transaction-id lists to theirparent nodes. We repeat the above steps until the PF-list becomes NULL. Figure 3and Figure 4 show the mining process for item ‘b’ and item ‘d’.

4.4. Performance issues of PFP-growth. The space complexity for constructing aPF-tree is O(n + |TDB|), where n represents the total number of nodes generated inPF-tree and |TDB| represents the total number of transactions in TDB. The databasesof many real-world applications (e.g. eCommerce and Twitter) contain millions of trans-actions. In other words, |TDB| normally could be a very large number and the size ofPF-tree depends on the number of transactions. So, the application of existing algorithmto mine the knowledge of periodic-frequent patterns is constrained due to the size of thetransactional database. As we construct a conditional tree for every pattern we extract,the amount of main memory consumed is directly proportional to number of patterns.So, investigation of efficient approaches is a research issue.

In the next section, we describe our solution to address the above issue of PFP-growth.

5. Proposed Approach.

5.1. Basic idea. It can be observed that during the construction of PF-Tree, the entirets-list (or tid-list) of a pattern gets fragmented into multiple distinct sub-lists at differenttail-nodes of the PF-Tree.

Example 5.1. For the pattern b in Table 1, TSb = {4, 5, 6, 8, 10, 12, 14, 16, 17, 18, 20}.When we compress the database into PF-tree, this list of timestamps of ‘b’ gets dividedinto two branches ‘b’ and ‘cadb’ (Figure 2(d)), and the corresponding sub-lists at thetail-node are {4, 6, 8, 17} and {5, 10, 12, 14, 16, 18, 20}.


Based on this observation, we have come up with the following idea: “instead of storingthe list of timestamps (or tids), it is sufficient to store the crucial information pertainingto the intervals in which an itemset has appeared periodically in a subset of data.” Let ussay a pattern X is periodic in the range [a,b] such that the difference between any two ofits occurrences in that range is no more than maxPer. Then, it is enough to store theinterval extremes [a,b] denoting that the pattern is periodic in that range.

Based on this idea, we have proposed two concepts. One is the concept of periodsummary which is employed to build the tree and the other is the process of mergingperiod summaries for the extraction of periodic-frequent patterns.

The concept of period summary is defined as follows.

Definition 5.1. A period summary of a pattern X, psXi , captures the interval infor-mation in which a pattern has appeared periodically in the data and the periodicity ofrespective pattern within that interval. That is, psXi = 〈tsXj , tsXk , perXi 〉, where tsXj and

tsXk , 1 ≤ j ≤ k ≤ |TDB|, represents the first and last timestamps of that range re-spectively, in which a pattern has appeared periodically in a subset of database and perXiis the periodicity of a pattern within the interval whose timestamps are within tsXj and

tsXk . Let tsXj be denoted as first and tsXk be denoted as last in the respective intervals.

Let PSX = {psX1 , psX2 , · · · , psXk }, 1 ≤ k ≤ |TDB|, denote the complete set of periodsummaries of X for any tail-node.

Example 5.2. Continuing with the previous example, the sub-lists of ‘b’ will result infollowing period summaries at their respective tail-nodes: PScadb = {〈4, 8, 2〉, 〈17, 17, 0〉}(b along with c, a and d) and PSb = {〈5, 5, 0〉, 〈10, 20, 2〉} (b in isolation). The firstelement in PScadb, says that ‘cadb’ has occurred with the periodicity of 2 in the sub-database whose timestamps are from 4 to 8. The second element in PScadb, says that‘cadb’ has occurred with periodicity of 0 in the sub-database whose timestamps are from17 to 17.

In Figure 5 and Figure 6, we can see how occurrences of pattern ‘cad’ are stored in theexisting methods and proposed method respectively.

In the mining phase, period summaries at different tail-nodes have to be merged togenerate the final period summary of the pattern to determine whether a pattern X isperiodic or not. We encounter the following cases while merging two intervals:

1. One interval is subset of another interval.2. Both the intervals are overlapping.3. Both the intervals are non-overlapping, and the difference between them is less thanmaxPer.

4. Both the intervals are non-overlapping, and the difference between them is greaterthan maxPer.

In the first three cases, we merge both the intervals and store the extended interval infinal period summary. In the fourth case, we store both the intervals separately in finalperiod summary. The periodicity of the merged element is calculated as follows:

perXk = maximum(perXi , perXj , (ts

Xi (l)− tsXj (f)))

In the above equation, perXi and perXj are the periodicity values of ith and jth period

summaries. The timestamps tsXi (l) and tsXj (f) represent the last timestamp and first

timestamp of the ith and jth period summaries which are to be merged.

Example 5.3. The merging starts with pointers P1 and P2 pointing to the start of PScadb

and PSb respectively. Since 〈5, 5, 0〉 of P2 is subset of 〈4, 8, 2〉 (4 < 5 < 8), we increment

34 R. Uday et al.

P2 and point it to 〈10, 20, 2〉. Now, intervals 〈4, 8, 2〉 and 〈10, 20, 2〉 are non-overlappingand the difference between rightmost timestamp of P2 and leftmost timestamp of P1 isless than maxPer (case 3) i.e., 10 − 8 = 2 < maxPer(= 4). Therefore, we merge theintervals to form 〈4, 20, 2〉 and add this element to final period summary. The periodicityof the added element is maximum(2, 2, 2) = 2. Now, we move our pointer P1 to 〈17, 17, 0〉.P1 is subset of 〈4, 20, 2〉. Hence, the final period summary of ‘b’ gets merged into singleelement as {〈4, 20, 2〉}.

Let PS be the final period summary. To check if a pattern is periodic or not, we checkfor three conditions in PS :

• PS.size() = 1• PS[0].f irst ≤ maxPer• (tsmax − PS[0].last) ≤ maxPer

Example 5.4. Continuing with the previous example, the final period summary of ‘b’ is{〈4, 20, 2〉}. Since it satisfies all the three conditions mentioned above (i.e., size = 1, 4 ≤4 and (20− 20) = 0 ≤ 4), we say pattern ‘b’ is periodic in the database.

This optimization helps in reducing the memory requirements for constructing a tree.The advantage will be even more if the original list of timestamps of X, i.e., TSX does notsplit into multiple branches as we just store one element in the tail-node implying thatit is periodic in the entire range of database. In dense databases, this optimization cansignificantly decrease the memory needed to store the tree. We now discuss the proposedPS-growth algorithm based on this optimization.

The PS-growth algorithm involves the following two steps: (i) Compress the tempo-ral database into PS-tree and (ii) Recursively mine the entire PS-tree to enumerate allperiodic-frequent patterns. Before we discuss these two steps, we describe the structureof PS-tree.

5.2. Structure of PS-Tree. Period Summary tree contains PF-list and a summarizedprefix-tree. The structure of PS-list is similar to that of PF-list in PF-tree (see Section4). However, the structure of prefix-tree in PS-tree is different from the prefix-tree in PS-tree. Two types of nodes are maintained in PS-tree: ordinary node and tail-node. Theformer is the type of the node similar to that used in FP-tree, whereas the latter noderepresents the last item of any sorted transaction. In PS-tree, we maintain the periodsummary of occurrences of that branch’s items only at the tail-node of every transaction.The tail-node structure maintains (i) ps-list (i.e., the summarized timestamp list usingperiod summaries explained in Definition 5.1) and (ii) the support of that branch’s items.In other words, the structure of tail-node in PS-tree is as follows: i :< tsia, ts

ib, per

ik >;<

tsic, tsid, per

il >; · · · ;< tsie, ts

if , per

im >, sup(i), where a, b, c, d, e, f, k, l,m ∈ R+.

5.3. Construction of PS-Tree. In the first database scan, the PS-growth scans thedatabase contents and constructs the PF-list to discover periodic-frequent items of size1. The construction of PF-list is shown in Algorithm 1. The final PF-list generated afterscanning the entire database of Table 1 is shown in Figure 1(d).

In the second database scan, the items in the PF-list will take part in the constructionof PS-Tree. The tree construction starts by inserting the first transaction, (1, acdg),according to PF-list order, as shown in Figure 7(a). All the items in the transaction areinserted in the same order as in PS-list except ‘g’. The tail-node ‘d : [〈1, 1, 0〉], 1’ carriesthe summarized timestamp list and support (see Figure 7 (b)). In the similar fashion,the second transaction, (2, cef), is also inserted into the tree. The tail-node structure for


{}

c

null

a

d:<1,1,0>,1

(a) (b) (c)

I s p idl

c

a

d

b

13

12

12

11

2

2

2

4

19

19

19

20

{}

c:<2,2,0>,1

null

a

d:<1,1,0>,1

c:<2,2,0>,1

a

d:<1,19,4>,8

b:<4,8,2>; <17,17,0>,4

(d)

b:<5,5,0>; <10,20,2>,7

{}null

Figure 7. Construction of PS-Tree. (a) PS-list (b) After scanning thefirst transaction (c) After scanning the second transaction and (d) Afterscanning all transactions.

(b)

c

a

{}null

d:<4,8,2> <17,17,0>,4

(a)

I s p

a 4 9

c 4 9

PS-list

d 4 9

{}nullPS-list

I s pI s p

a 12 2

c 13 2

PS-list

d 12 2

c:<2,2,0>,1

a

{}null

d:<1,19,2>,12

(c)

Figure 8. Mining using PS-growth algorithm. (a) PS-Tree after removing‘b’ (b) Prefix tree of ‘b’ (c) Conditional Tree of ‘b’.

(a)

I s p

a 12 2

c 13 2

PS-list

c:<2,2,0>,1

a:<1,19,2>,12

{}nullI s p

a 12 2

c 13 2

PS-list

c

a:<1,19,2>,12

null{}

I s p

a 12 2

c 13 2

PS-list

c

a:<1,19,2>,12

null{}

(b) (c)

Figure 9. Mining using PS-growth algorithm. (a) PS-Tree after removing‘d’ (b) Prefix tree of ‘d’ (c) Conditional Tree of ‘d’.

the second transaction would be ‘c : [〈2, 2, 0〉], 1’ (see Figure 7 (c)). The final PS-treegenerated after scanning entire database is shown in Figure 7(d).

Example 5.5. Consider the tail-node structure of ‘d’, from the branch ‘cad’. The originaloccurrences of ‘cad’ in TDB are {1, 3, 7, 9, 11, 13, 15, 19} and maxPer = 4. For thefirst transaction, the list will be {〈1, 1, 0〉}, 1. For the third transaction, it checks forcondition 1 + 4(= maxPer) ≥ 3. Since it is satisfied, the list’s last occurrence will beupdated as {〈1, 3, 2〉}, 2. In a similar fashion, the list gets updated for each occurrenceof the pattern ‘cad’. After scanning all the transactions, the tail-node structure of ‘d’ in‘cad’ will be {〈1, 19, 4〉}, 8.

The PS-tree maintains the complete information of all periodic-frequent patterns pat-terns in a database. The correctness is based on Property 3 and shown in Lemmas 5.1

36 R. Uday et al.

and 5.2. For each transaction t ∈ TDB, PFP (t) is the set of all periodic-frequent itemsin t, i.e., PFP (t) = item(t) ∩ PFP , and is called the candidate item projection of t.

Property 5.1. A PS-tree maintains a complete set of candidate item projections for eachtransaction in a database only once.

Lemma 5.1. Given a temporal database (TDB) and the user-defined minimum sup-port (minSup) and maximum periodicity (maxPer) constraints, the complete set of allperiodic-frequent item projections of all transactions in the TDB can be derived from thePS-tree.

Proof. Based on Property 5.1, each transaction t ∈ TDB is mapped to only one path inthe tree, and any path from the root up to a tail node maintains the complete projectionfor exactly n transactions (where n is the total number of entries (or support) in theps-list of the tail node).

Lemma 5.2. The size of PS-tree (without the root node) on a TDB for user-definedminSup and maxPer thresholds is bounded by

∑t∈TDB |PFP (t)|.

Proof. According to the PS-tree construction process and Lemma 5.1, each transactiont contributes at most one path of size |PFP (t)| to a PS-tree. Therefore, the total sizecontribution of all transactions can be

∑t∈TDB |PFP (t)| at best. However, since there

are usually many common prefix patterns among the transactions, the size of PS-tree isnormally much smaller than

∑t∈TDB |PFP (t)|.

Before we discuss the mining of PS-tree, we explore the following important propertyand lemma of an PS-tree.

Property 5.2. A tail node in a PS-tree maintains the occurrence information for all thenodes in the path (from the tail node to the root) at least in the transactions in its ps-list(period summary-list).

Lemma 5.3. Let Z = {a1, a2, · · · , an} be a path in an PS-tree where node an is the tailnode carrying the ts-list of the path. If the ps-list is pushed-up to node an−1, then an−1maintains the occurrence information of the path Z ′ = {a1, a2, · · · , an−1} for the same setof transactions in the ps-list without any loss.

Proof. Based on Property 5.2, an maintains the occurrence information of path Z ′ at leastin the transactions in its ps-list. Therefore, the same ps-list at node an−1 maintains thesame transaction information for Z ′ without any loss.

5.4. Mining of PS-tree. The mining process starts by considering the last item i in thePF-list (least support) and constructing its prefix tree (PTi), which has the prefix subpath of the nodes labeled i. We update the support and periodicity values by mergingthe ps-lists (Algorithm 3) of each item in the PF-list by traversing the item node pointers.All items in PF-list whose support is greater than minSup and periodicity is less thanmaxPer are used in conditional tree - CTi (Figure 8(c)).

For each item j in PTi, we aggregate all of its nodes’ ps-list to derive a summarized ps-list for the pattern ij, i.e., summarized TSij. Next, to check if the final ps-list is periodicor aperiodic using Algorithm 4. If j satisfies Algorithm 4 and support > minSup, weconsider j as periodic-frequent in PTi. Choosing every periodic-frequent item j in PTi ,we construct its conditional tree, CTij, and mine it recursively to discover the patterns.

After finding all periodic-frequent patterns for a suffix item i, we prune i from theoriginal PS-tree and push the corresponding nodes’ ps-lists to their parent nodes. Theabove steps are repeated until the main PF-list in the original PS-tree becomes NULL.


Algorithm 2 PS-growth (PS-Tree, α)

1: Select the last element in the PS-list.2: for each ai in header of Tree do3: Generate pattern β = ai ∪ α.4: Aggregate all of the ai’s summarized tid-lists into PSβ, using Algorithm 3.5: if PSβ.support ≥ minSup and Check(TSβ) then6: Construct β’s conditional PS-tree, Treeβ.7: if Treeβ 6= φ then8: Call PS-growth (Treeβ, β)

9: Remove ai from the tree10: Push ai’s summarized tid-list to its parent nodes using Algorithm 3.

Algorithm 3 Aggregating Intervals(I1, I2)

1: Initialize interval vector I32: if I1.f irst > I2.f irst then3: swap (I1, I2)

4: if I1.last > I2.last then5: I3.append(I1.f irst, I1.last)6: else if I1.last+maxPer ≥ I2.f irst then7: I3.append(I1.f irst, I2.last)8: else9: I3.append(I1.f irst, I1.last)

10: I3.append(I2.f irst, I2.last)

11: return I3

PFP-growth and PS-Growth generate the same set of periodic-frequent patterns. How-ever, it has to be noted that we may fail to determine the actual periodicity value of apattern because the periodic summaries do not preserve that information.

Algorithm 4 Check (I)

1: if I.size() = 1 then2: if I[0].f irst <= maxPer then3: if |TDB| − I[0].last <= maxPer then4: return TRUE5: return FALSE

6. Experimental Results. In this section, we compare the performance of the proposedapproach (PS-growth) with the existing approaches (PFP-growth [11], PF-growth++ [44]and ITL-Tree [45]). The structure of PF-tree++ is the same as the structure of PF-tree. Since PF-growth++ focuses only on improving the runtime for extracting periodic-frequent patterns. Only the runtime of PF-growth++ is compared with the proposedapproach. All the algorithms are written in GNU C++ and run with CentOS-7.1 on a3.00GHz machine with 8GB of memory. The runtime specifies the total execution time,i.e., CPU and I/Os. Details from /proc/<pid>/stat are used to compute the process’CPU usage (RAM).

38 R. Uday et al.

5000

10000

15000

20000

25000

30000

35000

40000

45000

2 3 4 5 6 7 8 9 10 11000

11500

12000

12500

13000

13500

14000

14500

15000

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 460

470

480

490

500

510

520

530

540

550

560

10 15 20 25 30 35 40 45 50

Nu

mb

er o

f p

atte

rns

maxPer %maxPer % maxPer %

Nu

mb

er o

f p

atte

rns

Nu

mb

er o

f p

atte

rns

(1) Mushroom (2) Twitter (3) Retail

minSup = 5% minSup = 70% minSup = 0.5%

PS-Growth/PF-Growth/PF-Growth++/ITL-Growth



maxPer vs Number of patterns

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 6 8 10 1.5

2

2.5

3

3.5

4

4.5

5

5.5

0.1 0.2 0.3 0.4 0.5 1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

10 20 30 40 50



Tre

e M

emo

ry (

MB

)

Tre

e M

emo

ry (

MB

)

Tre

e M

emo

ry (

MB

)

minSup = 5% minSup = 70% minSup = 0.5%

PF-Tree

PS-Tree ITL-Tree

PF-Tree

PS-Tree ITL-Tree

maxPer vs Tree memory

0

100

200

300

400

500

600

2 4 6 8 10 0

1000

2000

3000

4000

5000

6000

22

24

26

28

30

32

34

36

38

40

42

To

tal M

emo

ry (

MB

)

To

tal M

emo

ry (

MB

)

To

tal M

emo

ry (

MB

)



minSup = 5%minSup = 70% minSup = 0.5%

PF-Growth

PS-Growth ITL-Growth

0.1 0.2 0.3 0.4 0.5

PF-Growth


10 20 30 40 50

PF-Growth


maxPer vs Total memory

0

2

4

6

8

10

12

14

16

50

100

150

200

250

300

8.8

9

9.2

9.4

9.6

9.8

10

10.2

10.4

10 20 30 40 50



Tim

e (s

ecs)

Tim

e (s

ecs)

Tim

e (s

ecs)

minSup = 5%

minSup = 70% minSup = 0.5%

2 4 6 8 10

PF-Growth++


PF-Growth

0.1 0.2 0.3 0.4 0.5


PF-Growth++PF-Growth

maxPer vs Time

Figure 10. Comparative analysis by varying maxPer


0

5000

10000

15000

20000

25000

30000

35000

40000

45000

5 10 15 20 25 30 10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

60 62 64 66 68 70 0

200 400

600 800

1000 1200

1400 1600

1800 2000

2200

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Nu

mb

er o

f p

atte

rns

minSup %minSup % minSup %

Nu

mb

er o

f p

atte

rns

Nu

mb

er o

f p

atte

rns



maxPer = 10% maxPer = 0.5% maxPer = 10%



minSup vs Number of patterns

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0.2 0.4 0.6 0.8 1minSup % minSup %


Tre

e M

emo

ry (

MB

)

Tre

e M

emo

ry (

MB

)

maxPer = 10%

maxPer = 10%

maxPer = 0.5%

1.5 2

2.5 3

3.5

4 4.5

5 5.5

6

6.5

60 62 64 66 68 70

minSup %

Tre

e M

emo

ry (

MB

)PF-Tree

PS-Tree ITL-Tree

5 10 15 20 25 30

7

PF-Tree

PS-Tree ITL-Tree

minSup vs Tree memory

0

20

40

60

80

100

120

minSup % minSup %

To

tal M

emo

ry (

MB

)

To

tal M

emo

ry (

MB

)


maxPer = 10%

To

tal M

emo

ry (

MB

)

minSup %

0

100

200

300

400

500

600

5 10 15 20 25 30

PF-Growth


maxPer = 10%

60 62 64 66 68 70

1000

2000

3000

4000

5000

6000

7000

8000

9000maxPer = 0.5%

0.2 0.4 0.6 0.8 1

PF-Growth


0

Ma

in M

em

ory

Ex

ha

us

ted

minSup vs Total memory

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 7

8

9

10

11

12

13

14

(10) Mushroom (11) Twitter (12) RetailminSup % minSup %

Tim

e (s

ecs)

Tim

e (s

ecs)

maxPer = 10%

maxPer = 0.5%

maxPer = 10%

50

100

150

200

250

300

350

400

450

500

60 62 64 66 68 70

minSup %

Tim

e (s

ecs)

PF-Growth++


PF-Growth

550

0.2 0.4 0.6 0.8 1

PF-Growth++


PF-Growth

minSup vs Time

Figure 11. Comparative analysis by varying minSup

40 R. Uday et al.

6.1. Datasets description. The real-world datasets, mushroom, twitter, retail andkosark, have been used for conducting experiments. The mushroom dataset is obtainedfrom UCI Machine Learning Repository [52]. It is a dense dataset containing 8,124 trans-actions and 119 distinct items. The twitter dataset is provided by Kiran et al. [13], andcontains hashtags appeared on May 1st. The dataset contains 12 hours data with 43,200transactions and 44,201 items. Retail dataset is large scale sparse dataset containing88,162 transactions with 16,470 items. The retail database is provided by Brijs et al.[53], and contains market-basket data of an anonymous Belgian retail supermarket store.Kosark is a real world very large sparse dataset with 990,002 transactions and 41,270distinct items. In this paper, We use this database to study the scalability of PS-growthand PFP-growth/PFP-growth++ algorithms.

6.2. Performance Evaluation. Performance evaluation was done on three factors: treememory, total memory and time consumed. Total memory consumed is usually very highcompared to tree memory as we construct a conditional tree for every pattern we mine.

Figures 10(1) - 10(3) show the number of periodic-frequent patterns generated in variousdatasets at different maxPer values. It can be observed that increase in maxPer resultsin an increase in number of periodic-frequent patterns because more patterns satisfyperiodicity condition.

Figures 10(4) - 10(6) show the memory requirements of PS-tree, ITL-Tree and PF-tree(or PF-tree++) for different datasets at different maxPer values. Figures 10(7) - 10(9)show the total memory consumed by PS-growth, ITL-Growth and PFP-growth (or PFP-growth++) algorithms for various datasets at different maxPer values. Figures 10(10)- 10(12) show the total runtime consumed to discover periodic-frequent patterns by PS-growth, ITL-Growth, PFP-growth and PFP-growth++ algorithms at different maxPervalues.

As maxPer increases, the two reasons which decide the memory/time consumptionare number of period summaries stored and number of patterns generated. As maxPerincreases, the number of period summaries stored (proposed approach) decreases, resultingin lower consumption of memory. At the same time, as maxPer increases, number ofpatterns also increase, resulting in higher memory consumption. Considering these twofactors, memory/time consumption will either increase or decrease (proposed approach),depending on the dataset.

Overall, it can be observed from Figure 10 that PS-growth is memory and runtimeefficient as compared to the existing algorithms. In dense datasets, mushroom and twitter,PS-growth outperform PFP-growth (or PFP-growth++) by a wider margin. However, insparse datasets like retail, PS-growth outperforms PFP-growth (or PFP-growth++) by anarrow margin.

Figure 11 shows how memory/time consumed change at different minSup values. Itcan be observed that increase in minSup results in decrease of the number of periodic-frequent patterns generated. Thus tree memory, total memory and run time decreasesas minSup increases. It can be observed in the Figures 11(5), (8) and (11) that thePF-growth algorithm has encountered “memory out of bounds exception,” and therefore,failed to extract periodic-frequent patterns for minSup < 65%.

6.3. Scalability test. We divided the Kosarak database into five portions of 0.2 milliontransactions in each part. Then, we investigated the performance of PF-tree and PS-Tree after accumulating each portion with previous parts and mining periodic frequentpatterns each time. The experimental results are shown in Figure 12. It is clear fromFigure 12 that as the database size increases the memory requirement increases in both


0

5

10

15

20

25

30

35

40

45

0 2 4 6 8 10

Tre

e M

emor

y (M

B)

Dataset Size - Transactions (100k)

PS-tree

PF-tree

minSup = 0.2%maxPer = 1%

Figure 12. Memory requirements of PF-tree and PS-tree at different data-base sizes)

approaches. However, the proposed tree consumes less main memory as compared to thecurrent PF-tree.

6.4. Discussion. For patterns containing millions of transaction-ids, the proposed com-pact representation of transaction-ids using the notion of period summaries will resultin a huge reduction in memory and run time requirements. The reduction in memoryis because instead of storing thousands of transaction-ids in the tail-node, we store thesummary of these transactions in very few intervals. The reduction in run time to extractperiodic-frequent patterns is because the existing approaches require the scanning of theentire transaction-id list for determining the periodicity of a pattern whereas in the pro-posed approach we just have to check for three conditions (Algorithm 4), which can bedone in O(1) time.

7. Conclusions and Future Work. In this paper we have proposed an efficient ap-proach to extract periodic-frequent patterns from large transactional databases by usingthe notion of period summaries. The proposed approach reduces the computational costwithout missing any knowledge pertaining to periodic-frequent patterns as compared tothe existing approaches. As a part of future work, we are planning to explore efficientapproaches to extract periodic-frequent patterns for incremental data sets.

8. Acknowledgment. This work was supported by the Research and Development onReal World Big Data Integration and Analysis program of the Ministry of Education,Culture, Sports, Science, and Technology, JAPAN.

References

[1] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association rules,” in The InternationalConference on Very Large Data Bases, vol. 1215, 1994, pp. 487–499.

[2] C. C. Aggarwal, Applications of frequent pattern mining, 2014, pp. 443–467.[3] P. Fournier-Viger, J. C. W. Lin, R. U. Kiran, and Y. S. Koh, “A survey of sequential pattern mining,”

Data Science and Pattern Recognition, vol. 1, no. 1, pp. 54–77, 2017.[4] F. Thabtah, “A review of associative classification mining,” The Knowledge Engineering Review,

vol. 22, no. 1, pp. 37–65, 2007.[5] N. Abdelhamid and F. Thabtah, “Associative classification approaches: Review and comparison,”

Journal of Information and Knowledge Management, vol. 13, no. 03, pp. 1–30, 2014.[6] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets for associ-

ation rules,” in The International Conference on Database Theory, 1999, pp. 398–416.

42 R. Uday et al.

[7] K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent itemsets,” in IEEE InternationalConference on Data Mining, 2001, pp. 163–170.

[8] A. Salam and M. S. H. Khayal, “Mining top-k frequent patterns without minimum support thresh-old,” Knowledge and Information Systems, vol. 30, no. 1, pp. 57–86, 2012.

[9] R. U. Kiran, T. Y. Reddy, P. Fournier-Viger, M. Toyoda, P. K. Reddy, and M. Kitsuregawa, “Effi-ciently finding high utility-frequent itemsets using cutoff and suffix utility,” in Pacific-Asia Confer-ence on Knowledge Discovery and Data Mining, 2019, pp. 191–203.

[10] L. Tang, L. Zhang, P. Luo, and M. Wang, “Incorporating occupancy into frequent pattern miningfor high quality pattern recommendation,” in ACM International Conference on Information andKnowledge Management, 2012, pp. 75–84.

[11] S. K. Tanbeer, C. F. Ahmed, B. S. Jeong, and Y. K. Lee, “Discovering periodic-frequent patterns intransactional databases,” in Advances in Knowledge Discovery and Data Mining, 2009, pp. 242–253.

[12] R. U. Kiran and M. Kitsuregawa, “Finding periodic patterns in big data,” in The InternationalConference on Big Data Analytics, 2015, pp. 121–133.

[13] R. U. Kiran, H. Shang, M. Toyoda, and M. Kitsuregawa, “Discovering recurring patterns in timeseries,” in The International Conference on Extending Database Technology, 2015, pp. 97–108.

[14] P. Fournier-Viger, Z. Li, J. C. W. Lin, R. U. Kiran, and H. Fujita, “Efficient algorithms to identifyperiodic patterns in multiple sequences,” Information Sciences, vol. 489, pp. 205–226, 2019.

[15] J. N. Venkatesh, R. U. Kiran, P. K. Reddy, and M. Kitsuregawa, “Discovering periodic-correlatedpatterns in temporal databases,” Transactions on Large-Scale Data-and Knowledge-Centered Sys-tems, vol. 38, pp. 146–172, 2018.

[16] S. K. Tanbeer, M. M. Hassan, A. Almogren, M. Zuair, and B. Jeong, “Scalable regular patternmining in evolving body sensor data,” Future Generation Computer Systems, vol. 75, pp. 172–186,2017.

[17] D. Dinh, B. Le, P. Fournier-Viger, and V. Huynh, “An efficient algorithm for mining periodic high-utility sequential patterns,” Applied Intelligence, vol. 48, no. 12, pp. 4694–4714, 2018.

[18] A. Eckner, “A framework for the analysis of unevenly-spaced time series data,” 2011.[19] P. Yadav, M. Steinbach, V. Kumar, and G. Simon, “Mining electronic health records (ehrs): A

survey,” ACM Computer Surveys, vol. 50, no. 6, pp. 85:1–85:40, 2018.[20] R. U. Kiran, H. Shang, M. Toyoda, and M. Kitsuregawa, “Discovering partial periodic itemsets in

temporal databases,” in The International Conference on Scientific and Statistical Database Man-agement, 2017, pp. 30:1–30:6.

[21] R. U. Kiran, J. N. Venkatesh, P. Fournier-Viger, M. Toyoda, P. K. Reddy, and M. Kitsuregawa,“Discovering periodic patterns in non-uniform temporal databases,” in Pacific-Asia Conference onKnowledge Discovery and Data Mining, 2017, pp. 604–617.

[22] A. Anirudh, R. U. Kiran, P. K. Reddy, and M. Kitsuregawa, “Memory efficient mining of periodic-frequent patterns in transactional databases,” in IEEE Symposium Series on Computational Intelli-gence, 2016, pp. 1–8.

[23] Z. H. Deng and S. Lv, “Prepost+: An efficient n-lists-based algorithm for mining frequent itemsetsvia children–parent equivalence pruning,” Expert Systems with Applications, vol. 42, no. 13, pp.5424–5432, 2015.

[24] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: current status and future direc-tions,” Data Mining and Knowledge Discovery, vol. 15, no. 1, 2007.

[25] J. Han, G. Dong, and Y. Yin, “Efficient mining of partial periodic patterns in time series database,”in International Conference on Data Engineering, 1999, pp. 106–115.

[26] G. Pyun, U. Yun, and K. HoRyu, “Efficient frequent pattern mining based on linear prefix tree,”Knowledge-Based Systems, vol. 55, pp. 125–139, 2014.

[27] B. Liu, W. Hsu, and Y. Ma, “Mining association rules with multiple minimum supports,” in ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 337–341.

[28] Y. H. Hu and Y. L. Chen, “Mining association rules with multiple minimum supports: a new miningalgorithm and a support tuning mechanism,” Decision Support Systems, vol. 42, no. 1, pp. 1–24,2006.

[29] R. U. Kiran and P. K. Reddy, “Novel techniques to reduce search space in multiple minimumsupports-based frequent pattern mining algorithms,” in Extending Data Base Theory, 2011, pp.11–20.

[30] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets: generalizing association rules tocorrelations,” in Special Interest Group on Management of Data, 1997, pp. 265–276.


[31] E. R. Omiecinski, “Alternative interest measures for mining associations in databases,” IEEE Trans-actions on Knowledge and Data Engineering, vol. 15, pp. 57–69, 2003.

[32] J. Bradshaw, “Yams - yet another measure of similarity,” in EuroMUG, 2001,http://www.daylight.com/meetings/emug01/Bradshaw/Similarity/YAMS.html.

[33] A. Surana, R. U. Kiran, and P. K. Reddy, “Selecting a right interestingness measure for rare asso-ciation rules,” in International Conference on Management of Data, 2010, pp. 105–115.

[34] P. N. Tan, V. Kumar, and J. Srivastava, “Selecting the right interestingness measure for associationpatterns,” in Knowledge Discovery and Data Mining, 2002, pp. 32–41.

[35] T. Wu, Y. Chen, and J. Han, “Re-examination of interestingness measures in pattern mining: aunified framework,” Data Mining and Knowledge Discovery, vol. 21, no. 3, pp. 371–397, 2010.

[36] J. Han, W. Gong, and Y. Yin, “Mining segment-wise periodic patterns in time-related databases,”in Knolwedge Discovery in Databases, 1998, pp. 214–218.

[37] S. S. Chen, T. C. K. Huang, and Z. M. Lin, “New and efficient knowledge discovery of partial periodicpatterns with multiple minimum supports,” Journal of Systems and Software, vol. 84, no. 10, pp.1638–1651, 2011.

[38] W. G. Aref, M. G. Elfeky, and A. K. Elmagarmid, “Incremental, online, and merge mining of partialperiodic patterns in time-series databases,” IEEE Transactions on Knowledge and Data Engineering,vol. 16, no. 3, pp. 332–342, 2004.

[39] J. Yang, W. Wang, and P. S. Yu, “Mining asynchronous periodic patterns in time series data,” IEEETransactions on Knowledge and Data Engineering, vol. 15(3), pp. 613–628, 2003.

[40] M. Zhang, B. Kao, D. W. Cheung, and K. Y. Yip, “Mining periodic patterns with gap requirementfrom sequences,” ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 2, 2007.

[41] H. Cao, D. Cheung, and N. Mamoulis, “Discovering partial periodic patterns in discrete data se-quences,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, vol. 3056, 2004, pp.653–658.

[42] B. Ozden, S. Ramaswamy, and A. Silberschatz, “Cyclic association rules,” in International Confer-ence on Data Engineering, 1998, pp. 412–421.

[43] S. K. Tanbeer, C. F. Ahmed, B. S. Jeong, and Y. K. Lee, “Discovering periodic-frequent patterns intransactional databases,” in Advances in Knowledge Discovery and Data Mining, 2009, pp. 242–253.

[44] R. U. Kiran, M. Kitsuregawa, and P. K. Reddy, “Efficient discovery of periodic-frequent patterns invery large databases,” Journal of Systems and Software, vol. 112, pp. 110–121, 2016.

[45] K. Amphawan, P. Lenca, and A. Surarerks, “Mining top-k periodic-frequent pattern from trans-actional databases without support threshold,” in Advances in Information Technology, 2009, pp.18–29.

[46] R. U. Kiran and P. K. Reddy, “Towards efficient mining of periodic-frequent patterns in transactionaldatabases,” in The International Conference on Database and Expert Systems Applications, 2010,pp. 194–208.

[47] J. N. Venkatesh, R. U. Kiran, P. K. Reddy, and M. Kitsuregawa, “Discovering periodic-correlatedpatterns in temporal databases,” Transactions on Large-Scale Data-and Knowledge-Centered Sys-tems, vol. 38, pp. 146–172, 2018.

[48] M. M. Rashid, M. R. Karim, B. S. Jeong, and H. J. Choi, “Efficient mining regularly frequentpatterns in transactional databases,” in International Conference on Database Systems for AdvancedApplications, 2012, pp. 258–271.

[49] V. M. Nofong, “Discovering productive periodic frequent patterns in transactional databases,” An-nals of Data Science, vol. 3, no. 3, pp. 235–249, 2016.

[50] J. C. W. Lin, W. Gan, T. P. Hong, and V. S. Tseng, “Efficient algorithms for mining up-to-datehigh-utility patterns,” Advanced Engineering Informatics, vol. 29, no. 3, pp. 648–661, 2015.

[51] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: Afrequent-pattern tree approach,” Data Mining and Knowlwedge Discovery, vol. 8, no. 1, pp. 53–87,Jan. 2004.

[52] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available:http://archive.ics.uci.edu/ml

[53] T. Brijs, B. Goethals, G. Swinnen, K. Vanhoof, and G. Wets, “A data mining framework for opti-mal product selection in retail supermarket data: The generalized PROFSET model,” ComputingResearch Repository, pp. 300–304, 2001.

44 R. Uday et al.

Rage Uday Kiran is working as a Specially Appointed Research As-sistant Professor at the University of Tokyo, Tokyo, Japan. He re-ceived his PhD degree in computer science from International Instituteof Information Technology, Hyderabad, Telangana, India. His currentresearch interests include data mining, ICTs for agriculture, and recom-mender systems. He has published papers in International Conferenceon Extending Database Technology (EDBT), The Pacific-Asia Confer-ence on Knowledge Discovery and Data Mining (PAKDD), DatabaseSystems for Advanced Applications (DASFAA), International Confer-ence on Database and Expert Systems Applications (DEXA), Interna-tional Conference on Scientific and Statistical Database Management(SSDBM), International Journal of Computational Science and Engi-

neering (IJCSE), Journal of Intelligent Information Systems (JIIS), and Journal of Systems andSoftware (JSS).

Chennupati Saideep received his bachelor of technology degree fromIndian Institute of Information Technology, Kurnool. His research in-terests are Data Mining and Artificial Intelligence.

Masashi Toyoda is a professor of Institute of Industrial Science jointlyaffiliated with the Graduate School of Information Science and Tech-nology at the University of Tokyo, Japan. He received the BS, the MS,and the PhD degrees in computer science from the Tokyo Institute ofTechnology, Japan, in 1994, 1996, and 1999, respectively. In 1999, hejoined the Institute of Industrial Science, the University of Tokyo as aresearch fellow, and worked as a specially appointed associate professorfrom 2004 to 2006, and as an associate professor from 2006 to 2018.His research interests include archiving and analysis of Web, social me-dia, and IoT data, information visualization, visual analytics, and userinterface.


P. Krishna Reddy is a faculty member at IIIT Hyderabad. He is thehead of Agricultural Research Center and the member of Data Sciencesand Analytics Center research team at IIIT Hyderabad, India. During2013 to 2015, he has served as a Program Director, ITRA-Agricultureand Food, Information Technology Research Academy (ITRA), Gov-ernment of India. From 1997 to 2002, he was a research associate atthe Center for Conceptual Information Processing Research, Instituteof Industrial Science, University of Tokyo. From 1994 to 1996, he was afaculty member at the Division of Computer Engineering, Netaji Sub-has Institute of Technology, Delhi. During the summer of 2003, he wasa visiting researcher at Institute for Software Research International,School of Computer Science, Carnegie Mellon University, Pittsburg,

USA. He has received both M.Tech and Ph.D degrees in computer science from JawaharlalNehru University, New Delhi in 1991 and 1994, respectively. His research areas include datamining, database systems and IT for agriculture. He has published about 157 refereed researchpapers which include 22 journal papers, three book chapters, and six edited books. He is asteering committee member of the pacific-asia knowledge discovery and data mining (PAKDD)conference series and Database Systems for Advanced Applications (DASFAA) conference se-ries. He is a steering committee chair of Big Data Analytics (BDA) conference series since2017. He was a proceedings chair of COMAD 2008, a workshop chair of KDRS 2010, mediaand publicity chair of KDD 2015, and general chair of BDA2017. He has organized the 14thPacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2010), the ThirdNational Conference on Agro-Informatics and Precision Agriculture 2012 (AIPA 2012) and theFifth International Conference on Big Data Analytics (BDA 2017). He has delivered severalinvited/panel talks at the reputed conferences and workshops in India and abroad. He hasgot several awards and recognitions. He has executed research projects by raising the researchfunding of about 80 million Indian rupees. Since 2004, he has been investigating the buildingefficient knowledge agricultural knowledge transfer systems by extending developments in IT. Hehas developed eSagu system, which is an IT-based farm-specific agro-advisory system, which hasbeen field-tested in hundreds of villages on about 50 field and horticultural crops. He has alsobuilt eAgromet system, which is an IT-based agro-meteorological advisory system to providerisk mitigation information to farmers. He has conceptualized the notion of Virtual Crop Labsto improve applied skills for extension professionals. Currently, he is investigating the buildingof Crop Darpan system, which is a crop diagnostic tool for farmers, with the funding supportfrom India-Japan Joint Research Laboratory Program. He has received two best paper awards.The eSagu system, which is an IT based farm-specific agro-advisory system, has got severalrecognitions including CSI-Nihilent e-Governance Project Award in 2006, Manthan Award in2008 and finalist in the Stockholm Challenge Award 2008.

46 R. Uday et al.

Masaru Kitsuregawa is Director General of National Institute ofInformatics and Professor at Institute of Industrial Science, the Uni-versity of Tokyo. Received Ph.D. degree from the University of Tokyoin 1983. Served in various positions such as President of InformationProcessing Society of Japan (2013–2015) and Chairman of Committeefor Informatics, Science Council of Japan (2014-2016). He has wideresearch interests, especially in database engineering. He has receivedmany awards including ACM SIGMOD E. F. Codd Innovations Award,IEICE Achievement Award, IPSJ Contribution Award, 21st CenturyInvention Award of National Commendation for Invention, Japan andC and C Prize. In 2013, he awarded Medal with Purple Ribbon and in2016, the Chevalier de la Legion D’Honneur. He is a fellow of ACM,

IEEE, IEICE and IPSJ.

Documents

Finding Periodic-Frequent Patterns in Temporal Databases using … · 2019-08-12 · Data Science and Pattern Recognition c 2019 ISSN 2520-4165 Ubiquitous International Volume 3,