17
Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of candidate sequences. If we could effectively prune those unpromising candidate sequences, the utility and privacy tradeoff can be significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose PFS 2 , a novel differentially private FSM algorithm. It is the first algorithm that supports the general gap-constrained FSM in the context of differential privacy. The gap constraints in FSM can be used to limit the mining results to a controlled set of frequent sequences. In our PFS 2 algorithm, the core is to utilize sample databases to prune the candidate sequences generated based on the downward closure property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which candidate sequences are potentially frequent. To improve the accuracy of such private estimations, a gap-aware sequence shrinking method is proposed to enforce the length constraint on the sample databases. Moreover, to calibrate the amount of noise required by differential privacy, a gap-aware sensitivity computation method is proposed to obtain the sensitivity of the local support computations with different gap constraints. Furthermore, to decrease the probability of misestimating frequent sequences as infrequent, a threshold relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we show that our PFS 2 algorithm is -differentially private. Extensive experiments on real datasets illustrate that our PFS 2 algorithm can privately find frequent sequences with high accuracy. Index Terms—Frequent sequence mining, differential privacy, candidate pruning Ç 1 INTRODUCTION F REQUENT sequence mining (FSM) is a fundamental com- ponent in a number of important data mining tasks with broad applications, ranging from Web usage analysis to location-based recommendation system. Despite valuable insights the discovery of frequent sequences could provide, if the data is sensitive (e.g., DNA sequences, web browsing histories and mobility traces), releasing frequent sequences might pose considerable threats to individual’s privacy. Differential privacy [1], [2] has been proposed as a way to address such problem. Unlike the anonymization-based pri- vacy models (e.g., k-anonymity [3] and l-diversity [4]), dif- ferential privacy provides strong theoretical guarantees on the privacy of released data against adversaries with prior knowledge. By adding carefully calibrated noise, differen- tial privacy assures that the output of a computation is insensitive to the change in any individual’s record. There- fore, it restricts privacy leaks through the results. Given a collection of input sequences, the goal of FSM is to discover all subsequences that occur in the input sequen- ces more frequently than a user-specified threshold. In FSM, users often want to specify a gap constraint between adja- cent items in a frequent sequence [5]. For example, in ses- sion analysis, each input sequence corresponds to a user session and each item to a user action in the session. Depending on the application, users may be interested in sequences of consecutive actions, or non-consecutive actions but frequently appearing in close proximity. As another example, in text mining, input sequences corre- spond to sentences and items to words. The goal of n-gram mining is to discover frequent consecutive word sequences. By contrast, in word association mining, users want to dis- cover combinations of words that frequently appear in sen- tences but not necessarily consecutively. In FSM, the above requirements can be achieved by parameterizing a maxi- mum gap g . Given an input sequence, we consider only its subsequences which can be generated by skipping no more than g consecutive items. For the gap-constrained FSM, it is equivalent to n-gram mining when g = 0 and it is equivalent to unconstrained FSM when g = 1. In this paper, we focus on the design of differentially pri- vate FSM algorithms which can support the general gap constraints. Although private algorithms exist for mining special cases of frequent sequences (e.g., unconstrained fre- quent sequences [6] and frequent consecutive item sequen- ces [7]), we are not aware of any existing differentially private algorithm that can find frequent sequences under user-specified gap constraints. To make FSM satisfy differential privacy, a straightfor- ward approach is to leverage the downward closure prop- erty [8] to generate the candidate sequences (i.e., all possible frequent sequences), and add perturbation noise to the sup- port of each candidate sequence. However, this approach suffers poor performance. The root cause is the large num- ber of generated candidate sequences, which leads to a large amount of perturbation noise. We observe, in practice, the number of real frequent sequences is much smaller than S. Xu, X. Cheng, S. Su, and K. Xiao are with the Beijing University of Posts and Telecommunications, Beijing 100876, China. E-mail: {dear_shengzhi, chengxiang, susen, iejr}@bupt.edu.cn. L. Xiong is with Emory University, Atlanta, GA 30322. E-mail: [email protected]. Manuscript received 22 May 2015; revised 30 May 2016; accepted 30 July 2016. Date of publication 17 Aug. 2016; date of current version 3 Oct. 2016. Recommended for acceptance by E. Terzi. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2016.2601106 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016 1041-4347 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

Differentially Private Frequent Sequence MiningShengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong

Abstract—In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore

the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility

and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of

candidate sequences. If we could effectively prune those unpromising candidate sequences, the utility and privacy tradeoff can be

significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose PFS2, a novel

differentially private FSM algorithm. It is the first algorithm that supports the general gap-constrained FSM in the context of differential

privacy. The gap constraints in FSM can be used to limit the mining results to a controlled set of frequent sequences. In our PFS2

algorithm, the core is to utilize sample databases to prune the candidate sequences generated based on the downward closure

property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which candidate

sequences are potentially frequent. To improve the accuracy of such private estimations, a gap-aware sequence shrinking method is

proposed to enforce the length constraint on the sample databases. Moreover, to calibrate the amount of noise required by differential

privacy, a gap-aware sensitivity computation method is proposed to obtain the sensitivity of the local support computations with

different gap constraints. Furthermore, to decrease the probability of misestimating frequent sequences as infrequent, a threshold

relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we

show that our PFS2 algorithm is �-differentially private. Extensive experiments on real datasets illustrate that our PFS2 algorithm can

privately find frequent sequences with high accuracy.

Index Terms—Frequent sequence mining, differential privacy, candidate pruning

Ç

1 INTRODUCTION

FREQUENT sequence mining (FSM) is a fundamental com-ponent in a number of important data mining tasks with

broad applications, ranging from Web usage analysis tolocation-based recommendation system. Despite valuableinsights the discovery of frequent sequences could provide,if the data is sensitive (e.g., DNA sequences, web browsinghistories and mobility traces), releasing frequent sequencesmight pose considerable threats to individual’s privacy.

Differential privacy [1], [2] has been proposed as a way toaddress such problem. Unlike the anonymization-based pri-vacy models (e.g., k-anonymity [3] and l-diversity [4]), dif-ferential privacy provides strong theoretical guarantees onthe privacy of released data against adversaries with priorknowledge. By adding carefully calibrated noise, differen-tial privacy assures that the output of a computation isinsensitive to the change in any individual’s record. There-fore, it restricts privacy leaks through the results.

Given a collection of input sequences, the goal of FSM isto discover all subsequences that occur in the input sequen-ces more frequently than a user-specified threshold. In FSM,users often want to specify a gap constraint between adja-cent items in a frequent sequence [5]. For example, in ses-sion analysis, each input sequence corresponds to a user

session and each item to a user action in the session.Depending on the application, users may be interested insequences of consecutive actions, or non-consecutiveactions but frequently appearing in close proximity. Asanother example, in text mining, input sequences corre-spond to sentences and items to words. The goal of n-grammining is to discover frequent consecutive word sequences.By contrast, in word association mining, users want to dis-cover combinations of words that frequently appear in sen-tences but not necessarily consecutively. In FSM, the aboverequirements can be achieved by parameterizing a maxi-mum gap g. Given an input sequence, we consider only itssubsequences which can be generated by skipping no morethan g consecutive items. For the gap-constrained FSM, it isequivalent to n-gram mining when g = 0 and it is equivalentto unconstrained FSM when g =1.

In this paper, we focus on the design of differentially pri-vate FSM algorithms which can support the general gapconstraints. Although private algorithms exist for miningspecial cases of frequent sequences (e.g., unconstrained fre-quent sequences [6] and frequent consecutive item sequen-ces [7]), we are not aware of any existing differentiallyprivate algorithm that can find frequent sequences underuser-specified gap constraints.

To make FSM satisfy differential privacy, a straightfor-ward approach is to leverage the downward closure prop-erty [8] to generate the candidate sequences (i.e., all possiblefrequent sequences), and add perturbation noise to the sup-port of each candidate sequence. However, this approachsuffers poor performance. The root cause is the large num-ber of generated candidate sequences, which leads to a largeamount of perturbation noise. We observe, in practice,the number of real frequent sequences is much smaller than

� S. Xu, X. Cheng, S. Su, and K. Xiao are with the Beijing University ofPosts and Telecommunications, Beijing 100876, China.E-mail: {dear_shengzhi, chengxiang, susen, iejr}@bupt.edu.cn.

� L. Xiong is with Emory University, Atlanta, GA 30322.E-mail: [email protected].

Manuscript received 22 May 2015; revised 30 May 2016; accepted 30 July2016. Date of publication 17 Aug. 2016; date of current version 3 Oct. 2016.Recommended for acceptance by E. Terzi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2016.2601106

2910 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

1041-4347� 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

the number of candidate sequences generated based on thedownward closure property. In addition, in frequent pat-tern mining, for most patterns, their frequencies in a smallpart of the database are approximately equal to their fre-quencies in the original database [9], [10]. This is becausethe frequency of a pattern can be considered as the probabil-ity of a record containing this pattern. Thus, it is sufficientfor us to estimate which patterns are potentially frequentbased on a small part of the database. These observationsmotivate us to design a differentially private FSM algorithmwhich utilizes small parts of the database to further prunecandidate sequences generated based on the downward clo-sure property, such that the amount of noise required bydifferential privacy can be significantly reduced and theutility of the results can be considerably improved.

To this end, we propose a novel differentially privateFSM algorithm, called PFS2 (i.e., differentially Private Fre-quent Sequence mining via Sampling-based Candidate

Pruning). In PFS2, we privately discover frequent sequencesin order of increasing length. In particular, we first generatemultiple disjoint sample databases. Then, to discover fre-quent k-sequences (i.e., the frequent sequences containing kitems), we utilize frequent (k� 1)-sequences to generate thecandidate k-sequences based on the downward closureproperty. Next, the kth sample database is utilized to furtherprune these generated candidate k-sequences. Finally, wecompute the noisy support of the remaining candidatek-sequences in the original database, and output the candi-date k-sequences with noisy supports exceeding the user-specified threshold as frequent k-sequences.

The core of our PFS2 algorithm is to effectively prune thecandidate sequences. The local supports of candidatesequences in the sample database are used to estimate whichcandidate sequences are potentially frequent. To avoid pri-vacy breach, we have to add noise to the local support ofeach candidate sequence. To estimate which sequences arepotentially frequent accurately, it is necessary to reduce theamount of added noise. In differentially private frequentitemset mining (FIM), enforcing the length constraint ontransactions has been proven as an effective method forreducing the amount of added noise [11]. However, due tothe inherent sequentiality of sequential data, this methodcannot be used in differentially private FSM. In addition,after enforcing the length constraint on the sample database,the amount of noise required by differential privacy relies onthe maximal number of subsequences contained in a length-limited sequence. However, unlike unconstrained FSM, asthere is a gap constraint between adjacent items in gap-constrained FSM, the maximal number of subsequences in alength-limited sequence depends on the value of gap. It ishard to obtain the maximal number of subsequences for aspecific gap. If we simply use the number of all combinationsof items as the upper bound on the maximal number of sub-sequences, it will add more noise than required. Further-more, the sequences in the sample database are randomlydrawn from the original database. It causes many frequentsequences to become infrequent in the sample database. Ifwe simply use the user-specified threshold to estimate whichcandidate sequences are potentially frequent, many frequentsequences will be misestimated as infrequent. Therefore,how to accurately estimate which candidate sequences are

potentially frequent while satisfying differential privacy is achallenging task. To this end, in this paper, we propose threeeffective methods, namely, gap-aware sequence shrinking,gap-aware sensitivity computation and threshold relaxation.

For the gap-aware sequence shrinking method, weextend our recent work [6] such that it can effectively pre-serve frequency information under different gap con-straints. It consists of three schemes: irrelevant itemdeletion, consecutive patterns compression and sequencereconstruction. In particular, the irrelevant item deletionand consecutive patterns compression schemes can effec-tively shorten sequences without losing any frequencyinformation. After shortened by these two schemes, ifsome sequences still violate the length constraint, we usethe sequence reconstruction scheme to construct newsequences which satisfy the length constraint and preservefrequency information as much as possible. By using thegap-aware sequence shrinking method, we can signifi-cantly reduce the amount of noise added to the local sup-port of candidate sequences in the sample database.

In our gap-aware sensitivity computation method, for aspecific gap, we leverage a tree structure to derive the maxi-mal number of subsequences contained in a length-limitedsequence. After that, we can efficiently calculate the sensi-tivity of the local support computations of candidatesequences in the sample database, and calibrate the amountof noise required by differential privacy.

In the threshold relaxation method, we theoreticallyquantify the relationship between the decrement of thethreshold and the probability of misestimating frequentsequences as infrequent. Based on the theoretical results, werelax the threshold for the sample database to estimatewhich candidate sequences are potentially frequent. Indoing so, the probability of misestimating frequent sequen-ces as infrequent can be effectively decreased.

Through formal privacy analysis, we prove that our PFS2

algorithm guarantees �-differential privacy. Extensive exper-

imental results on real datasets show that our PFS2 algorithmachieves high data utility. Besides, to demonstrate the gener-ality of our sampling-based candidate pruning technique,we extend this technique to frequent itemset mining. Prelim-inary experimental results show that the performance ofexisting algorithm [11] is significantly improved by adoptingthis technique. In summary, we extend our recent work [6]and present several contributions:

1) We introduce the sampling-based candidate pruningtechnique as an effective means of achieving highdata utility in the context of differentially privateFSM. This technique is not only suitable for miningfrequent sequences, but also can be extended todesign other differentially private frequent patternmining algorithms.

2) We develop a novel differentially private FSM algo-rithm PFS2 by leveraging our sampling-based 1candi-date pruning technique. It can handle the generalgap-constrained FSM under differential privacy. Toprivately estimate which candidate sequences arepotentially frequent accurately, we propose threemeth-ods, namely, gap-aware sequence shrinking, gap-awaresensitivity computation and threshold relaxation.

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2911

Page 3: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

3) Through formal privacy analysis, we prove our PFS2

algorithm satisfies �-differential privacy. We conductan extensive experimental study over real datasets.

The experimental results demonstrate that our PFS2

algorithm can privately find frequent sequenceswith high accuracy.

The rest of our paper is organized as follows. We providea literature review in Section 2. Section 3 presents necessarybackground on differential privacy and FSM. In Section 4,we identify the technical challenges of designing a differen-tially private FSM algorithm. In Section 5, we describe oursampling-based candidate pruning technique. In Section 6,we discuss our PFS2 algorithm in detail and give the privacyanalysis. Comprehensive experimental results are reportedin Section 7. Finally, we conclude the paper in Section 8.

2 RELATED WORK

We coarsely categorize previous differentially private fre-quent pattern mining studies into three groups according tothe type of pattern being mined.

Sequence. Bonomi and Xiong [7] propose a two-phase dif-ferentially private algorithm for mining frequent consecu-tive item sequences (i.e., maximum gap g = 0). They firstutilize a prefix tree to find candidate sequences, and thenleverage a database transformation technique to refine thesupport of candidate sequences. Compared with theirwork, we focus on the general gap-constrained FSM in thecontext of differential privacy. Moreover, in [7], they con-sider the support of a sequence as how often it occurring inthe input sequences, which corresponds to collection fre-quency in text mining. By contrast, in our work, we considerthe support of a sequence as the number of input sequencescontaining it, which corresponds to document frequency intext mining. For example, given a database D = {abd, ababc}and gap g=0, the collection frequency of sequence ab is 3 whileits document frequency is 2. Recently, based on the idea ofcandidate pruning, Xiang et al. [12] propose a differentiallyprivate algorithm DP-MFSM for finding maximal frequentsequences. Different from their work, in this paper, weaddress a more practical and challenging problem, i.e., find-ing frequent sequences with general gap constraints in thecontext of differential privacy. Moreover, in DP-MFSM, ituses the kth sample database to prune candidate k-sequen-ces, and directly generates candidate (kþ 1)-sequencesbased on the remaining candidate k-sequences. After that, ituses a candidate verification technique to find maximal fre-quent sequences from all the remaining candidate sequen-

ces. By contrast, in our PFS2 algorithm, after pruningcandidate k-sequences based on the kth sample database,we compute the noisy support of each remaining candidatek-sequence on the original database to determine frequentk-sequences, and use frequent k-sequences to generate can-didate (kþ 1)-sequences.

Two studies have been proposed to address the issue ofpublishing sequence databases under differential privacy[13], [14]. In particular, Chen et al. [13] propose a differen-tially private sequence database publishing algorithm basedon a prefix tree. In [14], the authors employ a variable-length n-gram model to extract the necessary information ofthe sequence databases. These two studies focus on the

publication of sequence databases, while our work aims atthe release of frequent sequences.

Itemset. Several algorithms have been proposed totackle the problem of differentially private frequent item-set mining [11], [15], [16], [17], [18]. In particular, Bhaskaret al. [15] first study the problem of mining frequent item-sets with differential privacy. They propose two differen-tially private FIM algorithms, which adapt the Laplacemechanism [2] and the exponential mechanism [19]respectively. To meet the challenge of high dimensional-ity of transaction database, Li et al. [16] introduce thePrivBasis algorithm which projects the input high dimen-sional database onto lower dimensions. In [11], Zenget al. find the utility-privacy tradeoff in differentially pri-vate FIM can be improved by limiting the length of trans-actions. They propose a differentially private FIMalgorithm by transaction truncating. To limit the length oftransactions, Cheng et al. [17] present a transaction split-ting method. Based on the FP-growth algorithm [20], Suet al. [18] introduce an efficient differentially private FIMalgorithm PFP-growth. Different from [11], [17], [18], weprune candidate sequences based on sample databases toimprove the utility-privacy tradeoff. Moreover, we shrinksequences in the sample databases, so that we can prunethe unpromising candidate sequences accurately. Allthese algorithms [11], [15], [16], [17], [18] are shown to beeffective in some scenarios. However, the differencesbetween itemsets and sequences make these algorithmsnot applicable for differentially private FSM.

Graph. Shen et al. [21] integrate the process of frequentsubgraph mining and the privacy protection into a MarkovChain Monte Carlo framework. In [22], Xu et al. propose atwo-phase algorithm for finding frequent subgraphs under�-differential privacy.

In this paper, we extend our differentially private FSMalgorithm PFS2 proposed in [6] to support the general gapconstraints. In particular, we propose a new gap-awaresequence shrinking method, which can effectively preservethe frequency information under different gap constraints.Moreover, to efficiently calculate the sensitivity of the localsupport computations in sample databases with differentgap constraints, we add a new gap-aware sensitivity com-putation method. At last, we conduct more experiments to

evaluate the performance of our PFS2 algorithm, and alsomeasure the effectiveness of the gap-aware sequenceshrinking and gap-aware sensitivity computation methods.

3 PRELIMINARIES

3.1 Differential Privacy

Differential privacy [1], [2] has become a de-facto standardprivacy notion in private data analysis. Differential privacyrequires the outputs of algorithms to be approximatelysame even if any individual’s record in the database is arbi-trarily changed. Thus, the presence or absence of any indi-vidual’s record has a statistically negligible effect on theoutputs. Formally, differential privacy is defined as follows.

Definition 1 (�-differential privacy). An algorithm A satis-fies �-differential privacy iff for any databases D and D0 whichdiffer by at most one record, and for any subset of outputs S,

2912 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 4: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

Pr½AðDÞ 2 S� � e� � Pr½AðD0Þ 2 S�:

A fundamental concept for guaranteeing differential pri-vacy is the sensitivity [2]. It is used to measure the maximumchange in the outputs of a function when any individual’srecord in the database is changed.

Definition 2 (Sensitivity). For any function f :D ! Rn, andfor any database D1, D2 which differ by at most one record, thesensitivity of function f is

Df ¼ maxD1;D2

jjfðD1Þ � fðD2Þjj:

Dwork et al. [2] propose the Laplace mechanism. For afunction whose outputs are real values, they prove that dif-ferential privacy can be achieved by adding noise drawnrandomly from Laplace distribution.

Theorem 1. For any function f : D! Rn with sensitivity Df ,the algorithm

AðDÞ ¼ fðDÞþ < Lap1ð�Þ; . . . ; Lapnð�Þ >satisfies �-differential privacy, where Lapið�Þ is drawn i.i.dfrom a Laplace distribution with scale Df=�.

For a function whose outputs are integer values, Ghoshet al. [23] propose the Geometric mechanism.

Theorem 2. Let f : D! Rn be a function which outputs integervalues, and its sensitivity is Df . The algorithm

AðDÞ ¼ fðDÞþ < G1ð�Þ; . . . ; Gnð�Þ >guarantees �-differential privacy, where Gið�Þ i.i.d samplesfrom a geometric distribution Gð�=DfÞ.To support multiple differentially private computations,

sequential composition and parallel composition of differ-ential privacy are extensively used.

Theorem 3 (Sequential Composition [2]). Let A1; . . . ;Ak

be k algorithms, each provides �i-differential privacy. Asequence of algorithms AiðDÞ over database D providesðP �iÞ-differential privacy.

Theorem 4 (Parallel Composition [2]). Let A1; . . . ;Ak be kalgorithms, each provides �i-differential privacy. A sequence ofalgorithms AiðDiÞ over disjoint databases D1; . . . ; Dk providesmaxð�iÞ-differential privacy.

3.2 Frequent Sequence Mining

Suppose the alphabet is I ¼ fi1; . . . ; ijIjg. A sequence S is anordered list of items. We use S ¼ i1i2 . . . ijSj to denote a

sequence of length jSj. A sequence S is called a k-sequenceif jSj = k. Given a maximum gap g, we say sequence S =i1i2 . . . ijSj is contained in another sequence T = t1t2 . . . tjT j ifthere exist integers w1 < � � � < wjSj such that (i) ik ¼ twk

for 1 � k � jSj and (ii) wkþ1 � wk � 1 � g for 1 � k < jSj. IfS is contained in T , we say that S is a gsubsequence of T andT is a gsupersequence of S, denoted by S �g T . For example,

if S1 = acd, S2 = bcd and T = abcdeg, then S1 �1 T (but S1 ~0

T ) and S2 �0 T . In the rest of this paper, we will use symbolg to refer to the maximum gap.

A sequence database is a multiset of input sequences,where each input sequence represents an individual’srecord. The support (frequency) of a sequence S is the num-ber (percentage) of input sequences containing S. Our mea-sure of support corresponds to the concept of documentfrequency in text mining. The minimum support (relative)threshold is a number (percentage) of input sequences.Given the user-specified minimum support (relative)threshold, a sequence is called frequent if its support (fre-quency) is no less than this threshold. In the rest of thispaper, we use “threshold” as shorthand for “minimum sup-port threshold”. The gap-constrained FSM problem consid-ered in the paper is as follows: Given a sequence database, athreshold and a maximum gap, find all the frequent sequences inthe sequence database and also compute the support of each fre-quent sequence.

4 A STRAIGHTFORWARD APPROACH

In this section, we present a straightforward approach tofind frequent sequences under differential privacy. Themain idea of this approach is to add noise to the supportof all possible frequent sequences and determine whichsequences are frequent based on their noisy supports. Inparticular, we find frequent sequences in order ofincreasing length. For the mining of frequent k-sequen-ces, based on the downward closure property [8], we firstgenerate all possible frequent k-sequences by using thefrequent (k� 1)-sequences. The generated k-sequencesare called candidate k-sequences. Then, we compute thenoisy support of each candidate k-sequence according tothe maximum gap g. For the candidate k-sequenceswhose noisy supports exceed the threshold, we outputthem as frequent k-sequences.

Privacy Analysis. Let Qk denote the support computationsof the candidate k-sequences. The amount of noise added tothe support of candidate k-sequences depends on the allo-cated privacy budget and the sensitivity of Qk. Suppose themaximal length of frequent sequences is Lf . We uniformlyassign Qk a privacy budget �=Lf . Moreover, since adding(removing) an input sequence can, in the worst case, affectthe support of all candidate k-sequences by one, the sensi-tivity ofQk (i.e., Dk) is the number of candidate k-sequences.

Thus, adding Laplace noise LapðLf�Dk� Þ in Qk satisfies

�Lf-dif-

ferential privacy. The process of mining frequent sequencescan be considered as a series of computations Q =hQ1; . . . ; QLf

i. Based on the sequential composition property

of differential privacy, this approach as a whole maintains�-differential privacy.

Limitations. The approach discussed above, however,produces poor results. The root cause of the poor perfor-mance is the large number of candidate sequences. Itresults in the sensitivity of the support computations beingprohibitively high. For example, if there are 103 candidatetwo-sequences, the sensitivity of computing the supports

of these candidate two-sequences is D = 103. It indicates,even if the privacy level is low (e.g., � = 5), the support ofeach candidate two-sequence has to be perturbed by avery large amount of noise, which drastically reduces theutility of the results.

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2913

Page 5: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

5 SAMPLING-BASED CANDIDATE PRUNING

As discussed in Section 4, the amount of noise requiredby differential privacy in FSM is proportionate to thenumber of candidate sequences. In practice, we foundthe number of real frequent sequences is much smallerthan the number of candidate sequences generated basedon the downward closure property. For example, indatabase MSNBC, for the relative threshold u = 0.01and maximum gap g = 0, the number of frequent two-sequences is 36 while the number of candidate two-sequences generated based on the downward closureproperty is 256. As another example, in database BIBLE,for u = 0.15 and g = 1, the number of frequent two-sequences is 21 while the number of generated candidatetwo-sequences is 254. Therefore, if we could furtherprune the candidate sequences generated based on thedownward closure property, the amount of noiserequired by differential privacy can be significantlyreduced, which can considerably improve the utility ofthe results.

In FSM, for most sequences, their frequencies in asmall part of the database are approximately equal totheir frequencies in the original database [9], [10]. This isbecause the frequency of a sequence can be consideredas the probability of an input sequence containing it.Therefore, it is sufficient for us to estimate which candi-date sequences are potentially frequent based on a smallpart of the database.

Algorithm 1. Sampling-Based Candidate Pruning

Input:

Candidate k-sequences Ck; Sample database dbk; Privacy budget �4;

Threshold u; Maximum gap g; Maximal length constraint lmax.

Output:

Potentially frequent k-sequences PF;

1: db0k Transform sample database dbk; nn see Section 5.1

2: D0k Obtain the sensitivity of computing Ck in db0k; nn see Section 5.2

3: u0 Relax threshold u ; nn see Section 5.3

4: PF discover_potentially_frequent_sequences(Ck, db0k, �4, D

0k, u0, g);

5: return PF;

To this end,we propose a sampling-based candidate prun-ing approach. Given the candidate k-sequences and a smallsample database randomlydrawn from the original database,we use the local support of the candidate k-sequences in thesample database to estimate which candidate k-sequencesare potentially frequent. Due to the privacy requirement, wehave to add noise to the local support of each candidatek-sequence. To estimate which candidate sequences arepotentially frequent accurately, we propose a gap-awaresequence shrinking method to effectively reduce the amountof added noise. The gap-aware sequence shrinking methodconsists of three novel schemes, which can enforce the lengthconstraint on sequences while effectively preserving fre-quency information under gap constraints (See Section 5.1).Moreover, to calibrate the amount of noise required by differ-ential privacy, we propose a gap-aware sensitivity computa-tion method. For a specific gap, we utilize a tree structure toobtain the maximal number of subsequences in a length-lim-ited sequence, and efficiently calculate the sensitivity of the

local support computations in the sample database (See Sec-tion 5.2). Furthermore, to decrease the probability of misesti-mating frequent sequences as infrequent, we propose athreshold relaxation method. We theoretically quantify therelationship between the decrement of the threshold and theprobability of misestimating frequent sequences as infre-quent. By using the threshold relaxation method, we relaxthe user-specified threshold for the sample database to esti-mate which candidate sequences are potentially frequent(See Section 5.3).

Algorithm 1 shows the main steps of our sampling-basedcandidate pruning approach. Specifically, given a sampledatabase, we first transform the sample database to enforcethe length constraint (line 1). For the sequences whoselengths violate the length constraint, we shrink them byusing our gap-aware sequence shrinking method. Then, wecalculate the sensitivity of the local support computations ofcandidate sequences in the transformed sample database(line 2). Next, we relax the threshold for the sample data-base to estimate which candidate sequences are potentiallyfrequent (line 3). At last, we compute the noisy support ofeach candidate sequence in the transformed sample data-base and regard the candidate sequences whose noisy sup-ports exceed the new threshold as potentially frequentsequences (line 4).

5.1 Gap-Aware Sequence Shrinking

In the sample database, we use the local support of candi-date sequences to estimate which candidate sequences arepotentially frequent. Due to the privacy requirement, wehave to add noise to the local support of each candidatesequence. To make the estimation more accurate, the addednoise should be as little as possible.

In differentially private frequent itemset mining, enforc-ing the length constraint on transactions has been proposedas an effective way for reducing the amount of added noise.By limiting the length of transactions, even if a transactionis arbitrarily changed, the number of affected itemsets isrestricted, and thus the sensitivity of computing the supportof itemsets can be reduced. However, existing method forlimiting the length of transactions is designed for itemsets[11]. It cannot be applied to limit the length of sequences.

To this end, we introduce our gap-aware sequenceshrinking method. When we estimate the potentially fre-quent k-sequences, the candidate k-sequences contained ineach sequence S can be considered as the frequency infor-mation in S. Ideally, when we shrink S, we want to get anew sequence which meets the length constraint and pre-serves as much frequency information as possible. In thefollowing, we use ContainkðSÞ to denote the set of candidatek-sequences contained in a sequence S.

In our gap-aware sequence shrinking method, we proposethree effective schemes, namely, irrelevant item deletion, con-secutive patterns compression and sequence reconstruction.For the irrelevant itemdeletion and consecutive patterns com-pression schemes, they can effectively shorten sequenceswithout causing any frequency information loss.

5.1.1 Irrelevant Item Deletion

Given a sequence, it is clear that the items not containedin any candidate k-sequence do not contribute to the

2914 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 6: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

frequencies of the candidate k-sequences. We call suchitems irrelevant items. To get some insight into potentialapproaches for shrinking sequences, consider the followingexample.

Example 5.1. Suppose candidate two-sequences C2 = {ab,bc, bd}. Consider sequences T = abed and T 0 = abd. Here,T 0 is obtained from T by removing irrelevant item e.When the maximum gap g = 1, T 0 contains the same can-didate two-sequences as T . However, if g = 0, for candi-date two-sequence bd, it is not contained in the originalsequence T but contained in the new sequence T 0.

Based on the observation of Example 5.1, maybe contraryto expectation, we cannot simply delete irrelevant itemsfrom a sequence. Whether an irrelevant item can be deletedor not relies on the value of maximum gap g. For the irrele-vant items in a sequence, since they do not contribute to thefrequencies of candidate sequences, we replace them by aspecial symbol ?, called irrelevant symbol. Continuing fromExample 5.1, sequence T = abed can be written as ab?d.Moreover, we use a compact form to represent consecutiveirrelevant symbols. For instance, sequence ab???c can be

written as ab?3c.In our irrelevant item deletion scheme, we propose a

number of approaches that shrink sequences by deletingirrelevant symbols without losing frequency information.Our first approach, termed Prefix/Suffix Symbol Deletion,removes prefixes and suffixes of irrelevant symbols. Clearly,the irrelevant symbols at the front (end) of a sequence donot carry any useful information, as no candidate sequencestarts (ends) with irrelevant items. Thus, we can safelydelete the prefixes (suffixes) of irrelevant symbols.

Lemma 1 (Prefix/Suffix Symbol Deletion). For sequence?x1T?x2 , we have

Containkð?x1T?x2Þ ¼ ContainkðT Þ:For a given sequence, irrelevant symbols might appear

consecutively in it. In gap-constrained FSM, we only needto consider the subsequences which can be generated byskipping no more than g consecutive irrelevant symbols.Thus, for consecutive j irrelevant symbols, we can replacethem by g þ 1 irrelevant symbols, if j > g þ 1.

Lemma 2 (Consecutive Symbol Deletion). For sequence

T1?jT2, where j > g þ 1, we have

ContainkðT1?jT2Þ ¼ ContainkðT1?gþ1T2Þ:For a sequence T , after applying the prefix/suffix symbol

deletion and consecutive symbol deletion, it can be represented

in the form of T1?gþ1T2?gþ1 � � � ?gþ1Tm. Since T1; T2; . . . ; Tm

are separated by sufficiently many irrelevant symbols, everycandidate sequence can be contained in one or more of theTi but not spanning multiple Ti’s. If subsequencesTi1 ; . . . ; Tij contain the same candidate k-sequences, it is suf-

ficient to keep only one of these subsequences.

Lemma 3 (Subsequence Deletion). Suppose sequences T ¼T1 ?gþ1 � � � ?gþ1 Tm and S ¼ S1?gþ1 � � � ?gþ1Su. For eachsubsequence Ti, there exists a subsequence Sj such thatContainkðTiÞ ¼ ContainkðSjÞ. Moreover, for any two

subsequences Sp and Sq (p 6¼ q), ContainkðSpÞ 6¼ContainkðSqÞ. Then, we have

ContainkðT Þ ¼ ContainkðSÞ:

Reconsider sequence T1?gþ1 � � � ?gþ1Tm. For each subse-quence Ti, the distance between its first and last items isjTij � 1. If jTij � 1 � g þ 1, every two items in Ti satisfy thegap constraint. In this case, we can safely delete all the irrel-evant symbols in Ti.

Lemma 4 (Inside Symbol Deletion). Let I be the set of itemsnot contained in any candidate k-sequence. Consider sequences

T = T1 ?gþ1 � � � ?gþ1 Tm and T 0 = T 01 ?gþ1 � � � ?gþ1 T 0m. For ifrom 1 to m, if jTij � g þ 2, T 0i is obtained by removing theitems in I from Ti. Otherwise, T 0i = Ti. Then, we have

ContainkðT Þ ¼ ContainkðT 0Þ:

In Irrelevant Item Deletion scheme, based on the down-ward closure property, if an item is irrelevant for candidatek-sequences, it will be also irrelevant for candidate sequen-ces of length larger than k. It means a sequence might con-tain more irrelevant items as the length of candidatesequences increases. Thus, Irrelevant Item Deletion scheme ismore effective in shrinking sequences when we estimatelong frequent sequences.

In gap-constrained FSM, based on the consecutive symboldeletion, we can replace a group of irrelevant symbols by asmall number of symbols if the maximum gap g is small.Thus, consecutive symbol deletion is particularly effectivewhen g is small. In addition, for a sequence, inside symboldeletion can be applied when g is sufficiently large. There-fore, inside symbol deletion is more effective in shrinkingsequences when g is large.

5.1.2 Consecutive Patterns Compression

A sequence might contain a certain pattern which appearsconsecutively. For example, pattern ab appears consecu-tively in sequence ababab. Clearly, the k items in a candidatek-sequence can come from at most k patterns. Therefore, forthe estimation of potentially frequent k-sequences, we cancompress consecutive j patterns into k patterns withoutcausing any frequency information loss, where j> k.

Lemma 5 (Consecutive Patterns Compression). Let pk bepattern p appearing consecutively k times, and T1jT2jT3 be asequence which is the concatenation of sequences T1, T2 and T3.Then, for sequence T1jpjjT2, where j > k, we have

ContainkðT1jpjjT2Þ ¼ ContainkðT1jpkjT2Þ:

In Consecutive Patterns Compression scheme, when we esti-mate short frequent sequences, we can use a small numberof patterns to replace the original consecutive patterns.Thus, Consecutive Patterns Compression scheme is particu-larly effective in shrinking sequences when we estimateshort frequent sequences. In addition, as it is computation-ally expensive to discover consecutive patterns of all possi-ble lengths, we only compress consecutive patterns whereeach pattern contains no more than three items. In our

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2915

Page 7: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

experiments, we found such setting has already led to agood compression for sequences.

In gap-constrained FSM, for a candidate k-sequence cs, asequence T and a maximum gap g, if cs �g T , the distancebetween the first and last items selected from T is at most(g þ 1)�(k� 1). Thus, we can further compress consecutivek patterns into consecutivem patterns, wherem is the small-est integer such that (m� 1)�jpjþ1 > (gþ1)�(k�1)þ1.Lemma 6 (Consecutive Patterns Compression withGap). Given a maximum gap g, for sequence T1jpkjT2, we have

ContainkðT1jpkjT2Þ ¼ ContainkðT1jpmjT2Þ;

wherem ¼ 1þ dðgþ1Þ�ðk�1Þjpj e and jpj is the length of pattern p.

We can see that, when the maximum gap is small, to gen-erate a candidate sequence, the distance between the firstand last items selected from a sequence is short. Thus, Con-secutive Patterns Compression with Gap is more effective whenthe gap is small. Notice that, since we just consider the pat-tern containing no more than three items in practice, Conse-cutive Patterns Compression with Gap is only used when g <3 in our experiments.

5.1.3 Sequence Reconstruction

The irrelevant item deletion and consecutive patternscompression schemes can effectively shrink sequenceswithout incurring any frequency information loss. How-ever, after applying these two schemes, some sequencesmight still violate the length constraint. We have to deletesome items from the sequences until they meet the con-straint. Obviously, if we randomly delete items, lots offrequency information in the original sequences will belost, which leads to inaccurate estimations of potentiallyfrequent sequences.

Intuitively, when we shrink a sequence, we want to pre-serve as many candidate sequences as possible. Formally,given the maximal length constraint lmax, the set of candi-date k-sequences Ck and a sequence S (jSj > lmax), we aim

to construct a new sequence S (jSj � lmax), such that thenumber of common candidate k-sequences contained in

both S and S is maximized. A naive approach is to enumer-ate all possible sequences with length lmax and find thesequence which contains the maximum number of commoncandidate sequences with S. However, this approach suffersfrom exponential time complexity. If there are jIj items, for

example, we have to enumerate jIjlmax sequences, which iscomputationally infeasible in practice.

To this end, we put forward a sequence reconstructionscheme. The main idea of this scheme is to construct anew sequence by iteratively appending items which canadd the maximum number of candidate sequences con-tained in the original sequence into the new sequenceuntil the new constructed sequence reaches the lengthconstraint.

In our sequence reconstruction scheme, to efficiently findthe candidate sequences contained in a given sequence, wedesign a candidate sequence tree (referred to as CS-tree). Inparticular, given the candidate k-sequences, CS-tree groupsthe candidate k-sequences with identical prefix into the

same branch. The height of the CS-tree is equal to k. Eachnode in the CS-tree is labeled by an item and associatedwith a sequence of items from the root to it. For example,Fig. 1 shows the CS-tree of the candidate three-sequence set{abc, bcd, bda, bdb}.

For the CS-tree constructed based on the candidatek-sequences, the nodes at level k are associated with can-didate k-sequences. We call the nodes at level k c-nodes.In addition, every node at level k � 1 is associated with a(k � 1)-sequence which can generate a candidatek-sequence by appending an item to its end. We call thenodes at level k � 1 g-nodes, and call their associatedsequences generating sequences.

Our sequence reconstruction scheme is shown in Algo-rithm 2. It has a time complexity of Oðlmax � k � jCkj2), wherejCkj is the number of candidate k-sequences Ck and lmax isthe maximal length constraint. In this scheme, given the CS-tree CT constructed based on Ck and sequence S

(jSj> lmax), we first find the candidate k-sequences ~Ck which

are k-gsubsequences of S. Then, we use ~Ck to build a newCS-tree subCT (line 1). In the following, we will show how

to construct a new sequence S by using tree subCT.Specifically, our first step is to append a candidate

k-sequence in ~Ck to S. We show how to select such candi-date k-sequence as follows. For a sequence T = i1i2 . . . ijT jand maximum gap g, the ( k �1)-gsubsequences of T mightbe the generating sequences of candidate k-sequences. Forexample, in Fig. 1, for sequence bcd, if g = 1, its two-

gsubsequence bd is the generating sequence of candidatesequences bda and bdb. For sequence T , based on subCT,we can efficiently find its (k � 1)-gsubsequences which are

the generating sequences of candidate k-sequences in ~Ck.Due to the gap constraint, only the generating sequences,where the indexes of their last items are larger than jT j � g,can generate a k-gsubsequence with the new appended itemof index jT j þ 1. We refer to such generating sequences asvalid generating sequences of sequence T .

For each candidate sequence cs, its valid generatingsequences correspond to a set of g-nodes in subCT. For eachchild node of these g-nodes, its associated candidatesequence can be added into cs by appending its labeled itemto cs. If multiple child nodes are labeled by a same item, byappending this item, their associated candidate sequencescan be added into cs simultaneously. The more child nodeslabeled by a same item, the more candidate sequences can beadded into cs by appending one item. Thus, for each candi-

date sequence cs in ~Ck, we first utilize subCT to find the

Fig. 1. A simple candidate sequence tree.

2916 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 8: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

g-nodes whose associated generating sequences are validgenerating sequences of cs (line 3). Then, we use the numberof the item which appears in the most child nodes of theseg-nodes as cs’s score. We refer to such item as a-item andsuch score as a-score. We select the candidate sequence withthe highest a-score and append this candidate sequence and

its a-item to S (line 6). If multiple candidate sequences havethe same a-score, the candidate sequence and its a-item,which contain more different items, will be appended. Thisis because such sequence can produce more different subse-quences, and thus has chances to generate more candidatesequences. Besides, to differentiate the candidate sequences

still not contained in S, we remove the c-nodes correspond-

ing to the candidate sequences new added in S fromsubCT (line 7).

After appending a candidate sequence and its a-item to

S, we use subCT to find all the valid generating sequences of

S. Let GN denote the set of g-nodes corresponding to thesevalid generating sequences. Depending on whether the

g-nodes in GN have child nodes, we append items to S inthe following two manners.

Algorithm 2. Sequence Reconstruction

Input:

Sequence S; CS-tree CT constructed based on candidate k-sequences Ck;

Maximum gap g; Maximal length constraint lmax;

Output:

Sequence S;

1: subCT build_CSTree (CT , S);

2: for each candidate sequence cs in S do

3: cs.GN find_GNodes (cs, g, subCT);

4: cs.a-score max_item_in_childNodes (cs.GN);

5: end for

6: s Select a candidate sequence;

7: insert (S, s + s:a-item); update_tree(subCT, s, s:a-item);

8: while jSj < lmax do

9: GN find_GNodes (S, g, subCT);

10: if GN.childNodeNum() 6¼ 0 then

11: for each g-node gn in GN do

12: Item_GNode_Map map_item_g-node (gn);

13: end for

14: item i find_item (Item_GNode_Map);

15: insert (S, i); update_Tree (subCT, i, Item_GNode_Map);

16: else

17: for each c-node cn in subCT do

18: cs cn’s associated sequence;

19: cs.GN find_GNodes (cs, g, subCT);

20: cs.a-score max_item_in_childNodes (cs.GN);

21: end for

22: s Select a candidate sequence;

23: insert (S, s + s:a-item); update_Tree(subCT, s, s:a-item);

24: end if

25: end while

26: return S;

On the one hand, if the g-nodes in GN have child nodes,

we will append an item to S. Similarly, for each child nodeof the g-nodes in GN, its associated candidate sequence can

be added into S by appending its labeled item. Therefore, to

add the maximum number of candidate sequences into S,

we append the item which appears in the most child nodesof the g-nodes in GN (line 14). After that, we also removethe c-nodes corresponding to these new added candidatesequences from subCT (line 15).

On the other hand, if the g-nodes in GN do not haveany child node, no candidate sequence can be added into

S by appending only one item. Instead, we will select and

append a candidate k-sequence which is in ~Ck but not

contained in S. Recall that, when a candidate sequence is

added into S, its corresponding c-node is removed from

subCT. Hence, the candidate sequences in ~Ck but not in Sare the associated sequences of the c-nodes remaining insubCT. For each of these sequences, we compute itsa-score (line 20). We select the candidate sequence withthe highest a-score and append this candidate sequence

and its a-item to S (line 22). If multiple candidate sequen-ces have the same a-score, we append the candidatesequence and its a-item which contain more differentitems. After that, we remove the c-nodes correspondingto the new added candidate sequences from subCT(line 23).

Gap-Aware Sequence Shrinking Method. In summary,given the candidate k-sequences and a sequence S whoselength exceeds the length constraint, we shrink S by thefollowing steps. We first shrink S by using the irrelevantitem deletion scheme. Then, we iteratively compress con-secutive patterns in S in order of increasing pattern length.At last, if S still violates the length constraint, we use thesequence reconstruction scheme to construct a newsequence.

In our gap-aware sequence shrinking method, the irrele-vant item deletion and consecutive patterns compressionschemes do not cause the change of the support of candi-date sequences. For the sequence reconstruction scheme, itconstructs a new sequence based on the candidate sequen-ces contained in the original sequence. On the one hand,due to the length constraint, not all the candidate sequencescontained in the original sequence can be added into thenew constructed sequence. Thus, the support of some can-didate sequences might decrease. On the other hand, as weappend items into the new sequence, some candidatesequences not contained in the original sequence might beadded into the new sequence. Thus, the sequence recon-struction scheme might cause the increase of the support ofsome candidate sequences as well.

In our gap-aware sequence shrinking method, the irrel-evant item deletion and consecutive patterns compressionschemes can effectively shrink sequences. After shortenby these two schemes, only a small number of inputsequences still violate the length constraint and need tobe transformed by our sequence reconstruction scheme.Let ID, CC and SR denote the irrelevant item deletion,consecutive patterns compression and sequence recon-struction, respectively. The percentages of input sequen-ces shorten by these three schemes and the timeconsumed in these three schemes are shown in Table 1.We can see our gap-aware sequence shrinking methoddoes not consume too much time. In Section 7, we alsoshow that our PFS2 algorithm gains comparable efficiencyto the non-private FSM algorithm GSP [5].

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2917

Page 9: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

5.2 Gap-Aware Sensitivity Computation

In the sample database, given the candidate k-sequences Ck,we enforce the length constraint lmax on the sample databaseto reduce the amount of noise added to the local support ofcandidate k-sequences. The sensitivity of the local supportcomputations is the maximum change in the local supportsof candidate k-sequences when a sequence of length lmax isarbitrarily changed. That is, the sensitivity is equal to themaximal number of candidate k-sequences contained in asequence of length lmax. To compute this sensitivity, a sim-ple method is to enumerate all possible sequences withlength lmax and find the sequence which contains the maxi-mal number of candidate k-sequences. However, thismethod suffers from exponential time complexity, which iscomputationally infeasible.

In this paper, we utilize a safe approximatemethodwhichcomputes the upper bound of the sensitivity of the local sup-port computations, and uses this upper bound to perturb thesupport of each candidate k-sequence. Suppose the maxi-mum gap is g and the set of candidate k-sequences is Ck. Letvk be the maximal number of k-gsubsequences contained ina sequence of length lmax. Since a single sequence of lengthlmax can affect the support of at most min{vk, jCkj} candidatek-sequences by one, the sensitivity of the local support com-putations is no greater thanmin{vk, jCkj}.

In gap-constrained FSM, as there is a gap constraintbetween adjacent items selected from a sequence, the maxi-mal number of k-gsubsequences contained in a length-limitedsequence (i.e., vk) depends on the value of gap g. It is hard todirectly obtain vk for a specific gap g. If we simply use thenumber of all combinations of items as the upper bound onvk, it will addmuchmore noise than required.

To this end, we introduce our gap-aware sensitivity com-putationmethod. In this method, given a specific gap g and asequence S = i1; . . . ; ilmax , we can efficiently compute themaximal number of k-gsubsequences contained in S. In par-ticular, when gap g=0, it is clear that the maximal number ofk-gsubsequences contained in S is vk = lmax � kþ 1. In addi-tion, as the distance between the first and last items in S is dis= lmax � 1, if g is no smaller than dis-1, every two items in Ssatisfy the gap constraint. Thus, when g lmax-2, the maxi-

mal number of k-gsubsequences contained in S is vk = ðlmaxk Þ.

When 0< g< lmax � 2, to efficiently calculate the maxi-mal number of k-gsubsequences contained in sequence S,we propose a subsequence expanding tree, called SE-tree. Inparticular, each node in SE-tree is labeled by an item andassociated with a sequence of items from the root to it. Inthe first level of SE-tree, there are lmax nodes, and the jthnode nj represents item ij in S. Due to the gap constraint,only items ijþ1, ijþ2; . . . ; iminfjþ1þg;lmaxg can be the next item

of ij in a subsequence of S. We insert these min{g þ 1; lmax � j} items as the child nodes of nj. For each node

in SE-tree, we expand it in a similar way. Then, we can seethat the maximal number of k-gsubsequences contained in S

is the number of the nodes at the kth level. In Fig. 2, weshow a SE-tree of a sequence with length l.

For a node nc at level k (k2), suppose its labeled item isic, its parent node is np, and the labeled item of node np is ip.Due to the gap constraint, the distance between ic and ip isat most g. Thus, for each node labeled by ic, its parent nodeis labeled by a item whose distance from ic is less than g.Then, we can see that the number of nodes labeled by ic atlevel k is the sum of the nodes labeled by itemsimaxf1;c�g�1g; . . . ; ic�1 at level k� 1.

Based on the above analysis, we can derive the relation-ship between the number of nodes at level k� 1 and thenumber of nodes at level k. In particular, it is clear that thereis no nodes labeled by i1 at level k (k2). Moreover, for1<j< g þ 1, the number of nodes labeled by ij at level k isthe sum of nodes labeled by items i1; . . . ; ij�1 at level k� 1.Furthermore, for j> g þ 1, the number of nodes labeled byij at level k is the sum of nodes labeled by itemsij�g�1; . . . ; ij�1 at level k� 1. Let amk be the number of nodeslabeled by im at level k. Then, we have

ajk ¼0 if j ¼ 1Pj�1

m¼1 amk�1 if 1 < j � g þ 1Pj�1

m¼j�g�1 amk�1 if j > g þ 1:

8<: (1)

Based on Equation (1), we can see that the number ofnodes at level k (i.e., the maximal number of k-g subsequen-ces in a sequence of length lmax) is

vk ¼Xlmax

j¼1ajk: (2)

Actually, in our PFS2 algorithm, we do not maintain thewhole SE-tree during the mining process. We expand thenodes in the SE-tree level by level. We only keep the numberof the nodes labeled by different items at each level of thetree. Then, based on Equations (1) and (2), we can efficientlycompute the maximal number of subsequences of differentlengths contained in a length-limited sequence in an incre-mental manner.

Gap-Aware Sensitivity Computation Method. In thismethod, when we compute the local support of the candi-date k-sequences Ck in the transformed sample database,we first calculate the maximal number vk of k-g subsequen-ces contained in a sequence of length lmax. Then, we selectthe smaller value between jCkj and vk as the sensitivity ofthe local support computations to perturb the local supportof each candidate k-sequence.

TABLE 1Statistical Information

Dataset Threshold ID+CC (%) ID+CC (s) SR (%) SR (s)

MSNBC 0.015 14.32% 0.59 s 4.88% 3.0 sBIBLE 0.14 14.6% 0.1 s 0.2% 0.23 sHouse_Power 0.34 100% 0.4 s 28.99% 3.9 s

Fig. 2. A simple subsequence expanding tree.

2918 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 10: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

5.3 Threshold Relaxation

In the sample database, the local supports of candidatesequences are used to estimate which candidate sequencesare potentially frequent. However, the sample database isdrawn randomly from the original database, which causesmany frequent sequences to become infrequent in the sam-ple database. Moreover, the local supports of candidatesequences are also affected by the noise added to guaranteedifferential privacy and the information loss caused by lim-iting the length of sequences. Thus, if we simply use theuser-specified threshold in the sample database, it leads tomany misestimations.

Due to the downward closure property, if a frequentsequence is misestimated as infrequent, its supersequencesare all regarded as infrequent sequences without even com-puting their supports, which drastically reduces the utilityof the results. Thus, the threshold should be relaxed toavoid this problem. However, it is hard to quantify howmuch the threshold should be decreased. In particular, ifthe threshold is decreased too little, many frequent sequen-ces might still be misestimated as infrequent; if the thresh-old is decreased too much, only a few candidate sequencescan be pruned.

To this end, we theoretically quantify the relationshipbetween the decrement of the threshold and the probabilityof misestimating frequent sequences as infrequent. To quan-tify such relationship, we start from the following problem:given a candidate k-sequence t, what is the relationshipbetween its noisy support in the sample database db and its actualsupport in the original databaseD?

Let supt and sup0t be the actual support of t in D and dbrespectively. Suppose D contains jDj input sequences. If werandomly choose an input sequence S from D, the probabil-ity of S containing t can be regarded as ft = supt=jDj. As thesequences in db are drawn randomly and independentlyfrom D, we can approximately think the probability of eachsequence in db containing t is essentially constant and equalto ft. Then, sup

0t follows a binomial distribution:

sup0t Bðjdbj; ftÞ:It is known that, if the number of trials is large, the bino-

mial distribution is approximately equal to the normal dis-tribution. We assume the number of sequences in db is largeenough, and sup0t can be approximated by a normal distri-bution

sup0t Normalðft � jdbj; ft � ð1� ftÞ � jdbjÞ:Due to the privacy requirement, in the sample database,

we add noise to the local support of candidate sequences.As discussed in Section 5.1, we enforce the length constrainton the sample database to reduce such noise. It might leadto the change in the support of some candidate sequences.In our experiments, however, we found our irrelevant itemdeletion and consecutive patterns compression schemes caneffectively shorten the length of sequences, and only a fewsequences need to be transformed by our sequence recon-struction scheme. It means our gap-aware sequence shrink-ing method does not introduce many errors. Thus, in theanalysis of the relationship between t’s actual support in Dand t’s noisy support in db, we assume our gap-aware

sequence shrinking method does not incur frequency infor-mation loss.

In our PFS2 algorithm, we employ the Laplace mecha-nism to perturb the support of sequences. Suppose the sen-sitivity of computing the local support of candidatek-sequences in db is D and the privacy budget assigned tothese computations is �. Then, the noise added to the localsupport of each candidate k-sequence in db is chosen inde-pendently at random from a Laplace distribution with scal-ing parameter v = D=�. Let � denote the amount of noiseadded to sup0t. Then, � follows a Laplace distribution:

� Laplaceð0;vÞ:The noisy support of the sequence t in db, n-supt, can be

regarded as the linear combination sup0t + �. As the supportof a sequence is irrelevant to the amount of noise, sup0t and� are two random variables distributed independently ofeach other. In [24], it proves, if X and Y are independent

random variables havingNormal(m, s2) and Laplace(�, ’) dis-tributions respectively, the cumulative distribution function(i.e., cdf) of Z =X + Y can be expressed as

F ðzÞ ¼ 1

2erfcð�þ m� zffiffiffi

2p

� 1

4expð�þ m� z

’þ s2

2’2Þerfcð�þ m� zffiffiffi

2p

sþ sffiffiffi

2p

’Þ

þ 1

4expð��þ m� z

’þ s2

2’2Þerfcð��þ m� zffiffiffi

2p

sþ sffiffiffi

2p

’Þ;

where erfc is the complementary error function

erfcðxÞ ¼Z 1

x

expð�y2Þdy:

Therefore, given the actual support of the candidatesequence t in D, by setting its corresponding parameters m,s, � and ’, we can obtain the cdf of t’s noisy support in db.

After deriving the relationship between the noisy sup-port of a sequence in the sample database and its actual sup-port in the original database, we compare the probabilities ofestimating sequences with different supports as infrequent.

Theorem 5. Let t1 and t2 be two candidate k-sequences. Suppose,in the original database D, t1’s actual support sup1 is smallerthan t2’s actual support sup2. Then, in the sample database db,the probability of estimating t1 as infrequent is larger than theprobability of estimating t2 as infrequent.

Proof. SeeAppendixA,which can be found on the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2016. 2601106. tuBased on the above analysis, we can derive the relation-

ship between the decrement of the threshold in the sampledatabase and the probability of misestimating frequent sequen-ces as infrequent. Let tu be a k-sequence whose actual sup-port in the original database is equal to the user-specified threshold u. We first compute the cdf Ftu of the

noisy support of tu in the sample database db. Suppose,after relaxation, the threshold used in db is u0. Then, theprobability of estimating tu as infrequent (i.e., the proba-bility of tu’s noisy support in db smaller than u0) is Ftu (u

0).

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2919

Page 11: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

Since the actual support of any frequent k-sequence isnot smaller than u, based on Theorem 5, we can see thatthe probability of misestimating a frequent k-sequence asinfrequent is not larger than Ftu (u

0).Threshold Relaxation Method. In this method, given the

sample database dbk and the candidate k-sequences Ck,we first assume there is a k-sequence tk whose actualsupport in the original database is the user-specifiedthreshold. Then, we compute the cdf Ftk of the noisy sup-

port of tk in dbk. We set Ftkðu0Þ = z and compute the cor-

responding u0. In the sample database dbk, we use u0 toestimate which candidate sequences are potentially fre-quent. We refer to z as relaxation parameter. In our experi-ments, we find, when z is set to be 0.3, it typicallyobtains good results (see Section 7.5).

6 PFS2 ALGORITHM

6.1 PFS2 Algorithm Description

Our PFS2 algorithm is shown in Algorithm 3. It consists oftwo phases. In particular, in the pre-mining phase, weextract some statistical information from the database.Then, in the mining phase, we privately find frequentsequences in order of increasing length. Our sampling-based candidate pruning technique is used in the secondphase. Besides, we divide the total privacy budget � intofive portions �1; . . . ; �5. Specifically, �1, �2 and �3 are used inthe pre-mining phase, �4 is used in the sample databases toprune candidate sequences, and �5 is used for the supportcomputations in the original database. In the following, we

present the details of our PFS2 algorithm.Pre-Mining Phase. In the pre-mining phase, we first com-

pute the maximal length constraint lmax for the sample data-bases. Like [7], [11], we determine lmax in a heuristic way bysetting lmax = min{l1, l2g. In particular, parameter l1 repre-sents an upper bound on the length of input sequences. Itdetermines a maximum value of the error introduced by thenoise in support computations. Instead, parameter l2 is com-puted from the database. Let jDj be the number of totalinput sequences and ai be the number of input sequenceswith length i. Starting from l2 = 1, we incrementally com-

pute the percentage p = ðPl2j¼1 ajÞ=jDj until p is at least h

(line 3-6). Due to the privacy requirement, we add geomet-ric noise G(�1) to jDj and add geometric noise G(�2) to eachai (1 � i � l2).

To better utilize the privacy budget, we also estimate themaximal length of frequent sequences Lf . Since we enforcethe length constraint lmax on the sample databases, Lf

should not be larger than lmax. To estimate Lf , we first com-pute b = {b1, ..., blmax

}, where bi is the maximal support of

i-sequences. We add noise Gð�3= log lmaxd eÞ to each bi (line8). Then, we set Lf to be the integer y such that by is the

smallest value exceeding threshold u (line 9).Mining Phase. In the mining phase, we first randomly

partition the original database D into Lf disjoint databasesdbSet, each of which contains approximately jDj=Lf sequen-ces (line 12). We use these generated disjoint databases asthe sample databases to prune candidate sequences.

In the following, we privately find frequent sequences inorder of increasing length. In particular, for the mining offrequent k-sequences, we first generate the candidate

k-sequences Ck. If k ¼ 1, the candidate sequences are allitems. If k > 1, the candidate sequences are generated basedon the frequent (k� 1)-sequences by using the downwardclosure property. Then, given the kth sample databasedbk, we utilize algorithm Sampling-based_Candidate_Pruning(i.e., Algorithm 1) to prune unpromising candidatek-sequences (See Section 5). After that, we compute thenoisy support of remaining candidate k-sequences C0k onthe original database D. Notice that, we do not limit thelength of input sequences in the original database. For eachcandidate k-sequence in C0k, we add Laplace noiseLapðjC0kj=�0Þ to its support inD, where �0 = �5/Lf . The candi-date k-sequences whose noisy supports on D exceed thresh-old u are output as frequent k-sequences. This processcontinues until all frequent sequences of lengths from 1 toLf are found.

Algorithm 3. PFS 2 Algorithm

Input:

Original databaseD; Threshold u; Percentage h; Privacy budgets �1; . . . ; �5;

Maximum gap g; Maximal length constraint upper bound l1.

Output:

Frequent sequences FS;

1: /���� Pre-Mining Phase ����/2: jDj get the noisy number of total input sequences using �1;

3: for l2 = 1; p < h; l2++ do

4: al2 = get the noisy number of input sequences with length l2 using �2;

5: p = (Pl2

j¼1 ajÞ=jDj;6: end for

7: lmax min{l1, l2};

8: b get noisy maximal support of sequences of lengths 1; . . . ; lmax using �3;

9: Lf estimate_max_frequent_sequence_length (u, b);

10: /���� Mining Phase ����/11: FS ;;12: dbSet randomly_partition_database (D, Lf );

13: �0 �5 / Lf ;

14: for k from 1 to Lf do

15: if k == 1 then

16: Candidate Set Ck all items in the alphabet;

17: else

18: Candidate Set Ck generate_candidates (FSk�1);19: end if

20: C0k Sampling-based_Candidate_Pruning(Ck, dbk, �4, u, g, lmax);

/���� See Algorithm 1 ����/21: FSk discover_frequent_sequences (C0k,D, �0, u, g);22: FS += FSk;

23: end for

24: return FS;

6.2 Privacy Analysis of PFS2 Algorithm

In this section, we give the privacy analysis of our PFS2

algorithm. Since the key step in our algorithm is pruningthe candidate sequences based on the sample databases, wefirst prove our pruning technique satisfies �4-differentialprivacy.

Theorem 6. The sampling-based candidate pruning technique(i.e., Algorithm 1) guarantees �4-differential privacy.

Proof. Given the candidate k-sequences Ck and the kthsample database dbk, we use the noisy local support ofCk in dbk to estimate which candidate k-sequences are

2920 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 12: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

potentially frequent. To reduce the amount of addednoise, we use our gap-aware sequence shrinking methodto enforce the length constraint lmax on dbk. Then, basedon the gap-aware sensitivity computation method, weobtain the sensitivity Dk of computing the local support ofCk in the transformed sample database db0k. For our gap-aware sensitivity computation method, it only needs themaximal length constraint lmax (see Section 5.2). Asshown in Theorem 7, lmax is differentially private. Thus,our gap-aware sensitivity computation method does notbreach the privacy. Overall, we can see adding Laplacenoise LapðDk=�4Þ to the local support of Ck satisfies �4-dif-ferential privacy in db0k.

In our gap-aware sequence shrinking method, toshrink a sequence, our three schemes do not use any othersequences but only relying on Ck and lmax (see Section5.1).Ck does not breach the privacy because it is generatedbased on the frequent (k� 1)-sequences, and frequent(k� 1)-sequences are determined by their noisy supports.Thus, our gap-aware sequence shrinking method is a localtransformation (see Def. 7 in [11]). Based on Theorem 7 in[11] and Theorem 5 in [7], as long as the transformation islocal, applying an �4-differentially private algorithm ondb0k also guarantees �4-differential privacy for dbk.

For our threshold relaxation method, only using thenumber of sequences in the sample database has privacyimplications (see Section 5.3). We approximately thinkeach sample database contains jDj=Lf sequences, wherejDj is the number of total input sequences and Lf is thenumber of sample databases (i.e., the maximal length offrequent sequences). As shown in Theorem 7, jDj and Lf

are differentially private. Thus, our threshold relaxationmethod does not breach the privacy. In summary, oursampling-based candidate pruning technique (i.e., Algo-rithm 1) satisfies �4-differential privacy. tuIn the following, we prove that the PFS2 algorithm over-

all guarantees �-differential privacy.

Theorem 7. The PFS2 algorithm (i.e., Algorithm 3) satisfies�-differential privacy.

Proof. In the pre-mining phase of our algorithm, for thecomputation of jDj (i.e., the number of total input sequen-ces), as adding (removing) an input sequence only affectsjDj by 1, the sensitivity of this computation is 1. Thus,adding geometric noise G(�1) in computing jDj satisfies�1-differential privacy. Similarly, the sensitivity of com-puting ai (i.e., the number of i-sequences) is 1. We addnoise G(�2) in computing ai, which satisfies �2-differentialprivacy. In the pre-mining phase, we compute a1; . . . ;al2 ,

such that ðPl2j¼1 ajÞ=jDj is larger than the input parameter

h. Since a single input sequence can affect only one ele-ment among a1; . . . ;al2 (i.e., the sets of input sequences

with different lengths are disjoint), based on the parallelcomposition property [2], the privacy budgets used incomputing a1; . . . ;al2 do not need to accumulate. In addi-

tion, for the maximal length constraint lmax, as it is esti-mated based on jDj and a1; . . . ;al2 , we can safely use lmax.

Besides, in the pre-mining phase, we also compute b ={b1; . . . ;blmax

}, where bi is the maximal support of

i-sequences. Since the elements in b are non-increasing,

as shown in [11], adding noise Gð�3= log lmaxd eÞ in comput-ing b is �3-differentially private. For the maximal length offrequent sequences Lf , since it is estimated based on b,we can safely use Lf .

In the mining phase, we first randomly partition theoriginal database into Lf disjoint sample databases.Then, given the candidate k-sequences, we use the kthsample database to prune them. Theorem 6 proves thatour pruning technique is �4-differentially private. Next,we compute the noisy support of the remaining candi-date k-sequences C0k in the original database D. Noticethat, we do not limit the length of input sequences in theoriginal database. Since adding (removing) one sequencewithout length constraint can at most affect the supportof all C0k by 1, the sensitivity of computing the support ofC0k in D is jC0kj. In addition, we uniformly divide �5 intoLf portions and allocate �0 = �5=Lf to the support compu-tations of C0k in D. Thus, we add Laplace noisyLapðjC0kj=�0Þ to the support of C0k in D, which satisfies�0-differential privacy.

Since all sample databases are disjoint, due to the par-allel composition property [2], the privacy budget usedin sample databases can be summed separately. In sum-mary, based on the sequential composition property [2],we can conclude that our PFS2 algorithm satisfiesð�1 þ �2 þ �3 þ �4 þ Lf � �0Þ ¼ �-differential privacy. tu

7 EXPERIMENTS

7.1 Experiment Setup

As the PFS2 algorithm is the first algorithm that supportsthe general FSM under differential privacy, we compare the

PFS2 algorithm with two differentially private sequencedatabase publishing algorithms. In particular, the first algo-rithm [14] utilizes variable length n-grams (referred to asn-gram), and the second algorithm [13] utilizes a prefix treestructure (referred to as Prefix). For algorithms n-gram andPrefix, to privately find frequent sequences, we first runthem on the original datasets to generate anonymized data-sets, and then run the non-private FSM algorithm GSP [5]over the anonymized datasets.

We implement all algorithms in JAVA. All the experi-ments are conducted on a PC with Intel Core2 Duo E8400CPU (3.0 GHz) and 4 GB RAM. Because the algorithmsinvolve randomization, we run each algorithm ten timesand report its average results. In the experiments, we usethe relative threshold. As n-gram and Prefix generate anony-mized sequence datasets, the number of sequences in theoriginal dataset might be different from the number ofsequences in the anonymized datasets. For n-gram andPrefix, we use the relative threshold with respect to the origi-nal dataset.

In PFS2, we allocate the privacy budget � as follows: �1 =0:025�, �2 = 0:025�, �3 = 0:05�, �4 = 0:45� and �5 = 0:45�. Weassign more privacy budget to �4 and �5 as the sensitivity of

the support computations in PFS2 is much larger than thatof the other computations. Like [7], [11], the parameter h

used in the pre-mining phase of PFS2 is set to be 0.85, andthe upper bound on the length of input sequences l1 is set tobe 30. We set the privacy budget � to be 1.0 and the relaxa-tion parameter z to be 0.3. We also show the experiment

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2921

Page 13: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

results under varying � in Section 7.3 and the experimentresults under varying z in Section 7.5. For the maximumgap g, in our experiments, its default value is1. In Section7.2, we also compare the performance of the algorithms formining consecutive item sequences (i.e., g = 0) and illustratethe experiment results under varying g.

Datasets. In the experiments, we use three publicly avail-able real sequence datasets. Since the original data in data-set House_Power appears as time-series, like [7], wediscretize these values and successively construct asequence from every 50 samples. A summary of the charac-teristics of these three datasets is shown in Table 2.

Utility Measures. To evaluate the performance of the algo-rithms, we follow two widely used standard metrics: F-score[7], [11] and Relative Error (RE) [16]. In particular, weemploy the F-score to measure the utility of generated fre-quent sequences, and use the RE to measure the error withrespect to the actual supports of sequences.

7.2 Frequent Sequence Mining

We first study the performance of PFS2, n-gram and Prefixon three datasets as we vary the relative threshold u and themaximum gap g. We do not show the performance of Prefixin BIBLE as Prefix does not scale to handle datasets withlarge number of items. In BIBLE, it contains 13,905 items. Itis hard for Prefix to construct a prefix tree for BIBLE.

7.2.1 Mining Unconstrained Frequent Sequences

We first compare the performance of these algorithmswhen gap g = 1. From Figs. 3a, 3b, 3c, 3d, 3e, and 3f, we

can see PFS2 substantially outperforms n-gram and Prefix.For algorithm Prefix, it uses the prefixes of input sequences,usually less than 20 items, to construct a prefix tree anddirectly deletes the remaining items. However, if the aver-age length of input sequences is long (e.g., in House_Power), the prefix tree cannot preserve enough frequencyinformation, which inevitably leads to poor performance.For algorithm n-gram, it utilizes the n-gram model toextract the essential information of a sequence database.However, only the consecutive item sequences are consid-ered in the construction of this model. Thus, algorithmn-gram does not produce good results when g =1. In con-

trast, in PFS2, we utilize sample databases to prune candi-date sequences. In the sample databases, we enforce thelength constraint by using our gap-aware sequence shrink-ing method, which can effectively preserve the frequencyinformation with different values of gap g and significantlyimprove the utility of the results.

From Fig. 3, we can also see PFS2 obtains good perfor-mance in term of RE (usually less than 5 percent). The rea-sons are explained as follows. For metric RE, it focuses on

the supports of released sequences. In PFS2, thanks to thesampling-based candidate pruning technique, the numberof candidate sequences computed in the original database issignificantly reduced. It means the amount of noise addedto the support of each released frequent sequence is small.Moreover, we do not enforce the length constraint on theoriginal database, and thus the support of released frequentsequences is not affected by the transformation of the data-

base. Hence, PFS2 is able to achieve a surprisingly good per-formance in term of RE.

We also show the precision and recall on House_Powerin Figs. 3g and 3h. We observe the precision of PFS2 is verygood but the recall decreases a little. This is mainly because,in the sample databases, although our irrelevant item dele-tion and consecutive patterns compression schemes caneffectively shorten sequences, some long sequences stillneed to be transformed by the sequence reconstructionscheme. It leads to frequency information loss. As a result,some frequent sequences are not identified from the sampledatabases and the recall is decreased.

7.2.2 Mining Consecutive Item Sequences

Second, we study the performance of the three algorithmsfor mining consecutive item sequences (i.e., g = 0). From

Figs. 3i, 3j, 3k, 3l, 3m, and 3n, we can see PFS2 achievesbetter performance. We can also see algorithm n-gram

gains comparable performance to PFS2. This is because,for algorithm n-gram, it makes use of the n-gram model.As discussed above, the n-gram model preserves muchfrequency information in consecutive item sequences, andthus algorithm n-gram can achieve good performancewhen g = 0.

7.2.3 Mining Gap-Constrained Frequent Sequences

Finally, we compare the performance of these algorithmswith different values of gap g. From Figs. 3o, 3p, 3q, 3r, 3s,

and 3t, we can see PFS2 achieves better performance. We

also observe, for PFS2, its performance decreases a little as gincreases. This is because, as g increases, the maximal num-ber of subsequences contained in a length-limited sequencegrows, which causes the sensitivity of the local supportcomputations in the transformed sample databases toincrease (see Section 5.2). Thus, a larger magnitude of noiseis added in the sample databases.

7.3 Effect of Privacy Budget

Fig. 4 shows the performance of PFS2, n-gram and Prefixunder varying privacy budget � on MSNBC (relative thresh-old u = 0.015) and on House_Power (u = 0.34). Obviously,

PFS2 constantly achieves better performance at the samelevel of privacy. We observe all these algorithms behave ina similar way: the quality of the results is improved as �increases. This is because, as � increases, a lower degree ofprivacy is guaranteed and a lower magnitude of noise isadded. We also observe the quality of the results is morestable on MSNBC. This can be explained by the high sup-ports of the sequences in MSNBC, which are more resistantto the noise.

TABLE 2Sequence Dataset Characteristics

Dataset ]Sequences ]Items Max.length Avg.length

MSNBC 989,818 17 14,795 4.75BIBLE 36,369 13,905 100 21.64House_Power 40,986 21 50 50

2922 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 14: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

7.4 Effect of Key Methods

In Fig. 5, we show how the gap-aware sequence shrinking,gap-aware sensitivity computation and threshold relaxationmethods affect the performance of PFS2 on BIBLE andHouse_Power. In this experiment, we set the maximum gap

g to be 4. Let Base be the algorithm which randomly deletesitems in sample databases, uses the number of all combina-tions of items as the maximal number of subsequencescontained in a sequence, and does not relax the user-specifiedthreshold in sample databases. Moreover, let SC be the

Fig. 3. Frequent sequence mining.

Fig. 4. Effect of privacy budget.

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2923

Page 15: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

algorithm which uses our gap-aware sensitivity computationmethod to compute the sensitivity of the local support compu-tations in sample databases, and SC+SS be the algorithmwhich uses the gap-aware sensitivity computation method,and also uses our gap-aware sequence shrinking method tolimit the length of sequences in sample databases.

Form Fig. 5, we can see, without using these three meth-ods, Base produces poor results. For metric F-score, weobserve the improvement of the gap-aware sequence shrink-ing and threshold relaxation methods is significant. How-ever, the improvement of the gap-aware sensitivitycomputation method is considerable in BIBLE but not obvi-ous in House_Power. This is because, in House_power, themaximal length constraint in sample databases is 30, and thenumber of generated candidate sequences is usually smallerthan the maximal number of subsequences contained in asequence of such length. Thus, the gap-aware sensitivitycomputation method does not significantly reduce the sensi-tivity of the local support computations in House_Power. Formetric RE, all these algorithms achieve good performance,although it slightly decreases after we use the gap-awaresequence shrinking and threshold relaxation methods. Thisis because, by using these two methods, we identify morereal frequent sequences from the sample databases. Theincrease of the remaining candidate sequences slightly raisesthe amount of noise added to the support of the released fre-quent sequences.

We also want to know the effectiveness of the threeschemes proposed in our gap-aware sequence shrinkingmethod. In particular, let RD denote the algorithm whichenforces the length constraint on sample databases by ran-domly deleting items. Let IC denote the algorithm whichfirst uses our irrelevant item deletion and consecutive pat-terns compression schemes to shrink sequences in sampledatabases. After that, if some sequences still violate thelength constraint, IC deletes items which appear in the few-est number of candidate sequences from these sequences.From Fig. 6, we can see, without using our gap-aware

sequence shrinking method, RD does not produce reason-able results. We can also see that the improvement of allthese three schemes is significant.

7.5 Effect of Relaxation Parameter

Fig. 7 shows the performance of PFS2 by varying the relaxa-tion parameter z (used in the threshold relaxation method).We set the relative threshold u = 0.01 on MSNBC, u = 0.14 onBIBLE and u = 0.32 on House_Power. We can see, when z is

set to be 0.3, PFS2 typically obtains good results.

7.6 Benefit of Sampling-Based Candidate Pruning

We also study the benefit of our sampling-based candidatepruning technique. We compare PFS2 to a differentially pri-vate FSM algorithm which enforces the length constraint onthe original database, denoted PL. In particular, given thecandidate k-sequences, PL uses our gap-aware sequenceshrinking method to transform the original database anddetermines which candidate sequences are frequent basedon their noisy supports on the transformed database.

From Fig. 8, we can see PFS2 achieves better perfor-

mance. This is because, compared with PL, PFS2 uses sam-ple databases to prune candidate sequences, whicheffectively reduces the sensitivity of computing the supportof candidate sequences. In contrast, PL limits the length ofsequences in the original database to reduce such sensitiv-ity. However, it introduces a new source of error by discard-ing items from sequences. Due to the privacy requirement,it is hard to precisely compensate such information loss.Thus, PL introduces more errors.

To better understand the benefit of the sampling-basedcandidate pruning technique, we apply it to frequent item-set mining by extending the algorithm proposed in [11](which is referred to as TT). Specifically, for the mining offrequent k-itemsets, given a sample database, we use TT’ssmart truncating method to transform it and use our thresh-old relaxation method to relax the user-specified threshold.

Fig. 5. Improvement of key methods.

Fig. 6. Effect of gap-aware sequence shrinking method. Fig. 7. Effect of relaxation parameter.

2924 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016

Page 16: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

Then, we utilize the transformed sample database to prunecandidate k-itemsets generated based on the downward clo-sure property. For the remaining candidate k-itemsets, wecompute their noisy supports on the original database. Thecandidate k-itemsets whose noisy supports in the originaldatabase exceed the user-specified threshold are output asfrequent. We use two transaction datasets in this experi-ment: POS and Pumsb-star. Their characteristics are illus-trated in Table 3. We show the experiment results in Fig. 9.It is not surprising that the performance of TT is signifi-cantly improved by utilizing our sampling-based candidatepruning technique.

7.7 Running Time

Fig. 10 shows the running time of PFS2 and non-privateFSM algorithm GSP [5] for mining frequent sequences

under different values of threshold. We can see that PFS2

gains comparable performance with GSP. It indicates that

our sampling-based candidate pruning technique does notintroduce much overhead during the mining process.

8 CONCLUSION

In this paper, we investigate the problem of designing a dif-ferentially private FSM algorithm. We observe the amountof noise required in differentially private FSM is proportion-ate to the number of candidate sequences. We introduce asampling-based candidate pruning technique as an effectivemeans of reducing the number of candidate sequences,which can significantly improve the utility-privacy tradeoff.By leveraging such technique, we design our PFS2 algo-rithm. It can handle the general gap-constrained FSM in thecontext of differential privacy. In particular, given the can-didate sequences, we use their noisy local supports in asample database to estimate which candidate sequences arepotentially frequent. To improve the accuracy of such pri-vate estimations, we propose a gap-aware sequence shrink-ing method which can enforce the length constraint on thesample database while effectively preserving the frequencyinformation. Moreover, we propose a gap-aware sensitivitycomputation method, which can efficiently calculate thesensitivity of the local support computations in the sampledatabases under different gap constraints. Furthermore,we propose a threshold relaxation method, which relaxesthe user-specified threshold for the sample database to esti-mate which candidate sequences are potentially frequent.Formal privacy analysis and the results of extensive experi-

ments on real datasets show that our PFS2 algorithm canachieve a high degree of privacy and high data utility.

ACKNOWLEDGMENTS

The work was supported by the National Natural ScienceFoundation of China under grant 61502047, the NationalInstitute of Health (NIH) under award #R01GM114612, andthe Patient-Centered Outcomes Research Institute (PCORI)

Fig. 8. Effect of sampling on frequent sequence mining.

TABLE 3Transaction Dataset Characteristics

Dataset ]Transactions ]Items Max.length Avg.length

POS 515,597 1,657 164 6.5Pumsb-star 49,046 2,088 63 50.5

Fig. 9. Effect of sampling on frequent itemset mining.

Fig. 10. Running time evaluation.

XU ETAL.: DIFFERENTIALLY PRIVATE FREQUENT SEQUENCE MINING 2925

Page 17: 2910 IEEE TRANSACTIONS ON KNOWLEDGE AND …Differentially Private Frequent Sequence Mining Shengzhi Xu, Xiang Cheng, Sen Su, Ke Xiao, and Li Xiong Abstract—In this paper, we study

under award #ME-1310-07058. The preliminary version ofthis paper has been published in the Proceedings of the 2015IEEE International Conference on Data Engineering (ICDE2015), Seoul, Korea, April 13-17, 2015. Sen Su is thecorresponding author.

REFERENCES

[1] C. Dwork, “Differential privacy,” in Proc. 33rd Int. ColloquiumAutomata Languages Programming, 2006, pp. 1–12.

[2] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibratingnoise to sensitivity in private data analysis,” in Proc. 3rd TheoryCryptography Conf., 2006, pp. 265–284.

[3] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int.J. Uncertainty Fuzziness Knowl.-Base Syst., 2002, pp. 557–570.

[4] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubrama-niam, “l-diversity: Privacy beyond k-anonymity,” in Proc. IEEE22nd Int. Conf. Data Eng., 2006, pp. 24–24.

[5] R. Srikant and R. Agrawal, “Mining sequential patterns: General-izations and performance improvements,” in Proc. 5th Int. Conf.Extending Database Technol.: Advances Database Technol., 1996,pp. 3–17.

[6] S. Xu, S. Su, X. Cheng, Z. Li, and L. Xiong, “Differentially privatefrequent sequencemining via sampling-based candidate pruning,”in Proc. IEEE 31st Int. Conf. Data Eng., 2015, pp. 1035–1046.

[7] L. Bonomi and L. Xiong, “A two-phase algorithm for miningsequential patterns with differential privacy,” in Proc. 22nd ACMInt. Conf. Inf. Knowl. Manage., 2013, pp. 269–278.

[8] R. Agrawal and R. Srikant, “Fast algorithms for mining associa-tion rules,” in Proc. 20th Int. Conf. Very Large Data Bases, 1994,pp. 487–499.

[9] H. Toivonen, “Sampling large databases for association rules,” inProc. 22nd Int. Conf. Very Large Data Bases, 1996, pp. 134–145.

[10] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara, “Evaluation ofsampling for data mining of association rules,” in Proc. 7th Int.Workshop Res. Issues Data Eng., 1997, pp. 42–50.

[11] C. Zeng, J. F. Naughton, and J.-Y. Cai, “On differentially pri-vate frequent itemset mining,” Proc. VLDB Endowment, vol. 6,pp. 25–36, 2012.

[12] X. Cheng, S. Su, S. Xu, P. Tang, and Z. Li, “Differentially privatemaximal frequent sequence mining,” Comput. Secur., vol. 55,pp. 175–192, 2015.

[13] R. Chen, B. C. M. Fung, and B. C. Desai, “Differentially privatetransit data publication: A case study on the montreal transporta-tion system,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov-ery Data Mining, 2012, pp. 213–221.

[14] R. Chen, G. Acs, and C. Castelluccia, “Differentially privatesequential data publication via variable-length n-grams,” in Proc.2012 ACM Conf. Comput. Commun. Secur., 2012, pp. 638–649.

[15] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta, “Discoveringfrequent patterns in sensitive data,” in Proc. 16th ACM SIGKDDInt. Conf. Knowl. Discovery Data Mining, 2010, pp. 503–512.

[16] N. Li, W. Qardaji, D. Su, and J. Cao, “PrivBasis: Frequent itemsetmining with differential privacy,” Proc. VLDB Endowment, vol. 5,2012 pp. 1340–1351.

[17] X. Cheng, S. Su, S. Xu, and Z. Li, “DP-apriori: A differentially pri-vate frequent itemset mining algorithm based on transactionsplitting,” Comput. Secur., vol. 50, pp. 74–90, 2015.

[18] S. Su, S. Xu, X. Cheng, Z. Li, and F. Yang, “Differentially privatefrequent itemset mining via transaction splitting,” IEEE Trans.Knowl. Data Eng., vol. 27, no. 7, pp. 1875–1891, Jul. 2015.

[19] F. McSherry andK. Talwar, “Mechanism design via differentialprivacy,” in Proc. 48th Annu. IEEE Symp. Found. Comput. Sci., 2007,pp. 94–103.

[20] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without can-didate generation,” in Proc. ACM SIGMOD Int. Conf. Manage.Data, 2000, pp. 1–12.

[21] E. Shen and T. Yu, “Mining frequent graph patterns with differen-tial privacy,” in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discov-ery Data Mining, 2013, pp. 545–553.

[22] S. Xu, S. Su, L. Xiong, X. Cheng, and K. Xiao, “Differentially pri-vate frequent subgraph mining,” in Proc. IEEE 32nd Int. Conf. DataEng., 2016, pp. 229–240.

[23] A. Ghosh, T. Roughgarden, and M. Sundararajan, “Universallyutility-maximizing privacy mechanisms,” SIAM J. Comput.,vol. 41, pp. 1673–1693, 2012.

[24] E. D�ıaz-Franc�es and J. Montoya, “Correction to “On the linearcombination of normal and Laplace random variables”, by Nadar-ajah, S., Computational Statistics, 2006, 21, 63–71,” Comput. Statis-tics, vol. 23, pp. 661–666, 2008.

Shengzhi Xu received the PhD Degree in Com-puter Science from Beijing University of Postsand Telecommunications, China, in 2016. Hismajor is Computer Science. His research inter-ests include data mining and data privacy.

Xiang Cheng received the PhD degree in com-puter science from the Beijing University of Postsand Telecommunications, China, in 2013. He iscurrently an assistant professor with the BeijingUniversity of Posts and Telecommunications. Hisresearch interests include information retrieval,data mining, machine learning, and data privacy.

Sen Su received the PhD degree from the Uni-versity of Electronic Science and Technology,China, in 1998. He is currently a professor withthe Beijing University of Posts and Telecommuni-cations. His research interests include data pri-vacy, cloud computing, and internet services.

Ke Xiao is working toward the graduate degree atthe Beijing University of Posts and Telecommuni-cations, China. His major is computer science.His research interests include data mining anddata privacy.

Li Xiong is a Winship Distinguished ResearchProfessor of Computer Science (and BiomedicalInformatics) at Emory University. She holds a PhDfrom Georgia Institute of Technology, an MS fromJohns Hopkins University, and a BS from Univer-sity of Science and Technology of China, all inComputer Science. She and her research group,Assured Information Management and Sharing(AIMS), conduct research that addresses bothfundamental and applied questions at the inter-face of data privacy and security, spatiotemporaldata management, and health informatics.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

2926 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 11, NOVEMBER 2016