Outline Introduction Problem Definition Algorithms Evaluation Conclusion 1

OutlineIntroductionProblem DefinitionAlgorithmsEvaluation

Conclusion

1

Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases• Authors: Zhou Zhao, Da Yan and Wilfred Ng• Source: IEEE Transactions on Knowledge and Data Engineering,

Vol. 26, No. 5, May 2014, pp. 1171-1184• Reporter: Pei Wun Chen• Keywords: Frequent patterns, Uncertain databases,

Approximate algorithm, Possible world semantics

Data mining – Frequent Pattern MiningHelp reveal collections of popular merchandise itemsCo-occurring objectsCo-located events

Sequence dataMarket transaction records ordered by time Discover who frequently buying the item A will buy the item B later (A->B)

Sequential pattern mining

3

Introduction Problem Definition Algorithms Evaluation Conclusion

Sequential Pattern MiningWhat is sequential pattern mining ？Given a set of sequences, find the complete set of frequent

sequential patterns.

4


A market-basket sequence database SID sequence

1 <a(abc)(ac)d(cf)>

2 <(ad)c(bc)(ae)>

3 <(ef)(ab)(df)cb>

4 <eg(af)cbc>

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given a support threshold min_sup =2, <(ab)c> is a frequent sequential pattern

Uncertain dataMany real-life applications are riddled with uncertaintyRFID sensors networkGPS trajectory database

Reasons of uncertaintySampling and duration errorsPrivacy preserving

Uncertain data represent by probabilitiesUncertain data modeling

5


Uncertainty Data modelsSequence-level Model Sensors network dataset

• Pr{sup(AB) = 2} = Pr(pw1) = 0.9• Pr{sup(AB) = 1} = Pr(pw2) = 0.1• Pr{sup(AB) = 0} = 0

6


Uncertainty Data modelsElement-level ModelGPS trajectory dataset

•Pr{sup(AB) = 2} = Pr(pw3) = 0.9025

•Pr{sup(AB) = 1} = Pr(pw1) + Pr(pw2) + Pr(pw4) = 0.0975

•Pr{sup(AB) = 0} = 0 7


Probabilistic Frequentness patternsProbabilistic Frequentness(τsup, τprob)-frequentness

Find all probabilistic frequentness patterns that satisfied

Pr{sup(AB) ≥ 1} ≥ 1⨯Pr{sup(AB) ≥ 2} ≥ 0.91 8


U-PrefixSpan - Sequence-Level ModelSequence Projection in Sequence-Level Model

9


Element-Level U-PrefixSpan

10


Experiments-Environment

Windows 7 PCIntel® Core(TM) i5 CPU4GB memory

Coded in C++ and run in Eclipse

11


Experiments-SeqU-PrefixSpan (n/m)

12


Experiments-SeqU-PrefixSpan (l/d)

13


(The length of a sequence instance is randomly chosen from the range [1, ℓ])

(Each element in the sequence instance is randomly picked froman element table with d elements)

Threshold efficiency

14


Real dataset - RFID datasets

15


Precision of approximation - ElemU

16


Towards Efficient Sequential Pattern Mining in Temporal Uncertain Databases• Authors: Jiaqi Ge, Yuni Xia and Jian Wang• Source: Proceedings of 19th Pacific-Asia Conference on

Knowledge Discovery and Data Mining (PAKDD), pp. 268-279, 2015• Keywords: Temporal uncertainty, Sequential pattern mining

Temporal uncertaintyTimestamps of events in real applicationsInaccurateImprecise

Reasons of temporal uncertaintyUnavailable exact time of an eventAggregation operations on temporal scalesProtect privacy and confidentiality

18


Motivation A time series model T = {t, (t + 1), . . . , (t + n)}In probabilistic temporal databases

Possible world semantics probabilistic databasesEfficiency and scalability challenges in uncertain SPM

Propose an efficient SPM algorithm in temporal uncertain sequence databases.

19


Problem DefinitionUncertain event e = <sid, eid, T, I>sid = i and eid = j is denoted by eij

sequence-id and event-id are denoted as <sid, eid> An uncertain timestamp T modeled by a uniform distributionT ~ U(t−, t+)

I is an itemset

20


e11 = <1, 1, {[100, 103]}, {A, C}>e12 = <1, 2, {[102, 105]}, B>

An example of uncertain database

Temporal Possible Worlds

21


The time point of an event is randomly drawn from the correspondinguncertain timestamps

Uncertain SPM Problem

22


The SPM problem in temporal uncertain databases can be defined as followsGiven a minimal support ts, a minimal frequentness probability threshold tp, a

minimal time gap l and a maximal time gap h, find every probabilistic frequent sequential pattern s in a temporal uncertain database, which has P(sup(s) ≥ ts) ≥ tp

sup(s) is the total number of sequences that support s

ts is the user-defined minimal thresholdw is a possible world in which s is frequent

f(w) is the pdf of w

Probability of Satisfying Time Constraints

23


os = {ek1, . . . , ekn} is a minimal possible occurrence of s

Sequential pattern s = <s1, . . . , sn>

Ti is the uncertain time of the event eki

P(s os), denoted by P(<T1 · · · Tn>) is the probability that satisfyl ≤ Ti+1 − Ti ≤ h, i [1, n)∀ ∈os = {e21, e22}, s = {A, B}

P({A, B} {e21, e22}, ) = P(<T1, T2>)l ≤ T2 − T1 ≤ hBut T1, T2 are uncertain.

Probability of Satisfying Time Constraints

24


Two uncertain timestamps X U(x∼ −, x+) and Y U(y∼ −, y+).Time constraints mingap = l, maxgap = h, then

Decomposed into p cases by the endpoints

An example of computing P(<XY>)

25


l = 0 and h = 5, X ~ U(60, 63), Y ~ (62, 68)

[62, 68] divide by four endpoints {63 + 0, 60 + 0, 63 + 5, 60 + 5}

= {63, 60, 68, 65}

Insert points 63 and 65 into [62, 68] => [62, 63, 65, 68]

Then we have 3 subintervals as [62, 68] = [62, 63] [63, 65] [65, ∪ ∪68]

An example of computing P(<XY>)

26


Thereafter, P(XY) = P(XYi ∩ Yi) = 0.72

Compute P(<XY>) between two functions: y = x+l, y = x+h

Synthetic Data Generation

27


IBM market-basket data generatorIntel(R) Core (TM) Duo CPU @2.33GHz and 4GB memoryParameters

C: number of sequences T: average number of transactions/itemsets per data-sequence L: average number of items per transaction/itemset per data-sequence I : number of different items.

Add temporal uncertainty

28


Replace a timestamp t by a uniform distribution[(1 − r) t, (1 + r) t]∗ ∗r is randomly drawn from the uniform distribution U(0, 1)

Dataset named by parametersT4L10I1C10 indicates that T = 4, L = 10, I = 1 1000 and C = 10 1000∗ ∗

Scalability (1/2)

29


minsup = 0.5%, minprob = 0.7, mingap = 1, and maxgap = 10.

C = 10 000, T = 4, I = 10 000 and L = 2.

Scalability (2/2)

30



C = 10 000, T = 4, I = 10 000 and L = 2.

Efficiency

31



C = 10 000, T = 4, I = 10 000 and L = 2.

Real world stock market dataset

32


Extract the price of 882 stocks in 16 weeksEach stock corresponds to a sequenceThree events － price going up (+), going down (−) and no change (0).

For example, if price goes up at time 1, 2 and 3, then we aggregate them to form an uncertain event ([1,3], +).

Performance of uSPM in real stock dataset

33


minprob = 0.7, mingap = 1 and maxgap = 5

Conclusion – U-PrefixspanStudy the problem of mining p-FSPs in uncertain databases

Two new U-PrefixSpan algorithmsAvoid the problem of ”possible world explosion”

Three pruning rules and one early validating methodDecrease the execution time

34


Conclusion - uSPM

35


This paper study the problem of mining probabilistic frequent sequential patterns in databases with temporal uncertaintyDesign an incremental approach to manage temporal uncertainty efficiently and integrate it into classic pattern-growth SPM algorithm.The experimental results prove that the algorithm is efficient and scalable.

U-PrefixSpan Vs. uSPM

36


Uncertainty U-PrefixSpan in eventsuSPM in time

U-PrefixSpan is more practicaluSPM is the first work that study the problem of temporal uncertainty

Thanks for listening

Q&A

Documents

Outline Introduction Problem Definition Algorithms Evaluation Conclusion 1