Upload
adela-leonard
View
223
Download
0
Embed Size (px)
Citation preview
OutlineIntroductionProblem DefinitionAlgorithmsEvaluation
Conclusion
1
Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases• Authors: Zhou Zhao, Da Yan and Wilfred Ng• Source: IEEE Transactions on Knowledge and Data Engineering,
Vol. 26, No. 5, May 2014, pp. 1171-1184• Reporter: Pei Wun Chen• Keywords: Frequent patterns, Uncertain databases,
Approximate algorithm, Possible world semantics
Data mining – Frequent Pattern MiningHelp reveal collections of popular merchandise itemsCo-occurring objectsCo-located events
Sequence dataMarket transaction records ordered by time Discover who frequently buying the item A will buy the item B later (A->B)
Sequential pattern mining
3
Introduction Problem Definition Algorithms Evaluation Conclusion
Sequential Pattern MiningWhat is sequential pattern mining ?Given a set of sequences, find the complete set of frequent
sequential patterns.
4
Introduction Problem Definition Algorithms Evaluation Conclusion
A market-basket sequence database SID sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
Given a support threshold min_sup =2, <(ab)c> is a frequent sequential pattern
Uncertain dataMany real-life applications are riddled with uncertaintyRFID sensors networkGPS trajectory database
Reasons of uncertaintySampling and duration errorsPrivacy preserving
Uncertain data represent by probabilitiesUncertain data modeling
5
Introduction Problem Definition Algorithms Evaluation Conclusion
Uncertainty Data modelsSequence-level Model Sensors network dataset
• Pr{sup(AB) = 2} = Pr(pw1) = 0.9• Pr{sup(AB) = 1} = Pr(pw2) = 0.1• Pr{sup(AB) = 0} = 0
6
Introduction Problem Definition Algorithms Evaluation Conclusion
Uncertainty Data modelsElement-level ModelGPS trajectory dataset
•Pr{sup(AB) = 2} = Pr(pw3) = 0.9025
•Pr{sup(AB) = 1} = Pr(pw1) + Pr(pw2) + Pr(pw4) = 0.0975
•Pr{sup(AB) = 0} = 0 7
Introduction Problem Definition Algorithms Evaluation Conclusion
Probabilistic Frequentness patternsProbabilistic Frequentness(τsup, τprob)-frequentness
Find all probabilistic frequentness patterns that satisfied
Pr{sup(AB) ≥ 1} ≥ 1⨯Pr{sup(AB) ≥ 2} ≥ 0.91 8
Introduction Problem Definition Algorithms Evaluation Conclusion
U-PrefixSpan - Sequence-Level ModelSequence Projection in Sequence-Level Model
9
Introduction Problem Definition Algorithms Evaluation Conclusion
Element-Level U-PrefixSpan
10
Introduction Problem Definition Algorithms Evaluation Conclusion
Experiments-Environment
Windows 7 PCIntel® Core(TM) i5 CPU4GB memory
Coded in C++ and run in Eclipse
11
Introduction Problem Definition Algorithms Evaluation Conclusion
Experiments-SeqU-PrefixSpan (n/m)
12
Introduction Problem Definition Algorithms Evaluation Conclusion
Experiments-SeqU-PrefixSpan (l/d)
13
Introduction Problem Definition Algorithms Evaluation Conclusion
(The length of a sequence instance is randomly chosen from the range [1, ℓ])
(Each element in the sequence instance is randomly picked froman element table with d elements)
Threshold efficiency
14
Introduction Problem Definition Algorithms Evaluation Conclusion
Real dataset - RFID datasets
15
Introduction Problem Definition Algorithms Evaluation Conclusion
Precision of approximation - ElemU
16
Introduction Problem Definition Algorithms Evaluation Conclusion
Towards Efficient Sequential Pattern Mining in Temporal Uncertain Databases• Authors: Jiaqi Ge, Yuni Xia and Jian Wang• Source: Proceedings of 19th Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD), pp. 268-279, 2015• Keywords: Temporal uncertainty, Sequential pattern mining
Temporal uncertaintyTimestamps of events in real applicationsInaccurateImprecise
Reasons of temporal uncertaintyUnavailable exact time of an eventAggregation operations on temporal scalesProtect privacy and confidentiality
18
Introduction Problem Definition Algorithms Evaluation Conclusion
Motivation A time series model T = {t, (t + 1), . . . , (t + n)}In probabilistic temporal databases
Possible world semantics probabilistic databasesEfficiency and scalability challenges in uncertain SPM
Propose an efficient SPM algorithm in temporal uncertain sequence databases.
19
Introduction Problem Definition Algorithms Evaluation Conclusion
Problem DefinitionUncertain event e = <sid, eid, T, I>sid = i and eid = j is denoted by eij
sequence-id and event-id are denoted as <sid, eid> An uncertain timestamp T modeled by a uniform distributionT ~ U(t−, t+)
I is an itemset
20
Introduction Problem Definition Algorithms Evaluation Conclusion
e11 = <1, 1, {[100, 103]}, {A, C}>e12 = <1, 2, {[102, 105]}, B>
An example of uncertain database
Temporal Possible Worlds
21
Introduction Problem Definition Algorithms Evaluation Conclusion
The time point of an event is randomly drawn from the correspondinguncertain timestamps
Uncertain SPM Problem
22
Introduction Problem Definition Algorithms Evaluation Conclusion
The SPM problem in temporal uncertain databases can be defined as followsGiven a minimal support ts, a minimal frequentness probability threshold tp, a
minimal time gap l and a maximal time gap h, find every probabilistic frequent sequential pattern s in a temporal uncertain database, which has P(sup(s) ≥ ts) ≥ tp
sup(s) is the total number of sequences that support s
ts is the user-defined minimal thresholdw is a possible world in which s is frequent
f(w) is the pdf of w
Probability of Satisfying Time Constraints
23
Introduction Problem Definition Algorithms Evaluation Conclusion
os = {ek1, . . . , ekn} is a minimal possible occurrence of s
Sequential pattern s = <s1, . . . , sn>
Ti is the uncertain time of the event eki
P(s os), denoted by P(<T1 · · · Tn>) is the probability that satisfyl ≤ Ti+1 − Ti ≤ h, i [1, n)∀ ∈os = {e21, e22}, s = {A, B}
P({A, B} {e21, e22}, ) = P(<T1, T2>)l ≤ T2 − T1 ≤ hBut T1, T2 are uncertain.
Probability of Satisfying Time Constraints
24
Introduction Problem Definition Algorithms Evaluation Conclusion
Two uncertain timestamps X U(x∼ −, x+) and Y U(y∼ −, y+).Time constraints mingap = l, maxgap = h, then
Decomposed into p cases by the endpoints
An example of computing P(<XY>)
25
Introduction Problem Definition Algorithms Evaluation Conclusion
l = 0 and h = 5, X ~ U(60, 63), Y ~ (62, 68)
[62, 68] divide by four endpoints {63 + 0, 60 + 0, 63 + 5, 60 + 5}
= {63, 60, 68, 65}
Insert points 63 and 65 into [62, 68] => [62, 63, 65, 68]
Then we have 3 subintervals as [62, 68] = [62, 63] [63, 65] [65, ∪ ∪68]
An example of computing P(<XY>)
26
Introduction Problem Definition Algorithms Evaluation Conclusion
Thereafter, P(XY) = P(XYi ∩ Yi) = 0.72
Compute P(<XY>) between two functions: y = x+l, y = x+h
Synthetic Data Generation
27
Introduction Problem Definition Algorithms Evaluation Conclusion
IBM market-basket data generatorIntel(R) Core (TM) Duo CPU @2.33GHz and 4GB memoryParameters
C: number of sequences T: average number of transactions/itemsets per data-sequence L: average number of items per transaction/itemset per data-sequence I : number of different items.
Add temporal uncertainty
28
Introduction Problem Definition Algorithms Evaluation Conclusion
Replace a timestamp t by a uniform distribution[(1 − r) t, (1 + r) t]∗ ∗r is randomly drawn from the uniform distribution U(0, 1)
Dataset named by parametersT4L10I1C10 indicates that T = 4, L = 10, I = 1 1000 and C = 10 1000∗ ∗
Scalability (1/2)
29
Introduction Problem Definition Algorithms Evaluation Conclusion
minsup = 0.5%, minprob = 0.7, mingap = 1, and maxgap = 10.
C = 10 000, T = 4, I = 10 000 and L = 2.
Scalability (2/2)
30
Introduction Problem Definition Algorithms Evaluation Conclusion
minsup = 0.5%, minprob = 0.7, mingap = 1, and maxgap = 10.
C = 10 000, T = 4, I = 10 000 and L = 2.
Efficiency
31
Introduction Problem Definition Algorithms Evaluation Conclusion
minsup = 0.2%, minprob = 0.7, mingap = 1, and maxgap = 10.
C = 10 000, T = 4, I = 10 000 and L = 2.
Real world stock market dataset
32
Introduction Problem Definition Algorithms Evaluation Conclusion
Extract the price of 882 stocks in 16 weeksEach stock corresponds to a sequenceThree events - price going up (+), going down (−) and no change (0).
For example, if price goes up at time 1, 2 and 3, then we aggregate them to form an uncertain event ([1,3], +).
Performance of uSPM in real stock dataset
33
Introduction Problem Definition Algorithms Evaluation Conclusion
minprob = 0.7, mingap = 1 and maxgap = 5
Conclusion – U-PrefixspanStudy the problem of mining p-FSPs in uncertain databases
Two new U-PrefixSpan algorithmsAvoid the problem of ”possible world explosion”
Three pruning rules and one early validating methodDecrease the execution time
34
Introduction Problem Definition Algorithms Evaluation Conclusion
Conclusion - uSPM
35
Introduction Problem Definition Algorithms Evaluation Conclusion
This paper study the problem of mining probabilistic frequent sequential patterns in databases with temporal uncertaintyDesign an incremental approach to manage temporal uncertainty efficiently and integrate it into classic pattern-growth SPM algorithm.The experimental results prove that the algorithm is efficient and scalable.
U-PrefixSpan Vs. uSPM
36
Introduction Problem Definition Algorithms Evaluation Conclusion
Uncertainty U-PrefixSpan in eventsuSPM in time
U-PrefixSpan is more practicaluSPM is the first work that study the problem of temporal uncertainty
Thanks for listening
Q&A