Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
IRDM โ15/16
Jilles Vreeken
Chapter 7-2: Discrete Sequential Data
26 Nov 2015
IRDM โ15/16
IRDM Chapter 7, overview
Time Series 1. Basic Ideas 2. Prediction 3. Motif Discovery
Discrete Sequences 4. Basic Ideas 5. Pattern Discovery 6. Hidden Markov Models
Youโll find this covered in Aggarwal Ch. 3.4, 14, 15
VII-1: 2
IRDM โ15/16
IRDM Chapter 7, today
Time Series 1. Basic Ideas 2. Prediction 3. Motif Discovery
Discrete Sequences 4. Basic Ideas 5. Pattern Discovery 6. Hidden Markov Models
Youโll find this covered in Aggarwal Ch. 3.4, 14, 15
VII-1: 3
IRDM โ15/16
Chapter 7.3, ctd: Motif Discovery
Aggarwal Ch. 14.4, 3.4
VII-1: 4
IRDM โ15/16
Dynamic Time Warping DTW stretches the time axis of one series to enable better matches
(Aggarwal Ch. 3.4) VII-1: 5
IRDM โ15/16
DTW, formally Let ๐ท๐ท๐ท(๐, ๐) be the optimal distance between the first ๐ and first ๐ elements of time series ๐ of length ๐ and ๐ of length ๐
๐ท๐ท๐ท ๐, ๐ = ๐๐๐๐๐๐๐๐ ๐๐,๐๐ + min๏ฟฝ๐ท๐ท๐ท(๐, ๐ โ 1)๐ท๐ท๐ท(๐ โ 1, ๐)
๐ท๐ท๐ท(๐ โ 1, ๐ โ 1)
repeat ๐ฅ๐repeat ๐ฆ๐
repeat neither
We initialise as follows ๐ท๐ท๐ท 0,0 = 0 ๐ท๐ท๐ท 0, ๐ = โ for all ๐ โ {1, โฆ ,๐} ๐ท๐ท๐ท ๐, 0 = โ for all ๐ โ {1, โฆ ,๐}
We can then simply iterate by increasing ๐ and ๐
(Aggarwal Ch. 3.4) VII-1: 6
IRDM โ15/16
Computing DTW (1) Let ๐ท๐ท๐ท(๐, ๐) be the optimal distance between the first ๐ and first ๐ elements of time series ๐ of length ๐ and ๐ of length ๐
๐ท๐ท๐ท ๐, ๐ = ๐๐๐๐๐๐๐๐ ๐๐,๐๐ + min๏ฟฝ๐ท๐ท๐ท(๐, ๐ โ 1)๐ท๐ท๐ท(๐ โ 1, ๐)
๐ท๐ท๐ท(๐ โ 1, ๐ โ 1)
repeat ๐ฅ๐repeat ๐ฆ๐
repeat neither
From the initialised values, can simply iterate by increasing ๐ and ๐: for ๐ = 1 to ๐ for ๐ = 1 to ๐ compute ๐ท๐ท๐ท(๐, ๐)
We can also compute it recursively, by dynamic programming. Both naรฏve strategies cost ๐ ๐๐ , however.
(Aggarwal Ch. 3.4) VII-1: 7
IRDM โ15/16
Computing DTW (2)
Let ๐ท๐ท๐ท(๐, ๐) be the optimal distance between the first ๐ elements of time series ๐ of length ๐ and the first ๐ elements of time series ๐ of length ๐
๐ท๐ท๐ท ๐, ๐ = ๐๐๐๐๐๐๐๐ ๐๐ ,๐๐ + min๏ฟฝ๐ท๐ท๐ท(๐, ๐ โ 1)๐ท๐ท๐ท(๐ โ 1, ๐)
๐ท๐ท๐ท(๐ โ 1, ๐ โ 1)
repeat ๐ฅ๐repeat ๐ฆ๐
repeat neither
We can speed up computation by imposing constraints. e.g. a window constraint to compute ๐ท๐ท๐ท(๐, ๐) only when ๐ โ ๐ โค ๐ค we then only need max 0, i โ w โ min {๐, ๐ + ๐ค} inner loops
(Aggarwal Ch. 3.4) VII-1: 8
IRDM โ15/16
Lower bounds on DTW
Even smarter is to speed up DTW using a lower bound.
๐ฟ๐ฟ_๐พ๐๐พ๐พ๐พ(๐,๐) = ๏ฟฝ๏ฟฝ๐๐ โ ๐๐ 2
๐๐ โ ๐ฟ๐ 2
0
if ๐๐ > ๐๐ if ๐๐ < ๐ฟ๐otherwise
๐
๐=1
๐๐ = max {๐๐โ๐:๐๐+๐} ๐ฟ๐ = min {๐๐โ๐:๐๐+๐} where ๐ is the reach, the allowed range of warping
VII-1: 9
X
Y
L
U
IRDM โ15/16
Discrete Sequences
VII-2: 10
IRDM โ15/16
Chapter 7.4: Basic Ideas
Aggarwal Ch. 14.1-14.2
VII-2: 11
IRDM โ15/16
Trouble in Time Series Paradise
Continuous real-valued time series have their downsides mining results rely on either a distance function or assumptions indexing, pattern mining, summarisation, clustering, classification,
and outlier detection results hence rely on arbitrary choices
Discrete sequences are often easier to deal with mining results rely mostly on counting How to transform a time series into an event sequence? discretisation
VII-2: 12
IRDM โ15/16
Approximating a Time Series
(Lin et al. 2002, 2007) VII-2: 13
IRDM โ15/16
SAX
Symbolic Aggregate Approximation (SAX) most well-known approach to discretise a time series type of piece-wise aggregated approximation (PAA)
How to do SAX divide the data into ๐ค frames compute the mean per frame perform equal-height binning
over the means, to obtain an alphabet of ๐ characters
(Lin et al. 2002, 2007)
VII-2: 14
IRDM โ15/16
Definitions
A discrete sequence ๐1 โฆ๐๐ of length ๐ and dimensionality ๐, contains ๐ discrete feature values at each of ๐ different timestamps ๐1 โฆ ๐๐. Each of the ๐ components ๐๐ contains ๐ discrete behavioral attributes (๐ฅ๐1 โฆ ๐ฅ๐๐) collected at the ๐th timestamp. The actual time stamps are usually ignored โ they only induce an order on the components, or events.
VII-2: 15
IRDM โ15/16
Types of discrete sequences
In many applications, the dimensionality is 1 e.g. strings, such as text or genomes. for AATCGTAC over an alphabet ฮฃ = {A, C, G, T}, each ๐๐ โ ฮฃ
In some applications, each ๐๐ is not a vector, but a set e.g. a supermarket transaction, ๐๐ โ ฮฃ there is no order within ๐๐ We will consider the set-setting, as it is most general
VII-2: 16
IRDM โ15/16
Chapter 7.5: Frequent Patterns
Aggarwal Ch. 15.2
VII-2: 17
IRDM โ15/16
Sequential patterns
A sequential pattern is a sequence. to occur in the data, it has to be a subsequence of the data. Definition: Given two sequences ๐ณ = ๐1 โฆ๐๐ and ๐ต = ๐1 โฆ๐๐ where all elements ๐๐ and ๐๐ in the sequences are sets. Then, the sequence ๐ต is a subsequence of ๐ณ, if ๐ elements ๐๐1 โฆ๐๐๐ can be found in ๐ณ, such that ๐1 < ๐2 < โฏ < ๐๐ and ๐๐ โ ๐๐๐ for each ๐ โ {1 โฆ๐}
VII-2: 18
a a a b d c d b a b c a a b a b a b
a b ๐ต =
๐ณ =
IRDM โ15/16
Support
Depending on whether we have a database ๐ซ of sequences, or a single long sequence, we have to define the support of a sequential pattern differently. Standard, or โper sequenceโ support counting given a database ๐ซ = {๐ณ1, โฆ ,๐ณ๐}, the support of a subsequence ๐ต is the number of sequences in ๐ซ that contain ๐ต.
Window-based support counting given a single sequence ๐ณ, the support of a subsequence ๐ต is the
number of windows over ๐ณ that contain ๐ต.
(we can define frequency analogue as relative support) VII-2: 19
IRDM โ15/16
Windows
A window ๐ณ[๐; ๐] is a strict subsequence of sequence ๐ณ. ๐ณ[๐; ๐] = ๐๐ โ ๐ณ โฃ ๐ โค ๐ โค s
Window-based support counting we can choose a window length ๐ค, and sweep over the data
VII-2: 20
a a a b d c d b a b c a b d a a b c
a b ๐ต =
๐ณ =
IRDM โ15/16
Windows
VII-2: 21
a a a b d c d b a b c a b d a a b c
a b : 1
A window ๐ณ[๐; ๐] is a strict subsequence of sequence ๐ณ. ๐ณ[๐; ๐] = ๐๐ โ ๐ณ โฃ ๐ โค ๐ โค s
Window-based support counting we can choose a window length ๐ค, and sweep over the data
๐ต =
๐ณ =
IRDM โ15/16
Windows
VII-2: 22
a a a b d c d b a b c a b d a a b c
a b : 1
A window ๐ณ[๐; ๐] is a strict subsequence of sequence ๐ณ. ๐ณ[๐; ๐] = ๐๐ โ ๐ณ โฃ ๐ โค ๐ โค s
Window-based support counting we can choose a window length ๐ค, and sweep over the data
๐ต =
๐ณ =
IRDM โ15/16
Windows
VII-2: 23
a a a b d c d b a b c a b d a a b c
a b : 1 : 2
A window ๐ณ[๐; ๐] is a strict subsequence of sequence ๐ณ. ๐ณ[๐; ๐] = ๐๐ โ ๐ณ โฃ ๐ โค ๐ โค s
Window-based support counting we can choose a window length ๐ค, and sweep over the data
๐ต =
๐ณ =
IRDM โ15/16
A window ๐ณ[๐; ๐] is a strict subsequence of sequence ๐ณ. ๐ณ[๐; ๐] = ๐๐ โ ๐ณ โฃ ๐ โค ๐ โค s
Window-based support counting we can choose a window length ๐ค, and sweep over the data
Windows
VII-2: 24
a a a b d c d b a b c a b d a a b c
a b : 2 : 3 ๐ต =
๐ณ =
IRDM โ15/16
Windows
VII-2: 25
a a a b d c d b a b c a b d a a b c
a b : 3 : 4
A window ๐ณ[๐; ๐] is a strict subsequence of sequence ๐ณ. ๐ณ[๐; ๐] = ๐๐ โ ๐ณ โฃ ๐ โค ๐ โค s
Window-based support counting we can choose a window length ๐ค, and sweep over the data support is now dependent on ๐ค, what happens with longer ๐ค?
๐ต =
๐ณ =
IRDM โ15/16
Minimal windows
Fixed window lengths lead to double counting if ๐ณ[๐; ๐] supports sequence ๐ต so do ๐ณ[๐; ๐ + ๐] and ๐ณ[๐ โ ๐; ๐]
We can avoid this by counting only minimal windows ๐ค = ๐ณ[๐; ๐] is a minimal window of pattern ๐ต if ๐ค contains ๐ต but
no other proper sub-windows of w contain ๐ต. for efficiency or fun, we may want to set a maximal window size
VII-2: 26
a a a b d c d b a b c a b d a a b c
a b ๐ต =
๐ณ =
IRDM โ15/16
Minimal windows
Fixed window lengths lead to double counting if ๐ณ[๐; ๐] supports sequence ๐ต so do ๐ณ[๐; ๐ + ๐] and ๐ณ[๐ โ ๐; ๐]
We can avoid this by counting only minimal windows ๐ค = ๐ณ[๐; ๐] is a minimal window of pattern ๐ต if ๐ค contains ๐ต but
no other proper sub-windows of w contain ๐ต. for efficiency or fun, we may want to set a maximal window size
VII-2: 27
a a a b d c d b a b c a b d a a b c
a b ๐ต =
๐ณ =
IRDM โ15/16
Minimal windows
Fixed window lengths lead to double counting if ๐ณ[๐; ๐] supports sequence ๐ต so do ๐ณ[๐; ๐ + ๐] and ๐ณ[๐ โ ๐; ๐]
We can avoid this by counting only minimal windows ๐ค = ๐ณ[๐; ๐] is a minimal window of pattern ๐ต if ๐ค contains ๐ต but
no other proper sub-windows of w contain ๐ต. for efficiency or fun, we may want to set a maximal window size
VII-2: 28
a a a b d c d b a b c a b d a a b c
a b ๐ต =
๐ณ =
: 5
IRDM โ15/16
Mining Frequent Sequential Patterns
Like for itemsets, the per-sequence and per-window definitions of support are also monotone we can employ level-wise search!
We can modify APRIORI to get to GSP (Agrawal & Srikant, 1995; Mannila, Toivonen, Verkamo, 1995)
ECLAT to get SPADE (Zaki, 2000)
FP-GROWTH to get PREFIXSPAN (Pei et al., 2001)
VII-2: 29
IRDM โ15/16
Generalised Sequential Pattern Mining
Algorithm GSP(sequence database ๐ซ, minimal support ๐) begin ๐ โ 1; โฑ๐ โ {all frequent 1 โ item elements} while โฑ๐ is not empty do Generate ๐๐+1 by joining pairs of sequences in โฑ๐ , such that removing an item from the first element of one sequence matches the sequence obtained by removing an item from the last element of the other Prune sequences from ๐๐+1 that violate downward closure Determine โฑ๐+1 by support counting on (๐๐+1,๐ซ) and retaining sequences from ๐๐+1 with support at least ๐ ๐ โ ๐ + 1 end return โ โฑ๐๐
๐=1 end (Agrawal & Srikant, 1995; Mannila, Toivonen & Verkamo, 1995)
VII-2: 30
a b c b
a c b +
โ โฑ๐
โ โฑ๐
โ ๐๐+1
IRDM โ15/16
Episodes
There are many types of sequential patterns The most well-known are ๐-grams, ๐-mers, or strict subsequences,
where we do not allow gaps serial episodes, or subsequences,
where we do allow gaps
VII-2: 31
a c b a c b a a a b c c d b d b e c b d a a b c ๐ณ =
IRDM โ15/16
Episodes
There are many types of sequential patterns The most well-known are ๐-grams, ๐-mers, or strict subsequences,
where we do not allow gaps serial episodes, or subsequences,
where we do allow gaps
Each element can contain one or more items
VII-2: 32
a b a c
b d
c d
a a c d d c d b e f e d b d a a b c ๐ณ = a b c a d b c a a
IRDM โ15/16
Parallel episodes
Serial episodes are still restrictive not everything always happens exactly in sequential order
Parallel episodes acknowledge this a parallel episode defines a partial order, for a match it requires all
parallel events to happen, but does not specify their exact order. e.g. first ,, then in any order and , and then We can also combine the two into generalised episodes VII-2: 33
b
d
c a
c b
d
a a
and c b
d
a b d c
IRDM โ15/16
Chapter 7.6: Hidden Markov Models
Aggarwal Ch. 15.5
VII-2: 34
IRDM โ15/16
Informal definition Hidden Markov Models are probabilistic, generative models for discrete sequences. It is a graphical model in which nodes correspond to system states, and edges to state changes. In a HMM the states of the system are hidden; not directly visible to the user. We only observe a sequence over symbols ฮฃ that the system generates when it switches between states.
VII-2: 35
IRDM โ15/16
Example HMM
This HMM can generate sequences such as VVVVVVVMVV Veggie (common) MVMVVMMVM Omni (common) MMVMVVVVVV Omni-turned-Veggie (not very common) MMMMMMMM Carnivore (rare)
VII-2: 36
Vegetarian Omnivore 0.99 0.90 0.01
0.10
Meal distribution V = 99% M = 1%
Meal distribution ๐ = 50% ๐ = 50%
IRDM โ15/16
Example HMM (2)
VII-2: 37
Flexitarian Omnivore 0.60 0.90 0.20
0.08
Meal distribution V = 80% M = 20%
Meal distribution ๐ = 50% ๐ = 50%
Vegetarian Carnivore
0.99
Meal distribution V = 99% M = 1%
Meal distribution V = 1%
M = 99%
0.60
0.01 0.20 0.40 0.02
IRDM โ15/16
Formal definition
A Hidden Markov Model over alphabet ฮฃ = {๐1, โฆ ,๐ ฮฃ } is a directed graph ๐บ(๐,๐ท) consisting of ๐ states ๐ = {๐1, โฆ , ๐๐}. The initial state probabilities are ๐๐ , โฆ ,๐๐. The (directed) edges correspond to state transitions. The probability of a transition from state ๐๐ to state ๐๐ is denoted by ๐๐๐ . For every visit to a state, a symbol from ฮฃ is generated with probability ๐(๐๐ โฃ ๐๐). VII-2: 38
IRDM โ15/16
What to do with an HMM There are three main things to do with an HMM 1. Training.
Given topology ๐บ and database ๐ซ, learn the initial state probabilities, transition probabilities, and the symbol emission probabilities.
2. Explanation. Given an HMM, determine the most likely state sequence that generated test sequence ๐ต.
3. Evaluation. Given an HMM, determine the probability of test sequence ๐ต.
VII-2: 39
IRDM โ15/16
Using an HMM for Evaluation
We want to know the fit probability that sequence ๐ณ = ๐1 โฆ๐๐ was generated by the given HMM. Naรฏve approach compute all ๐๐ possible paths over ๐บ for each, determine probability of generating ๐ณ sum these probabilities, this is the fit probability of ๐ณ
VII-2: 40
IRDM โ15/16
Recursive Evaluation
The fit probability of the first ๐ symbols1 can be computed recursively from the fit probability of the first (๐ โ 1) symbols2 Let ๐ผ๐ ๐ณ, ๐๐ be the probability that the first ๐ symbols of ๐ณ are generated by the model, and the last state is ๐๐.
๐ผ๐ ๐ณ, ๐๐ = ๏ฟฝ๐ผ๐โ1 ๐ณ, ๐๐ โ ๐๐๐ โ ๐ ๐๐ ๐๐ ๐
๐=1
That is, we sum over all paths up to different final nodes.
1 and a fixed value of the ๐๐ก๐ก state 2 and a fixed value of the (๐ โ 1)๐ ๐ก state VII-2: 41
IRDM โ15/16
Forward Algorithm
We initialise with with ๐ผ1 ๐ณ, ๐๐ = ๐๐ โ ๐ ๐1 ๐๐ and then iteratively compute for each ๐ = 1 โฆ๐. The fit probability of ๐ณ is the sum over all end-states,
๐น(๐ณ) = ๏ฟฝ๐ผ๐ ๐ณ, ๐๐
๐
๐=1
The complexity of the Forward Algorithm is ๐(๐2๐)
VII-2: 42
IRDM โ15/16
But, why?
Good question to ask: why compute the fit probability? classification clustering anomaly detection
For the first two, we can now create group-specific HMMs, and assign the most likely sequences to those. For the the third, we have an HMM for our training data, and can now report poorly fitting sequences.
VII-2: 43
IRDM โ15/16
Using an HMM for Explanation
We want to know why a sequence ๐ง fits our data. The most likely state sequence gives an intuitive explanation. Naรฏve approach compute all ๐๐ possible paths over the HMM for each, determine probability of generating ๐ณ report the path with maximum probability
Instead of naively, can re-use the recursive approach?
VII-2: 44
IRDM โ15/16
Viterbi Algorithm Any subpath of an optimal state path must also be optimal for generating the corresponding subsequence. Let ๐ฟ๐(๐ณ, ๐๐) be the probability of the best state sequence generating the first ๐ symbols of ๐ณ ending at state ๐๐ , with
๐ฟ๐ ๐ณ, ๐๐ = max๐โ[1,๐]
๐ฟ๐โ1 ๐ณ, ๐๐ โ ๐๐๐ โ ๐(๐๐ โฃ ๐๐)
That is, we recursively compute the maximum-probability path over all ๐ different paths for different final nodes. Overall, we initialise recursion with ๐ฟ1 ๐ณ, ๐๐ = ๐๐๐(๐1 โฃ ๐๐), and then iteratively compute for ๐ = 1 โฆ๐.
VII-2: 45
IRDM โ15/16
Training an HMM
So far, we assumed the given HMM was trained. How do we train a HMM in practice?
Learning the parameters of an HMM is difficult no known algorithm is guaranteed to give the global optimum
There do exist methods for reasonably effective solutions e.g. the Forward-Backward (Baum-Welch) algorithm
VII-2: 46
IRDM โ15/16
Backward
We already know how to calculate the forward probability ๐ผ๐(๐ณ, ๐๐) for the first ๐ symbols of a sequence ๐ณ, ending at ๐๐. Now, let ๐ฝ๐(๐ณ, ๐๐) be the backward probability for the sequence after and not including the ๐๐ก๐ก symbol, conditioned that the ๐๐ก๐ก state is ๐๐. We initialise ๐ฝ๐ณ ๐ณ, ๐๐ = 1, and compute ๐ฝ๐(๐ณ, ๐๐) just as ๐ผ๐(๐ณ, ๐๐) but from back to front. For the Baum-Welch algorithm, weโll also need ๐พ๐ ๐ณ, ๐๐ for the probability that the ๐๐ก๐ก state corresponds to ๐๐ , and ๐๐(๐ณ, ๐๐ , ๐๐) for the probability of the ๐๐ก๐ก state ๐๐ , and the ๐ + 1 ๐ก๐ก state ๐๐
VII-2: 47
IRDM โ15/16
Baum-Welch
We initialize the model parameters randomly.
We will then iteratively (E-step) Estimate ๐ผ โ ,๐ฝ โ ,๐ โ , and ๐พ(โ ) from the current model parameters (M-step) Estimate model parameters ๐ โ ,๐ โ โ ,๐โ โ from the current ๐ผ โ ,๐ฝ โ ,๐ โ , and ๐พ(โ ) until the parameters converge. This is simply the EM strategy!
VII-2: 48
IRDM โ15/16
Estimating parameters
๐ผ โ Easy. We estimate these using the Forward algorithm. ๐ฝ โ Easy. We estimate these using the Backward algorithm.
VII-2: 49
IRDM โ15/16
Estimating parameters (2)
๐(โ ) We can split this value into first till ๐๐ก๐ก , ๐๐ก๐ก, and ๐ + 1 ๐ก๐ก till end ๐๐ ๐ณ, ๐๐ , ๐๐ = ๐ผ๐ ๐ณ, ๐๐ โ ๐๐๐ โ ๐ ๐๐+1 ๐๐ โ ๐ฝ๐+1(๐ณ, ๐๐)
and normalize to probabilities over all pairs ๐, ๐. So, easy, after all. ๐พ โ Easy. For ๐พ๐(๐ณ, ๐๐) just fix ๐๐ and sum over ๐๐(๐ณ, ๐๐ , ๐๐) varying ๐๐
VII-2: 50
IRDM โ15/16
But, why?
VII-2: 51
IRDM โ15/16
Conclusions
Discrete sequences are a fun aspect of time series many interesting problems
Mining sequential patterns more expressive than itemsets, more difficult to define support
Hidden Markov Models can be used to predict, explain, evaluate discrete sequences
VII-2: 52
IRDM โ15/16
Discrete sequences are a fun aspect of time series many interesting problems
Mining sequential patterns more expressive than itemsets, more difficult to define support
Hidden Markov Models can be used to predict, explain, evaluate discrete sequences
VII-2: 53