View
235
Download
4
Category
Preview:
Citation preview
Modeling and Mining Sequential Data
Machine Learning and Data MiningPhilipp Singer
CC image courtesy of user puliarfanita on Flickr
What is sequential data?
Stock share price (Bitcoin)
Screenshot from bitcoinwisdom.com
Daily degrees in Cologne
Screenshot from google.com (data from weather.com)
Human mobility
Screenshot from maps.google.com
Web navigation
AustriaGermanyC.F. Gauss
Song listening sequences
Screenshots from youtube.com
Let us distinguish two types of sequence data
Continuous time series
Categorical (discrete) sequences
Let us distinguish two types of sequence data
Continuous time seriesStock share price
Daily degree in Cologne
Categorical (discrete) sequences (focus)Sunny/Rainy weather sequence
Human mobility
Web navigation
Song listening sequences
This lecture is about...
Modeling
Predicting
Pattern Mining
This lecture is about...
Modeling
Predicting
Pattern Mining
Markov Chains
S1S2S3
1/21/21/32/31
Markov Chain Model
Markov Chain Model
Stochastic Model
Transitions between states
S1S2S3
1/21/21/32/31StatesTransition
probabilities
Markov Chain Model
Markovian propertyThe next state in a sequence only depends on the current one, and not on a sequence of preceding ones
S1S2S3
1/21/21/32/31StatesTransition
probabilities
Classic weather example
0.1SunnyRainy
0.90.50.5
Formal definition
State space
Amounts to sequence of random variables
Markovian memoryless property
Transition matrix
Rows sum to 1Transition matrix PSingle transition
probability
Example
0.1SunnyRainy
0.90.50.5
Transition matrix
Likelihood
Transition probabilities are parameters
Transition probabilityTransition count
Maximum Likelihood (MLE)
Given some sequence data, how can we determine parameters?
MLE estimation
Maximize!
See ref [1]
[1] http://journals.plos.org/plosone/articleid=10.1371/journal.pone.0102070
Prediction
Simply derived from transition probabilities
?
One option: Take max prob.
Prediction
What about t+3?
?
Pattern mining
Simply derived from (non-normalized)
transition matrix
90
2
2
1
Most common transitionSequential pattern
Full example
Training sequence
Full example
5
2
2
1
Transition counts
5/7
2/7
2/3
1/3
Transition matrix (MLE)
Full example
5/7
2/7
2/3
1/3
Transition matrix (MLE)
Likelihood of given sequence
We calculate the probability of the sequence with the assumption that we start with sunny.
Full example
5/7
2/7
2/3
1/3
Transition matrix (MLE)
?
Prediction?
Full example
5/7
2/7
2/3
1/3
Transition matrix (MLE)
?
Prediction?
Higher order Markov Chain models
Drop the memoryless assumption?
Models of increasing order2nd order MC model
3rd order MC model
...
Higher order Markov Chain models
Drop the memoryless assumption?
Models of increasing order2nd order MC model
3rd order MC model
...
2nd order example
depends on
Higher order to first order transformation
Transform state space
2nd order example new compound states
2nd order example
3
1
1
1
1
0
1
1
3/4
1/4
1/2
1/2
1/1
0
1/2
1/2
Reset states
R
R
...
RR
R
R
Marking start and end of sequences
Transformation easier (same #transitions)
Comparing models
1st vs. 2nd order
Statistical model comparison necessary
Nested models higher order always fits better
Account for potential overfitting
Model comparison
Likelihood ratio testRatio between likelihoods for order m and k
Follows a Chi2 distribution with degrees of freedom
Only for nested models
Akaike Information Criterion (AIC)
The lower the better
Bayesian Information Criterion (BIC)
Bayes FactorsRatio of evidences (marginal likelihoods)
Cross validation
See http://journals.plos.org/plosone/articleid=10.1371/journal.pone.0102070
AIC example
R
R
...
RR
R
R
5/8
2/8
2/3
1/3
RR1/8
0/3
1/1
0/1
0/1
3/5
1/5
1/2
1/2
0
1/2
1/2
RRR
R
R
RR1/5
0
1/1
0
0
1/1
0
0
1/1
0
0
0
0
0
0
0
0
0
0
0
0
0
1st order parameters2nd order parameters
AIC example
5/8
2/8
2/3
1/3
RR1/8
0/3
1/1
0/1
0/1
3/5
1/5
1/2
1/2
0
1/2
1/2
RRR
R
R
RR1/5
0
1/1
0
0
1/1
0
0
1/1
0
0
0
0
0
0
0
0
0
0
0
0
0
1st order parameters2nd order parameters
Example on
blackboard
Markov Chain applications
Google's PageRank
DNA sequence modeling
Web navigation
Mobility
Hidden Markov Chain Model
Hidden Markov Models
Extends Markov chain model
Hidden state sequence
Observed emissions
What is the weather like?
Forward-Backward algorithm
Given emission sequence
Probability of emission sequence?
Probable sequence of hidden states?
Hidden seq.Obs. seq.
Check out YouTube tutorial: https://www.youtube.com/watch?v=7zDARfKVm7sFurther material: cs229.stanford.edu/section/cs229-hmm.pdf
Setup
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R0.5
0.5
Note: Literature usually uses a start probability and uniform end probability for the forward-backward algorithm.
Forward
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R
0.4
0.1
R0.5
0.5
Forward
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R
0.4
0.1
0.034
0.144
R0.5
0.5
What is the probability of going to each possible state at t2 given t1?
Forward
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R
0.4
0.1
0.034
0.144
0.011
0.061
R0.5
0.5
Forward
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R
0.4
0.1
0.034
0.144
0.011
0.061
0.035
0.006
R0.5
0.5
forwardR
0.5
0.5
reset transition
Backwards
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R
R0.5
0.5
0.5
0.5
Backwards
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
0.31
0.28
R0.5
0.5
What is the probability of arriving at t4 given each
possible state at t3?R
0.5
0.5
Backwards
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R0.5
0.5
0.010
0.12
R
0.31
0.28
0.5
0.5
Backwards
0.7
0.3
0.6
0.4
R0.5
0.5
0.9
0.2
0.1
0.8
R0.5
0.5
0.039
0.049
R
0.097
0.12
0.31
0.28
0.5
0.5
R
backwardreset
emission
Forward-Backward
Most likely state at t2
0.4
0.1
0.034
0.144
0.011
0.061
0.035
0.006
0.039
0.049
0.097
0.12
0.31
0.28
0.5
0.5
Forward-Backward
Posterior decoding
Most likely state at each t
For most likely sequence: Viterbi algorithm
Learning parameters
Train parameters of HMM
No tractable solution for MLE known
Baum-Welch algorithmSpecial case of EM algorithm
Uses Forward-Backward
HMM applications
Speech recognition
POS tagging
Translation
Gene prediction
Other related methods
Sequential Pattern Mining
PrefixSpan
Apriori Algorithm
GSP Algorithm
SPADE
Reference: rakesh.agrawal-family.com/papers/icde95seq.pdf
Graphical models
Bayesian networksRandom variables
Conditional dependence
Directed acyclic graph
Markov random fieldsRandom variables
Markov property
Undirected graph
Questions?
Philipp Singerphilipp.singer@gesis.org
Recommended