Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU

Embed Size (px)

Citation preview

  • Slide 1

Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU Slide 2 2 t OtOt Consider a sequence of real-valued observations (speech, sensor readings, stock prices ) Slide 3 3 t We can model it purely based on contextual properties OtOt Slide 4 4 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices ) We can model it purely based on contextual properties OtOt Slide 5 5 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices ) We can model it purely based on contextual properties However, we would miss important temporal structure OtOt Slide 6 6 t Consider a sequence of real-valued observations (speech, sensor readings, stock prices ) We can model it purely based on contextual properties However, we would miss important temporal structure OtOt Slide 7 7 t Current efficient approaches learn the wrong model OtOt Slide 8 8 t OtOt Our method successfully discovers the overlapping states Slide 9 9 t Our goal: Efficiently discover states in sequential data while learning a Hidden Markov Model OtOt Slide 10 10 Motion Capture Slide 11 11 Definitions and Notation An HMM is ={A,B, } where A : N N transition matrix B : observation model { s, s } for each of N states : N 1 prior probability vector T : size of observation sequence O 1,,O T q t : the state the HMM is in at time t. q t {s 1,,s N } Slide 12 12 ProblemAlgorithm Complexity Likelihood evaluation: L( O) = P(O| ) Forward- Backward O(TN 2 ) Path inference : Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter learning: * argmax,Q P(O,Q| * argmax P(O| (for fixed N) Viterbi Training Baum-Welch (EM) O(TN 2 ) Operations on HMMs Slide 13 13 ProblemAlgorithm Complexity Likelihood evaluation: L( O) = P(O| ) Forward- Backward O(TN 2 ) Path inference : Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter learning: * argmax,Q P(O,Q| * argmax P(O| (for fixed N) Viterbi Training Baum-Welch (EM) O(TN 2 ) Model selection: * argmax,Q,N P(O,Q| * argmax,N P(O| ?? want O(TN 2 ) Operations on HMMs Slide 14 14 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Slide 15 15 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-up State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: More robust to local minima Problems: Require a loose upper bound on N, which adds complexity Difficult to decide which states to prune/merge Slide 16 16 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-upTop-down State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: More robust to local minima Problems: Require a loose upper bound on N, which adds complexity Difficult to decide which states to prune/merge ML Successive State Splitting [Ostendorf & Singer 1997] Heuristic split-merge [ Li & Biswas 1999] Advantage: More robust to local minima, and more scalable Problems: Previous methods not effective at state discovery, and still slow for large N Slide 17 17 Previous Approaches Multi-restart Baum-Welch N is inefficient, highly prone to local minima Bottom-upTop-down State merging [Stolcke & Omohundro 1994] Entropic state pruning [Brand 1999] Advantage: More robust to local minima Problems: Require a loose upper bound on N, which adds complexity Difficult to decide which states to prune/merge ML Successive State Splitting [Ostendorf & Singer 1997] Heuristic split-merge [ Li & Biswas 1999] Advantage: More robust to local minima, and more scalable Problems: Previous methods not effective at state discovery, and still slow for large N We propose Simultaneous Temporal and Contextual Splitting (STACS) A top-down approach that is much better at state- discovery while being at least as efficient, and a variant V-STACS that is much faster. Slide 18 18 Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data) / P(data|model size) P(model size) log P(model size|data) / log P(data|model size) + log P(model size) Slide 19 19 Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data) / P(data|model size) P(model size) log P(model size|data) / log P(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data) log P(data|model size, MLE ) (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate Slide 20 20 Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data) / P(data|model size) P(model size) log P(model size|data) / log P(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data) log P(data|model size, MLE ) (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate - BIC is an asymptotic approximation to the true posterior Slide 21 21 Algorithm Summary (STACS/VSTACS) Initialize n 0 -state HMM randomly for n = n 0 Nmax Learn model parameters for i = 1 n Split state i, optimize by constrained EM (STACS) or constrained Viterbi Training (VSTACS) Calculate approximate BIC score of split model Choose best split based on approximate BIC Compare to original model with exact BIC (STACS) or approximate BIC (VSTACS) if larger model not chosen, stop Slide 22 22 STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat Slide 23 23 STACS Learn parameters using EM, calculate the Viterbi path Q * input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 Slide 24 24 STACS Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2"> 25 Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 Choose a subset D = {O t : Q * (t) = s 2 } STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2"> 26 Learn parameters using EM, calculate the Viterbi path Q * Consider splits on all states e.g. for state s 2 Choose a subset D = {O t : Q * (t) = s 2 } Note that | D | = O(T/N) STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3"> 27 STACS Split the state input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 Slide 28 28 Split the state Constrain s to except for offspring states observation densities and all their transition probabilities, both in and out S1S1 S2S2 S3S3 STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat Slide 29 29 Split the state Constrain s to except for offspring states observation densities and all their transition probabilities, both in and out Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihood P(O,Q * \ D | s ) STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 Slide 30 30 Split the state Constrain s to except for offspring states observation densities and all their transition probabilities, both in and out Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihood P(O,Q * \ D | s ) Update Q * over D to get R * STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 Slide 31 31 Scoring is of two types: STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat Slide 32 32 Scoring is of two types: The candidates are compared to each other according to their Viterbi path likelihoods STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 vs. S1S1 S2S2 S3S3 Slide 33 33 Scoring is of two types: The candidates are compared to each other according to their Viterbi path likelihoods The best candidate in this ranking is compared to the un-split model using BIC, i.e. log P(model | data ) log P(data | model) complexity penalty STACS input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 vs. S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 Slide 34 34 Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat Slide 35 35 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has winner-take-all variants Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 Slide 36 36 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has winner-take-all variants V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS soft updates Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 Slide 37 37 Recall that STACS learns the free parameters using two-state EM over D. However, EM also has winner-take-all variants V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS soft updates The Viterbi path likelihood is used to approximate the BIC vs. the un-split model in V-STACS Viterbi STACS (V-STACS) input: n 0, data sequence O = {O 1,,O T } output: HMM of appropriate size n 0 -state initial HMM repeat optimize over sequence O choose a subset of states for each s design a candidate model s : choose a relevant subset of sequence O split state s, optimize s over subset score s end for if max s (score( s )) > score( ) best-scoring candidate from { s } else terminate, return current end if end repeat S1S1 S2S2 S3S3 Slide 38 38 Time Complexity Optimizing N candidates takes N O(T) time for STACS N O(T/N) time for V-STACS Scoring N candidates takes N O(T) time Candidate search and scoring is O(TN) Best-candidate evaluation is O(TN 2 ) for BIC in STACS O(TN) for approximate BIC in V-STACS Slide 39 39 Other Methods Li-Biswas Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) Slide 40 40 Other Methods Li-Biswas Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) Optimizes all candidate parameters over entire sequence Slide 41 41 Other Methods Li-Biswas Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) Optimizes all candidate parameters over entire sequence ML-SSS Generates 2 N candidates, splitting each state in two ways Slide 42 42 Other Methods Li-Biswas Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) Optimizes all candidate parameters over entire sequence ML-SSS Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel Slide 43 43 Other Methods Li-Biswas Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) Optimizes all candidate parameters over entire sequence ML-SSS Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel Temporal split: optimizes offspring states observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series Slide 44 44 Other Methods Li-Biswas Generates two candidates splits state with highest-variance merges pair of closest states (rarely chosen) Optimizes all candidate parameters over entire sequence ML-SSS Generates 2 N candidates, splitting each state in two ways Contextual split: optimizes offspring states observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel Temporal split: optimizes offspring states observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series Optimizes split of state s over all timesteps with nonzero posterior probability of being in state s [ i.e. O(T) data points] Slide 45 45 Results Slide 46 46 Data sets Australian Sign-Language data collected from 2 Flock 5DT instrumented gloves and Ascension flock-of-birds tracker [Kadous 2002 (available in UCI KDD Archive)] Other data sets obtained from the literature Robot, MoCap, MLog, Vowel Slide 47 47 Learning HMMs of Predetermined Size: Scalability Robot data (others similar) Slide 48 48 Learning HMMs of Predetermined Size: Log-Likelihood Learning a 40-state HMM on Robot data (others similar) Slide 49 49 Learning HMMs of Predetermined Size Learning 40-state HMMs Slide 50 50 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Slide 51 51 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Both STACS, VSTACS discovered 10 states and correct underlying transition structure Slide 52 52 Model Selection: Synthetic Data Generalize (4 states, T = 1000) to (10 states, T = 10,000) Both STACS, VSTACS discovered 10 states and correct underlying transition structure Li-Biswas, ML-SSS failed to find 10-state model 10-state Baum-Welch also failed to find correct observation and transition models, even with 50 restarts! Slide 53 53 Model Selection: BIC score MoCap data (others similar) Slide 54 54 Model Selection Slide 55 55 Sign-language recognition Initial results on sign-language word recognition 95 distinct words, 27 instances each, divided 8:1 Average classification accuracies and HMM sizes: Accuracy final N Slide 56 56 Modeling motion capture data 35-dimensional data (thanks to Adrien Treuille) Slide 57 57 Modeling motion capture data Original data: Slide 58 58 Modeling motion capture data Original data: STACS simulation: (found 235 states) Slide 59 59 Modeling motion capture data Original data: STACS simulation: Baum-Welch: (found 235 states) (on 235 states) Slide 60 60 Modeling motion capture data Original data: STACS simulation: Baum-Welch: (found 235 states) (on 235 states) [Video] Slide 61 61 Discovering Underlying Structure Sparse dynamics - difficult to learn using regular EM STACS smoothly tiles the low-dimensional manifold of observations along with correct dynamic structure Slide 62 62 Conclusion A better method for HMM model selection and learning discovers hidden states avoids local minima faster than Baum-Welch Even when learning HMMs with known size, better to discover states using STACS up to the desired N Widespread applicability classification, recognition and prediction for real-valued sequential data problems Slide 63 63 Slide 64 64 t t (1) t (2) t (3) t (N) 1 2 3 4 5 6 7 8 9 Slide 65 65 t t (1) t (2) t (3) t (N) 1 2 3 4 5 6 7 8 9 Slide 66 66 t t (1) t (2) t (3) t (N) 1 2 3 4 5 6 7 8 9 The Viterbi path is denoted by Suppose we split state N into s 1,s 2 Slide 67 67 t t (1) t (2) t (3) t (s 1 ) t (s 2 ) 1 2 3 4 5 6 7 8 9 ?? ?? ?? ?? The Viterbi path is denoted by Suppose we split state N into s 1,s 2 Slide 68 68 t t (1) t (2) t (3) t (s 1 ) t (s 2 ) 1 2 3 4 5 6 7 8 9 The Viterbi path is denoted by Suppose we split state N into s 1,s 2 Slide 69 69 t t (1) t (2) t (3) t (s 1 ) t (s 2 ) 1 2 3 4 5 6 7 8 9 The Viterbi path is denoted by Suppose we split state N into s 1,s 2