Inertial Hidden Markov Models: Modeling Change in

Inertial Hidden Markov Models: Modeling Change in Multivariate Time SeriesGeorge D. Montanez

School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected]

Saeed AmizadehYahoo Labs

Sunnyvale, CA [email protected]

Nikolay LaptevYahoo Labs

Sunnyvale, CA [email protected]

Abstract

Faced with the problem of characterizing systematic changesin multivariate time series in an unsupervised manner, wederive and test two methods of regularizing hidden Markovmodels for this task. Regularization on state transitions pro-vides smooth transitioning among states, such that the se-quences are split into broad, contiguous segments. Our meth-ods are compared with a recent hierarchical Dirichlet pro-cess hidden Markov model (HDP-HMM) and a baseline stan-dard hidden Markov model, of which the former suffers frompoor performance on moderate-dimensional data and sensi-tivity to parameter settings, while the latter suffers from rapidstate transitioning, over-segmentation and poor performanceon a segmentation task involving human activity accelerom-eter data from the UCI Repository. The regularized methodsdeveloped here are able to perfectly characterize change ofbehavior in the human activity data for roughly half of thereal-data test cases, with accuracy of 94% and low variationof information. In contrast to the HDP-HMM, our meth-ods provide simple, drop-in replacements for standard hiddenMarkov model update rules, allowing standard expectationmaximization (EM) algorithms to be used for learning.

Introduction“Some seek complex solutions to simple problems;it is better to find simple solutions to complex

problems.” - Soramichi Akiyama

Time series data arise in different areas of science andtechnology, describing the behavior of both natural and man-made systems. These behaviors are often quite complexwith uncertainty, which in turn require us to incorporate so-phisticated dynamics and stochastic models to model them.Furthermore, these complex behaviors can change over timedue to some external event and/or some internal systematicchange of dynamics/distribution. For example, consider thecase of monitoring one’s physical activity via an array ofaccelerometer body sensors over time. A certain patternemerges on the time series of the sensors’ readings whilethe person is walking; however, this pattern quickly changesto a new one as she begins running. From the data analy-sis perspective, it is important to first detect these changepoints as they are quite often indicative of an “interesting”event or an anomaly in the system. We are also interested incharacterizing the new state of the system (e.g. running vs.walking) which reflects its mode of operation. Change pointdetection methods (Kawahara, Yairi, and Machida 2007;

Copyright © 2014, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Xie, Huang, and Willett 2013; Liu et al. 2013; Ray and Tsay2002) have been proposed to answer the first question whileHidden Markov Models (HMM) can answer both.

One crucial observation in many real-world systems, nat-ural and man-made, is the behavior changes are typicallyinfrequent; that is, the system takes some (unknown) timebefore it changes its behavior to a new modus operandi. Forinstance, in our earlier example, it is unlikely for a personto rapidly fluctuate between walking and running, makingthe durations of different activities over time relatively longand highly variable. We refer to this as the inertial prop-erty, alluding to the physical property of matter that ensuresit will continue along a fixed course unless acted upon byan external force. Unfortunately, classical HMMs are notequipped with sufficient mechanisms to capture this prop-erty and often result in a high rate of state transitioning andsubsequently false positives in terms of detecting changepoints.

Few solutions exist in the literature to address this prob-lem. In the context of Markov models, Fox et al. (Fox et al.2011; Willsky et al. 2009) have recently proposed the stickyhierarchical Dirichlet process hidden Markov model (HDP-HMM) which uses a Bayesian non-parametric approach withappropriate priors to promote self-transitioning (or sticki-ness) for HMMs. Despite its elegant theoretical foundation,the sticky HDP-HMM is not a practical solution in manyreal-world situations. In particular, the performance of theHDP-HMM tends to degrade as the dimensionality of theproblem increases beyond ten dimensions. Moreover, dueto iterative Gibbs sampling for its learning, the sticky HDP-HMM can become computationally prohibitive. In practice,the most significant drawback of the sticky HDP-HMM orig-inates with its non-parametric Bayesian nature: due to theexistence of many hyperparameters, the search space for ini-tial tuning is exponentially large and significantly affects thelearning quality for a given task.

In this paper, we propose a regularization-based frame-work for HMMs, called Inertial hidden Markov models (In-ertial HMMs), which are biased towards the inertial state-transition property. Similar to the sticky HDP-HMM, ourframework is based on theoretically sound foundations, yetis much simpler and more intuitive than the HDP-HMM. Inparticular, our framework has only two initial parametersfor which we have developed intuitive initialization tech-niques that significantly minimize the effort needed for pa-rameter tuning. Furthermore, as we show later, our proposedmethods in practice boil down to upgraded update rules forstandard HMMs. This allows one to easily upgrade exist-

ing HMM libraries to take advantage of our methods, whilestill preserving the computational efficiency of the standardHMM approach. By performing rigorous experiments onboth synthetic and moderate dimensional real datasets, weshow that Inertial HMMs are not only much faster thanthe sticky HDP-HMM, but also produce significantly bet-ter detection, suggesting Inertial HMMs as a more practicalchoice in comparison to the current state-of-the-art.

Problem StatementLet X = {x1, . . . ,xT } denote a d-dimensional multivari-ate time series, where xt ∈ Rd. Given such a time se-ries, we seek to segment X along the time axis into seg-ments, where each segment corresponds to a subsequenceXi...i+m = {xi, . . . ,xi+m} and maps to a predictive (la-tent) state z, represented as a one-of-K vector, where |z| =K and

∑Ki=1 zt,i = 1. For simplicity of notation, let zt = k

denote zt,k = 1 and let Z = {z1, . . . , zT } denote the se-quence of latent states. Then for all xt mapping to state k,we require that

Pr(xt+1|X1...t, zt = k) = Pr(xt+1|zt = k)

= Pr(xt′+1|zt′ = k)

= Pr(xt′+1|X1...t′ , zt′ = k).

Thus, the conditional distribution over futures at time t con-ditioned on being in state k is equal to the distribution overfutures at time t′ conditioned on being in the same state.Thus, we assume conditional independence given state, andstationarity of the generative process.

We impose two additional criteria on our models. First,we seek models with a small number of latent states, K �T , and second, we desire state transition sequences with theinertial property, as defined previously, where the transition-ing of states does not occur too rapidly.

The above desiderata must be externally imposed on ourmodel, since simply maximizing the likelihood of the datawill result inK = T (i.e., each sample corresponds a uniquestate/distribution), and in general we may have rapid transi-tions among states. For the first desideratum, we choose thenumber of states in advance as is typically done for hiddenMarkov models (Rabiner 1989). For the second, we directlyalter the probabilistic form of our model to include a param-eterized regularization that reduces the likelihood of transi-tioning between different latent states.

Inertial Hidden Markov ModelsHidden Markov models (HMMs) are a class of long-studiedprobabilistic models well-suited for sequential data (Rabiner1989). As a starting point for developing our inertial HMMs,we begin a standard K-state HMM with Gaussian emissiondensities. HMMs trained by expectation maximization (lo-cally) maximize the likelihood of the data, but typically donot guarantee slow inertial transitioning among states. Thenumber of states must be specified in advance, but no otherparameters need to be given, as the remaining parametersare all estimated directly from the data.

To accommodate the inertial transition requirement, wederive two different methods for enforcing state-persistence

in HMMs. Both methods alter the probabilistic form of thecomplete data joint likelihood, which result in altered transi-tion matrix update equations. The resulting update equationsshare a related mathematical structure and, as is shown in theExperiments section, have similar performance in practice.

We will next describe both methods and provide outlinesof their derivations.

Maximum A Posteriori (MAP) Regularized HMMFollowing (Gauvain and Lee 1994), we alter the standardHMM to include a Dirichlet prior on the transition probabil-ity matrix, such that transitions out-of-state are penalized bysome regularization factor. A Dirichlet prior on the transi-tion matrix A, for the jth row, has the form

p(Aj ; η) ∝K∏k=1

Aηjk−1jk

where the ηjk are free parameters and Ajk is the transitionprobability from state j to state k. The posterior joint densityover X and Z becomes

P (X,Z; θ, η) ∝

K∏j=1

K∏k=1

Aηjk−1jk

P (X,Z | A; θ)

and the log-likelihood is

`(X,Z; θ, η) ∝K∑j=1

K∑k=1

(ηjk − 1) logAjk + logP (z1; θ)

+

T∑t=1

logP (xt|zt; θ) +T∑t=2

logP (zt|zt−1; θ).

MAP estimation is then used in the M-step of the expecta-tion maximization (EM) algorithm to update the transitionprobability matrix. Maximizing, with appropriate Lagrangemultiplier constraints, we obtain the update equation for thetransition matrix,

Ajk =(ηjk − 1) +

∑Tt=2 ξ(z(t−1)j , ztk)∑K

i=1(ηji − 1) +∑Ki=1

∑Tt=2 ξ(z(t−1)j , zti)

(1)

where ξ(z(t−1)j , ztk) = E[z(t−1)jztk].Given our prior, we can control the probability of self-

transitions among states, but this method requires that wechoose a set of K2 parameters for the Dirichlet prior. How-ever, since we are solely concerned about increasing theprobability of self-transitions, we can reduce these param-eters to a single parameter λ governing the amplification ofself-transitions. We therefore define ηjk = 1 when j 6= kand ηkk = λ ≥ 1 otherwise, and the transition update equa-tion becomes

Ajk =(λ− 1)1(j = k) +

∑Tt=2 ξ(z(t−1)j , ztk)

(λ− 1) +∑Ki=1


(2)

where 1(·) denotes the indicator function.

Inertial Regularization via Pseudo-observationsAlternatively, we can alter the HMM likelihood function toinclude a latent binary random variable, V , indicating thata self-transition was chosen at random from among all tran-sitions, according to some distribution. Thus, we view thetransitions as being partitioned into two sets, self-transitionsand non-self-transitions, and we draw a member of the self-transition set according to a Bernoulli distribution governedby parameter p. Given a latent state sequence Z, with tran-sitions chosen according to transition matrix A, we define pas a function of both Z and A. We would like p to have twoproperties: 1) it should increase with increasing

∑k Akk

(probability of self-transitions) and 2) it should increase asthe number of self-transitions in Z increases. This will allowus to encourage self-transitions as a simple consequence ofmaximizing the likelihood of our observations.

We begin with a version of p based on a penalization con-stant 0 < ε < 1 that scales appropriately with the numberof self-transitions. If we raise ε to a large positive power,the resulting p will decrease. Thus, we define p as ε raisedto the number of non-self-transitions, M , in the state tran-sition sequence, so that the probability of selecting a self-transition increases as M decreases. Using the fact thatM = (T − 1)−

∑Tt=2

∑Kk=1 z(t−1)kztk, we obtain

p = εM = ε∑T

t=2 1−∑T

t=2

∑Kk=1 z(t−1)kztk

= ε∑T

t=2

∑Kk=1 z(t−1)k−

∑Tt=2

∑Kk=1 z(t−1)kztk

=

T∏t=2

K∏k=1

εz(t−1)k−z(t−1)kztk . (3)

Since ε is arbitrary, we choose ε = Akk, to allow p to scaleappropriately with increasing probability of self-transition.We therefore arrive at

p =

T∏t=2

K∏k=1

Az(t−1)k−z(t−1)kztkkk .

Thus, we define p as a computable function of Z and A.Defining p in this deterministic manner is equivalent tochoosing the parameter value from a degenerate probabil-ity distribution that places a single point mass at the valuecomputed, allowing us to easily obtain a posterior distribu-tion on V . Furthermore, we see that the function increasesas the number of self-transitions increases, since Akk ≤ 1for all k, and p will generally increase as

∑k Akk increases.

Thus, we obtain a parameter p ∈ (0, 1] that satisfies all ourdesiderata. With p in hand, we say that V is drawn accord-ing to the Bernoulli distribution, Bern(p), and we observeV = 1 (i.e., a member of the self-transition set was chosen).

To gain greater control over the strength of regularization,let λ be a positive integer and V be an λ-length sequenceof pseudo-observations, drawn i.i.d. according to Bern(p).Since P (V = 1|Z;A) = p, we have

P (V = 1|Z;A) =

[T∏t=2

K∏k=1

Az(t−1)k−z(t−1)kztkkk

]λ

where 1 denotes the all-ones sequence of length λ.Noting that V is conditionally independent of X given

the latent state sequence Z, we maximize (with respect toAjk) the expected (with respect to Z) joint log-density overX, V, and Z parameterized by θ = {π,A, φ}, which arethe start-state probabilities, state transition matrix and emis-sion parameters, respectively. Using appropriate Lagrangemultipliers, we obtain the regularized maximum likelihoodestimate for Ajk:

Ajk =Bj,k,T + 1(j = k)Cj,k,T∑K

i=1Bj,i,T + Cj,j,T(4)

where 1(·) denotes the indicator function, γ(ztk) = E[ztk]and

Bj,k,T =

T∑t=2

ξ(z(t−1)j , ztk),

Cj,k,T = λ

[T∑t=2

[γ(z(t−1)k)− ξ(z(t−1)j , ztk)]

]. (5)

The forward-backward algorithm can then be used for ef-ficient computation of the γ and ξ values, as in standardHMMs (Bishop 2007).

Ignoring normalization, we see that

Ajk ∝{Bj,k,T + Cj,j,T if j = k

Bj,k,T otherwise.

Examining the Cj,j,T term (i.e., Equation (5)), we see thatλ is a multiplier of additional mass contributions for self-transitions, where the contributions are the difference be-tween γ(z(t−1)j) and ξ(z(t−1)j , ztj). These two quantitiesrepresent, respectively, the expectation of being in a state jat time t − 1 and the expectation of remaining there in thenext time step. The larger λ or the larger the difference be-tween arriving at a state and remaining there, the greater theadditional mass given to self-transition.

Parameter ModificationsScale-Free Regularization In Equation 2, the strengthof the regularization diminishes with growing T , so thatasymptotically the regularized estimate and unregularizedestimate become equivalent. While this is desirable in manycontexts, maintaining a consistent strength of inertial regu-larization becomes important with time series of increasinglength, as is the case with online learning methods. Figure 1shows a regularized segmentation of human accelerometerdata (discussed later in the Experiments section), where theregularization is strong enough to provide good segmenta-tion. If we then increase the number of data points in eachsection by a factor of ten while keeping the same regulariza-tion parameter setting, we see that the regularization is nolonger strong enough, as is shown in Figure 2. Thus, the λparameter is sensitive to the size of the time series.

We desire models where the regularization strength isscale-free, having roughly the same strength regardless ofhow the time series grows. To achieve this, we define the

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

hidden state 1hidden state 2hidden state 3

Figure 1: Human activities accelerometer data, short sequence.Vertical partitions correspond to changes of state.

Figure 2: The long sequence human activities accelerometer datausing regularization parameter from short sequence.

λ parameter to scale with the number of transitions, namelyλ = (T − 1)ζ , and our scale-free update equation becomes

Ajk =((T − 1)ζ − 1)1(j = k) +

∑Tt=2 ξ(z(t−1)j , ztk)

((T − 1)ζ − 1) +∑Ki=1


.

(6)

This preserves the effect of regularization as T increases,and ζ becomes our new regularization parameter, controllingthe strength of the regularization. For consistency, we alsore-parameterize Equation (5) using λ = (T − 1)ζ .

Towards Parameter-Free Regularization Although ourmethods require specifying the strength of regularization inadvance, one can avoid this requirement by making an as-sumption concerning the distribution of segment lengths.Namely, by assuming that most of the segment lengths areof roughly the same order-of-magnitude scale, then we canautomatically tune the regularization parameter, as follows.

We first define a range of possible regularization param-eter values (such as λ ∈ [0, 75]), and perform a searchon this interval for a value that gives sufficient regulariza-tion. Sufficient regularization is defined with respect to theGini ratio (Gini 1936; Wikipedia 2014), which is a mea-sure of statistical dispersion often used to quantify incomeinequality. For a collection of observed segment lengthsL = {l1, . . . , lm}, given in ascending order, the Gini ratio isestimated by

G(L) = 1− 2

m− 1

(m−

∑mi=1 ili∑mi=1 li

).

We assume that the true segmentation has a Gini ratio lessthan one-half, which corresponds to having more equalityamong segment lengths than not. One can perform a binarysearch on the search interval to find the smallest ζ parameterfor which the Gini ratio is at least one-half. This increases

the time complexity by a factor of O(log2(R/ε)), where Ris the range of the parameter space.

ExperimentsWe perform two segmentation tasks on synthetic and realmultivariate time series data, using our scale- and parameter-free regularized inertial HMMs. For comparison, we presentthe results of applying a standard K-state hidden Markovmodel as well as the sticky HDP-HMM of (Fox et al. 2011).We performed all tasks in an unsupervised manner, withstate labels being used only for evaluation.

DatasetsThe first (synthetic) multivariate dataset was generated usinga two-state HMM with 3D Gaussian emissions, with transi-tion matrix

A =

(0.9995 0.00050.0005 0.9995

),

equal start probabilities and emission parameters µ1 =(−1,−1,−1)>, µ2 = (1, 1, 1)>, Σ1 = Σ2 = diag(3). Us-ing this model, we generated one hundred time series con-sisting of ten-thousand time points each. Figure 3 shows anexample time series from this synthetic dataset.

The second dataset was generated from real-world forty-five dimensional human accelerometer data, recorded forusers performing five different activities, namely, playingbasketball, rowing, jumping, ascending stairs and walkingin a parking lot (Altun, Barshan, and Tuncel 2010). Thedata were recorded from a single subject using five XsensMTx™ units attached to the torso, arms and legs. Each unithad nine sensors, which recorded accelerometer (X,Y, Z)data, gyroscope (X,Y, Z) data and magnetometer (X,Y, Z)data, for a total of forty-five signals at each time point.

We generated one hundred multivariate time series fromthe underlying dataset, with varying activities (latent states)and varying number of segments. To generate these sets,we first uniformly chose the number of segments, betweentwo and twenty. Then, for each segment, we chose an ac-tivity uniformly at random from among the five possible,and selected a uniformly random segment length proportion.The selected number of corresponding time points were ex-tracted from the activity, rescaled to zero mean and unit vari-ance, and appended to the output sequence. The final outputsequence was truncated to ten thousand time points, or dis-carded if the sequence contained fewer than ten thousandpoints or fewer than two distinct activities. Additionally,prospective time series were rejected if they caused numer-ical instability issues for the algorithms tested. The processwas repeated to generate one hundred such multivariate timeseries of ten thousand time ticks each, with varying numberof segments, activities and segment lengths. An exampledata sequence is shown in Figure 4 and the distribution ofthe time series according to number of activities and seg-ments is shown in Figure 5.

Experimental MethodologyWe compared performance of four methods on the twodatasets described in the previous section: a standard K-

Dim

ensions

−3

−2

−1

0

1

2

3

4

0 2000 4000 6000 8000

State

Figure 3: Synthetic data example. Generated from two-stateHMM with 3D Gaussian emissions and strong self-transitions.

Dim

ensions

−6.0

−4.5

−3.0

−1.5

0.0

1.5

3.0

4.5

6.0

0 2000 4000 6000 8000

State

Figure 4: Human activities accelerometer data. Three state, 45-dimensional.

0 2 4 6 8 10 12 14Number of Segments

2

3

4

5

Num

ber

of

Act

ivit

ies

/ C

lass

es

Figure 5: Distribution of Accelerometer Time Series Data.

state hidden Markov model, the sticky HDP-HMM and bothinertial HMM variants. The task was treated as a multi-class classification problem, measuring the minimum zero-one loss under all possible permutations of output labels,to accommodate permuted mappings of the true labels. Wemeasured the normalized variation of information (Meila2003) between the predicted state sequence and true statesequence, which is an information metric capturing the dis-tance between two partitionings of a sequence. We alsoconsidered the ratio of predicted number of segments totrue number of segments, which gives a sense of whether amethod over- or under-segments data, and the absolute seg-ment number ratio (ASNR), which is defined as ASNR =max(St, Sp)/min(St, Sp), where St is the true number ofsegments in the sequence and Sp is the predicted number ofsegments, and quantifies how much a segmentation methoddiverges from the ground truth in terms of relative factor ofsegments. Lastly, we tracked the number of segments differ-ence between the predicted segmentation and true segmenta-tion and how many segmentations we done perfectly, givingthe correct states at all correct positions.

Parameter selection for the inertial HMM methods wasdone using the automated parameter selection procedure de-scribed in the Parameter Modifications section. For fasterevaluation, we ran the automated parameter selection pro-cess on ten randomly drawn examples, averaged the final ζparameter value, and used the fixed value for all trials. Thefinal ζ parameters are shown in Tables 1 and 2.

To evaluate the sticky HDP-HMM, we used the publiclyavailable HDP-HMM toolbox for MATLAB, with defaultsettings for the priors (Fox and Sudderth 2009). The Gaus-sian emission model with normal inverse Wishart (NIW)prior was used, and the truncation level L for each examplewas set to the true number of states, in fairness for compar-ing with the HMM methods developed here, which are alsogiven the true number of states. The “stickiness” κ parame-ter was chosen in a data-driven manner by testing values ofκ = 0.001, 0.01, 0.1, 1, 5, 10, 50, 100, 250, 500, 750 and1000 for best performance over ten randomly selected exam-ples each. The mean performance of the 500th Gibbs sampleof ten trials was then taken for each parameter setting, andthe best κ was empirically chosen. For the synthetic dataset,a final value of κ = 10 was chosen by this method. For thereal human accelerometer data, a value of κ = 100 providedthe best accuracy and relatively strong variation of informa-tion performance. These values were used for evaluation oneach entire dataset, respectively.

To evaluate the HDP-HMM, we performed five trials oneach example in the test dataset, measuring performance ofthe 1000th Gibbs sample for each trial. The mean perfor-mance was then computed for the trials, and the average ofall one hundred test examples was recorded.

Synthetic Data ResultsAs seen in Table 1, the MAP regularized HMM had thestrongest performance, with top scores on all metrics. Theinertial pseudo-observation HMM also had strong perfor-mance, with extremely high accuracy and low variationof information. The standard HMM suffered from over-

−6

−4

−2

0

2

4

6

hidden state 1hidden state 2hidden state 3

Pre

d.

0 2000 4000 6000 8000

Tru

e

Figure 6: Example segmentation of human activities accelerome-ter data using inertial (MAP) HMM.

Table 1: Results from quantitative evaluation on 3D synthetic data.Statistical significance is computed with respect to MAP results.

Method Acc. SNR ASNR SND VOI Per.HDP-HMM (κ = 10) 0.85* 0.59* 3.50* 2.79* 0.56* 0/100Standard HMM 0.87* 172.20* 172.20* 765.91* 0.62* 0/100MAP HMM (ζ = 2.3) 0.99 0.96 1.13 0.51 0.07 2/100PsO HMM (ζ = 8.2) 0.99 0.87‡ 1.43‡ 1.15* 0.14† 1/100Acc. = Average Accuracy (value of 1.0 is best)SNR = Average Segment Number Ratio (value of 1.0 is best)ASNR = Average Absolute Segment Number Ratio (value of 1.0 is best)SND = Average Segment Number Difference (value of 0.0 is best)VOI = Average Normalized Variation of Information (value of 0.0 is best)Per. = Total number of perfect/correct segmentations

paired t-test: † < α = .05, ‡ < α = .01, * < α = .001

segmentation of the data (as reflected in the high SNR,ASNR, and SND scores), while the sticky HDP-HMMtended to under-segment the data. All methods were ableto achieve fairly high accuracy.

Human Activities Accelerometer Data Results

Table 2: Results from real 45D human accelerometer data.Method Acc. SNR ASNR SND VOI Per.HDP-HMM (κ = 100) 0.60* 0.75‡ 4.68* 5.03‡ 0.95* 0/100*Standard HMM 0.79* 134.59* 134.59* 584.16* 0.38* 9/100*MAP HMM (ζ = 33.5) 0.94 1.28 1.43 2.62 0.14 48/100PsO HMM (ζ = 49.0) 0.94 1.03† 1.29 1.29 0.15 48/100

paired t-test: † < α = .05, ‡ < α = .01, * < α = .001

Results from the human accelerometer dataset are shownin Table 2. Both the MAP HMM and inertial pseudo-observation HMM achieved large gains in performance overthe standard HMM model, with average accuracy of 94%.Furthermore, the number of segments was close to correcton average, with a value near one in both the absolute andsimple ratio case. The average normalized variation of infor-mation was low for both the MAP and pseudo-observationmethods. Figure 6 shows an example segmentation for theMAP HMM, displaying a single dimension of the multivari-ate time series for clarity.

In comparison, a standard hidden Markov model per-formed poorly, strongly over-segmenting the sequences inmany cases. Even more striking was the improvement overthe sticky HDP-HMM, which had an average normalizedvariation of information near 1 (i.e., no correlation betweenthe predicted and the true segment labels). The method

tended to under-segment the data, often collapsing to a sin-gle uniform output state, reflected in the SNR having a valuebelow one, and may struggle with moderate dimensionaldata, as related by Fox and Sudderth through private corre-spondence. The sticky HDP-HMM suffers from slow mix-ing rates as the dimensionality increases, and computationtime explodes, being roughly cubic in the dimension. As aresult, the one hundred test examples took several days ofcomputation time to complete, whereas the inertial HMMmethods took a few hours.

DiscussionOur results demonstrate the effectiveness of inertial regular-ization on HMMs for behavior change modeling in multi-variate time series. Although derived in two independentways, the MAP regularized and pseudo-observation inertialregularized HMM converge on a similar maximum likeli-hood update equation, and thus, had similar performance.

The human activity task highlighted an issue with usingstandard HMMs for segmentation of time series with infre-quent state changes, namely, over-segmentation. Incorpo-rating regularization for state transitions provides a simplesolution to this problem. Since our methods rely on chang-ing a single update equation for a standard HMM learningmethod, they can be easily incorporated into HMM learninglibraries with minimal effort. This ease-of-implementationgives a strong advantage over existing persistent-state HMMmethods, such as the sticky HDP-HMM framework.

While the sticky HDP-HMM performed moderately wellon the low-dimensional synthetic dataset, the default pa-rameters produced poor performance on the real-world ac-celerometer data. It remains possible that different set-tings of hyperparameters may improve performance, butthe cost of a combinatorial search through hyperparameterspace combined with lengthy computation time prohibits anexhaustive exploration. The results, at minimum, show astrong dependence on hyperparameter settings for accept-able performance. In contrast, the inertial HMM methodsmake use of a simple heuristic for automatically selectingthe strength parameter ζ, which resulted in excellent per-formance on both datasets without the need for hand-tuningseveral hyperparameters. Although the sticky HDP-HMMhas poor performance on the two segmentation tasks, thereexist tasks for which it may be a better choice (e.g., whenthe correct number of states is unknown).

Related WorkHidden Markov models for sequential data have enjoyed along history, gaining popularity as a result of the widelyinfluential tutorial by Rabiner (Rabiner 1989). Specificto the work presented here, the use of regularization forHMM parameters received a general treatment in (Gauvainand Lee 1994), for both transition and emission parame-ters. Our work details a more specific version of the reg-ularization, useful for state persistence. Neukirchen andRigoll (Neukirchen and Rigoll 1999) studied the use of reg-ularization in HMMs for reducing parameter overfitting ofemission distributions due to insufficient training data, but

without an emphasis on inertial transitioning between states.Similarly, Johnson (Johnson 2007) proposed using Dirichletpriors on multinomial hidden Markov models as a means ofenforcing sparse emission distributions.

Fox et al. (Fox et al. 2011) specifically developed aBayesian sticky HMM to provide inertial state persistence.They presented a method capable of learning a hiddenMarkov model without specifying the number of statesor regularization-strength beforehand, using a hierarchicalDirichlet process and truncated Gibbs sampling. As dis-cussed, their method requires a more complex approach tolearning the model and specification of several hyperparam-eters for the Bayesian priors along with a truncation limit. Incontrast, our models only require the specification of two pa-rameters, K and ζ, whereas the sticky HDP-HMM requiresanalogous truncation level L and κ parameters to be chosen,in addition to the hyperparameters on the model priors.

ConclusionsFor modeling changes in multivariate time series data, wederive two modified forms of hidden Markov model thateffectively enforce state persistence. Although the derivedmethods are simple, they perform well and are computation-ally tractable. We have shown that inertial models are easilyimplemented, run efficiently, add almost no additional com-putation cost, and work well on data with moderate dimen-sions. Their simplicity is thus a feature and not a bug.

Furthermore, a simple method was developed for auto-mated selection of each regularization parameter. Our exper-iments on synthetic and real-world data show the effective-ness of inertial HMMs, giving large improvements in per-formance over standard HMMs and the sticky HDP-HMM.

The simplicity of our models pave the way for natural ex-tensions, such as incremental parameter learning and chang-ing the form of the class conditional emission distributionsto incorporate internal dynamics. Such extensions are thefocus of future work.

ReferencesAltun, K.; Barshan, B.; and Tuncel, O. 2010. Comparativestudy on classifying human activities with miniature inertialand magnetic sensors. Pattern Recogn. 43(10):3605–3620.Bishop, C. M. 2007. Pattern Recognition and Machine Learn-ing. Springer. 616–625.Fox, E. B., and Sudderth, E. B. 2009. HDP-HMM Toolbox. https://www.stat.washington.edu/˜ebfox/software.html. [Online; accessed 20-July-2014].Fox, E. B.; Sudderth, E. B.; Jordan, M. I.; Willsky, A. S.; et al.2011. A sticky HDP-HMM with application to speaker diariza-tion. The Annals of Applied Statistics 5(2A):1020–1056.Gauvain, J.-l., and Lee, C.-h. 1994. Maximum A PosterioriEstimation for Multivariate Gaussian Mixture Observations ofMarkov Chains. IEEE Transactions on Speech and Audio Pro-cessing 2:291–298.Gini, C. 1936. On the measure of concentration with specialreference to income and statistics. In Colorado College Publi-cation, number 208 in General Series, 73–79.

Johnson, M. 2007. Why doesn’t EM find good HMM POS-taggers. In In EMNLP, 296–305.Kawahara, Y.; Yairi, T.; and Machida, K. 2007. Change-pointdetection in time-series data based on subspace identification.In Data Mining, 2007. ICDM 2007. Seventh IEEE InternationalConference on, 559–564. IEEE.Liu, S.; Yamada, M.; Collier, N.; and Sugiyama, M. 2013.Change-point detection in time-series data by relative density-ratio estimation. Neural Networks 43:72–83.Meila, M. 2003. Comparing clusterings by the variation of in-formation. In Scholkopf, B., and Warmuth, M., eds., LearningTheory and Kernel Machines, volume 2777 of Lecture Notes inComputer Science. Springer Berlin Heidelberg. 173–187.Neukirchen, C., and Rigoll, G. 1999. Controlling the complex-ity of HMM systems by regularization. Advances in NeuralInformation Processing Systems 737–743.Rabiner, L. 1989. A tutorial on hidden Markov models andselected applications in speech recognition. Proceedings of theIEEE 77(2):257–286.Ray, B. K., and Tsay, R. S. 2002. Bayesian methods for change-point detection in long-range dependent processes. Journal ofTime Series Analysis 23(6):687–705.Wikipedia. 2014. Gini coefficient — Wikipedia, the free ency-clopedia. [Online; accessed 8-June-2014].Willsky, A. S.; Sudderth, E. B.; Jordan, M. I.; and Fox, E. B.2009. Nonparametric Bayesian learning of switching linear dy-namical systems. In Advances in Neural Information Process-ing Systems, 457–464.Xie, Y.; Huang, J.; and Willett, R. 2013. Change-point de-tection for high-dimensional time series with missing data. Se-lected Topics in Signal Processing, IEEE Journal of 7(1):12–27.

Documents

Inertial Hidden Markov Models: Modeling Change in