On particle-based online smoothing and parameter inference ...kth.diva-portal.org/smash/get/diva2:811884/FULLTEXT01.pdf · t= exp(X t=2) t (t2N); (2.1) where f" tg t2N and f tg t2N

On particle-based online smoothing and parameter inference in

general hidden Markov models

Johan Westerborn

TRITA-MAT-A 2015:06

ISRN KTH/MAT/A-15/06-SE

ISBN 978-91-7595-554-4

Department of MathematicsRoyal Institute of Technology

SE-100 44, Stockholm

Sweden

Akademisk avhandling som med tillstand av Kungl Tekniska Hogskolan framlagges

till offentlig granskning for avlaggande av teknologie licentiatexamen, fredagen den

5 juni 2015 klockan 14.00 i rum 3721, Institutionen for matematik, Lindstedtsvagen

25, Kungl Tekniska Hogskolan, Stockholm.

c© Johan Westerborn, 2015

Printed in Stockholm by Universitetsservice US-AB

Abstract

This thesis consists of two papers studying online inference in general hid-den Markov models using sequential Monte Carlo methods.

The first paper present an novel algorithm, the particle-based, rapid in-cremental smoother (PaRIS), aimed at efficiently perform online approxima-tion of smoothed expectations of additive state functionals in general hiddenMarkov models. The algorithm has, under weak assumptions, linear compu-tational complexity and very limited memory requirements. The algorithm isalso furnished with a number of convergence results, including a central limittheorem.

The second paper focuses on the problem of online estimation of parame-ters in a general hidden Markov model. The algorithm is based on a forwardimplementation of the classical expectation-maximization algorithm. The al-gorithm uses the PaRIS algorithm to achieve an efficient algorithm.

iii

Sammanfattning

Denna avhandling bestar av tva artiklar som behandlar inferens i doldaMarkovkedjor med generellt tillstandsrum via sekventiella Monte Carlo-metoder.

Den forsta artikeln presenterar en ny algoritm, PaRIS, med malet att effek-tivt berakna partikelbaserade online-skattningar av utjamnade vantevardenav additiva tillstandsfunktionaler. Algoritmen har, under svaga villkor, enberakningkomplexitet som vaxer endast linjart med antalet partiklar samthogst begransade minneskrav. Dessutom harleds ett antal konvergensresultatfor denna algoritm, sasom en central gransvardessats.

Den andra artikeln fokuserar pa online-estimering av modellparametrar ien generella dolda Markovkedjor. Den presenterade algoritmen kan ses somen kombination av PaRIS och en nyligen foreslagen online-implementation avden klassiska EM-algoritmen.

iv

List of papers

This thesis consists of two papers referred to in the text with the capitalletters A and B.

Paper A. Efficient particle-based online smoothing in general hidden Markovmodels: the PaRIS algorithm, Jimmy Olsson and Johan Westerborn

Paper B. An efficient particle-based online EM algorithm for general state-space models, Jimmy Olsson and Johan Westerborn

v

vi

Acknowledgments

I would like to thank my supervisor Associate Professor Jimmy Olsson forall his constant support and encouragement.

I would also like to thank all of my colleagues at the Division of Mathe-matical Statistics, especially my fellow PhD students, for all their support.

Lastly I wish to thank Therese for her endless love and patience.

vii

viii

1. Overview

In this thesis we consider the problem of inference and parameter estima-tion in general hidden Markov models (HMMs), alternatively termed generalstate space-models (SSMs), using sequential Monte Carlo (SMC) methods,also known as particle filters.

An HMM consists of two processes: a hidden Markov chain, referred toas the state process, which is only partially observed through an observationprocess in such a way that conditionally on the states, the observations arestatistically independent with conditional distribution of each observation de-pending on the corresponding state only. The generality of HMMs allowsthese to be applied in many different contexts and research fields, from eco-nomics [8] to speech recognition [29] and computational biology [27] (the over300 references in [4] gives an idea of the applicability of these models). Whenworking with HMMs we are typically given a fixed record, or data-stream,of observations from the observation process and are interested in calculatingconditional distributions of one or several states given these observations. Aswe will see, the estimation of such conditional distributions will be particularlycritical for parameter calibration.

In an HMM, the state process may, e.g., be the location and velocity ofa moving target and the observation process the bearing of this target withrespect to an observation point. We are then interested in estimating themost likely position and velocity of the target given all the observations upto a given time (see [20]). In this case the distribution of interest is the filterdistribution, i.e., the conditional law of the current state given the observationsup to the current time point. However, if we wish to know the most likelytrajectory of this target, focus us set on the smoothing distribution, i.e., theconditional joint law of all latent states up to the current time point given thecorresponding observations.

There are only two classes of HMMs that allow for exact computation ofthe filter and smoothing distributions, namely when the state space is finite(in which case the solution is provided by the Baum-Welch algorithm [30]) orwhen the model is linear and Gaussian (using disturbance smoothing [3, 26],which is closely related to the Kalman filter [24]). For nonlinear models itis possible obtain approximate solutions using the extended Kalman filter [1],where the model is linearized around an estimate of the mean and covarianceusing a first order Taylor expansion. Nevertheless, if the model is highly non-linear, the extended Kalman filter has generally poor performance. To someextent, this can be counteracted by the unscented Kalman filter [23], whichrecover the filter mean and variance by propagating, using a deterministic

1

2

sampling technique, a set of so-called “sigma points” through the nonlinearfunctions.

In the light of the previous, we are, in the presence of nonlinear and/ornon-Gaussian model components, generally referred to finding approxima-tions of the filter and smoothing distributions. In this thesis we will use SMCmethods for this purpose. These methods, also known as particle methods,propagate, through a series of sampling and resampling steps, particles andweights such that the weighted empirical measure associated with the sampleprovides a discrete approximation of the distributions of interest. We refer to[16, 7, 17, 2, 6] for introductions to the topic. In a practical application, werequire the algorithm under consideration to be numerically efficient in thesense that the computational time of the algorithm should grow only slowlywith the number of particles; moreover, the algorithm should be numericallystable in the sense that the Monte Carlo variance should be controlled in thelong-term. This is feasible in the case of filtering (see [15, 12, 9] for theoreticalresults on the time uniform convergence of particle filters), while estimation ofthe smoothing distribution flow poses generally more of a problem. Standardmethods for estimation of smoothed expectations requires typically the datato be processed in two passes, in the forward as well as the backward directions[19, 17, 18], resulting in batch algorithms that have to be re-executed if a newobservation becomes available. Another problem with current algorithms (see,e.g., [11]) is that the computational complexity is growing typically quadrat-ically with the number of particles, resulting in slow algorithms.

In Paper A, a novel algorithm, the particle-based, rapid incremental smoother(PaRIS), for efficient online approximation of smoothed expectation of addi-tive state functionals in general HMMs is presented. As we will see in the nextsection, the computation of smoothed expectations of additive form is crucialfor parameter inference in HMMs. The algorithm, which processes the datain a single forward sweep, has a computational complexity that grows onlylinearly with the number of particles used. The PaRIS algorithm is coupledwith several convergence results including a Hoeffding type inequality and acentral limit theorem.

In Paper B, the ideas of the PaRIS algorithm is used for designing anefficient online expectation-maximization algorithm for parameter estimationin HMMs. Like PaRIS, this algorithm has a computational complexity thatgrows only linearly with the number of particles.

3

Xk−1 Xk Xk+1

Yk−1 Yk Yk+1

Figure 1. Hidden Markov model

2. Hidden Markov models

As mentioned previously, an HMM comprises two processes: a state processXtt∈N and an observation process Ytt∈N. These two processes can bedescribed by the following dynamical system:

Xt+1 | Xt = xt ∼ qθ(xt, ·),Yt | Xt = xt ∼ gθ(xt, ·),

where qθ and gθ are Markov transition densities, the latter referred to as theemission density, and θ is a vector containing model parameters. The Markovchain Xtt∈N, which is unobserved, is initialized according to some initialdistribution χ. The dependence structure of an HMM can be described bythe graphical model in Figure 1. For a rigorous treatment of HMMs, we referto [7, Section 2].

Example 1 (Stochastic volatility). As an example of an HMM, consider thefollowing nonlinear stochastic volatility model [21]:

Xt+1 = φXt + σεt+1

Yt = β exp(Xt/2)ζt(t ∈ N), (2.1)

where εtt∈N and ζtt∈N are sequences of mutually independent standardnormally distributed random variables. The strength of this model is that itallows volatility clustering of the asset log-returns Y , and the parameter φcan be interpreted as the persistence in the volatility shocks. Figure 2 displaysa trajectory generated through simulation under the parameter vector θ =(φ, β2, σ2) = (.9, .32, .92).

In the following we assume that we are given a fixed sequence ytt∈N ofobservations of Ytt∈N. Any kind of statistical inference in HMMs involves

4

0 50 100 150 200 250−6

−5

−4

−3

−2

−1

0

1

2

3

4

time

Simulation of stochastic volatility model

Figure 2. Simulated values of Y0:250 using the parametersφ = .9, β2 = .32, and σ2 = .92.

typically the computation of conditional distributions of the state process,or parts of it, given the observations. We let φs:s′|t;θh := Eθ[h(Xs:s′) | Y0:t]denote the expectation of a function of the hidden states given all observations.Using Bayes’s formula, this expectation can be written as

φs:s′|t;θh =

∫·· ·∫h(xs:s′)gθ(x0, y0)χ(x0)

∏t−1`=0 gθ(x`+1, y`+1)qθ(x`, x`+1) dx0:t∫

·· ·∫gθ(x0, y0)χ(x0)

∏t−1`=0 gθ(x`+1, y`+1)qθ(x`, x`+1) dx0:t

,

where we assumed that the denominator is positive. Of particular interestare the distributions φt;θ := φt:t|t;θ and φ0:t;θ := φ0:t|t;θ, i.e., the filter andsmoothing distributions, respectively.

An interesting feature of an HMM is that conditionally on the observations,the state process is still a Markov chain in the forward as well as the time-reversed directions. The conditional density of Xt given Xt+1 = xt+1 andthe data y0:t, which we refer to as the backward transition density, is denotedby ←−q φt;θ (xt+1, xt). Using the backward density it is possible to rewrite thesmoothing distribution as

φ0:t;θh =

∫· · ·∫ (t−1∏

s=0

←−q φs;θ (xs+1, xs)

)φt;θ(xt)h(x0:t) dx0:t. (2.2)

The previous expression is known as the backward decomposition of the smooth-ing distribution.

5

As will soon become apparent, one is most often interested in estimatingφ0:t;θht for ht being of additive form

ht(x0:t) =

t−1∑`=0

h`(x`:`+1). (2.3)

Before using an HMM in practice, e.g., for prediction, the unknown modelparameters θ have to be calibrated, and such parameter estimation involvesgenerally the computation of smoothed expectations over objective functionsof type (2.3). One way of calibrating the parameters goes via the maximum

likelihood method, which returns the parameter estimate θ = argmaxθ `(θ; y0:t)maximizing the log-likelihood function `(θ; y0:t) (i.e., the logarithm of the jointdensity of the observations y0:t as a function of θ). For an HMM, the log-likelihood is given by

`(θ; y0:t) := log

(∫· · ·∫gθ(x0, y0)χ(x0)

t−1∏`=0

gθ(x`+1, y`+1)qθ(x`, x`+1) dx0:t

),

which, again, is intractable for most models. As it is generally the case forlatent data models, it is easier is to consider instead the complete data log-likelihood `(θ; y0:t, x0:t) defined as the logarithm of the joint density of theobservations as well as the hidden states as a function of the parameter, i.e.,

`(θ; y0:t, x0:t) := log

(gθ(x0, y0)χ(x0)

t−1∏`=0

gθ(x`+1, y`+1)qθ(x`, x`+1)

)

= logχ(x0) +

t∑`=1

log qθ(x`−1, x`) +

t∑`=0

log gθ(x`, y`).

The complete data log-likelihood depends also on the latent data X0:t, whichhas to be estimated. The expectation-maximization (EM) algorithm (pre-sented in [13]) consists roughly in repeating this procedure until the parame-ters converge. More specifically, the EM algorithm introduces the intermediatequantity

Q(θ, θi) := Eθi [`(θ;Y0:t, X0:t) | Y0:t] ,where Eθi denotes expectation under the model parametrized by some givenparameter estimate θi. The parameter estimate is then updated recursively asθi+1 = argmaxθQ(θ, θi). Calculating Q(θ, θi) is referred to as the expectation-step (or E-step) and finding the maximum is referred to as the maximization-step (or M-step). The EM algorithm is initialized by setting θ0 arbitrarily.The convergence of this algorithm was, under certain assumptions, establishedin [31].

6

The EM algorithm is particularly easy to implement if the distributionof the complete data (X0:t, Y0:t) belongs to an exponential family, i.e., thecomplete data log-likelihood can be written as

`(θ; y0:t, x0:t) = 〈φ(θ), st(x0:t)〉 − c(θ) + log h(x0:t),

where 〈·, ·〉 denotes scalar product, φ(θ) and c(θ) are known functions, andst(x0:t) are some (possibly vector-valued) sufficient statistics. In this case,The E-step then boils down to computing φ0:t;θst, and maximizing Q(θ, θi)is reduced to maximizing 〈φ(θ), φ0:t;θst〉 − c(θ) with respect to θ. This maxi-mization should allow a closed form solution θi+1 = Λ(φ0:t;θst), which is oftenthe case. Observe that we should calculate the smoothed expectation of thesufficient statistics under the model dynamics determined by the current pa-rameter estimate θi. As mentioned earlier, the smoothing distribution lacksclosed-form solution in the general case and we are hence referred to findingapproximations of the smoothed expectations under consideration. This stan-dard version of EM, where the whole trajectory y0:t is processed from scratchbefore each parameter update, is referred to as the batch EM algorithm.

Example 2 (Stochastic volatility, cont.). The model in Example 1 comprisesfour sufficient statistics of additive form (2.3), with terms given by

s1t (xt:t+1) = x2t , s2t (xt:t+1) = xtxt+1,

s3t (xt:t+1) = x2t+1, s4t (xt:t+1) = y2t+1 exp(−xt+1).

For this model, the M-step involves the update

Λ(z1, z2, z3, z4) =

(z2z3, z3 −

z22z1, z4

).

In Figure 3 the batch EM algorithm is used for estimating the parameters of thestochastic volatility model, on the basis of the trajectory in Figure 2. By takingthe mean of the last 10 parameter estimates we obtain θ∗ = (.917, .2422, .9042).

3. Particle methods for filtering and smoothing

SMC methods is a class of genetic-type algorithms that update sequen-tially a random sample of particles with associated importance weights. Theweighted empirical measures formed by the particle sample at the differenttime steps serve as approximations of the filter distribution flow.

7

0 20 40 60 80 100 120 140 160 180 2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Batch

Parameter estimation using EM−algorithm

φ

σ

β

Figure 3. Parameter estimation in the stochastic volatilitymodel using the EM algorithm. The trajectory simulatedin Figure 2 is used as input. Taking the mean over the last10 iterates yields θ∗ = (.917, .2422, .9042).

3.1. Particle filter. The bootstrap particle filter, presented first in [20], con-structs a sequence of weighted particles targeting the filter distribution flowφtt∈N given the sequence ytt∈N of observations. In order to describe themost basic SMC algorithm we proceed recursively and assume that we haveat hand a weighted sample (ωit, ξit)Ni=1 approximating the filter distributionφt;θ in the sense that

φNt;θh =

N∑i=1

ωitΩth(ξit),

where Ωt =∑N`=1 ω

`t , converges to φt;θh as N tends to infinity. To form a

weighted particle sample (ωit+1, ξit+1)Ni=1 targeting the next filter φt+1;θ, we

begin by resampling the current particles by drawing a vector Iit+1Ni=1 ofindices from the multinomial distribution with probabilities proportional tothe weights, i.e.,

Iit+1 ∼ Pr(ω`tN`=1

),

where Pr(ω`tN`=1) refers to the discrete probability distribution induced bythe weights ω`tN`=1. After this, the particles are propagated through the

model dynamics by, first, drawing ξit+1 ∼ qθ(ξIit+1

t , ·) and, second, assigning

8

Time0 5 10 15

-5

-4

-3

-2

-1

0

1

2

3

4Particle trajectories

Figure 4. The historical trajectories of the particles whenrunning the standard particle filter. The grey circles representthe particles and the black line the particles active in theestimate at time 15.

each particle the weight ωit+1 = gθ(ξit+1, yt+1). The algorithm is initialized by

drawing the initial particles ξi0 from the initial distribution χ and letting theassociated importance weight be ωi0 = gθ(ξ

i0, y0).

As a by-product, the historical trajectories of the particle filter providesjointly an estimate of the joint smoothing distribution. These trajectoriesare constructed by linking each generation of particles with the correspondingancestors. Since the ancestors are selected through resampling, this methodsuffers from a well-known degeneracy phenomenon that is seen in the collapseof the trajectories. In Figure 4, displaying a simulation of this phenomenon,we see clearly that for most time steps only a single particle contributes tothe resulting estimator, which, consequently, will degenerate. See [28, 6, 6]for further discussion.

3.2. Forward-filtering backward-smoothing. To combat the particle pathdegeneracy, the backward decomposition (2.2) can be used, granted that wecan approximate the backward transition densities ←−q φs;θ . As approximationsof these transition densities (on the supporting points formed by the particles)we consider

←−q φNs;θ (ξis+1, ξ

js) =

ωjsqθ(ξjs , ξ

is+1)∑N

`=1 ω`sqθ(ξ

`s, ξ

is+1)

. (3.1)

9

The forward-filtering backward-smoothing (FFBSm) algorithm [17, 22, 25] con-sists in simply inserting this approximation into (2.2), yielding the approxi-mation

φN0:t;θh :=

N∑i0=1

· · ·N∑it=1

(t−1∏s=0

ωiss qθ(ξiss , ξ

is+1

s+1 )∑N`=1 ω

`sqθ(ξ

`s, ξ

is+1

s+1 )

)ωittΩt

h(ξi00 , . . . , ξitt ),

of φ0:t;θh. For a general objective function h, this occupation measure isimpractical as the cardinality of its support grows geometrically fast withtime. In the case of an objective function of additive form the computationalcomplexity is still quadratic in the number of particles since the normalized

constant∑N`=1 ω

`sqθ(ξ

`s, ξ

is+1

s+1 ) needs to be computed for each particle and timepoint.

3.3. Forward-only implementation of FFBSm. As noted in [10], esti-mates of the sequence φ0:t;θhtt∈N may, in the case of objective functions ofadditive form (2.3), be computed recursively. This is done by noting that thesequence of functions Ttht(xt) = E[ht(X0:t) | Xt = xt, y0:t], t ∈ N, may beupdated recursively according to

Tt+1ht+1(xt+1) (3.2)

= E[ht+1(X0:t+1) | Xt+1 = xt+1, y0:t+1]

= E[ht(Xt:t+1) + E[ht(X0:t) | Xt = xt, y0:t] | Xt+1 = xt+1, y0:t+1]

=

∫←−q φt;θ (xt+1, xt)

(ht(xt:t+1) + Ttht(xt)

)dxt. (3.3)

Each expectation of interest is then calculated as φ0:t;θht = φt;θTtht. Pluggingthe approximations (3.1) into the recursion we get approximations τ itNi=1 ofthe statistics Ttht(ξ

it)Ni=1, initialized by τ i0 = 0. These approximations may

be updated sequentially by first evolving the particle filter one step and thensetting

τ it+1 =

N∑j=1

ωjt qθ(ξjt , ξ

it+1)∑N

`=1 ω`tqθ(ξ

`t , ξ

it+1)

(τ jt + ht(ξ

jt , ξ

it+1)

), (3.4)

yielding the estimate

φN0:t+1;θht+1 =

N∑i=1

ωit+1

Ωt+1τ it+1

of φ0:t+1;θht+1. This allows for online estimation. In addition, the algorithmhas the appealing property that only the current statistics and particle sampleneed to be stored in the memory. The computational complexity of this

10

algorithm is however still quadratic since a sum of N terms has to be computedfor each particle [11].

3.4. Forward-filtering backward-simulation. In order to remedy the highcomputational complexity of FFBSm, the forward-filtering backward-simulation(FFBSi) generates trajectories on the index space of the particle locationsgenerated by the particle filter. This is done by simulating repeatedly a time-reversed, inhomogeneous Markov chain with transition probabilities equal tothe particle approximation of the backward densities, i.e., drawing indicesJsts=0 with

Jt ∼ Pr(ω`tN`=1),

and, recursively,

Js | Js+1 = j ∼ Pr(ωisqθ(ξis, ξjs+1)Ni=1).

Given Jsts=0, an approximate draw from the smoothing distribution is formed

by the random vector (ξJ00 , . . . , ξJtt ). Consequently, the uniformly weightedoccupation measure associated with a set of conditionally independent suchdraws provides a finite-dimensional approximation of the smoothing distribu-tion φ0:t;θ; see [19]. In this basic formulation of FFBSi, the backward sam-pling pass requires the normalizing constants of the particle-based backwarddensities to be computed, and hence the algorithm suffers from a quadraticcomplexity. On the other hand, on the contrary to FFBSm, this complexityis the same for all types of objective functions.

However, following [14] it is, under the assumption that the transition den-sity qθ is bounded from above by some constant ε, possible to reduce thecomputational complexity of FFBSi by simulating using the following accept-reject technique. In order to simulate Js given Js+1 = j a candidate J∗ isdrawn from the proposal distribution Pr(ωisNi=1) is accepted with probability

qθ(ξJ∗

s , ξjs+1)/ε. The procedure is repeated until acceptance. Under the addi-tional assumption that the transition density is bounded also from below itcan be shown (see [14, Proposition 2]) that the expected number of proposalsis bounded by a time independent constant, resulting in an algorithm thathas only a linear computational complexity.

4. Summary of Paper A

Requiring separate forward and backward processing of the data, the stan-dard design of FFBSi is impracticable in online applications. In Paper A, wepropose a novel algorithm, the PaRIS algorithm, which can be viewed as a

11

hybrid between the forward-only implementation of FFBSm and the FFBSialgorithm.

Similarly to the forward-only implementation of FFBSm, the PaRIS algo-rithm propagates particle estimates τ itNi=1 of Ttht(ξ

it)Ni=1. The key ingre-

dient of PaRIS is the replacement of (3.4), in the spirit of FFBSi, by a Monte

Carlo estimate. More specifically, for each particle location ξit+1 we draw N

indices J (i,j)t+1 Nj=1 according to

J(i,j)t+1 ∼ Pr

(ω`tqθ(ξ`t , ξit+1)N`=1

). (4.1)

and update the associated auxiliary statistic according to

τ it+1 = N−1N∑j=1

(τJ

(i,j)t+1

t + ht(ξJ

(i,j)t+1

t , ξit+1)

).

At each time step, an estimator of φ0:t+1ht+1 is obtained as∑Ni=1 ω

it+1τ

it+1/Ωt+1.

The algorithm is initialized by setting τ i0 = 0. In order to gain efficiency, (4.1)is performed using the accept-reject technique speeding up the FFBSi algo-rithm. This results in an algorithm having linear computational complexity.

In Paper A it is shown that the design parameter N should be at least 2in order for the algorithm to be numerically stable. Indeed, when N = 1,a phenomenon similar to the collapsing of the historical trajectories of theparticle filter is observed. Figure 5 displays an illustration of the differencebetween the cases N = 1 and N = 2, where the similarity between the casesN = 1 and Figure 4 is visible. Paper A provides a full theoretical study ofPaRIS, which is highly non-trivial due to the complex dependence structureinduced by the backward sampling approach.

5. Summary of Paper B

In Paper B, an online EM algorithm for parameter estimation in hiddenMarkov models is presented. Applying the EM algorithm to HMMs belongingto the exponential family involves estimating t−1φ0:t;θst where st are (possi-bly vector-valued) sufficient statistics. Since these sufficient statistics are ofadditive form we can use (3.3) to achieve an online smoothing algorithm ofthe time-averaged smoothed sufficient statistics using the PaRIS algorithm.In the context of EM, the quantity of interest is generally the normalizedexpectation t−1φ0:t;θst rather than φ0:t;θst, and we hence propagate t−1Ttst

12

(a) N = 1 (b) N = 2

Figure 5. Genealogical traces corresponding to backwardsimulation in the PaRIS algorithm. Columns of nodes refer todifferent particle populations (with N = 3) at different timepoints (with time increasing rightward) and arrows indicateconnections through the backward simulation in the PaRISalgorithm. Black-coloured particles are included in the finalestimator, while grey-coloured ones are inactive.

according to the recursion

γtTtst(xt) =

∫←−q φt−1;θ

(xt, xt−1)

(1− γt)γt−1Tt−1st−1(xt−1)

+ γtst−1(xt−1:t)

dxt−1, (5.1)

where γt = t−1. The main idea of the online EM algorithm is to updatesequentially t−1φ0:tst through the recursion (5.1) and then use the maximiza-tion function Λ to update the parameters; see [5, 11]. In practice, insteadof using γt = t−1, a step size γt is used that satisfies the regular stochasticapproximation requirements

∑t γt = ∞ and

∑t γ

2t < ∞. A usual choice is

γt = t−α, where .5 < α ≤ 1.Again exact computations of γtTtst(xt) is infeasible for most models, and

we hence need to, as before, approximate this quantity. In Paper B thisapproximation is formed by applying the same technique as in the PaRIS al-gorithm, using accept-reject sampling to draw a sample from the backwardkernel and taking the Monte Carlo average as an estimate. Thus, we propagatein parallel τ itNi=1 and the particle filter estimate (ωit, ξit)Ni=1, where τ it isan estimate of t−1Ttst(ξ

it). When updating the estimates, the particle cloud

is first propagated under the parameters θt, whereupon indices J (i,j)t+1 Nj=1

13

are drawn from Pr(ω`tqθt(ξ`t , ξit+1)N`=1) using the accept-reject sampling ap-proach proposed earlier. The auxiliary statistics are then updated through

τ it+1 = N−1N∑j=1

(1− γt+1)τ

J(i,j)t+1

t + γt+1st(ξJ

(i,j)t+1

t , ξit+1)

.

Finally the updated parameters are given by

θt+1 = Λ

(N∑i=1

ωit+1

Ωt+1τ it+1

),

where Λ is the maximization function. The algorithm is initialized by settingτ i0 = 0 and θ0 arbitrarily.

References

[1] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice-Hall,1979.

[2] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on par-ticle filters for on line non-linear/non-Gaussian Bayesian tracking. IEEETrans. Signal Process., 50:241–254, 2002.

[3] A. Bryson and M. Frazier. Smoothing for linear and nonlinear dynamicsystems. Technical Report TDR 63-119, Aero. Sys. Div. Wrigth-PattersonAir Force Base, Ohio, 1963.

[4] O. Cappe. Ten years of HMMs (online bibliography 1989–2000), March2001.

[5] O. Cappe. Online EM algorithm for hidden Markov models. Journal ofComputational and Graphical Statistics, 20(3):728–749, 2011.

[6] O. Cappe, S. J. Godsill, and E. Moulines. An overview of existing meth-ods and recent advances in sequential Monte Carlo. IEEE Proceedings,95(5):899–924, 2007.

[7] O. Cappe, E. Moulines, and T. Ryden. Inference in Hidden MarkovModels. Springer, 2005.

[8] S. Chib, F. Nadari, and N. Shephard. Markov chain Monte Carlo methodsfor stochastic volatility models. J. Econometrics, 108:281–316, 2002.

[9] P. Del Moral. Feynman-Kac Formulae. Genealogical and Interacting Par-ticle Systems with Applications. Springer, 2004.

[10] P. Del Moral, A. Doucet, and S. Singh. A Backward Particle Interpreta-tion of Feynman-Kac Formulae. ESAIM M2AN, 44(5):947–975, 2010.

[11] P. Del Moral, A. Doucet, and S. Singh. Forward smoothing using sequen-tial Monte Carlo. Technical report, Cambridge University, 2010.

14

[12] P. Del Moral and A. Guionnet. On the stability of interacting processeswith applications to filtering and genetic algorithms. Annales de l’InstitutHenri Poincare, 37:155–194, 2001.

[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihoodfrom incomplete data via the EM algorithm. J. Roy. Statist. Soc. B,39(1):1–38 (with discussion), 1977.

[14] R. Douc, A. Garivier, E. Moulines, and J. Olsson. Sequential MonteCarlo smoothing for general state space hidden Markov models. Ann.Appl. Probab., 21(6):2109–2145, 2011.

[15] R. Douc, E. Moulines, and J. Olsson. Long-term stability of sequentialMonte Carlo methods under verifiable conditions. Ann. Appl. Probab.,24(5):1767–1802, 2014.

[16] A. Doucet, N. De Freitas, and N. Gordon, editors. Sequential MonteCarlo Methods in Practice. Springer, New York, 2001.

[17] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte-Carlo sam-pling methods for Bayesian filtering. Stat. Comput., 10:197–208, 2000.

[18] P. Fearnhead, D. Wyncoll, and J. Tawn. A sequential smoothing algo-rithm with linear computational cost. Biometrika, 97(2):447–464, 2010.

[19] S. J. Godsill, A. Doucet, and M. West. Monte Carlo smoothing for non-linear time series. J. Am. Statist. Assoc., 50:438–449, 2004.

[20] N. Gordon, D. Salmond, and A. F. Smith. Novel approach tononlinear/non-Gaussian Bayesian state estimation. IEE Proc. F, RadarSignal Process., 140:107–113, 1993.

[21] J. Hull and A. White. The pricing of options on assets with stochasticvolatilities. J. Finance, 42:281–300, 1987.

[22] M. Hurzeler and H. R. Kunsch. Monte Carlo approximations for generalstate-space models. J. Comput. Graph. Statist., 7:175–193, 1998.

[23] S. J. Julier and J. K. Uhlmann. A new extension of the Kalman filter tononlinear systems. In AeroSense: The 11th International Symposium onAerospace/Defense Sensing, Simulation and Controls, 1997.

[24] R. E. Kalman and R. Bucy. New results in linear filtering and predictiontheory. J. Basic Eng., Trans. ASME, Series D, 83(3):95–108, 1961.

[25] G. Kitagawa. Monte-Carlo filter and smoother for non-Gaussian nonlin-ear state space models. J. Comput. Graph. Statist., 1:1–25, 1996.

[26] R. Kohn and C. F. Ansley. A fast algorithm for signal extraction, influ-ence and cross-validation in state space models. Biometrika, 76:65–79,1989.

[27] T. Koski. Hidden Markov Models for Bioinformatics. Kluwer, 2001.

15

[28] J. Olsson, O. Cappe, R. Douc, and E. Moulines. Sequential Monte Carlosmoothing with application to parameter estimation in non-linear statespace models. Bernoulli, 14(1):155–179, 2008. arXiv:math.ST/0609514.

[29] L. R. Rabiner. A tutorial on hidden Markov models and selected ap-plications in speech recognition. Proc. IEEE, 77(2):257–285, February1989.

[30] L. R. Welch. Hidden Markov models and the Baum-Welch algorithm.IEEE Inf. Theory Soc. Newslett., 53(4), 2003.

[31] C. F. J. Wu. On the convergence properties of the EM algorithm. Ann.Statist., 11:95–103, 1983.

Documents

On particle-based online smoothing and parameter inference ...kth.diva-portal.org/smash/get/diva2:811884/FULLTEXT01.pdf · t= exp(X t=2) t (t2N); (2.1) where f" tg t2N and f tg t2N