Applied Probability - Jonathan Jordanjonathanjordan.staff.shef.ac.uk/ApplProb/notes.pdf · Applied Probability MAS371/MAS6071 School of Mathematics and Statistics University of She

Applied Probability

MAS371/MAS6071

School of Mathematics and Statistics

University of Sheffield

2019-20

1 Introduction

1.1 Coverage and contents

This module builds on the module MAS275 Probability Modelling. In that module, you

will have seen probability models: for example, discrete variables Y1, . . . , Yn form a (discrete

time) Markov chain if

P (Yk = yk|Yk−1 = yk−1) = pyk−1,yk

where pyk−1,yk is the (yk−1, yk)th element of a transition probability matrix P .

This module builds on that in a number of ways. One is that it will introduce further types

of probability model; in particular it will extend the idea of a Markov chain to continuous

time, and it will look at models for random scatterings of points. It will also consider the

idea of how we might carry out inference for our probability models. A simple example

is where we have a discrete time Markov chain with two states, but with an unknown

transition matrix (θ1 1− θ1

1− θ2 θ2

),

where we have unknown parameters θ = (θ1, θ2), and we can ask the question of how to

learn about these parameters θ from observations Y1, . . . , Yn. There are lots of special cases,

1

but for most purposes we use the likelihood

L(θ|y1, . . . , yn) = P (Y1 = y1, . . . , Yn = yn|θ).

or the corresponding log-likelihood

l = logL.

We will also see applied problems in which probability models and statistical inference

for their parameters are useful. Examples of modelling will be drawn from meteorology,

epidemic studies, seismology, disease mapping and elsewhere. We will also look at the wider

question of how to develop models to address particular questions, including issues of model

fit and model comparison.

The module will include theory, examples, and some computation (largely self-contained—

not much prior exposure to R is assumed).

The module will start off by introducing the basic Poisson process, which is both an impor-

tant model in its own right and a starting point for some more advanced models which we

will see later in the course, and review the main concepts of discrete time Markov chains

from MAS275. It will then discuss some of the ideas of likelihood-based inference (some

of which you may have seen before in MAS223), before applying those to discrete time

Markov chains. Later sections of the course will extend the theory to continuous time

Markov chains and to extensions of the Poisson process, while also covering ideas of infer-

ence for these models. It will also introduce important examples of probability models such

as those for epidemics and queues.

1.2 Books

The following books cover parts of the content of this module, as well as lots of other

material. There are lots of other texts that cover the material on stochastic processes, but

many of them are at a more technical level than this module. The book by Guttorp is

closest in overall spirit to what we are doing here.

P. Guttorp (1995) Stochastic Modelling of Scientific Data, Chapman & Hall. There is also

a 2nd edition, by P. Guttorp and V. N. Minin.

H. M. Taylor & S. Karlin (1998) An Introduction to Stochastic Modelling (3rd edn), Aca-

demic Press. There is also a 4th edition, by M. Pinsky and S. Karlin.

2 Review of probability models

2.1 The basic Poisson process

2.1.1 Introduction and definitions

This section will review the basic Poisson process from MAS275. There are many examples

of things whose random occurrences in time can be modelled by Poisson processes, for

2

example customers arriving in a queue, incoming calls to a phone, eruptions of a volcano,

and so on. Generalizations of the Poisson process will be covered later, in chapter 6.

Assume we start counting at time 0, and let N(t) be the random variable defined as the

number of the events which occur in the interval (0, t]. Note that we can also describe the

number of the events which occur in an interval (u, v] with v > u > 0 without defining any

more random variables, as we can write it as N(v)−N(u). N defined in this way is known

as a (random) counting function, and has some simple properties: N(t) is a non-negative

integer for any t, N is an increasing function and N(0) = 0.

We define the basic (homogeneous, one-dimensional) Poisson process by the following two

assumptions:

1. For any 0 ≤ u ≤ v, the distribution of N(v)−N(u) is Poisson with parameter λ(v−u).

2. If (u1, v1], (u2, v2], . . . , (uk, vk] are disjoint time intervals then

N(v1)−N(u1), N(v2)−N(u2), . . . , N(vk)−N(uk) are independent random variables.

2.1.2 Inter-occurrence times

Let T1 denote the length of time until the first occurrence, T2 denote the length of time

between the first and second occurrences, and so on, so that Tn represents the time between

occurrences n−1 and n. These random variables are called inter-occurrence times. The

two results below were shown in MAS275.

Theorem 1. The probability that in (0, t] two occurrences of a Poisson process with rate λ

occur at exactly the same time is zero.

Theorem 2. Inter-occurrence times are independent of each other, and are exponentially

distributed with parameter λ.

2.1.3 A conditional property of the Poisson process

The following property of the Poisson process, again from MAS275, is highly useful for

generalizations.

Theorem 3. Given the total number of points of a homogeneous Poisson process in an

interval, the positions of the points are independently uniformly distributed over the interval.

Note that this is equivalent to the statement that, conditional on there being n occurrences

in (s, t], the number of occurrences in any interval (u, v] ⊆ (s, t] (so 0 ≤ u < v ≤ t) has

a Bi(n, (v − u)/(t − s)) distribution, as each of the n occurrences would have probability

(v − u)/(t− s) of being in (u, v], independently of the others.

This property gives a way to simulate a Poisson process of rate λ on an interval [0, t]:

1. first generate the number of points in [0, t] as an observation N(t) from the Poisson

distribution Po(λt);

2. then generate n independent uniform U(0, t) variables Ui, i = 1, . . . , n and put a point

at each position Ui. (Then the successive positions of the points reading from left to

right will be the ordered values W1 ≤ · · · ≤Wn of the Ui s.)

3

2.2 Discrete time Markov chains

The discrete-time Markov chains studied in MAS275 are simple models for dependent vari-

ables. In this section, we recall some of the basic ideas and terminology associated with

them.

2.2.1 Markov property

We start with a general definition of a Markov process:

Definition 4. Let {Xt : t ∈ T} be a set of random variables (a stochastic process), with an

infinite index set T .

If, for each real number x, each n ≥ 1 and t1 < . . . < tn < t with ti ∈ T for all i, the

following condition (the Markov property) holds

P{Xt ≤ x | Xt1 = x1, . . . Xtn = xn} = P{Xt ≤ x | Xtn = xn}

we say Xt is a Markov process.

Thus the correlation of the Xt is of the simplest possible form: given the present, the future

depends only on that, and no further on the past.

A Markov chain is a special case of a Markov process, one which stays in a countable set S

(such as the integers):

Definition 5. If {Xt, t ∈ T} is a Markov process and there is a countable set S such that

P{Xt ∈ S} = 1 for all t ∈ T we say that {Xt, t ∈ T} is a Markov chain. S is the state

space of the chain and points i ∈ S are states of the chain.

If T is a set of integers, {Xt} is a discrete time Markov chain. Later we will consider

the continuous-time case, where T is a continuous interval of the real line.

2.2.2 Homogeneous Markov chains

Definition 6. A Markov chain is said to be homogeneous if and only if Xt+h | Xt has

distribution independent of t, so that the probability that you move from one state to another

in a fixed number h of moves is the same whenever the moves begin.

Homogeneous chains provide a rich class of models, so we concentrate on them here.

For a discrete time chain, the above definition is equivalent to saying that

P{Xt+n = j | Xt = i} = P{Xn = j | X0 = i},

with t and n integers, so the following makes sense:

Definition 7. Let Xn be a homogeneous discrete time Markov chain. Then we call the

probabilities P{Xn = j | X0 = i} the n-step transition probabilities and denote them by

p(n)ij .

4

For n = 1 they are called the one-step transition probabilities or simply transition

probabilities, and we just write pij instead of p(1)ij .

The n-step transition probabilities form a |S|× |S| matrix, P (n) = (p(n)ij ), which is called the

n-step transition probability matrix.

For n = 1 it is called the one-step transition matrix, or just the transition matrix,

and we write P instead of P (1).

Note that the row sums of a transition matrix P are all equal to 1 (why?):∑j

pij = 1.

A square matrix of non-negative elements with this property is said to be stochastic.

We now give some examples of transition matrices, most or all of which you will have seen

in MAS275.

Example 1. Weather modelling

Assume that days at a particular location can be classified as either “dry” or “wet”. Let

S = {D,W}, and say, for example, that Xn = W means that day n is wet. Assuming a Markov

chain model, let our transition matrix be(1− α α

β 1− β

).

Then, given that day n is dry, the probability that day n+ 1 is wet is α, while given that day n

is wet, the probability that day n+ 1 is dry is β.

We will return to this model, with some actual data, later in the course.

Example 2. Random walk on triangle

Label the vertices of a triangle by A,B and C, and assume that a particle moves as a Markov

chain from vertex to vertex, at each time step moving from its current vertex to one of its

neighbours, moving clockwise with probability p and anti-clockwise with probability 1−p. Then

S = {A,B,C}, and the transition matrix is 0 p 1− p1− p 0 p

p 1− p 0

.

If p = 1/2 then the random walk is symmetric. The more general idea of a random walk on a

graph and the related idea of Google PageRank were seen in MAS275.

Example 3. Ehrenfest model for diffusion

Imagine that we have N molecules, each of which is in one of two containers, A or B. At each

time step, one molecule (of the N) is selected at random (each equally likely to be chosen),

and moved to the other container.

Let Yn be the number of particles in container A at time n. Then (Yn) forms a Markov chain

with state space {0, 1, 2, . . . , N} and transition probabilities given by pk,k−1 = kN , pk,k+1 = N−k

N

and pk,j = 0 if j /∈ {k − 1, k + 1}.

5

2.2.3 The Chapman-Kolmogorov equations

Let Xn be a discrete time homogeneous Markov chain with state space S. It was shown

in MAS275 that the transition probabilities for Xn satisfy the Chapman-Kolmogorov

equations

p(m+n)ij =

∑k∈S

p(m)ik p

(n)kj ,

and that therefore the n-step transition matrix P (n) equals Pn, the n-th power of P .

This, with knowledge of the initial state, allows us to calculate the probability that the

chain will be in any specified state at time n; that is P{Xn = j}, say. Writing π(n)j for

P{Xn = j} and π(n) for the row vector with entries (π(n)i : i ∈ S),

π(n)j =

∑i

π(0)i p

(n)ij

or in matrix-vector terms:

π(n) = π(0)Pn.

2.2.4 Computation of Pn

Pn may be obtained by straightforward matrix multiplication. The following is a simple R

function to return successive powers of P for a two-state chain.

nstep <- function(P,nst)

{

# n-step probs for a 2 state chain, for n=2,...,nst

Parr <- array(0,dim=c(2,2,nst))

Parr[,,1] <- P

for (i in 2:nst)

Parr[,,i] <- P%*%Parr[,,(i-1)]

Parr

}

An alternative approach, computationally more efficient in large chains, is to diagonalize

P . This has some advantage too in showing how chains evolve. Suppose P has distinct

eigenvalues. Then matrix theory shows that P can be expressed as

P = TDT−1 (1)

where D is a diagonal matrix whose diagonal entries are the eigenvalues of P , and T is a

non-singular matrix. Expression (1) is called the spectral representation of P . (In fact the

columns t of T are right eigenvectors of P ; that is, they satisfy P t = dt for an eigenvalue

d of P . Clearly, since P is stochastic, d = 1 is an eigenvalue, corresponding to eigenvector

t′ = (1, . . . , 1). It may be shown that all eigenvalues of a stochastic matrix have modulus

≤ 1, so under our assumption about P , d = 1 is the largest eigenvalue.) It is convenient

to reorder rows and columns so that the diagonal entries in D are in order of decreasing

6

modulus, so D is of the form

D =

1 0 · · · · · ·0 d2 0 · · ·0 0 d3 · · ·...

......

...

,

where 1 > |d2| > |d3| > . . ..

From (1)

P 2 = TDT−1 TDT−1 = TD2T−1,

and in general

Pn = TDnT−1, n = 1, 2, . . . . (2)

Moreover

Dn =

1 0 · · · · · ·0 dn2 0 · · ·0 0 dn3 · · ·...

......

...

, (3)

so (2) and (3) give a simple way to calculate Pn.

The R function eigen calculates the eigenvalues and eigenvectors of a matrix, and the

function solve finds inverses. The following code uses these functions to give the spectral

representation of a stochastic P .

eP <- eigen(P)

eigvals <- eP$values; D <- diag(eigvals)

Tmat <- eP$vectors; Tinv <- solve(Tmat)

print(Tmat); print(D); print(Tinv)

and the nth power of P may then be calculated by:

Tmat%*%D^n%*%Tinv

where n is an explicit numerical value. The R function spect on the course website applies

the above code for a given matrix P and calculates its nth power.

2.2.5 Stationary distributions

Definition 8. Let Xn be a discrete time homogeneous Markov chain with one-step tran-

sition matrix P . Any distribution π such that πj =∑

i∈S πipij is called a stationary

distribution of the chain.

In other words, π = {πk} satisfies π = πP , so it is a left eigenvector of P with eigenvalue

1.

7

If a Markov Chain has a stationary distribution π and if the starting state is chosen ac-

cording to π, then Xn ∼ π for all n, since

π(1) = πP = π,

π(2) = πP 2 = (πP )P = πP = π,

and so on.

A stationary distribution need not exist or be unique, as seen in MAS275. However Markov

chains that do have a stationary distribution are of particular interest as models for processes

whose overall properties remain stable over time even though the state of the process itself is

continually changing. In MAS275 it was seen that for many Markov chains, π(n) converges

to a stationary distribution, irrespective of the starting distribution, and it is these chains

that will be particularly useful as models for stable systems. In terms of the classification

of chains discussed in MAS275, it is the irreducible aperiodic, positive recurrent chains that

we will mainly use as models. These chains are also called ergodic chains. For them

p(n)ij → πj > 0 as n→∞

and

{πj} is a stationary distribution.

Example 4. Stationary distributions

For the dry/wet model, Example 1, we find a stationary distribution by solving π = πP . With

π =(πD πW

), this gives us the equations

(1− α)πD + βπW = πD

απD + (1− β)πW = πW ,

together with the fact that πD + πW = 1 (as we are looking for a distribution). Re-arranging

either equation gives πD = βαπW , which together with πD + πW = 1 gives π =

(β

α+βα

α+β

).

It is not hard to check that the chain is irreducible and aperiodic, and hence ergodic. Hence the

probability that day n is dry converges to βα+β as n→∞, by the results in MAS275.

For the Ehrenfest model, Example 3, solving the stationary distribution equations gives a unique

stationary distribution with πj =(Nj )2N

. However, this chain is not aperiodic (odd and even states

alternate) and so the convergence results do not apply.

3 Useful ideas and methods for inference

This chapter collects together some basic mathematical and statistical facts that will be

useful later, in particular relating to the theory of likelihood. Some should be familiar;

others are likely to be new.

8

3.1 Central Limit Theorem

Theorem 9 (CLT for iid variables). If random variables X1, . . . , Xn are independent and

identically distributed with mean µ and variance σ2 <∞, then∑n1 Xi − nµσ√n

→ Z ∼ N (0, 1), as n→∞.

The result generalizes to iid random vectors.

Theorem 10 (CLT for iid random vectors). If random vectors X1, . . . ,Xn are independent

and identically distributed with mean vector µ and variance-covariance matrix Σ, finite,

then ∑n1 Xi − nµ√

n→ Z ∼ N (0,Σ), as n→∞.

Notes:

1. The → in these theorems denotes convergence in distribution.

2. Terminology. In cases like these when something converges in distribution to a Normal

distribution it is often said to be asymptotically Normal as n→∞.

3. The Central Limit result remains true for dependent and/or non-identically dis-

tributed random variables/vectors under suitable conditions.

3.2 Likelihood and inference

Suppose observations x1, . . . , xn = x are modelled as the values of random variables X1, . . . ,

Xn = X, and suppose that the probability density function (probability function in the

discrete case) fX of X depends on an unknown parameter θ, one- or multi-dimensional.

Inference consists of drawing conclusions about θ on the basis of x and the model fX . The

notion of likelihood is central to inference.

Definition 11. The likelihood of θ based on observed data x is defined to be the function

of θ:

L(θ) = L(θ;x) = fX(x;θ).

The maximum likelihood estimator of θ, often written θ, is the value of θ which maximizes

L(θ).

In the discrete case, for each θ, L(θ) gives the probability of observing the data x if θ is

the true parameter (provided f is from the correct family of distributions). Thus we can

think of L(θ) as a measure of how plausible θ is as the value that generated the observed

data x. In the continuous case a similar statement is true if we recall that in practice all

measurements are made only to a bounded precision.

The ratio L(θ1)/L(θ2) measures how plausible θ1 is relative to θ2 as the value generating

the data. If θ is the maximum likelihood estimator, then the relative likelihood is defined

to be the ratio

RL(θ) = L(θ)/L(θ).

9

Values of θ for which the relative likelihood is not too much different from 1 are plausible

in the light of the observed x.

It is convenient to plot the likelihood on a log scale, and this scale is mathematically

convenient too. So the log-likelihood is defined to be

l(θ) = logL(θ).

Statements about relative likelihoods become statements about differences of log-likelihoods.

An important special case is for independent Xi. Then

L(θ) =n∏1

fXi(xi;θ) (4)

and

l(θ) =n∑1

log fXi(xi;θ) (5)

where fXi denotes the density function of Xi.

Often θ may be found as the solution of the likelihood equation(s)

∂L(θ)

∂θ= 0

or equivalently,

∂l(θ)

∂θ= 0. (6)

Example 5. Markov chain

We consider a two state Markov chain (Xn), as in Example 1 but with state space S = {1, 2},with transition matrix (

1− θ θ

φ 1− φ

).

We assume that the chain is in equilibrium, and we consider finding the likelihood for the

parameters θ = (θ, φ).

The stationary distribution here is(

φθ+φ

θθ+φ

), by the same calculations as in Example 4.

Imagine we observe X0 = 2, X1 = 1. Because we assume the chain is in equilibrium, we have

P (X0 = 2) = θθ+φ , so

P (X0 = 2, X1 = 1) =θ

θ + φφ.

Hence this expression also gives us the likelihood of (θ, φ) given our observation, and we can

write

L(θ, φ; x) =θφ

θ + φ.

This is plotted in Figure 1.

Imagine that we go further, and observe the sequence of states 2, 1, 1, 2, 2, 2. Then our likelihood

becomes

L(θ, φ; x) =θ

θ + φφ(1− θ)θ(1− φ)(1− φ) =

θ2φ(1− θ)(1− φ)2

θ + φ.

This is plotted in Figure 2.

10

θ

φ

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: Likelihood for two observations from a Markov chain

3.3 Approximating the log likelihood

In one or two dimensions, plots of the log likelihood can be highly informative for inference.

In higher dimensions, and to reveal general features even in one or two dimensions, it is

useful to summarize the log-likelihood. It turns out that in many cases it can usefully be

approximated by a quadratic function of θ, so can be summarized by the position of the

maximum and the curvature there.

Example 6. Exponential sample

Suppose that observations x1, . . . , xn are modelled as a random sample from an exponential

distribution with unknown rate parameter θ ≥ 0. (For example, we could observe a Poisson

process until we have n occurrences, and let xi be the ith inter-occurrence time.)

The probability density function for each observation is

fXi(x; θ) =

{θ e−θx x ≥ 0

0 x < 0(7)

and so the log likelihood is

l(θ) =

{n (log θ − xθ) if minxi ≥ 0

−∞ otherwise.

with maximum likelihood estimator θ = 1/x.

Figure 3 shows the log relative likelihoods from samples of sizes n = 10, 20, 40 and 80. Each

sample had mean x = 0.4. Evidently as n increases the log-likelihood becomes more peaked

11

θ

φ

0.002

0.004

0.006

0.008

0.01 0.012

0.014

0.016 0.018

0.02

0.022

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2: Likelihood for six observations from a Markov chain

around its maximum. Thus it becomes less and less plausible that values of θ a fixed distance

away from the maximum generated the data.

Definition 12. For 1-dimensional θ the function J(θ) = −∂2l/∂θ2 is called the observed

information about θ in the sample.

For p-dimensional θ the observed information is a matrix with components

J(θ)rs = − ∂2l(θ)

∂θr∂θs. (8)

Note that at θ in the 1-dimensional case, we’ll usually find J(θ) > 0. (In the multi-

dimensional case correspondingly the matrix J(θ) will be positive definite.)

For most likelihoods, not just the one in the example, it’s true that close to θ the log

likelihood is well approximated by a quadratic function of θ:

l(θ)− l(θ) ≈ 1

2(θ − θ)2∂

2l(θ)

∂θ2,

= − 1

2(θ − θ)2 J(θ). (9)

This is only useful if ‘close’ includes the values of θ that are plausible. Usually, this is

increasingly true as the amount of information increases, for example as n increases in the

i.i.d. case.

12

1 2 3 4 5 6 7

−8

−6

−4

−2

0

θ

Log

rela

tive

likel

ihoo

d

10204080

Figure 3: Log-relative-likelihood for exponential samples of size 10, 20, 40, 80

3.4 Approximate inference and asymptotic Normality

How uncertain are findings from the likelihood? One way of answering this question is to

think about what would have happened if the data had been different. The mle θ will

generally take different values for different data x. For example the plots in Figure 4 come

from five samples, each of size 20, from the exponential distribution with rate parameter

θ = 2.5. Evidently the value of θ and the range of plausible values of θ varies with the

sample. This sampling variability may be addressed by thinking of θ as a random variable

θ(X), a function of the random vector X. Denote the maximum likelihood estimate of θ

based on a random sample of size n by θn = θ(X1, . . . , Xn).

Under repeated sampling θ differs from the true value of θ by an amount which

for large n is approximately Normally distributed. Furthermore, we can esti-

mate its variance using the log likelihood.

Key Fact 1 (Asymptotic Normality of mles in the iid case).

In the random sample case, under mild conditions, as sample size n → ∞, θ has an ap-

proximately multivariate Normal distribution with mean given by the true value of θ and

covariance matrix estimated by J(θ)−1. In the 1-dimensional special case, θ has an approx-

imately Normal distribution with mean given by the true value of θ and variance estimated

by 1/J(θ).

In the 1-dimensional case, we immediately get the approximate 95% confidence interval for

13

1 2 3 4 5 6

−4

−3

−2

−1

0

θ

Log

rela

tive

likel

ihoo

d

Figure 4: Log-relative-likelihood for different exponential samples of size 20

θ0 (θ − 1.96

√1/J(θ), θ + 1.96

√1/J(θ)

).

Example 7. Exponential sample continued

From Example 6, we have

∂l/∂θ = n(1/θ − x)

and∂2l(θ)

∂θ2= − n

θ2.

Hence J(θ) = nθ2 , and as θ = 1/x we have J(θ) = nx2.

Hence an approximate 95% confidence interval for θ is(

1x − 1.96 1

x√n, 1x + 1.96 1

x√n

). For

example if x = 0.5 we get (1.12, 2.88) as our approximate confidence interval for θ. (Recall

that the expected value of an Exp(θ) random variable is 1/θ.)

14

3.5 Likelihood Ratio Tests

3.5.1 (a) Likelihood Ratio Test for a simple null hypothesis

Suppose θ∗ is a specific value and we wish to test the hypothesis H0 : θ = θ∗ against the

alternative H1 : θ 6= θ∗.

The relative likelihood RL(θ∗) = L(θ∗)

L(θ)is useful for this, because

RL(θ∗) small suggests evidence against H0

RL(θ∗) close to 1 suggests H0 plausible

Thus, since logRL(θ∗) = l(θ∗)− l(θ),

l(θ∗)− l(θ) well below 0 suggests evidence against H0

l(θ∗)− l(θ) close to 0 suggests H0 plausible

Equivalently (for reason apparent below),

W = −2(l(θ∗)− l(θ)

)well above 0 suggests evidence against H0

W = −2(l(θ∗)− l(θ)

)close to 0 suggests H0 plausible

How far above 0 could W = −2(l(θ∗)− l(θ)

)be even when H0 is true?

Key Fact 2 (Wilks’ Theorem I: Asymptotic χ2 distribution of W ).

In the random sample case, when the true value of θ is θ∗ (ie H0 true),

W = −2(l(θ∗)− l(θn)) = −2 logRL(θ∗)→ χ2p (10)

in distribution as n→∞, where p is the dimension of θ.

Large observed values of W will be critical of H0. Key Fact 2 tells us that when H0 is true,

the probability of observing a value of W larger than a particular w is,

pobs = P (W ≥ w | H0) ≈ P (χ2p ≥ w). (11)

Thus the following test procedure is reasonable.

Likelihood Ratio Test for the Simple Hypothesis H0,

1. From the data calculate the observed value w of the test statistic W ,

2. Find (from χ2 tables or via a computer) the probability pobs = P (χ2p ≥ w)

3. Interpret pobs as a measure of the weight of evidence in the data against H0 in the

sense that the smaller pobs, the more surprising would the observed data be if H0 were

true (and therefore the stronger the evidence against H0).

15

A likelihood region (or, when p = 1, a likelihood interval) is a set of the form {θ : l(θ) >

l(θ) − c} for some constant c, and can be interpreted as the set of values of θ which are

plausible in the light of the data. The choice of the constant c allows a tuning of how strong

the interpretation of the word “plausible” is here, and Wilks’ Theorem gives us a way to

choose c. If we choose 2c = χ2p,0.95, then, when θ is the true parameter value,

P (l(θ) > l(θ)− c) = P (l(θ)− l(θ) < c) ≈ P (χ2p < 2c) = 0.95,

so the likelihood region with this c is an approximate 95% confidence region. For example,

when p = 1 we have χ21,0.95 = 3.84, so we can choose c = 1.92, sometimes approximated as

c = 2.

3.5.2 (b) Generalized Likelihood Ratio Test

The Generalized Likelihood Ratio Test extends the test above to more general (composite)

hypotheses H0. Suppose that θ is p-dimensional with values in a set Θ ∈ Rp and suppose

that we wish to test a null hypothesis H0 that the true value θ belongs to a subspace Θ0 of

Θ, where Θ0 is q-dimensional with q < p. The alternative hypothesis H1 is that θ ∈ Θ\Θ0.

Consider the statistic

GLR = maxθ∈Θ0

RL(θ) =maxθ∈Θ0 L(θ)

maxθ∈Θ L(θ)=L(θ)

L(θ),

where θ is the usual mle (the global mle) and θ is the value of θ maximizing L within Θ0

(the restricted mle). Note that when Θ0 is a single point, −2 logGLR reduces to the W

used in subsubsection (a) above.

In general if H0 is true, then the global maximum of L is likely to occur close to Θ0, so θ

and θ are likely to nearly coincide and GLR to take a value close to 1. On the other hand,

if H1 is true, then the maximum of L within Θ0 is likely to be considerably less than the

global maximum, so GLR will tend to be considerably smaller than 1. This suggests that

a test statistic for H0 could be based on GLR. We could use GLR directly, but the fact

that −2 logGLR reduces to W above in the special case of a point Θ suggests using this

instead. We expand the definition of W in Key Fact 2 by writing now W = −2 logGLR.

Then values of W close to zero are expected under H0 and larger values under H1. The

distribution of W under H0 is given by

Key Fact 3 (Wilks’ Theorem II: Asymptotic χ2 distribution of W ).

In the random sample case, when the true value of the parameter θ ∈ Θ0,

W = −2(l(θ)− l(θ)) = −2 logGRL→ χ2p−q (12)

in distribution, as n → ∞, where p is the dimension of Θ, the full parameter space, and q

is the dimension of the restricted parameter space Θ0.

This gives:

Generalized Likelihood Ratio Test for the Composite Hypothesis H0,

1. From the data calculate the observed value w of the test statistic W = −2 logGLR,

16

2. Find (from χ2 tables or via a computer) the p-value P (χ2p−q ≥ w)

3. Interpret the p-value as a measure of the weight of evidence in the data against H0

in the sense that the smaller the p-value, the stronger the evidence against H0.

Notes:

1. The degrees of freedom p− q in the limiting χ2 distribution can often be interpreted

as the number of linear restrictions imposed on θ by confining it to Θ0.

2. When Θ0 is a single point then q = 0 and the test reduces to the version in (a) above.

Example 8. Two exponential samples

Imagine we have two samples from exponential distributions with possibly different parameters:

X1, X2, . . . , Xn from an exponential distribution with rate parameter φ and Y1, Y2, . . . , Yn from

an exponential distribution with rate parameter ψ. We are interested in whether φ = ψ; writing

θ = (φ, ψ), the null hypothesis that φ = ψ can be written as θ ∈ Θ0, where Θ0 is the line

ψ = φ. (We assume here that the sample sizes are the same, but the idea is easy to extend to

the case where they are not.)

The log likelihood given data x and y is

l(φ, ψ; x,y) = n(log φ+ logψ − xφ− yψ).

It is easy to see that θ = (1/x, 1/y). Thus

l(θ) = −n(log x+ log y + 2).

The restricted MLE assuming φ = ψ is θ =(

2x+y ,

2x+y

), so

l(θ) = −n(

2 log

(x+ y

2

)+

2x

x+ y+

2y

x+ y

).

So we have

W = −2(l(θ − l(θ))

= 2n

(2 log

(x+ y

2

)− log x− log y

).

For example, with n = 10, x = 2.3 and y = 1.7, we get W = 0.455, which is to be compared

with χ21. As pchisq(0.455,1) in R gives 0.5, there is no evidence in this case against H0.

4 Inference for discrete time Markov chains

This chapter looks at modelling and statistical inference for Markov chains, considering

a particular modelling problem and examining how far a simple specification of time-

dependence can be useful for it.

17

4.1 A weather problem

4.1.1 Setting

Weather forecasts such as those issued by the Met Office are based on very detailed obser-

vations made at a large network of weather stations. These data are assimilated into large

meteorological models based on atmospheric physics. Computations of the changes to the

atmosphere are then carried out at a fine grid of points. The results give useful short term

predictions of the weather. However, there is a need – for example in computer models of

longer-term changes in the Earth system – to model weather over longer periods without the

extensive information needed by the short-range meteorological forecasting models. Since

weather is unpredictable, it is natural to seek a probability model.

We focus on the problem of constructing, fitting and checking a model for daily rainfall

at a specific location. As a first step, given observed rainfall amounts, r1, . . . , rn say, on

a sequence of consecutive days, we suppose that daily rainfalls are the realised values of a

sequence of random variables R = R1, . . . , Rn.

What characteristics do we expect the Ri to have? What range of values? Will their

distribution be constant throughout the year?

Figure 5 shows the observed daily rainfall at Snoqualmie Falls, Washington US, 1948–1983

(given by Guttorp 1995). The measurements were in 0.01 ins. At this scale it is difficult to

see much structure.

+

+

+

+

+

++

+

+

+

++++++++++++++++++++++++++++

++++++

+++

++++

++++

++

+++

+++++

++++++++++++

+

++

+

+

++

+

++++++++

++++++

++

+++++

+

+++++

+++++++++++

+

+++

++++++++++

+

+++++++++

+

+

+

++++++++++++

+

++++

+

+++

++

++++++++++++++

++

+++++++++++++++++++

+

++++++++++++++++++

+

++++

++++

+

++

+

++++++++++++++++++

+

+++++++++

+

++

+

+++++

+

+++

++++++++++++++++++

+

+++

++

+

+

+++++++++++

+

++

+++

+

+

+

+

+

+

++

+

+

++

+

++++

+

+

+

+++++++++

+

+++++++++

+

+

+++++++

+

+++++++++++++++++++++++++++++

+

+

++

+

+++

++

++

+++

++

+++++++++++++++++

+

++++++

+++++++++++++++++++++++++++++++

+++++

++++

+

++

+

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++

+++

+++

++++++++++++++

+

++++

+

+++++

+

++++++++++++

+

+++++++++++

+

+

+

+

+++++++++

+

+

+++

+

+++++

++++++

++++++

++

++++++

+

+

+

+

+

+

+

++

+++++

+

+++

+

++++

++

++

++

+++

++

+

++++++++++

++

+

+++++

+++

+++++++++

+++++

+

+

+

++++++++++

+

+

++

++

+

+++++++

++

+++++++++++

+++

++

+++++++

+++

+

+++

++++++++++++++++

+

+++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++

+

++++++++++++++++++++++++++

+

+++++++++++++++++++++++++++++++++++++++

++

++++++++

++

++

+

++

+

++++++

++

++++

+

+

+

++

+

++

+

+++++++++++++

++

+++

++++++

+

+

++

+++

++

+

+

+

+++

+++++++++++

++

+++

+++

+

++

+

+++++++++++

+

++

++++

+

++

++

+++++++

+

+

+++

++

+

++

+++++++++

++++++++++

+

++++++++

+

++

+

++++++++++++++

+

+++++++++++++++++++++++++++

++

++++++++++++

+

+++++++++++++

++++

++++++++

++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+

+

+++++++

+++++++++++++++++++

++++

++++

++

+

++++++++++++

+

+

+

+

+

+++++

+

++++

+++++++

+

+

+

++++++++++++

+++++++

+

+

+

++

++++++++

+++

+

++

+

+

+++++++++++++++++

+++++++++++++++

+

+++++

+

++++

+

++++++

++++++++++++++

+++++++++++++

+

+

+++++++++++++++

++++++++++++++++++

+

+++++++++++++

+

+++++++++++++++

+

+++++++++++++++++++++++++++++++++++++++

+

++++++

+

+

++++++++++++++++++++++

+

+++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++

+

++++++++++++++++++++++++

++

+

+++++

++++++++

+++++++

+++++

+

+

++

+

+

++

+

+

+

+++++++++

++

+

++++++

+

++++++

+

++++++++

+

++++++++++

++++++++++++

++++

+

+

+++++++

++

+++

+

+

+++++++++++++++

++++++++

+

++++

+++++++++

+++

++++

+

++++++++++++

+

++++++++++

+

++++

++

++++++

+++

+++++++++++++++

+

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+

+

+

++++++

++

+

++++++

++++

++++++++++

+

+

+++++

++++++++

+++

+++++

++

+

+++

++

++

+

+

++

+

+

+

+

+++

+

++++

++

+++++++

+

+++

+++

++

+

+++++++

+++++

++

+

+

++++++++++++++++++++

+

+++

+++

+

+

+

++

+++++

+++++++

++

+++++++++++++++++

+

++++++

++

+++

++++++

+++++++++++++++++++++++++++

+

+++++++++++++++++++

+

+++++

+

++++++++

+

+

+

+++

++++++++++

+

++++++++

+

+++++++++

+

+++++++++++++++++++++++++++

+++++

+

++++++++++++++++++++++++

+

+

+++++++++++++++++++++++

+

+++++++

+

++++++++++++++++

+

++++++++++

+

+++

+++++

+

+++++++++++

+++++

+

+++

+++++++

++++++

+

+

+++

++++++++++++++++++++++++++++

+

++++++

+

+

+

++++++++++++++

+

+++

+

+++

++++++

+

+

+

++++++++++

++++++

++++

++++++

++

++

+

+++

+

+++++

+

+++

+

++++++++++++++

+

++

+

+++++++++++++

+++++++++++++++++++++++++++++

+++

++

+

++++++++++++++++++++++++

++

+++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++++++

+

+++++

++

+

++

+

++++++++++++++

++

+++++++

+++

+++++

+

+++++++

++

++++

+++

+++

+

+

++++

+

+

+++

+

+

+++++++

+

++

+++

+

+++++

++

+

+

+++

+++++++

++

+

+++++

++

++++++++++++++++++++++++++++++++++

+

+

+

+

+

+

++

+

+++++++++++++

++

+

+

+

+++

++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++

+

+

++++++++

+

+++++++++++++++

+

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++++++

+

++++++++++++++++++

+

+

+

+

+

++

+

+

+++++

+

+

++

+++

+++++

+

++

+

+++

+

+++++++++++++++

+

++

+++

+

++

+++

+

+

+

++

+

+++++++++++++++++

++++

+

++++++++++++++++++

+

+++++

+

++

+++++++++++++++

+

+

+

+++++++

+

+

+

++++++

+++++++++++++++++

+

+++

++++++++++++++++++++

+

+++++++++++++++++++++++

++

++

+

+++++++++++++++++++

+

+++

+

++++++++++++++++++++++++++++

+

++++++++++++++++++++++

+++

++++++++++++++++++++++++++++

+

+++++++++++++++++++++++++

+

++++

++++++

+

+++++++++

+

++++++++++++++++++++++

+++++++++++++

+

++

+

++++

+

+++++++

+

++++

+

+++

++

+

++

+

++++++++++

++++++

+

+++

+++++

+++

+

+

+

+

+

+

++++

+

+++

++

+

++

++

+

+++++

+

+

+

++++++++

+

+

++++++++++++++++++++++++++

+

+++++++

+

+++

++

+

+

+

+

++

+

++++++++++++++++++++++++++++++++++++++

+

+++++++++++++++++++++++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+++++++++++++++

++

+

++++

++++++++++++++++++

++++++

+

+

+

++++++++++++++++

+

++

+

++

+

++

+++

+

+

+

+

+

+

+

++++++++

++++

+

++

+

+

++++++

+

+

++

+++

++

++++++++++

++++

+

+

+

+++++

++++

+

+

+

++

+

+++++++++++++++++

+

++

+++++++++

++++

++++++++

+

+++++++

+

++

+

++++++

+

+

+++++++++++++++++++++++++++

++

+

++

+

+

++

+++++++++

+++

+

+++++++++

++++++++

+

+++++

++++++

+

++++++++++

+

+++++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+++

+++++++++++

+

+++

+++++++

+

+

++++++++++

+

++

+

++++++++

+

+

++++++

+

+++++

+

++++++++

+

++

+

+

+

+

+

+

+

++++++++++

+

++++++++

++

+

+

+

++++++++++++++

+

+++++++++++++++++++++++

++++

+

+

+

+++

+

+++

+++

+++

+

+

+

+++++++++++++++++++

+++

+

++++++

+

+++++++++++++

+

++

+++++++++++++

+

+

+++

+

+

+

++++++++++++++

++

++++++++

+

+++

+

+

+++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++

+

++++++++

+

++++++++++++++++++++++++++++++++

+

+++

++

++++++++++

+

+

++

++++

+

+

+

+++++++

+

+

+

+++

++

++

+

++

+

+

+++++++++

++++++++++++++

+

+

++++

++++++++++++

+

++

+

++++

++

+

++

+++++++++++

++

+++

+++

+

++++++

+

+

+

+

+

+

++

+

++

+

++

+

+

+

++++

+

++++

++

+++++++

+

+++++

+

+++++

+++

++++

++++

++

+++++

+

+++++++++

+

+

++

+++

+++++++++++++++++

+

++

+++++++++++

+

++++++++++++++++++++++++++++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++

+

+++++++++++++++++

+

+++++++

+++++++

+

++++

+

++

++++++++++

+

++

+

+++++

+

++++++++

+

++++++++++

+

+

+++++

+

+++++

+

++

+

+++++++++

+

+++++

++

+++

+++++

+

+

++++

+++++++++

+

+++++++++++++++++++++++++++

+

+++++++++

++++++++

+

+++

++

++++++++++++++

++

+++++++++

+

+

+

++++++++++++++++++++

+

+++++++++++++++

+

+++++++++

+

+++++++++

+

+++++++++++++++++++++++++++++++++

+

+++++++++++++++++++++++++++

++

++++++++++++++++++++++++++++++++++++

++

++

+

+++++++++++++

+

++++

+

++++

++

+++++++++++++++++++++++++++++++

++

+++++++

++

+++

+

+

+

++

+

+

+

+

++

+

+++++++++

+

+++++

+++++++

++

++

+++

++++++

+++++++++++++++++++++

++

+

+

+

+++++++++++++++

+++++

++

++

+

+++++++++++++++++++++++++++

+

+

++++++

+

+++++++++++++++++++++++++

+++

+

+++++++++++++++++++++++++++

+++

+

+++

+

+++++++++++

+++++++++++++++++

+

+++++++++++++

+

++++++++

+

+++++++++++++++++++++++++++++++++++++++++++

+

+++++++++++++++++++++++++++++

+

+++++

+

+

+

+

+

++++

+

+++

+++

+

+

++++

+

+++

+

+++

+

+

+++

+

++

+

++++++++

++

++

+++++++++

++

+

+

+

++

+++++++

+

++++

+

++

+

+++

+

+

++

+

+

+

++++

+

+

+++

+

++

+++++++++++

+++

++++++++++++++++++

++

++++

++++

+

++

+

++++++++++++++

++

++++++++++++++

+++++

+

+++++++++++++++++++++++++++

+++

++++++++++++

+++

+

++++++++

+

++

+++++++++++++++++++++++++++

++++++++++++++++

+

+

+++++++++++++++

+

+++++++++

+

++++++++++

+

++++++++

+

++

+

+++++++++

+

+

+

+++++

+++++++++++++++++++++++++++

+

++++++++

+

+++++++++

+

+

+

++

+

+

+

+

++++++

++

++++

+++++++

+

+

++

+

+

+++++++++++++++++++++++++

+

+

+++

+

+

+

+

+

++++

++

++

++++

+

+

+++++++

+++++

++

++++++++++++++++++++++

+

+++++++++

+

+++++++++++++++++

+

+

+

+++++++++++++

+

+++++++++

+

+

+++++++++++++

+

+++++++++++++++++++++++++++++++++++++++++

+

+++++++++++++++++++++++++++++

+

++++++++++

++

+

++++++++++++++++++++++

+

+++++++++++++++++++

++

++++++++

+

++++++++++++++

+

++

+

++

+

+++++++++++++

+++

+

++

++++++

+

++

+

++

+

++++++++++++++

+++++

+

+

+++

+

+

++

+

+

++

+++

+

++

++++++++++++++

++++++++++++

+

+

+++++++++++++++

+

++++++++

+

+

+++

+

++++

++++++++++++

+

+

+

+++++++++

+

++++++++

+

+

+++

++

+++++++++

+

+++++++

+++++++++++++++++++++++++++++++++++++++++++++

+

++++

+

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++++++++++++++++++++++

+

++++++

++

+++++++++++++++++++++++

+

+

+

+

++++++++

+

+

++

++

+

++++++

+++

+

+

+

+

++++

+

++++

++++++

+

+

+

+

+

+

+++++

+

++

+++

+++

+

+

++++++

+

+++++++

+

+++++++

++

+++

+

++++++++++

+

+++++++

+

++++++

+++

++

++

+

++++++

++++++++++++++++

++++

+

++++++++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++

+++++++++++++++++++

+

+

+

++++++

++

+

+

++++

+

++

+

+

++

++

+

+++

+

++++++++++

+++++++++++++

+

+++

++++

+

+++++++

+

+++++++++++

+

+

++

++++++++++++++

+

++

+++++++

++

++++++++++++

+

+

+++++++++++++

+

+

+

++++

+++++++++++

++++++++

+

+

+++++++

+

+

+

+

+

+

+

++++++

+++++

+

+

++

+

++

++++++++++++++++++++++++++++++++

+

++++++

++++++

+

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++++++++

+

+

+

+++

+

+++++++++++++++

+++

++

+++++++++++++++

+

+

+

++

++++++++

+

+

+

+++++++++++++

+

+++++

+

++

+

+++

++++++++

++++++

+

+++

+

+++

+

+

+++

+++

+

++

+

++++

++

++++++

+++

+

+

+

+

++++

+++++++

+++++++++++++++

+

+

+

++

+

+

++++++++++++++++++++++++

+

++++++++++

++

++++

+

+++++++++++++++++++++++++

+

++

+++

+++++

+

+++

++++++++++++++++++++++++++

+

+

+++++++++++++++++++++++

+

+

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+++

+

+

+

++

+

+

+++++

++

++++++++++++++++++++++++++++++++++

+

++++++++++++++++++

+

+++++++++++++++++

+

++

+++++++

+

+

+

++++++++++++++++++++

++

++

+++

+

++

+

+

+

+

+

+++

+

++++++++++++++

+

++

++++++++++++++++

+

+++++++

+

++++++++

+

++++++++

+

++++

+

++

+

++++++++

+

+

+++++++++

+

+++++++++++++++++++++++++++++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++

++++++

+

+++++++++++++++++++++++++++++++

+

++

+++++++++++

+++

++++++++++++++

++

+++

+

++++++++++

++++

+++++++++++++++

+

++

++

++

++++

+

+++

+

+

+++++++

+

+

+

++

+

++

+

+++

++

+

+++++++++++

+

+

+

+++++++

+

+

+

+++++

+

+

+

++

++

++

++

+++++++++

+

++++

++

+++

++

++

+

+++++

+

+++++

++++

++

++

+

++++

+++++++

+

+

+++++

++

++++++++

++

++++++++++++++++++++++++++++++++++

++

+

++

+

++++++++++++++++++++++++

+

++++

++++++

+

++++++++++

++

++

+

++++++++++++++++++++++++++++++++++++++++++++

+++++++++

+

+

++

+

+++++++++++++++++++

++

+

+++++++++++++++

+

+

++++

+

+++++

+

+

++++++

+

+

+

++

+

+++++++++++++

+

+

+

+

+++

+

+++++

+

+

++

+

++++

++

++

+

++

++++++

+++++++++++++

+

+

++

++++

+++

+

+

++

+

++++++++++++

++++++

+

+

+

+

+++

++++++++

+

+

+

++++

+

+

+++++++

+

++++++++

+

++++++++++

+++

+

+++

++++

+++++

++

++++

+

+++++

++++++++

+

++++++++++++

+

+++++++++++++++++

++++++++++++

+++

+

+

+++++++++++++++++

+

+++++++++++++++++++++++++++++++++

+

++++++++++++++++++++++++++++++++++++

++

++++++++++++++++++++++++++++++++

+

++++

++

++++

++++

+

++++++++++++++++

+

+++++

+

+++++++++++++

+++

+

+++

+

+

+

+

+

+

+++

+

+++++++++

+

+

+

++++++

+

++++++++++

++

++++++++++++++++++++++++

+

+

+

++++++++

+

++

+

+++++

+

+++++++++++++++++++++++++++

+

++

+

++++++++++++++++++++++++++++++++++

+++

+++++++++++

++++++++++++

++++++

+

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++

+

++++++++++++

++++++++++++++++

+

+

+++++++++++++

+++

+++++++++

+++++++

+

+

+

+

++

++

+++

+

+++

++

++

+

+++++++

+

+

+

+++

++

++

+

+

++++++

+

+

+

++

+++++++++++++++

+

+

+

+

+

+

+

++++

+++

+++

+

++

++++++++++++

+

++

++

+

++

+++

+++++

+++

++

++

+++

+++

+++

+++++++++++

+

+++++++++++++++++++++++++++++++++++++++++

+++

+

+++++++++++++++++++++

++

+

+

+++++++++++++++++++++++++++++++

+

+

++++++

++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+++++++

++

+++++++++

+

+++++++++

+

+

+

+

+++

+

+

++++++++++++++

+++

++++++

+

+

+++++

+

++

+

++++

+

+

++

+

+++

+++

++

++

++++

+++++++++++++++

+

++++

+

++

+

+

+++

+

+++

+

+++++

+

++++++

+

++++++

+++

+

+++++++++++++++

+++++++++++++++

++++

+++

+++++++

++++++++

+

+++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

+

+++

++

+++++++++++++++++++++++++++++++++++++++

++

+

+

++++++++

+

+

+

+

+++

+

++++++

+

+

+++++++

+++

++++

+

+

+

++++++

+

+++++++

+

+

++

++

++

++++++

+++++++++

+

+

+

+

++

+

++++

++++

+++

+

++

++

+

+++++++

+++

+

+++++++++++++++

+

+++

+

+

+

+

+

+++++

++

+

+++++++++++

+

++++++

+

+++++

+

++++++

+++++++++++++++

+

++++

+

+++

++++++++

+

++

+

++++++++++++++++++++++++

+

++++++++++++++

++

+

++++++++

+

++++++++++++

+

+++++++++++++++++++++++++++++

+

+

++++++++

+++

+

++++

++

++++++++++

+

+++++++

++++++++++++++++++++++++++++++++++++++++++

+

++

+

+

++

+++++++++++++++++

++++++++++++++++++

+

+++++++++++++++

+

+

+

+

+

+++++++++++++++

+++++++++++++++++++++++++++++

++++++++++++++++++++

+

+

+++

+

+

+

++

+

++++++++++++++

+

+++++++++++++++++

+

+

+

++++++++++++++++

++++++++

+

++++++++++++

+

++++++

++

++

+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+

++

+++

+

+++++

++++++++++++++++

+

++

+

++++++++++++++++++++++++++++

+

+

+

+

++++

++

+

++++

+

++++++

+

+++++++++

++

+

++

+

+

+

+

+++

+++++

+

+++

+

+

++++++++++++++++++

+

+

+

+

+

++

++++++++++

+++++

+

++++++

+

+

++

+

+

+++++++++++++++++++

++++++++

+++++++

+

+++++++++

+++++++++++++

+++++

+

+++++++++

+++

+++++++++

++

+++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++

+

+++++++++++++++++

+

+

++++++++++++++++

++

++++++

++++++++++++++

+

+++++++++++++++++++++++++++

+

++

+

+++++++++

+

+

+++

++

++++++

+

+++++++++++

+

+

+

+

++

++

+++++

+

++

+

++++++

++++++++++++++++++++

+++++++++++

+

++++++++++++++

+

+

+

++++

++++++++++++

+

+

++++++

++

+++++++++++++++++++++++++

+

+++++++++

+

++

+

++++++++++++++++++++++++++++++++++++++++++++

+

+++++++++++++++++++++++++++++++++

+

+

+++++++

+

+++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++

+++++++++++++++++++++++++++++++++

+

++++++++

+

+

+

+

++

+++++++++++++++++++++++++++++

+++++

+

+

+

++

+

++

+

+

+

+

+

+

++

+

+

++

+

++++++++++

+++

+

++++++

+

++++

+

++++++++++++++++

+

++++

++++++++++++

++++++

+

+

+

++++

+

++++++

+

+

+++

+

++

++++++++

+

+++

+++++++++

++

+++++++++

+

+++++++++++++++++++++++++++++++++++

+

+++++++

+++++++++++++++

+

++++++

+

++++++++++++++++

+

+++++++++++++++++++++++++++++++++++

+

+++++++++++++++

+

+++++++++++++++++

++

++++++++++++++++++++++

+

++++++++++++++++++

++

+

+

+

+

++

+++++++++++

+

++

++++

+

++++

++

+

++++++++++++++++

++

++

+

+

+

++

+

+

++++++++++++++++++++++

+

++++

+

+++++++++++++

+

+++

++

+

+

+

+++++

++++++

+

+++++++++++

+

++++++++++++++

+

+++++++++

+

+

+++++++++++

++

+++

+

++++++++++++

+

++

+++++++++++

+

+++++++++++

+

++

+

+++++++++++++++++++++++++++

+

+++++

++

+++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++

++++

++

+

++++++

+

+++

+

+

++

+++++++++++++++++

+++

+++++++++++++

+

+++++

+++

++++++++++

++

++

++

+++

+

+++

+

+

+

++

+

+++++

+

+

++++++

+++

+++++

+

++++

+

++

+++++

+

+

++

++++++

+++++++++

+

+

+

+++

+

+++

+++++

+

+

+

+++++++

+

+

+

+++

+++++++++++++++++++++++++++++

+

++

++++

+++++++++

+

++++

+++++

+

+++++++++

+

++++++++++++++++++++++++++++++++++++++++++++

+

++++++++++++

++++++++++++++++++++++++++++++

+

++++++++++++++++++++++++++

+

+

+

+++++++++++++++++++

+

+++

+

+++++++++

+++++

++

++

+

+++

+

++++

++++++++++++

+

+

+++

+++++++

+

++++

+

+

+

+

++++++

+

+

+

+

++

+

+

+++++++++++++

++

+

++

+

+

+

++++++++

+

+++

+

++++

+

+++++++++

+

++++++++++++

+

+++++++

+

++++++++

++

++++++++++++++++++

++

+++

+

++++++++++++++++++++++++++++++++

+

+++++++++

+

++++++++++++++++++++++++

+

+++++++++

+

++

++++++++++

+++++++++

++

+

+

++++++++++++

+

++++++++++++++

+

++++++++++++++++

+++

++

++++++++

+

++++++++

+++++++++++++++++++++++++++++++++

+

++++++++

+

+

+

+

+

+

+

++++++

+

+++

+

++

+

+

++

+

+++++++++

++++

++

+

++

+

++++++++++++++++

+

+

0 2000 4000 6000 8000 10000 12000

01

23

4

Days from 1st January, 1948

Dai

ly r

ainf

all,

in in

ches

Rainfall at Snoqualmie Falls

Figure 5: Rainfall at Snoqualmie Falls, Washington State, 1948-1983

18

4.1.2 Wet and dry

The fact that on some days there is no rain and on others a positive amount suggests that

it might be useful to consider first the succession of dry days and wet days, and later model

the amount of rain falling on a day conditionally on the day being wet.

Let

Xi =

{1 if Ri > 0

0 if not

so that Xi is an indicator variable for day i being wet. Realizations of the Xi in January

1983 are shown in Figure 6. It rained on 26 days that month.

+ + + + + + + + + +

+

+ + + + + + + + +

+

+ + + + + +

+

+

+ +

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Days from 1st January, 1983

Wet

day

Wet and dry days, January 1983

Figure 6: Wet (1) and dry (0) days at Snoqualmie Falls, January 1983

We seek a simple model for the Xi. It is possible that precipitation patterns vary seasonally,

so initially we concentrate on modelling only within a limited period during which conditions

can be expected to remain reasonably stable. For this reason we concentrate on the sequence

of Xi during Januaries.

4.2 Independence model

The simplest model supposes that the Xi are independent and identically distributed; that

is, they form a Bernoulli sequence.

19

4.2.1 Fitting the independence model

If p = P (Xi = 1) then the likelihood of p based on observations x1, . . . , xn is

L(p) = p∑ni=1 xi (1− p)n−

∑ni=1 xi .

Over the Januaries 1948-1983 there were 791 wet days out of 1116 days altogether. Thus∑1116i=1 xi = 791 and the log-likelihood is

l(p) = 791 log(p) + 325 log(1− p) :

see Figure 7.

0.66 0.68 0.70 0.72 0.74

−67

9−

678

−67

7−

676

−67

5−

674

−67

3

p

Log−

likel

ihoo

d

Log−likelihood for probability of a wet day

Figure 7: Log-likelihood for p = P (Wet): independence model.

The MLE p =∑n

i=1 xi/n = 791/1116 = 0.71 and an approximate 95% confidence interval

for p obtained from the log-likelihood is (0.68, 0.74).

Notes:

1. In fact in this case we can find a confidence interval directly, since under our as-

sumptions∑n

i=1Xi has a Binomial Bin(1116, p) distribution, so that the estimated

standard error of p is√p(1− p)/n and an approximate 95% confidence interval based

on the Normal approximation is p± 2× e.s.e. Here the two intervals, likelihood-based

and Binomial-based, are numerically the same to 2 dp.

2. R code for producing Fig. 7 is

20

curve(791*log(x)+325*log(1-x), from=0.66, to=0.75,

ylab="log-likelihood",xlab="p")

abline(h=-675.135,lty=2)

title("Log-likelihood for probability of a wet day")

The locator function is useful for reading off the ends of the likelihood interval from

the plot.

4.2.2 Model adequacy

Is the independence model adequate? Does it faithfully represent features of the data? We

might suspect not, because large weather systems may take several days to pass a point, and

whilst they are doing so the weather conditions at the point will tend to persist, undermining

the independence assumption.

For a rough check, consider the number of dry days followed by wet days. Under indepen-

dence, the expected number of such pairs in the data is

E( no. dry wet pairs) = 36× 30× (1− p)p,

since there are 36 Januaries in the data-set and 30 pairs of days in each. Similarly for other

combinations of wet and dry. To estimate these expected numbers we replace p by p.

Table 1 gives the observed frequencies of the four pairings of Wet and Dry days over the 36

Januaries in the data-set, and their estimated expected numbers.

Table 1: Numbers of pairs of days; Januaries 1948–1983

Today

Wet Dry Total

Wet 643 (542.6) 128 (222.9) 771Yesterday

Dry 123 (222.9) 186 (91.6) 309

Total 776 314 1080

Scrutiny of Table 1 shows more Dry-Dry and Wet-Wet pairs than expected, and correspond-

ingly fewer mixed pairs. This is what persistence of weather would suggest.

To see whether the discrepancies in Table 1 could reasonably be ascribed to chance under

our independence model, we might consider carrying out a goodness of fit test. The Pearson

X2 statistic

X2 =∑ (observed− expected)2

expected

is a natural choice for a test statistic for such a test, since large values of X2 will indicate

discrepancy from our expectations under the independence assumption. For a standard 2×2

contingency table, the observed value of X2 would be compared with a χ21 distribution,

which is the approximate distribution of X2 under the hypothesis of independence in a

contingency table.

For such a table the underlying model is a Multinomial distribution with four categories,

and the basis for use of this distribution is a sampling scheme in which each of the items

21

whose frequencies are recorded in the table can fall independently into the four cells with

prescribed probabilities.

Is that appropriate for Table 1?

In Table 1 the items counted in the four cells are pairs of days. From the way the data have

been collected it is clear that consecutive pairs of days overlap; the second day of one is the

first day of the next (within each January). Thus our null hypothesis of independence and

identical distribution of days does not translate immediately into the standard Multinomial

basis for a X2 goodness of fit test. We need an extension of the standard theory to the case

when items are collected in such a way that each is dependent on its predecessor.

The developments later will provide a proper basis for the test above, but in the meantime it

is interesting to resort to a back-of-the-envelope assessment of fit. The difficulty in applying

the usual X2 test based on a χ2 distribution arises because the data come from overlapping

pairs of days. If instead we consider non-overlapping consecutive pairs, then individual

items being classified would be independent under our hypothesis and the χ2 theory would

apply. Frequencies from non-overlapping pairs are:

Table 2: Numbers of non-overlapping pairs of days; Januaries 1948–1983

Today

Wet Dry Total

Wet 325 (271.3) 55 (111.5) 380Yesterday

Dry 66 (111.5) 94 (45.8) 160

Total 391 149 540

A standard X2 test applied to Table 2 (with expected values calculated from the marginal

totals in the table rather than from our estimate p = 0.709) gives X2 = 107. For comparison

the 95% quantile of χ21 is 3.84. There is overwhelming evidence against the independence

assumption even with this partial view of the data. Wetness or dryness has a tendency to

persist from one day to another to a greater extent than the independence theory would

predict.

This points the need for the analysis of dependent variables. We need models that allow a

variable’s value today to affect its value tomorrow. A discrete time Markov chain is such a

model.

4.3 Estimation of transition probabilities

We now assume that we have observations of a process which we think might be well

modelled by a Markov chain on a known state space S, but with unknown transition matrix

P , and think about how to estimate P . One way is to use likelihood, generalizing the ideas

of Example 5.

Suppose we have observations x1, . . . , xn modelled as successive states of a Markov chain

X1, . . . , Xn with transition probabilities pij . Then we have the following expression for the

22

probability of observing a given sequence of states.

P (X0 = x0, . . . , Xn = xn) = (13)

= P (Xn = xn | Xn−1 = xn−1, . . . , X0 = x0) P (Xn−1 = xn−1, . . . , X0 = x0)

= P (Xn = xn | Xn−1 = xn−1) P (Xn−1 = xn−1 | Xn−2 = xn−2, . . . , X0 = x0)

× P (Xn−2 = xn−2, . . . , X0 = x0)

= P (Xn = xn | Xn−1 = xn−1) P (Xn−1 = xn−1 | Xn−2 = xn−2) . . .

. . .× P (X1 = x1 | X0 = x0)× P (X0 = x0)

= P (X0 = x0) px0 x1 px1 x2 . . . pxn−1 xn (14)

by the Markov property. The expression in (14) is the likelihood L for the parameters

P = (pij).

Denote by nij the number of transitions observed from state i to state j. Then (14) can be

written as

L = L(P ) = P (X0 = x0)∏i,j

pnijij ,

and the log likelihood for P becomes

l(P ) = logP (X0 = x0) +∑i,j

nij log pij . (15)

As usual, values of the pij giving large l are plausible in the light of the data.

The maximum likelihood estimates of the transition probabilities are obtained by maximiz-

ing l(P ) subject to P being a transition matrix. This means that we will need to apply the

constraint that the sum of the entries in each row is 1; we can do this either using Lagrange

multipliers (see MAS211) or by, for example, writing the last term in each row in terms of the

others as we did in Example 5: if S = {1, 2, . . . , N} then pi,N = 1−(pi,1 +pi,2 + . . .+pi,N−1).

In some cases we may assume a particular form for the transition matrix where the con-

straint that the row sums are 1 is already assumed.

There is a question how to treat the first term in (15), since so far its dependence on the

parameters {pij} has not been specified. One approach is to argue conditionally on the

observed value x0 of X0. In this case the (conditional) likelihood is defined as P (Xn =

xn, . . . | X0 = x0) and the above argument gives a log-likelihood without the P (X0 = x0)

term. (Check!) We follow this approach below.

Another approach is to assume that the chain is in equilibrium, meaning that we can take

P (X0 = x0) to be the probability of finding the chain in state x0 under its stationary

distribution, as we did in Example 5.

In practice, when n is large, l is likely to be dominated by its second term, so the precise

treatment of the first term is unimportant anyway.

Suppose the state space S is finite. Then, arguing conditionally on X0 = x0, we wish to

maximize ∑i,j

nij log pij subject to∑j

pij = 1, for i = 1, . . . , |S|.

23

Form the Lagrangian function

L(P,λ) =∑i,j

nij log pij +∑i

λi

∑j

pij − 1

.

Then

∂L∂pij

=nijpij

+ λi, i, j = 1, . . . , |S|. (16)

From this we get that

pij =nij−λi

,

and thus, using∑

j pij = 1, it follows that

pij = nij/ni.. (17)

Here we use ni. to mean∑

j nij , the total number of transitions from state i. Similarly, we

will later use n.. to mean∑

i,j nij .

Example 9. Snoqualmie Falls

Transition count data for Januaries at Snoqualmie Falls 1948–1983 are given in Table 1. We use

them to fit a 2-state Markov chain model for the sequence of wet and dry days. In the discussion

above it was assumed that the data arose from a single consecutive set of observations. However

the data in Table 1 come from 36 separate Januaries. Under a Markov chain model the log-

likelihood has the form (15) for each January, but for the full likelihood we need to consider the

relationship between years. In fact it does not seem unreasonable in a first model to suppose

that years are independent. We suppose also that the transition probabilities {pij} are the

same over the years. There may be some room for doubt about this latter assumption given El

Nino/La Nina effects in the Pacific, and given climate change, and this may be a topic to be

investigated further later.

Thus as a first model we assume independence from one year to another, and stationarity over

years. Let n(k)ij be the observed transition count from i to j in year k. Then the likelihood

of P based on the observed sequence of wet/dry days, conditionally on the first observation in

January each year, is

L(P ) =∏k

∏i,j

pn

(k)ij

ij ,

giving log-likelihood

l(P ) =∑k

∑ij

n(k)ij log pij =

∑ij

nij log pij ,

where the nij are the total transition counts. The maximization used above therefore applies

directly, and from Table 1 gives the estimated transition matrix shown in Table 3.

The estimated probabilities are quite different from those based on the independence model,

which would be (0.29, 0.71) in each row.

We are of course also interested in the uncertainty on these transition probability estimates.

We can address this too by considering the (conditional) likelihood function. To allow for

24

Table 3: Estimated Transition Matrix; Januaries 1948–1983

Today

Dry Wet

Dry 0.602 (186/309) 0.398 (123/309)Yesterday

Wet 0.166 (128/771) 0.834 (643/771)

the constraint on the parameters in a given row, we can reparameterise the transition matrix

in terms of the first |S| − 1 probabilities on each row, and look for a likelihood region for

them. This is straightforward when |S| = 2, less so in other cases. In practice, often we

will look instead at an asymptotic approximation, to be covered in section 4.5.

The Markov chain model in the example has been fitted by considering pairs of days. A

natural question to ask is how well does it represent features of the data other than those

used in its fitting?

Example 10. Snoqualmie Falls: persistence of spells

A feature of the wet/dry sequence is the length of persistence of wet spells and dry spells. The

data from January 1983 are:

1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0

in which 1 represents a wet day and 0 a dry day. Evidently there were 3 dry spells whose

beginnings and ends fell in this month, and each lasted only 1 day. Over all 36 years the average

length of dry spells that did not overlap Jan. 1 or Jan. 31 is found to be 2.21 days and the

standard deviation of the lengths, 1.64 days.

These are observations. From the Markov properties, however, we can calculate the expected

duration of dry spells in the model, and then compare observed values with calculated to give a

rough check on the model’s realism.

Let T denote the duration of a dry spell. Since for us to observe a dry spell at all there must

be a wet day followed by a dry day, we can start numbering time from a wet day followed by a

dry day. Then

P (T = 1) = P (X2 = 1 | X1 = 0, X0 = 1) = p01,

and for k ≥ 2

P (T = k) = P (Xk+1 = 1, Xk = 0, . . . , X2 = 0 | X1 = 0, X0 = 1) = p01 pk−100 . (18)

This is a geometric distribution with expectation

E(T ) = (1− p00)

∞∑k=1

k pk−100 = 1/(1− p00) (19)

and

E(T (T − 1)) = (1− p00)p00

∞∑k=1

k(k − 1) pk−200 = 2p00/(1− p00)2

25

and so variance

E(T (T − 1)) + E(T )− (E(T ))2 = p00/(1− p00)2. (20)

Substitution of p00 = 0.602 into (19) and (20) gives

Est. mean 2.51 Obs. mean 2.21

Est. var 1.952 Obs. var 1.642

Informally the agreement is encouraging.

A slightly closer check on the agreement between means may be carried out as follows. The

number of dry spells observed during the 36 Januaries was 112, so the estimated standard error

of the sample mean duration of dry spells is approximately 1.64/√

112 = 0.155. The model-

estimated mean (2.51) therefore differs from the observed mean (2.21) by about 0.30/0.155 ≈1.9 standard errors, which is reasonably plausible for an adequate model.

In fact the comparison above is somewhat unfair to the model since it does not account fully

for the way in which the data were collected; it neglects the fact that only dry spells falling fully

within Januaries were counted. The next subsubsection shows that, after this fact is allowed for,

the difference of estimated and observed means is only about one standard error, giving even

less evidence against the Markov model. MAS6071 students should read section 4.3.1, but it is

optional for MAS371.

4.3.1 Allowing for boundary effects

Example 11. Snoqualmie Falls: persistence of spells, allowing for boundary effects

The comparison of observed and model-estimated means in Example 10 was unfair to the model

since it did not account fully for the way in which the data were collected; it neglected the

fact that only dry spells falling fully within Januaries were counted. Thus, for example, any

dry spell beginning in January but extending into February was ignored. The same was true for

any dry spell that actually began in December, or even on January 1st, and extended later into

January. A dry spell longer than 29 days could never be recorded under this system either. A

little reflection suggests that the qualitative effect is that the distribution of dry spells recorded

in the data is shifted towards smaller durations, and therefore both the mean and standard

deviation of recorded dry spells will be less than the true mean and standard deviation of all dry

spells appropriate for January conditions, which is what the model-derived estimates relate to.

To quantify the effect of the sampling bias on the distribution of dry spells we can argue as

follows. We are interested in the probabilities

qk = P (T = k and T was observed in the January window).

Saying that a dry spell falls within the January window amounts to saying that the spell’s first

(dry) day is one of the days from 2nd January to 30th January inclusive, and its last dry day does

not extend beyond 30th January. This depends on when the first dry day occurs, suggesting

conditioning on that time:

qk =

30∑j=2

P (T = k, T obs. in Jan window | T starts on day j)P (starts on day j) (21)

26

Given that a dry spell begins at all in the period, it is reasonable to suppose that its start day

will be equally likely to be any day, so we can take

P (starts on day j) = constant. (22)

Given that a dry spell starts on day j and is of length k days, it falls within the window iff

j + k − 1 ≤ 30. Thus the first probability in the sum in (21) is

P (T = k and T observed | T starts on day j) =

{P (T = k) if j + k − 1 ≤ 30

0 if not(23)

and so from (21–23) and (18),

qk ∝31−k∑j=2

p01 pk−100 = (30− k)p01 p

k−100 , k = 1, . . . , 29. (24)

The qks must sum to 1 over k = 2, . . . , 29 so we can find their absolute values by dividing the

terms in (24) by their sum. The following R commands carry out the calculation, replacing p00

by its estimate 0.602:

k <- c(1:29)

qk <- (30-k)*(0.602^(k-1))

qk <- qk/sum(qk)

The resulting probability function is shown (as a line for clarity, though it is defined only for

integer values) in Figure 8. For comparison the geometric probability function (18) is shown by

the dotted line.

As expected, length-biasing has shifted the distribution towards lower values but only by a very

small amount. The effect on mean and variance is calculated by:

meanqk <- sum(k*qk)

sqk <- sum(k*k*qk)

sdqk <- sqrt(sqk-meanqk^2)

giving

Est. length biased mean 2.37 Obs. mean 2.21

Est. length biased var 1.802 Obs. var 1.642

Comparison of the observed and model-predicted means now shows that they are only about

one standard error apart, giving no grounds on this aspect to question adequacy of the Markov

chain model.

4.4 Properties of the MLEs

In Chapter 3 it was claimed that random sampling leads to asymptotic normality of maxi-

mum likelihood estimators and χ2 distributions for differences of log likelihoods.

Essentially the same arguments as used for the independence case show that the following

facts about the maximum likelihood estimators themselves and the W statistics are true.

27

●

●

●

●

●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

k

Pro

babi

lity

● Length−biasedGeometric

Figure 8: Length-biased and geometric probability functions for duration of dry spells

Theorem 13 (Asymptotic Normal and χ2 distributions for MLEs and W ).

If {Xn : n = 0, 1, . . .} is a finite ergodic chain with transition matrix P , then as n→∞:

1.

(pij − pij)/√Vij → N (0, 1), (25)

where Vij = pij(1− pij)/Ni.;

2. If d is the number of non-zero entries in P , and s = |S| the size of the state space,

then when P = P ∗

W = −2(l(P ∗)− l(P ))→ χ2d−s; (26)

3. If Θ is the full parameter space of the transition probabilities P and Θ0 is a lower

dimensional subspace of Θ, then when the true value of P belongs to Θ0

W = −2(l(P )− l(P ))→ χ2d−s−q, (27)

where P is the restricted maximum likelihood estimator of P under the restriction to

Θ0, and q is the dimension of Θ0.

(Cf Key Facts 1− 3 in section 3.)

28

4.5 Approximate confidence intervals for pij

The result (25) in Theorem 13 claims that the estimate pij has an approximate normal

distribution with mean (pij) and variance pij(1− pij)/Ni..

An immediate application is the construction of approximate confidence intervals for the

pij .

Example 12. Snoqualmie Falls

The estimated transition matrix is

Dry Wet

Dry 0.602 0.398

Wet 0.166 0.834

These estimates were based on nD. = 309, and nW . = 771. Estimated standard errors are

therefore

ese(pDD) =√pDD(1− pDD)/nD. = 0.028

and

ese(pWW ) =√pWW (1− pWW )/nW . = 0.013,

giving, for example, approximate 95% confidence intervals

pDD (0.55, 0.66),

pWW (0.81, 0.86).

4.6 Test for a specified P

In some situations we may have reason to expect data to be modelled by a Markov chain

with a known transition matrix P ∗. The results in section 4.4 allow us to test the idea.

Our null hypothesis will be H0 : P = P ∗ with the alternative H1 usually being that P is

unrestricted. (In some settings certain transitions may be assumed to be impossible, in

which case the relevant entries of P will be assumed to be zero.) To carry out the test we

could use the log likelihood ratio test statistic W = −2(l(P ∗)− l(P )) to compare likelihood

under H0 with the maximum value of likelihood as in section 3.5. Large values of W

discredit H0. According to (26) in Theorem 13, if H0 is true W should have a distribution

that is approximately χ2 with d− s degrees of freedom.

(When H1 is that P is unrestricted, d here will be the number of entries in the matrix, i.e.

s2, so the number of degrees of freedom will be s2 − s.)

The test statistic W is given by

W = −2(l(P ∗)− l(P )) = 2∑i,j

nij log

(nijni.p∗ij

), (28)

where p∗ij is the i, j entry of P ∗.

29

Example 13. Pseudo-random numbers

Imagine we have a sequence of random digits produced by a computer’s random number gen-

erator: for example

5, 3, 1, 8, 7, 9, 0, 6, 4, 5, . . .

Are they independent, or is there a tendency for certain digits to follow others?

Model the sequence as a Markov chain on the integers 0, 1, . . . , 9. Independence, and each digit

being equally likely, corresponds to the transition matrix being

P ∗ =

1/10 1/10 . . . 1/10

1/10. . . . . . 1/10

......

. . ....

1/10 . . . . . . 1/10

.

We therefore can carry out the above test with this choice of P ∗. The number of degrees of

freedom will be d − s = 100 − 10 = 90 degrees of freedom. Thus an approximate p-value for

an observed value wobs of W is

p = P (χ290 > wobs).

For example, the following are 200 random integers between 1 and 10, generated in R:

9 6 7 9 5 8 4 10 7 9 5 3 1 10 3 6 4 10 6 9 4 7 7 9 5 4 5 8 7 5 2 10 6 6 1 2 1 2 7 8 3 2 4 6 2 2

4 6 1 4 6 5 9 3 5 9 7 2 10 2 6 6 4 2 1 8 2 10 10 4 8 7 3 6 2 10 9 7 8 8 6 8 4 5 9 5 10 6 10 2 3

8 9 9 7 1 3 4 9 4 2 7 7 5 9 6 10 10 1 2 2 5 2 10 2 1 1 7 8 4 6 5 9 3 1 2 2 9 10 5 5 2 1 8 5 2 5

5 8 9 1 8 5 6 8 10 2 9 2 2 9 9 2 6 8 2 7 7 1 9 8 5 1 6 3 7 9 6 8 2 6 4 9 8 3 3 10 1 2 9 5 6 5 2

2 7 8 7 4 9 5 5 9 6 8 3 8 5 5 3

The transition counts are:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 5 1 1 0 1 1 3 1 1

[2,] 4 5 1 2 2 3 4 0 4 5

[3,] 2 1 1 1 1 2 1 2 0 1

[4,] 0 2 0 0 2 4 1 1 3 2

[5,] 1 5 2 1 4 2 0 3 6 1

[6,] 2 2 1 3 3 2 1 5 1 2

[7,] 2 1 1 1 2 0 3 4 4 0

[8,] 0 3 3 3 4 1 3 1 2 1

[9,] 1 2 2 2 6 4 3 2 2 1

[10,] 2 4 1 1 1 3 1 0 1 2

The MLE for the transition matrix is:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 0.067 0.333 0.067 0.067 0.000 0.067 0.067 0.200 0.067 0.067

[2,] 0.133 0.167 0.033 0.067 0.067 0.100 0.133 0.000 0.133 0.167

[3,] 0.167 0.083 0.083 0.083 0.083 0.167 0.083 0.167 0.000 0.083

[4,] 0.000 0.133 0.000 0.000 0.133 0.267 0.067 0.067 0.200 0.133

30

[5,] 0.040 0.200 0.080 0.040 0.160 0.080 0.000 0.120 0.240 0.040

[6,] 0.091 0.091 0.045 0.136 0.136 0.091 0.045 0.227 0.045 0.091

[7,] 0.111 0.056 0.056 0.056 0.111 0.000 0.167 0.222 0.222 0.000

[8,] 0.000 0.143 0.143 0.143 0.190 0.048 0.143 0.048 0.095 0.048

[9,] 0.040 0.080 0.080 0.080 0.240 0.160 0.120 0.080 0.080 0.040

[10,] 0.125 0.250 0.062 0.062 0.062 0.188 0.062 0.000 0.062 0.125

With these data we obtain W = 98.308. As 1-pchisq(98.308,90) in R gives a p-value of

0.258, there is no evidence against H0.

Notes:

1. If no transitions are observed in the data from state i to state j, then we will have

nij = 0. To interpret (28) in this case, you should interpret 0 log 0, which will appear

in the sum, as zero.

2. In the example above, if we were given the total number, ni., of digits i seen in the

data, then, under H0, we would expect to see ni.p∗ij = ni./10 transitions i j. The

observed number is nij , so W is a comparison of observed and expected frequencies.

Writing ni.p∗ij = eij for notational convenience, we can express W in (28) as

W = 2∑i,j

nij log

(nijeij

). (29)

This statistic is asymptotically equivalent in large samples to the Pearson X2 statistic

for comparing observed and expected frequencies

X2 =∑i,j

(nij − eij)2

eij.

3. The example above uses a W test based on Theorem 13 to check for independence,

but it can be used in the same way to test any other simple hypothesis about the

transition probabilities P .

4.7 Goodness of fit: testing independence

In section 4.3 we wished to test independence against the alternative of Markov dependence

in a model for the Snoqualmie Wet/Dry data (but needed to develop more theory before

being able to do so in a way that used all the data). The results in section 4.4 now give a

general approach. The difference from the example in section 4.6 is that H0, the hypothesis

of independence, is now composite: it does not specify the value of P completely, only that

rows of P are identical. Thus H0 is

H0 : pij = qj for each i,

where {qj} is a probability distribution on the state space S. Under H0, the transition

matrix P is determined by q = s − 1 parameters. Without restriction, P is determined

by s(s − 1) parameters. We use the test statistic W appropriate for testing composite

hypotheses, (27):

W = −2(l(P )− l(P )),

31

where P is the restricted maximum likelihood estimator of P under H0. Theorem 13 tells

us that if H0 is true, then W has a χ2 distribution on d−s−q = s(s−1)− (s−1) = (s−1)2

degrees of freedom.

To find the restricted maximum likelihood estimates pij , note that under H0 the log likeli-

hood is

l =∑i,j

nij log pij =∑i,j

nij log qj =∑j

n.j log qj ,

so that pij = qj = n.j/n... The test statistic is therefore

W = 2

∑i,j

nij lognijni.−∑j

n.j logn.jn..

. (30)

Example 14. Snoqualmie Falls (test completed)

From Table 2 the transition counts from 36 Januaries were

Dry Wet

Dry 186 123

Wet 128 643

We assume, as in section 4.3, that observations in different years are independent and we argue

conditionally on the state on Jan 1 each year. The log likelihoods under the two hypotheses are

then sums of log likelihoods arising from the separate years, each of the form above, and it is

easily verified that they reduce to the above forms on summation. We can therefore treat the

problem as if the data had arisen from a single realization. It is found that wobs = 193.5, which

in comparison with the quantiles of χ21 (99%, 99.9% and 99.99% quantiles 6.6, 10.8, 15.1) gives

emphatic confirmation of our earlier findings with partial data, that an independence model is

a very poor description of the data.

The following is the simple R function used to calculate the value of W for the test above.

wind <- function(tran) {

# calculates W statistic for independence from a transition count matrix

s <- nrow(tran)

nidot <- apply(tran,1,sum); ndotj <- apply(tran,2,sum)

ndotdot <- sum(nidot)

w <- sum(tran*log(tran)) + ndotdot*log(ndotdot)

w <- w - sum(nidot*log(nidot)) - sum(ndotj*log(ndotj))

w <- 2*w; w }

Example 15. Russian linguistics (Guttorp after Markov)

Modelling language using Markov chains and similar methods is an interesting topic with po-

tential applications in areas such as cryptography and textual analysis. For a simple example,

Markov classified the consecutive letters in a piece of text from Pushkin’s Eugene Onegin ac-

cording to whether they were vowels or consonants, obtaining the following results:

32

Vowel next Consonant next

Vowel 1106 7532

Consonant 7533 3829

If we model the sequence of vowels and consonants as a Markov chain with state space {V,C},then this gives the maximum likelihood estimate for our transition matrix as(

0.128 0.872

0.663 0.337

).

A very simple linguistic model is to assume a constant probability of switching from one type of

letter to the other; that is to use a Markov chain model with

P =

(1− p p

p 1− p

). (31)

Can such a model fit the data? It looks dubious, but can be tested explicitly with W .

Under the hypothesis that P is of form (31) the log likelihood reduces to

l = (n00 + n11) log(1− p) + (n01 + n10) log p,

which is maximized at p = (n01+n10)/n from which the restricted maximum likelihood estimate

P of P is obtained by substitution in (31). The unrestricted estimates of pij are pij = nij/ni.as usual. Thus the likelihood ratio statistic W = −2(l(P ) − l(P ) can be found. Its value for

the data above turns out to be 1217.7. The degrees of freedom for the null χ2 distribution are

d− s− q = 2(2− 1)− 1 = 1 since the model (31) has dimension 1. The large value of wobs is

extremely strong evidence against the tentative model.

Inspection of the data suggests that there is a stronger tendency for consonants to follow

consonants than for vowels to follow vowels.

4.8 Comparing models: homogeneity

Can the same Markov chain model serve for data from different sources? For example, do

we need different models for the weather at different times of year?

Suppose we have data x = (x0, . . . , xn) from source 1 and data y = (y0, . . . , ym) from source

2, where x and y are realizations from Markov chains with transition matrices Px and Pyrespectively. We wish to test H0 : Px = Py = P say. Again we can use the W test statistic.

Under H0, assuming independence of the two sources, the log likelihood is

l =∑ij

(nxij + nyij) log pij

so that the maximum likelihood estimates are

pij = (nxij + nyij)/(nxi. + nyi.).

The unrestricted estimates of Px, Py, each based on its own sample, are of the standard

form. Thus W may be calculated straightforwardly. Its null distribution is approximately

χ2 with degrees of freedom

2(s2 − s)− (s2 − s) = s(s− 1).

33

Example 16. Snoqualmie wet/dry

Is there evidence of changes through time in the Markov chain model for Wet/Dry?

Transition counts for the years 1948–1965 and 1966-1983 were:

Januaries 1948–1965

Dry Wet

Dry 86 63

Wet 66 325

Januaries 1966–1983

Dry Wet

Dry 100 60

Wet 62 318

The marginally increased rate Dry → Dry and reduced rate Wet → Wet might suggest that

conditions in later years were systematically less wet. The observed value of the W statistic

however is wobs = 0.78, giving no evidence when compared to χ22 values of a difference in

transition matrices. A conclusion is that the observed discrepancy is within the bounds of

random variation within a common model.

4.9 Higher order chains

The idea of Markov dependence extends to cases in which dependence reaches further back

than just to the previous time. For example, we might want to consider a model where the

state at time n+ 1 depends on the state at time n− 1 as well as that at time n; this would

give a second order Markov chain. More generally, we make the following definition.

Definition 14. A stochastic process {Xi} in discrete time with a discrete state space S is

an rth order Markov chain if

P (Xn+1 = xn+1 | Xn = xn, Xn−1 = xn−1, . . . , X0 = x0)

= P (Xn+1 = xn+1 | Xn = xn, Xn−1 = xn−1, . . . , Xn−r+1 = xn−r+1) (32)

An rth order chain can in fact be treated as a first order chain, with an appropriately

defined state space, so most of the standard theory can be used. Define the vector process

Y i = (Xi, Xi−1, . . . , Xi−r+1) on state space Sr. Then when {Xi} is rth order Markov

P (Y n+1 = yn+1 | Y n = yn,Y n−1 = yn−1, . . . ,Y r−1 = yr−1)

= P (Y n+1 = yn+1 | Y n = yn)

so that {Y i} is first order Markov.

The one-step transition probabilities for a second order chain {Xi} now depend on the

previous two states. If we denote them by p(ij)k = P (X2 = k | X0 = i,X1 = j) then the

log likelihood for them based on a realization which gave corresponding transition counts

n(ij)k is, given the pair of starting states,

l =∑i,j,k

n(ij)k log p(ij)k. (33)

The maximum likelihood estimators are found, as before, to be

p(ij)k = n(ij)k/n(ij).,

and approximate confidence intervals follow along the same lines as for first order chains.

34

4.9.1 Second order vs first order?

Given suitable data—such as that in Table 4—we can use W to test whether a first order

chain is an adequate model, as follows.

Within the overall second-order model for the chain, the hypothesis that the chain is first

order is

H0 : p(ij)k = qjk,

and the log-likelihood under H0 becomes from (33)

l =∑i,j,k

n(ij)k log qjk =∑j,k

n(.j)k log qjk =∑j,k

njk log qjk, (34)

writing njk for n(.j)k =∑

i n(ij)k. The restricted maximum likelihood estimates are there-

fore p(ij)k = njk/nj. and so the maximum log likelihood under the restricted model is

l(P ) =∑jk

njk log

(njknj.

).

The maximum without restriction is

l(P ) =∑ijk

n(ij)k log

(n(ij)k

n(ij).

)

and so the GLRT test statistic W = −2(l(P )− l(P )) may be found. Large values of W dis-

credit H0 in favour of second order dependence. The null distribution of W is approximately

χ2 with degrees of freedom

s2(s− 1) − s(s− 1) = s(s− 1)2

since the number of non-zero elements in P is s3, the size of the state space is s2 and under

H0 there are s2 − s independent probabilities qjk.

Example 17. Snoqualmie wet/dry

Table 4 shows the necessary data in this case. Here s = 2 so the degrees of freedom in the W

test are s(s − 1)2 = 2. For the transition count data in Table 4 it is found that wobs = 0.29

giving a p-value of 0.87 and no reason at all to question the first order Markov dependence

model.

Table 4: Wet/Dry in relation to previous 2 days; Januaries 1948–1983

Previous days Current day Proportion

2 before 1 before Dry Wet Dry Wet

Wet Wet 100 527 0.159 0.841

Dry Wet 25 94 0.206 0.794

Wet Dry 70 52 0.574 0.426

Dry Dry 109 67 0.619 0.381

35

4.10 Further rain modelling – some possibilities

We have concentrated on models for dry and wet days over only a limited period and we

have not discussed the actual amounts of rain. The following lists some simple possibilities

for developing our models so that they apply throughout the year and account for rainfall.

1. To extend the wet/dry modelling to other months it is natural to ask whether they

too could be modelled with Markov chains, possibly with different transition matrices

in different months or groups of months. If so, a Markov chain model in which

the transition matrix changes over time might be considered. The change might

be discontinuous (at month-ends, say) or be made smooth by interpolating between

transition matrices estimated separately for different months. In either case the result

would be a time-heterogeneous chain, but one simple to simulate from.

2. If Ri is the rainfall on day i, then Ri = 0 if i is dry. A very simple model for the positive

Ri would be to take them to be independent and from a fixed distribution. Checks on

such a model could be based informally on the semi-variogram (for independence) and

on tests for equality of distribution (for the assumption of stationarity), all applied

to observed values of rainfall. It seems possible that the distribution of daily rainfall

amount might vary over the year. If so, then it may be possible to represent the

variation through simple forms of time-varying parameters in a standard distribution,

say a gamma distribution. Again, such a model would be easy to simulate from.

3. Two general strategies for checking the adequacy of a model are (i) to compare the

theoretical value of some feature(s) of the model with data, and (ii) to embed the

model in a broader one and to test the hypothesis that the data are generated from the

smaller model. In (i) it is desirable that the features should not have been used directly

in the fitting of the model. A frequent difficulty is the calculation of the theoretical

values of model features. Simulation may be useful when analytical approaches are

not available. The same difficulty can stand in the way of strategy (ii) and again

simulation methods may offer a way round it.

4. The rainfall model emerging from (1) and (2) above is unsatisfactory in that it makes

minimal use of knowledge of the weather. Rainfall is associated with the passage of

weather systems over the observation site. Different weather systems have different

propensities to generate rain at the site. We might model this by assuming a random

succession of a small number of weather types over the site, each having different

probabilities of causing rain. Observations, as before, are the presence or absence of

rain on successive days. If the succession of weather types is modelled by a Markov

chain whose states are not recorded, the resulting process is called a hidden Markov

model (HMM). It is an example of a state space model. We might assume in addition

different distributions for the amount of rain falling on the wet days generated through

the basic HMM, obtaining a more flexible overall rainfall model than before. This idea

generalizes to models for several sites over a region.

36

5 Continuous time Markov chain models

So far, our Markov chains have been in discrete time: they have been described by a

sequence of random variables Xn for integers n. Many phenomena do not have a natural

“time step”, and so are better described in continuous time; examples include biological

growth and telephone traffic. These may be studied by discretizing time, but the restriction

may fail to represent all of the essential structure of the process. It may not be as simple

technically either.

A continuous time stochastic process is then a set of random variables Xt for real numbers

t; often we assume time starts at 0 so that t ∈ R+. In this course we will be concentrating

on continuous time processes which maintain a constant value for some time (usually a

random time) and then jump to a different value. You have already seen one example of

such a process, the Poisson process, and we will start off this section with a different way

of thinking about the Poisson process, which we will then generalize to develop a family of

continuous time processes known as continuous time Markov chains.

5.1 Events of a Poisson process in a short period of time, and Landau Oand o Notation

Another way of describing the Poisson process involves the probability that there will be an

event of the process in a short period of time. To introduce this we need to introduce Landau

O and o Notation, which is a useful shorthand for limits and bounds on real quantities. It

gives a way of making precise statements about approximations.

Definition 15. For sequences xn and yn,

xn = o(yn) as n→∞ means limn→∞

xn/yn = 0,

and

xn = O(yn) as n→∞ means that for some B, |xn/yn| ≤ B for all n sufficiently large.

If instead f(h) and g(h) are (real) functions of a continuous variable h, then f = o(g(h))

and f = O(g(h)) as h tends to a limit have analogous meanings.

We will use these continuous versions mainly in the case h→ 0, for example to look at the

behaviour of probabilities over very short time intervals.

Example 18. Binomial distribution asymptotics

Let Sn ∼ Bin(n, p) with fixed p ∈ (0, 1), and consider

xn = P (Sn ≤ 1) = (1− p)n + np(1− p)n−1.

As we can write xn = (1 − p)n−1(1 − p + np), we have xn → 0 as n → ∞, so we can write

xn = o(1) as n→∞. We can describe the rate of the convergence to zero by comparing with

some other functions of n which also go to zero as n→∞. First, note that

xn(1− p)n

= 1 + np

1− p→∞ as n→∞,

37

so we cannot write xn = O((1− p)n). However,

xnn(1− p)n

=1

n+

p

1− p≤ 1 +

p

1− p,

so we can write xn = O(n(1− p)n) as n→∞.

On the other hand, xn 6= o(n(1− p)n), as 1n + p

1−p does not tend to zero as n→∞. However,

xnn2(1− p)n

=1

n2+

p

n(1− p)→ 0 as n→∞,

so we do have xn = o(n2(1− p)n).

Now let p(t) be the probability that there is at least one event of a Poisson process with rate

λ in the time interval (0, t]. By the assumption that the number of events in the interval

has a Poisson distribution with parameter λt, we have p(t) = 1 − e−λt. We consider the

asymptotics of this as t ↓ 0.

First of all, by the series expansion of the exponential we have

p(t) = λt− λ2t2

2+λ3t3

6− λ4t4

24+ . . .

and thusp(t)

t= λ− λ2t

2+λ3t2

6− λ4t3

24+ . . . .

The right hand side here tends to λ as t → 0, so is bounded close to 0, and hence we can

write p(t) = O(t).

To be more precise, we consider the difference p(t)− λt, and observe

p(t)− λtt

= −λ2t

2+λ3t2

6− λ4t3

24+ . . .→ 0 as t ↓ 0.

So we can write p(t)− λt = o(t), and we also write this as p(t) = λt+ o(t).

In fact, the Poisson process can be characterized as follows:

1. If (u1, v1], (u2, v2], . . . , (uk, vk] are disjoint time intervals then

N(v1)−N(u1), N(v2)−N(u2), . . . , N(vk)−N(uk) are independent random variables.

2. N(u) takes non-negative integer values, and is increasing in u.

3. For any time u, as t ↓ 0 we have that P (N(u + t) − N(u) ≥ 1) = λt + o(t) and

P (N(u+ t)−N(u) ≥ 2) = o(t).

(The first of these properties was the second of the properties we assumed originally, but

note that it is not necessary to specify that the distribution of N(t) is Poisson; see Example

26.)

38

5.2 Continuous time Markov chains

A continuous time stochastic process is a Markov chain if it has discrete states (convention-

ally labelled by the integers) and satisfies the condition: for tn > tn−1 > · · · > t1 ≥ 0,

P{Xtn = in | Xtn−1 = in−1, . . . Xt1 = i1} = P{Xtn = in | Xtn−1 = in−1} (35)

for each n ≥ 2 (the Markov Property).

As in the discrete time case, we concentrate on chains which are time-homogeneous in that

P (Xt+τ = j | Xt = i) = pij(τ),

not depending on t.

In discrete time the n-step transition probabilities were determined by P = (pij), so it was

natural to specify the model by saying what P and the initial distribution π(0) were. The

one-step transition matrix was thus a natural parameter for the chain. What is the analogue

of P in continuous time?

For a continuous-time chain the Chapman-Kolmogorov equations are, by the same argument

as in Section 2.2.3,

pij(t+ τ) =∑k

pik(t)pkj(τ),

which may be written in matrix terms as

P (t+ τ) = P (t)P (τ), (36)

where P (t) = (pij(t)).

An important consequence of (36) is that if we knew the value of P (t) for all t in a short

interval [0, δt] then we would be able to find it for any larger t too because the C-K equations

(36) would allow successive extensions to [0, 2δt], [0, 4δt] and so on.

This suggests that we should be able to parametrize the chain in terms of the transition

probabilities over arbitrarily short times. As δt→ 0 it is reasonable to insist that

P (δt)→ I,

the unit matrix, so we focus on how pij → 0 (i 6= j) and pii → 1.

Suppose that for δt very small the probability of a transition from i to j with i 6= j is

pij(δt) = gijδt+ o(δt), (37)

and that

pii(δt) = 1 + giiδt+ o(δt), (38)

where the gij are constants. (Note that gii will be negative.) The matrix G with elements

gij is called the (infinitesimal) generator of the Markov chain.

Notes:

39

1. Properties (37) and (38) are sometimes also called the infinitesimal transition scheme

for the chain.

2. Since∑

j pij(δt) = 1, we might expect that

pii(δt) = 1−∑j 6=i

(gijδt+ o(δt))

= 1−

∑j 6=i

gij

δt+ o(δt)

and so that

gii = −∑j 6=i

gij . (39)

Property (39) will certainly be true for chains with finite state space. In general it is

not guaranteed. However, most chains used in modelling satisfy it. They are called

conservative chains. We focus on them from now on.

3. A natural way to think about gij for j 6= i is as the instantaneous rate at which, given

that the chain is in state i, a transition to state j is expected to occur.

When a continuous time Markov chain is used for modelling, it is usually not the transition

probabilities pij(t) that are specified, but the generator.

Example 19. A two state chain

Let S = {1, 2} and set

G =

(−α α

β −β

),

with α, β > 0. Then G is the generator matrix for a continuous time Markov chain with two

states. (As previously, we could think of this as a weather model, with the states being “wet”

and “dry”.)

Example 20. The Poisson process

Consider Nt, the number of events in (0, t] in a Poisson process with rate λ. We have already

seen that P (Nt+δt = n+ 1|Nt = n) = λδt+ o(δt) and that P (Nt+δt > n+ 1|Nt = n) = o(δt).

So we can consider a Poisson process as a continuous time Markov chain on N0 with generator

given by gn,n+1 = λ, gn,n = −λ and other entries zero.

The chain is not irreducible: values can only increase, so it is not possible to get from state i

to state j if j < i.

Example 21. Linear birth process

(Also known as the Yule process.)

Consider a population where each member of the population reproduces at rate λ (meaning

that for each member of the population, the probability of a reproduction event between times

40

t and t+ δt is λt+ o(δt)) and where there are no other ways the population can change. Then,

letting Nt be the size of the population at time t, P (Nt+δt = i+ 1|Nt = i) = λiδt + o(δt) and

P (Nt+δt = i|Nt = i) = 1− λiδt + o(δt), with P (Nt+δt = j|Nt = i) being o(δt) if j is neither i

nor i+ 1. So we have

gij =

λi j = i+ 1

−λi j = i

0 otherwise.

Example 22. Linear birth-death process

Extend Example 21 by allowing for deaths as well, for example for a population of bacteria.

Now each member of the population reproduces at rate λ, giving P (Nt+δt = i + 1|Nt = i) =

λiδt + o(δt), and dies at rate µ, giving P (Nt+δt = i− 1|Nt = i) = µiδt + o(δt). We now have

P (Nt+δt = i|Nt = i) = 1− (λ+ µ)iδt + o(δt), and we have

gij =

λi j = i+ 1

−(λ+ µ)i j = i

µi j = i− 1

0 otherwise.

The chain is still not irreducible, as it is not possible to leave state 0.

Example 23. M/M/n queue

Imagine that we have a queue with n servers, and assume that:

• Arrivals of customers occur as a Poisson process of rate λ.

• There are n servers, who can each serve one customer at a time, and will automatically

move on to the next customer in the queue when their current customer leaves.

• Each customer currently being served completes their service at rate µ, so that the prob-

ability they leave in (t, t+ δt] is µδt+ o(δt).

(The model is presented in terms of people in a queue, but could also, for example, represent

jobs in a computer system. In the name “M/M/n”, the first M refers to arrivals satisfying the

Markov property, the second M refers to service completions following the Markov property, and

the n refers to the number of servers.)

Let Nt be the number in the queue at time t. If i ≤ n, then in (t + δt] we may have an

arrival, which is an event of a Poisson process, so has probability λδt+ o(δt), or we may have a

service completion, which happens with probability µiδt+ o(δt), as there are i customers being

served, or we may have more than one event with probability o(δt). Hence we have gi,i+1 = λ,

gi,i−1 = µi and gii = −(λ + µi). If i > n then gi,i+1 is still λ, but there are now only n

customers being served, so gi,i−1 = µn and gii = −(λ+ µn).

This chain is irreducible.

Example 24. Generalized birth-death process

41

All the above are examples of the Generalized Birth-Death process. This has transitions

i

↗ i+ 1, λi δt+ o(δt);

→ i, 1− (λi + µi) δt+ o(δt);

↘ i− 1, µi δt+ o(δt).

corresponding to

gij =

λi if j = i+ 1

−λi − µi if j = i

µi if j = i− 1

0 otherwise.

5.3 Determination of the transition probabilities P (t)

This section outlines heuristically1 how the transition probabilities to describe behaviour of

a model may be obtainable from a specified generator matrix.

The generator definitions in (37) and (38) can be written in matrix notation as

P (δt)− I = Gδt+ o(δt), as δt→ 0,

that is,

limδt→0

P (δt)− Iδt

= G. (40)

Consider evolution of the chain from its starting state at time 0 to its state at time t+ δt.

From the Chapman-Kolmogorov equations (36), the transitions over the whole period t+δt

may be expressed in terms of those between 0 and t and those in the final short time δt by

P (t+ δt) = P (t)P (δt).

Thus

P (t+ δt)− P (t)

δt=P (t) (P (δt)− I)

δt. (41)

On letting δt→ 0 it is reasonable to interpret the limit of the left hand term in (41) as the

derivative P ′(t) of the matrix P at time t. Together with (40) this suggests that

P ′(t) = P (t)G, (42)

known as the forward differential equations.

A similar argument based on decomposing paths over 0, t + δt into short (δt) and long (t)

sections rather than vice versa leads to the backward differential equations

P ′(t) = GP (t). (43)

1aiming to convey the main idea without filling in all formal justifications and conditions

42

Aside: rigorous justification of these equations for chains with finite state spaces is imme-

diate. For infinite state space care is needed in the interchange of the limiting operation as

δt→ 0 with the infinite summations.

There is no completely general way to solve equations such as (42) or (43); solution ap-

proaches are usually tailored to specific cases. For finite chains however the following is

sometimes useful.

The equations (42) and (43) are matrix generalizations of the simple differential equation

y′(t) = g y(t) for a scalar variable y(t) and constant g, of which the solution is y(t) = y(0)egt.

With P (0) = I, this suggests a formal solution for the forward and backward equations

P (t) = exp(Gt), . (44)

This uses the matrix exponential function: for a matrix A the exponential function is

defined as

exp(A) =∞∑k=0

Ak/k!. (45)

For computation and theoretical discussion it can be helpful to use a spectral decomposition

of G to simplify calculation of the matrix powers in (45).

Assuming G is diagonalizable, the spectral representation of G (cf (1)) is

G = TDT−1

where T is a matrix whose columns are right eigenvectors t of G, so that Gt = dt, and D

is a diagonal matrix of the corresponding eigenvalues d

D =

0 0 · · · · · ·0 d2 0 · · ·0 0 d3 · · ·...

......

...

(conservatism of G guaranteeing that 0 is an eigenvalue). Thus G2 = TDT−1 TDT−1 =

TD2T−1 and generally

Gn = TDnT−1, n = 1, 2, . . .

where

Dn =

0 0 · · · · · ·0 dn2 0 · · ·0 0 dn3 · · ·...

......

...

This allows exp(Gt) to be found easily as follows, giving a reasonably general expression for

P (t) = P (0) exp(Gt).

43

exp(Gt) =

∞∑n=0

(Gt)n/n!

= T

( ∞∑n=0

(Dt)n/n!

)T−1

= T

1 0 · · · · · ·0∑

(d2t)n/n! 0 · · ·

0 0∑

(d3t)n/n! · · ·

......

......

T−1

= T

1 0 · · · · · ·0 ed2t 0 · · ·0 0 ed3t · · ·...

......

...

T−1

Example 25. Two state chain

Consider the special case of Example 19 with

G =

(−1 1

2 −2

).

Then G has eigenvalues 0 and −3, and can be diagonalized as

G =

(1 1

1 −2

)(0 0

0 −3

)(23

13

13 −1

3

).

Hence we have

P (t) =

(1 1

1 −2

)(1 0

0 e−3t

)(23

13

13 −1

3

)=

(23 + 1

3e−3t 1

3 −13e−3t

23 −

23e−3t 1

3 + 23e−3t

).

As can be seen P (t)→(

23

13

23

13

)as t→∞.

In some cases special techniques give a solution of the equations. For example, we can show

that the continuous time Markov chain characterization of the Poisson process really does

lead to Poisson distributions. Recall from MAS275 that if X is a discrete random variable

taking non-negative integer values then its probability generating function FX(s) is

defined by

FX(s) = E(sX) =∞∑k=0

skP (X = k),

and that this sum is convergent at least for s ∈ [0, 1].

Example 26. Solution of forward equations for Poisson process

44

Assume that we are interested in the distribution of Nt, and note that the probability that

Nt = k is the same as the transition probability p0k(t). For the Poisson process, the forward

equations P ′(t) = P (t)G becomep′00(t) p′01(t) p′02(t) · · ·p′10(t) p′11(t) p′12(t) · · ·

......

.... . .

=

p00(t) p01(t) p02(t) · · ·p10(t) p11(t) p12(t) · · ·

......

.... . .

−λ λ 0 0 · · ·

0 −λ λ 0 · · ·...

......

.... . .

.

Looking just at the entries in the top row on the left hand side and writing pk(t) = p0k(t) for

simplicity, these give

p′0(t) = −λp0(t)

p′1(t) = λp0 − λp1(t)

p′2(t) = λp1(t)− λp2(t)...

...

p′k(t) = λpk−1(t)− λpk(t)... =

...

Multiply both sides of each of these equations by sk, and sum all the equations, obtaining

∞∑k=0

p′k(t)sk = λ

∞∑k=1

pk−1(t)sk − λ∞∑k=0

pk(t)sk.

Note that the second sum on the right hand side is the probability generating function FNt(s),

and similarly the term on the left hand side can be written∂FNt (s)

∂t and the first sum on the

right hand side is sFNt(s). So we have

∂FNt(s)

∂t= λ(s− 1)FNt(s).

If we treat s as fixed, this is an ordinary differential equation with respect to t, and its solution

is

FNt(s) = C exp (λ(s− 1)t) .

To identify the value of C, note that N0 is zero with probability 1, so FN0(s) = 1 for any s,

which tells us that C = 1. Hence

FNt(s) = exp (λ(s− 1)t) .

A Poisson distribution with parameter µ has probability generating function given by

∞∑k=0

µke−µ

k!sk = e−µ

∞∑k=0

(µs)k

k!= e−µeµs = eµ(s−1),

so we can recognise FNt(s) as the probability generating function of a Poisson distribution with

parameter λt. The probability generating function determines the distribution, so we do indeed

have that Nt ∼ Po(λt).

45

5.4 Stationary and limiting distributions

In many applications the process studied appears to have been running for a long time and

to have reached an apparently stable equilibrium. Then we are not so much interested in

the transient properties of our models as in the limiting probabilities as t → ∞; that is,

in the steady-state condition reached. It is (heuristically) straightforward to find what the

steady-state must be.

Assume that as t→∞ the transition probabilities pij(t) converge to some limit πj for each

j, whatever the initial state i. In matrix terms this says that each row of P (t) tends to the

row vector, π say, of the πj . Given that the pij(t) converge to constants, we would expect

(unless they are very peculiar) their time derivatives to converge to zero: that is, P ′(t)→ 0.

Thus in the forward equations (42)

P ′(t) = P (t)G,

as t→∞ the first term tends to zero and each row of the second term tends to πG, so that

the limiting probabilities satisfy

πG = 0. (46)

Say that a distribution π = (πj) (a row vector) on the states is stationary if

πP (t) = π for all t > 0.

If P (t) can be expressed as exp(Gt), then it is simple to show that a distribution π is a

stationary distribution if and only if π satisfies (46). In particular this identifies stationary

and limiting distributions for finite chains. The identification holds for many infinite chains

too.

An interesting way to think about the limiting behaviour of a continuous time chain is to

imagine observing the chain only at discrete times 0, h, 2h, . . . . The resulting sequence is a

discrete time Markov chain (called a skeleton chain).

By using the conditions for existence of limiting distributions in discrete chains it is possible

to show that if the continuous time chain is irreducible, meaning that all states communicate

in the sense that pij(t) > 0 and pji(s) > 0 for some t and s, then either the system is stable

and the limiting distribution limt→∞ pij(t) = πj is the unique solution of (46), or pij(t)→ 0

for all states and (46) has no solution. Complications due to periodicity do not arise for

continuous time chains.

Example 27. Two state chain

We find a stationary distribution for the chain in Example 19. To do this, we solve(π1 π2

)(−α α

β −β

)=(0 0

).

This gives

−απ1 + βπ2 = 0

απ1 − βπ2 = 0,

either of which gives π2 = αβπ1. The constraint that π1 + π2 = 1 gives π =

(β

α+βα

α+β

).

46

Example 28. Generalized birth-death process

i

↗ i+ 1, λi dt+ o(δt);

→ i, 1− (λi + µi) dt+ o(δt);

↘ i− 1, µi dt+ o(δt).

giving

G =

−λ0 λ0 0 0 0 . . .

µ1 −(λ1 + µ1) λ1 0 0 . . .

0 µ2 −(λ2 + µ2) λ2 0 . . .

0 0 µ3 −(λ3 + µ3) λ3 . . .

. . . . . . . . . . . . . . . . . .

The equations for the stationary distribution

(π0 π1 . . .

)G = 0 are therefore

λ0π0 = µ1π1

(λj + µj)πj = λj−1πj−1 + µj+1πj+1, j ≥ 1

from which we see

π1 =λ0

µ1π0,

π2 =1

µ2

{−λ0π0 + (λ1 + µ1)

λ0π0

µ1

}=λ0λ1

µ1µ2π0

and in general

πj =λ0 . . . λj−1

µ1 . . . µjπ0

which gives a probability distribution if and only if λ0 6= 0 and

∞∑j=1

λ0 . . . λj−1

µ1 . . . µj<∞

and then the distribution has probability function

π0 = 1/

1 +

∞∑j=1

λ0 . . . λj−1

µ1 . . . µj

, πj =λ0 . . . λj−1

µ1 . . . µjπ0, j = 1, 2, . . . .

Example 29. M/M/1 queue

For the M/M/1 queue, Example 23 with n = 1, we have λi = λ and µi = µ for all i. Hence

the criterion in Example 28 becomes

∞∑j=1

λj

µj<∞,

which is true if and only if λ < µ. (I.e. the service rate exceeds the arrival rate.) In this case,

summing the geometric series gives π0 = µ−λµ , and so πj = µ−λ

µ

(λµ

)j. Hence the stationary

distribution is a geometric distribution.

47

Example 30. Immigration-death process

Consider a generalized birth-death process with λi = λ for all i, representing a constant rate

of immigration, and µi = µi, as in the linear birth-death process. The criterion in Example 28

now becomes∞∑j=1

λj

µjj!<∞,

which is always satisfied, as∑∞

j=1λj

µjj!= eλ/µ − 1. Then π0 = 1/eλ/µ and

πj =λj

µjj!eλ/µ=

(λµ

)je−λµ

j!,

showing that the stationary distribution is Poisson with parameter λ/µ.

5.5 Evolution of the chain

Holding Times:

Suppose the chain is initially in state i. Denote by H0 the length of time before it first

leaves i, the first holding time, and let F(t) = P (H0 > t).2 Then from the Markov property

F(t+ δt) = F(t)(1 + giiδt+ o(δt)).

On subtracting F(t) from both sides, dividing by δt and taking the limit it follows that

F ′(t) = giiF(t).

The solution is

F(t) = egiit,

so that H0 has an exponential distribution with rate parameter −gii (recall that gii < 0).

For notational convenience we can write −gkk as gk. Then H0 is exponential with rate

parameter gi.

First Jump Probabilities:

When the chain leaves i where does it go? Plausibly

P (i j during (t, t+ δt) | leave i during (t, t+ δt))

=gijδt+ o(δt)

−giiδt+ o(δt)for j 6= i

→ gijgi

= pij , say, as δt→ 0.

The pij are called first jump probabilities. For a conservative chain∑j 6=i

pij = 1, for each i

2So 1−F(t) is the cumulative distribution function of H0.

48

so the first jump probabilities form a stochastic matrix (with zero diagonal). The corre-

sponding discrete time chain is called the first jump chain. It is further true that if J is the

destination state on leaving i,

P (J = j | H0 = t) = pij , j 6= i; (47)

that is, J is independent of H0. The following is a heuristic argument for (47).

P (J =j | H0 ∈ (t, t+ δt))

= P (X(t+ δt) = j | H0 ∈ (t, t+ δt)) + o(δt)

since the chance of 2 jumps in δt is o(δt)

=P (X(t+ δt) = j,H0 > t)

P (H0 ∈ (t, t+ δt))+ o(δt)

=P (X(t+ δt) = j | X(s) = i, s ≤ t)P (H0 > t)

P (H0 ∈ (t, t+ δt))

=(gijδt+ o(δt))egiit

(−gii)egiitδt+ o(δt)

→ gijgi

as δt→ 0.

How the chain develops:

Thus the development of the process can be described as follows. The process remains in its

initial state, i say, for an exponentially distributed time with parameter gi. It then jumps

to another state j 6= i chosen according to the first jump probabilities pij , and so on.

The sequence of states passed through, the discrete time Markov chain with transition

matrix P = (pij), describes the space structure of the process; and the rate of movement

through this structure, the chain’s time evolution, is determined by the exponential holding

times.

These facts are widely useful. They give a ready means to simulate continuous time Markov

chains. They are the basis for inference about chains as described in section 5.6. They also

give a means to explore properties of the continuous time chain that depend only on its

space structure. To see how, note that the jump chain, the discrete time chain obtained

by observing the continuous time process only when it changes state, passes through the

same sequence of states as the continuous time process, so properties depending on space

but not time structure – for example, probabilities of absorption – can be obtained from

the discrete jump chain.

Note, however, that the jump chain and the underlying continuous time chain will often

have different stationary distributions.

Example 31. Linear birth-death process

Recall from Example 22 that we have gi,i+1 = λi, gi,i−1 = µi and gi,i = −(λ+ µ)i.

The holding time in state i thus has an exponential distribution with parameter −gi,i = (λ+µ)i

(and hence mean 1(λ+µ)i). The first jump probabilities are

pi,i+1 =gi,i+1

−gi,i=

λi

(λ+ µ)i=

λ

λ+ µ,

49

and

pi,i−1 =gi,i−1

−gi,i=

µi

(λ+ µ)i=

µ

λ+ µ,

for any i ≥ 1. State 0 is a special case; the chain cannot leave.

The behaviour of the jump chain is then similar to the Gambler’s Ruin discrete time chain

studied in MAS275, with the probability of jumping up being λλ+µ and that of jumping down

being µλ+µ , except at state 0 which cannot be left. The only difference is that in MAS275 there

was an upper target N at which the gambler would stop, whereas here the population may

increase indefinitely.

From MAS275, the probability of reaching N before 0 starting at state i is

qN,i =

1−(µλ)

i

1−(µλ)N λ 6= µ

iN λ = µ.

Letting N → ∞, we see that qN,i → 0 if µ ≥ λ, indicating that the size of the population

will not get indefinitely large and will eventually reach 0, so the population will become extinct

with probability 1. On the other hand, if µ < λ then there are two possibilities. We have that

qN,i → 1 −(µλ

)i, indicating that, if the chain starts in state i there is a probability 1 −

(µλ

)ithat the population will become indefinitely large and not become extinct, but there is also a

probability(µλ

)ithat the population will reach 0, and hence become extinct.

5.6 Likelihood and inference

How do we fit a continuous-time Markov chain model to observations?

Suppose we have a complete record of the chain’s evolution over the period [0, t], {x(s) :

0 ≤ s ≤ t}. Suppose too that the model is parametrized by its generator matrix G. The

likelihood of G is the probability (density) of the observation as a function of the parameters.

According to section 5.5 the chain’s path can be described in terms of the sequence of states

passed through and the lengths of holding time in each.

Thus let {Xk : k = 0, 1, . . . , Nt} denote the sequence of states passed through by time t, Nt

being the total number of jumps. The holding time in state Xk is denoted by Hk. A special

case is that the last holding time HNt is incomplete at time t, as the chain has not yet left

state Xk at this time.

Then, conditionally on X0 = x0, the likelihood has the following terms:

• For each completed holding time hk in state xk, representing an observation from an

exponential random variable with parameter gxk , a term gxke−gxkhk

• For each observed jump from state xk to xk+1, a term pxkxk+1corresponding to the

probability of the jump.

• The holding time hnt is incomplete at time t. So we know that the realization of the

exponential random variable Hnt is at least hnt , which has probability e−gxnt hnt , and

so this is the contribution to the likelihood.

50

Somewhat unusually, this includes both “discrete” terms corresponding to specific proba-

bilities, and “continuous” terms coming from a probability density function. Multiplying

the contributions together gives the likelihood as

L(G : x(s), 0 ≤ s ≤ t) =

{nt−1∏k=0

gxk e−gxkhk pxkxk+1

}e−gxnt hnt

=

nt∏k=0

e−gxkhknt−1∏k=0

(gxk pxkxk+1)

=

nt∏k=0

e−gxkhknt−1∏k=0

gxkxk+1

= exp

(−∑i

giai

) ∏i 6=j

gnijij , (48)

where ai is the total time spent in state i during [0, t] and nij is the number of transitions

observed from state i to state j during [0, t].

Aside:

A consequence of (48) is that the values of nt, the ai and the nij are the only aspects of

the data relevant for the likelihood L. If these statistics are known, the rest of the record

is irrelevant; they are sufficient statistics.

Maximum Likelihood Estimates:

From (48) the log likelihood is

l = −∑i

giai +∑i 6=j

nij log gij ,

which, because gi = −gii =∑

i 6=j gij , is the same as

l = −∑i

∑j 6=i

gijai +∑j 6=i

nij log gij .

Thus∂l

∂gij= −ai +

nijgij

and is zero at at

gij =nijai

i 6= j. (49)

The estimate for gi = −gii follows from

gi = −gii =∑j 6=i

gij =∑j 6=i

nijai

=ni.ai. (50)

Notes:

51

1. The estimates (49) have the form no. transitions i j per unit time spent in state i.

This is very reasonable in the light of the interpretation from (37) of gij as the rate

at which transitions from i to j occur.

2. In many problems the generator elements gij are themselves functions of a lower di-

mension vector parameter, θ say. In that case the maximum likelihood estimate θ

would be found as the value at which l is maximum with respect to θ, and the gener-

ator elements would be estimated as gij(θ).

3. The asymptotic covariance matrix of the estimates is given as usual by the inverse

of the information matrix, minus 1 times the matrix of second derivatives of the

log-likelihood l with respect to the parameters.

Example 32. Inference for a linear birth-death process.

Suppose λn = λn and µn = µn, for n = 0, 1, . . . , where λ and µ are unknown. The generator

elements are

gij =

λ i j = i+ 1

−(λ+ µ) i j = i

µ i j = i− 1.

Suppose as above that the process has been observed over an interval [0, t], and the times aiand transition counts nij recorded. Then the log-likelihood for λ and µ is

l = −∑i

giai +∑i,j

nij log gij

= −∑i

(λ+ µ)iai +∑i

ni i+1 log(λi) +∑i

ni i−1 log(µi).

Thus

∂l

∂λ= −

∑i

iai +1

λ

∑i

ni i+1

∂l

∂µ= −

∑i

iai +1

µ

∑i

ni i−1

so that

λ =∑i

ni i+1/∑i

iai (51)

µ =∑i

ni i−1/∑i

iai. (52)

These estimators have a reasonable form: b =∑

i ni i+1 and d =∑

i ni i−1 are the total number

of births and deaths respectively, and∑iai is the total accumulated time, T say, (person-years

52

say) during which birth or death could happen to one individual, so (51) is the observed birth

rate b/T , and (52) the observed death rate d/T , per individual per unit time.

To find standard errors we calculate the observed information matrix J . From the above

∂2l

∂λ2= −b/λ2,

∂2l

∂µ∂λ= 0,

∂2l

∂µ2= −d/µ2,

so that

J =

(b/λ2 0

0 d/µ2

)and J−1 =

(λ2/b 0

0 µ2/d

).

Estimated standard errors for λ and µ are obtained from J−1 by taking square roots of the

appropriate elements and substituting estimates for unknown parameters. For example

ese (λ) = λ/√b,

and an approximate 95% confidence interval for λ is

λ± 2λ√b.

5.7 Beyond . . .

Continuous-time Markov chain models and developments of them are used in many areas;

for example in:

• population studies, where, for example, they have been developed to take account of

age structure and spatial distributions of animal populations;

• manpower planning for large organizations;

• studies of social and occupational mobility;

• diffusion of news, rumours and internet viruses;

• competition, in ecology and in animal populations;

• predator-prey phenomena;

• disease modelling, including cancer growth;

• epidemic modelling

The following sketch of epidemic modelling aims to indicate the subject-matter and some

of the demands it makes for development of ideas and techniques.

53

5.7.1 Epidemics

The modelling of epidemics is important because it provides means to formulate and evaluate

possible control measures and eradication and mitigation programmes. Epidemics, both in

human and animal populations, are, of course, a source of high public and governmental

concern.

The following is a simple way to begin thinking about an epidemic. (It’s over-simplified,

but does capture enough of what actually goes on in enough cases to have value.) We

suppose that each individual in the study population may be either susceptible, infective

or removed. A susceptible individual is one who may catch the relevant disease by contact

with an infective individual. The removed individuals are those who, having caught the

disease earlier, have now recovered and become immune or have been removed (by death

or by being isolated, for example) and are no longer open to re-infection themselves or able

to infect any one else.

Suppose that the population consists of n individuals, and denote the numbers who are

susceptible, infective or removed at time t by S(t), I(t) and R(t); thus S(t)+I(t)+R(t) = n.

A susceptible individual can catch the disease by meeting an infectious individual and

being infected by them. At time t the rate of occurrence of such meetings and infections

will depend on the numbers S(t) and I(t) of both susceptibles and infectives, and it may

be reasonable to suppose proportionality, so that the rate of change in the number of

susceptibles is

dS(t)

dt= −βS(t)I(t) (53)

for some constant β > 0 that we might call the infection rate. The number of infectives

I(t) will be affected by the same encounters, but will be increased by them. It will also

be subject to reduction, however, as infectives pass through the disease to reach removed

status. The rate at which infectious individuals come to the end of their infectious period

will be larger the more infectious individuals there are at the time. If the rate is assumed

to be proportional to I(t) then

dI(t)

dt= βS(t)I(t)− γI(t) = (βS(t)− γ)I(t), (54)

where γ is a constant referred to as the removal rate.

From (53) and (54) it follows that the rate of change of the number removed is

dR(t)

dt= γI(t).

It would be reasonable to suppose that the epidemic begins from a small number I0 of

infectives at time 0. The solution of equations (53) and (54) then gives a deterministic

prediction for the progress of the epidemic until no new infections are occurring.

One important deduction can be made, however, without solving the equations. An epi-

demic can be said to occur only if at the very least there is an increase in the number of

infective individuals. Thus unless the condition

dI(t)

dt> 0, at t = 0 (55)

54

holds, no epidemic can occur. From (54), if

βS0 − γ < 0, that is S0 < γ/β = ρ, say (56)

(55) fails and therefore there is no epidemic. Thus there is a threshold number of susceptible

individuals, ρ, called the relative removal rate, needed before an epidemic can take off.

This has an important implication for immunization policies; they are likely to prevent an

epidemic only if they can reduce the number of susceptibles below the threshold level.

For the case S0 > ρ the equations (53) and (54) may be solved to show the course of

the epidemic. Data from epidemics often record the number of deaths per day, which

corresponds to dR/dt, often called the epidemic curve, as seen in Figure 9

Figure 9: An example epidemic curve

The form of the curve predicted by the model reflects a common observation in many actual

epidemics: that the number of new cases reported each day rises to a peak level then falls

away again.

The differential equation approach and the threshold effect above were first proposed by

Kermack & McKendrick in 1927, though it was not until much later that a full solution

of the equations was obtained by D G Kendall. Kermack & McKendrick also showed that

when S0 > ρ, so that an epidemic occurs, the final number of susceptibles remaining at the

end of the epidemic is approximately ρ− (S0 − ρ); that is, the final number is as far below

the threshold as it was above it initially.

Though the deterministic model above carries some useful information, it does not coincide

with the observation in actual epidemics that they do not proceed smoothly, but are subject

to apparently random fluctuations. A stochastic model is therefore of interest. Randomness

can be introduced by replacing the rates of change of numbers in the deterministic equations

55

by the probabilities of transitions. Suppose then that during (t, t+ δt)

(I, S) −→

{(I + 1, S − 1) with probability βSIδt+ o(δt)

(I − 1, S) with probability γIδt+ o(δt).(57)

This is a Markov chain on the state space of pairs (S, I). The general solution is quite

complicated, but useful information can be extracted by the following simple argument.

At the beginning of the epidemic the number of susceptibles S will be not many fewer than

n if the initial number of infectives is small, and the reduction in S will be slow. Thus the

transition probabilities in (57) will be approximately

I −→

{I + 1 with probability βS0Iδt+ o(δt)

I − 1 with probability γIδt+ o(δt),(58)

which are the transition probabilities of a simple linear birth-death process with birth rate

λi = βS0i and death rate µi = γi. From section 5.5 it follows that the infectives will die

out if βS0 ≤ γ (that is, if S0 ≤ ρ) and therefore there will be no major epidemic. However,

if βS0 > γ (that is, if S0 > ρ), there is still a probability (γ/(βS0))I0 of extinction, so that

a major epidemic may take place then rather than will take place.

The stochastic approach therefore yield slightly different conclusions to the deterministic

approach, though agreeing in some essentials.

The models above are at best only rough approximations to the complexity of real diseases.

For many diseases there is an incubation period following infection, during the first part of

which (the latent period) the sufferer is not infectious. The infectious period which follows

may begin before symptoms appear at the end of the incubation period and therefore before

(the sufferer is aware of having caught the disease or) any isolation or treatment measures

could be put into operation. The transmission, even of simple diseases, is unlikely to be

described exactly by the proportional mixing assumptions in the models above. In human

populations, for example, contacts between individuals are not equally likely between all

members. We each have a sub-population of family, friends, colleagues and others with

whom we come into contact to varying degrees, and these sub-populations themselves differ

but overlap in complicated ways. Some diseases such as malaria are transmitted by an

intermediate insect or animal vector, requiring at least a further stage in the epidemic model,

and in all diseases spatial location and movement of susceptibles, infectives and, where

appropriate, vectors, can be key factors for understanding the epidemic. Modern epidemic

models, such as those which have played a key role in responses to recent outbreaks of BSE,

foot-and-mouth disease and avian flu, use a combination of stochastic and deterministic

methods to cope with such complexities. These models also incorporate inference methods

that allow data on the spread of the epidemic as it develops to be assimilated into the model

in real time to continuously update predictions, a vital capability for effective responses.

6 Modelling sets of points

Aspects of many phenomena can be represented by sets of points, so point process models

are widely useful. These sets of points may refer to moments in time, in which case they

are thought of as points in one dimensional space, such as events in a Poisson process (see

56

section 2.1), or more generally they may refer to locations in space, often two or three

dimensional.

6.1 Examples

Example 33. Floods in Burbage Brook

Figure 10 shows times and sizes of floods in Burbage Brook 1925-1982. In the plot ‘flood’ is

taken to be flow over 4 cumecs. 3 Initial interest might be in modelling the times of floods

as a point process. Times and magnitudes together can be regarded as a point process in two

dimensions.

Figure 10: Times of floods in Burbage Brook

Example 34. Insurance claims

Figure 11 shows major fire insurance claims in Denmark from 1980 to 1990, from Embrechts,

Kluppelberg & Mikosch 1997. There might be interest in modelling the times alone, or the

times and sizes combined. Model times and (times, sizes)?

Example 35. Japanese pine saplings

Figure 12 shows the locations of saplings of Japanese black pines, collected by Numata (1961).

Models for spatial patterns like this are of interest; questions could include whether there is any

evidence for clustering or repulsion.

Example 36. East Yorkshire leukaemia cases

Figure 13 gives the locations of cases of leukaemia in children in East Yorkshire from 1974 to

1986, and locations of a second set of children without leukaemia but otherwise matched to the

cases. Model the two patterns. Is there evidence for differences?

31 cumec is 1 cubic metre per second

57

Figure 11: Major fire insurance claims in Denmark, 1980–1990

6.2 Fitting a Poisson process model to data

In this section we consider how to fit a Poisson process model to some data.

Given observation of a point process over an interval [0, t], how can we fit a Poisson process

model? That is, how do we estimate the rate λ? Also, having fitted a Poisson process, how

can we assess the adequacy of the model?

There are two initial possibilities for the estimation:

1. from the observed number of events N(t) = n;

2. by the methods developed for continuous time Markov chains in section 5.6.

Details:

1. Fitting using N(t) ∼ Po(λt), the log-likelihood having observed N(t) = n is

l = −λt+ n log(λt) + constant

= −λt+ n log(λ) + n log(t) + constant

= −λt+ n log(λ) + constant (59)

so the derivative∂l

∂λ= −t+

n

λ,

so that the maximum likelihood estimator λ = n/t, and considering the second deriva-

tive ese(λ) = λ/√n.

2. Fitting using the Markov chain method of section 5.6:

58

Figure 12: Locations of Japanese pine saplings

Figure 13: East Yorkshire leukaemia data, 1974-1986 (Lawson 2003)

The durations in states i = 0, . . . , n are

a0 = T1, . . . , an−1 = Tn, an = t−n∑1

Ti.

Also, the transition rates in the chain are simply

gi i+1 = λ = −gii = gi

so that the log-likelihood is

l = −∑

giai +∑i 6=j

nij log gij

= −λ∑i

ai + log λ∑i

ni i+1

= −λ t+ n log λ (60)

59

since∑

i ai =∑n

1 Ti + (t−∑n

1 Ti) = t and∑

i ni i+1 = total number of points in [0, t].

Since (60) is the same as the log-likelihood (59) from approach 1, inferences from the

two approaches are the same. (Thus, slightly surprisingly at first sight, what appears

to be extra information here, the values of individual Ti, doesn’t actually make any

difference. The reason will emerge later.)

Possibilities for checks on model adequacy:

Any means of checking the properties of a Poisson process can be used. For example,

the interval [0, t] could be divided into a number of equal sub-intervals, the number of

points noted in each, and the resulting data used in a χ2 goodness-of-fit test for a Poisson

distribution. Similarly a check could be based on the distribution of the intervals between

points, which should be exponential.

Example 37. Burbage Brook floods

Between 1925 and 1983 (inclusive) there were 48 flood events (flows ≥ 4 cumecs) in Burbage

Brook. Since the observation period is t = 59 years, the maximum likelihood estimator of the

rate of occurrence is

λ = 48/59 = 0.81, with ese = 0.12

events/year.

Figure 14 shows a histogram of the inter-event times in days. The sample mean interval between

0 500 1000 1500 2000

05

1015

Intervals between floods (days)

Fre

quen

cy

Figure 14: Burbage interflood times

floods is found to be 423.3 days, and standard deviation 457.9 days. An exact fit to an expo-

nential distribution cannot be expected for these data because of discreteness, but the effect

of discreteness here (making a difference of ±1 in observations whose mean is over 400, and

so accounting for discrepancies of less than ±0.25% from a continuous variable) is negligible.

However the strength of evidence for or against an exponential distribution is not completely

clear from Fig. 14.

60

The property of the conditional distribution of points of a Poisson process, Theorem 3, gives

another way to write the likelihood for the process:

L = P (N(t) = n)× P (positions of the points |N(t) = n)

= e−λt(λt)n

n!× 1

tn∝ e−λtλn,

of course leading to the same result as before. The fact that, given the number N(t) of

points in the interval, the conditional distribution of the positions of the points no longer

depends on λ explains the finding that knowledge of the specific locations of points makes

no difference to the estimate of λ once the number of points is known: if the distribution

does not depend on λ then the observations contain no further information about it. (The

statistic N(t) is said to be sufficient for λ in this case.)

The same property suggests a further check on the fit of a Poisson process model: check

whether the positions of points within [0, t] are consistent with having been generated from

a U(0, t) distribution.

Example 38. Burbage flood dates

Figure 15 shows a histogram of the dates of the Burbage flood events. There is some apparent

unevenness but no more than would be expected from a sample from a uniform distribution.

10000 15000 20000 25000 30000

01

23

45

67

Days from 01/01/1900

Fre

quen

cy

Figure 15: Histogram of Burbage flood dates

6.3 Inhomogeneous one-dimensional Poisson processes

The rate λ governs the probability that an event/point occurs in an arbitrarily short interval.

So far we have assumed that it is constant. However in many applications it is plausible

that events could occur randomly and independently, but that the probability of occurrence

could change over time. A flood, for example, might be more likely in winter, arrivals at a

casualty department more likely in the rush hour, etc.

61

6.3.1 Definition and properties

We therefore now allow λ = λ(t) to depend on time. Generalising the characterisation of

the Poisson process in section 5.1, we now assume that the counting process N(t) satisfies

P (N(t+ h) = i+ 1 |N(t) = i) = λ(t)h+ o(h)

and

P (N(t+ h) = i |N(t) = i) = 1− λ(t)h+ o(h),

and that the probabilities of all other changes are o(h). The resulting process is called an

inhomogeneous Poisson process of rate (or intensity) λ(t).

Let N(t) be an inhomogeneous Poisson process of intensity λ(t), and define

Λ(t) =

∫ t

0λ(u) du.

Suppose we change the time-scale and define a new counting process

M(s) = N(t) where s = Λ(t).

Let t(s) = Λ−1(s) be the inverse transformation of the time scale, taking the new time s

back to t. Note that a small change s→ s+ h on the s-scale corresponds to a change

t(s+ h)− t(s) = hdt

ds+ o(h) =

1

λ(t)h+ o(h) ≈ h

λ(t)

on the t-scale. Thus

P (M(s+ h) = i+ 1 |M(s) = i) = P (N(t(s+ h)) = i+ 1 |N(t(s)) = i)

= λ(t)h

λ(t)+ o

(h

λ(t)

)= h+ o(h)

where λ(t) > 0. Similarly

P (M(s+ h) = i |M(s) = i) = 1− h+ o(h)

and other transitions have probabilities of smaller order than h. Thus M(s) is a homoge-

neous Poisson process with intensity 1.

We can therefore transfer properties of the homogeneous processes to the inhomogeneous

case. In particular:

Properties of the Inhomogeneous Poisson Process N(t):

Proposition 1. N(t) has a Poisson distribution with mean Λ(t) =∫ t

0 λ(u)du.

Reason: N(t) = M(s) ∼ Po(s) = Po(Λ(t)).

62

Proposition 2. The numbers of points in disjoint intervals I1, . . . , Ik are independent and

Poisson distributed with means∫Iiλ(u)du, i = 1, . . . , k.

Reason: independence follows from translation of the corresponding property of the basic

Poisson process, and the distributions follow as in Proposition 1 above.

Proposition 3. Given that the total number of points in [0, t] is N(t) = n, the positions of

the points are independently distributed with pdf λ(v)/Λ(t), 0 ≤ v ≤ t.

Reason: conditional independence and identical distribution of positions in the N process

follows from the same properties of those in the M process. Let V denote the position of

a point in the N process. Then the corresponding position for the M process is s(V ) and

in the M process positions are uniformly distributed over [0, s(t)]. Thus the distribution

function of V is

P (V ≤ v) = P (s(V ) ≤ s(v)) =s(v)

s(t)=

Λ(v)

Λ(t)

and so the pdf isdP (V ≤ v)

dv=λ(v)

Λ(t).

6.3.2 Fitting to data

Proposition 3 enables us to write down the likelihood for λ(·) based on observations over

[0, t]. Suppose that we have observed N(t) = n points and that their positions are v1, . . . , vn.

Then

L = P (N(t) = n)× P (positions of the points |N(t) = n)

= e−Λ(t) Λ(t)n

n!×

n∏1

λ(vi)

Λ(t)

= e−Λ(t)

∏n1 λ(vi)

n!,

so that the log-likelihood is

l = −Λ(t) +n∑1

log λ(vi) + const. (61)

Notes:

1. If λ(t) is a constant λ, then Λ(t) reduces to Λ(t) = λt and (61) becomes

l = −λt+ n log λ+ const

in agreement with (59).

2. In most applications of inhomogeneous Poisson models the function λ(·) is specified

in terms of a finite number of parameters. Their maximum likelihood estimates are

then the values maximizing l in (61).

63

3. The asymptotic properties of likelihood inference carry over to such inhomogeneous

Poisson processes under reasonable conditions. Approximate standard errors may

then be found from the information matrix and tests may be based on twice the

difference of log-likelihoods.

Example 39. Freezes of Lake Constance 1300–1974

Figure 16 shows the years between 1300AD and 1974AD when major freezes of Lake Constance

occurred, a major freeze being defined as one in which the upper lake, which is 7–14km wide,

could be crossed by vehicle or foot. (Data from Steinijans (1976), Applied Statistics.4) Figure

Year

1300 1400 1500 1600 1700 1800 1900

Figure 16: Years of Major Freezes of Lake Constance

16 and the histogram and uniform QQ plot for the dates of freezes in Figures 17, 18 do not

appear to support a Poisson process model with a constant intensity. Consider therefore a

1300 1400 1500 1600 1700 1800 1900 2000

01

23

45

6

Freeze date

Fre

quen

cy

Figure 17: Histogram of Lake Constance Freeze Dates

non-homogeneous Poisson process model. To represent a changing rate in a simple form, we

assume

λ(t) = α+ βt

4There have been no major freezes of the lake since the data set here ended.

64

++

+++

+++

+++

++

+++++++

++

++

++

++

+

1300 1400 1500 1600 1700 1800 1900

1300

1400

1500

1600

1700

1800

1900

Freeze Dates

Uni

form

qua

ntile

s

Figure 18: QQ plot of Lake Constance Freeze Dates

where α and β are constants, and t, for convenience in this problem, measures years post 1300

in units of 100 years. The cumulative intensity function Λ(t) is

Λ(t) =

∫ t

0λ(u)du = αt+

1

2βt2,

so that the log-likelihood is, from (61),

l = −αto −1

2βt2o +

n∑1

log(α+ βvi), (62)

where to denotes the length of the observation period, 6.75 centuries, n denotes the number of

events, and the vi denote the times of occurrence of the events (again, in centuries after 1300).

Note that this log likelihood only makes sense for values of α and β such that α+ βvi > 0 for

all freeze dates vi.

To find the maximum likelihood estimators α and β, differentiate:

∂l

∂α= −to +

n∑1

1

α+ βvi

∂l

∂β= −1

2t2o +

n∑1

vi

α+ βvi

Hence we need to solve

−to +n∑1

1

α+ βvi= 0 (63)

−1

2t2o +

n∑1

vi

α+ βvi= 0. (64)

65

Multiplying (63) by α and (64) by β and adding them together shows that αto + 12 βt

2o = n,

which allows elimination of one variable. However, there is still no explicit solution, so numerical

maximization is required. It shows that

α = 7.015 (1.76)

β = −0.81 (0.38)

where the values in brackets are estimated standard errors obtained by inversion of the observed

information matrix J : (Var(α) Cov(α, β)

Cov(α, β) Var(β)

)= J−1,

where

J = −

∂2l∂α2

∂2l∂α∂β

∂2l∂β∂α

∂2l∂β2

.

We could calculate J by differentiating (63) and (64) and substituting α and β. However, if the

numerical maximization of l is done in R, numerical differentiation is available as a by-product.

Code for the numerical maximization will be made available on the course website.

The estimated rate λ(t) = 7.0 − 0.81t is decreasing, in agreement with the indications from

Figures 16 and 17. To assess the objective strength of evidence for a decreasing rate we can

carry out a test of the hypothesis H0 : β = 0, against the alternative H1 : that there is no

restriction on the value of β. The generalized likelihood ratio test gives a general method

for constructing such tests; see section 3.5.2.

Generalized Likelihood Ratio Test

The test statistic is

W = −2(l(α, β)− l(α, β)), (65)

where l(α, β) is the maximum log-likelihood under H1, and l(α, β) the maximum under the

restriction imposed by H0. Under H0, the distribution of W is approximately χ2 with

degrees of freedom = no.parameters under H1 − no.parameters under H0,

and under H1 it tends to be larger. Thus large values of W discredit H0 and the p-value

corresponding to an observed value of W may be found from the χ2 distribution.

Example 40. Lake Constance revisited

For the Lake Constance data the maximum likelihood estimators α and β above give the values

of α and β that maximize l under H1, so we only need to find the maximizing values under H0,

the restricted maximum likelihood estimators α and β.

When H0 is true, λ(t) = α so the Poisson process is time-homogeneous. From section 6.2

therefore the maximum likelihood estimator of α is

α =n

to=

29

6.75= 4.3 events/century

66

(and necessarily β = 0).

Substitution into l (62) and use of (65) now gives

w = −2(13.275− 15.249) = 3.948.

By comparison with χ21 the p-value is slightly less than 0.05. Thus there is some evidence of

a change in the rate of occurrence of freezing events – which from the sign of β must be a

reduction – but the strength of the evidence is not overwhelming.

Notes:

1. Other forms for λ(t) may be preferable. For example λ(t) = exp(α + βt) specifies

a rate that could be increasing or decreasing (according to the sign of β) but can

never give a negative value, unlike the linear λ(t) used in the example. If there is a

possibility of periodicity in the rate of occurrence of points (as for floods, for example,

which may be more likely in the winter), then a λ(t) incorporating sines and cosines

might be useful.

2. To check adequacy of a non-homogeneous model we could transform the time-scale

from t to s by the s = Λ(t) transformation. On the new scale the non-homogeneous

process becomes homogeneous, and therefore all the checks for that case (section 6.2)

become available.

6.4 Spatial Poisson processes

6.4.1 Definition

In section 6.3.1 we saw that, given an intensity function λ(·) ≥ 0, a general Poisson process

on the line has the following properties:

• for any interval I, the number of points N(I) of the process in I has a Poisson

distribution with mean∫I λ(u) du

• for disjoint intervals I1, I2, . . . , Ik, the random variables N(I1), N(I2), . . . , N(Ik) are

independent.

• N is a counting process: the number of points it counts in an interval is the sum of

the numbers in subintervals.

A Poisson process in the plane or in space (or indeed in d dimensions for any positive

integer d) is defined as a counting process with these same properties, where the properties

are merely re-phrased to make sense in higher dimensions.

Definition 16. Suppose that λ(·) is a real-valued non-negative function on R2 and that

for each set B in the plane, N(B) is a random variable taking non-negative integer values

(interpreted as the number of points of the process in B). If

• N(B) has a Poisson distribution with mean∫B λ(u) du;

• when B1, B2, . . . , Bk are disjoint, the random variables N(B1), N(B2), . . . , N(Bk) are

independent;

67

• N has the additive property N(∪ki=1Bi) =∑k

i=1N(Bi) for disjoint Bi;

then N is called a spatial Poisson process (on the plane), or a planar Poisson process, with

intensity λ(·).

Definition 17. A homogeneous spatial Poisson process is the special case of Definition 16

when λ(·) is constant.

A Poisson process in a three- or more- dimensional Euclidean space S = Rd, d = 3, . . . is

defined in the same way; λ is a function of the spatial coordinates, and the B are sets in S.

Aside: Actually there is a bit more work to do to ensure that these definitions work: it’s

necessary to show that the objects described exist in the mathematical sense. For that it’s

necessary to be a little more precise about the sets they are defined on, but the upshot is

that everything works out properly and the model is available for use. We take that as

read.

Notation: To save writing let us define, for sets B in the space of the points (Rd, d =

2, 3, . . . ),

Λ(B) =

∫Bλ(u) du.

6.4.2 Some properties of spatial Poisson processes useful in modelling

A Poisson process may be defined on an arbitrary subset of the plane, for example the

region of East Yorkshire in Example 36, by simply restricting the definition above to that

subset.

Proposition 4. Let R denote the distance from the origin to the nearest point in a homo-

geneous planar Poisson process with intensity λ. Then the probability density function of R

is

hR(r) = 2λπre−λπr2, r > 0, (66)

(the density of a Rayleigh distribution).

Reason: The number of points in a circle of radius r centred at the origin has a Poisson

distribution with mean λπr2. If there are no points in this circle then R > r, and conversely.

Thus P (R > r) = exp(−λπr2) and (66) follows by differentiation.

The distribution of R is called the first contact distribution of the Poisson process. It is in

fact the distribution of the distance from any fixed point in the plane to the nearest point

in the process, as can be seen by simply re-defining the origin to be at the fixed point. The

distribution of distance from an arbitrary point of the process itself to its nearest neighbour

may be found too, and for the homogeneous Poisson process this distribution turns out

to be exactly the same as the first contact distribution. The two distributions will not

necessarily be equal generally, so a test for a Poisson process could be based on seeing

whether estimates of the distributions based on observed distances are similar or not.

68

Proposition 5. Thinning of a Poisson process refers to the random deletion of some of the

points. A simple form of thinning is to remove or retain each point independently with fixed

probabilities 1 − p and p, say. If the original process has intensity function λ(·), then the

point process resulting from such independent thinning is a Poisson process with intensity

pλ(·).

Reason: Let N denote the original process and N∗ the thinned process. Independence and

additivity of N∗ follow immediately from independence and additivity of N and the fact

that thinning is carried out independently. Thus the only thing to show is that, for each

set B, N∗(B) has a Poisson distribution with mean pΛ(B). For each r ≥ 0,

P (N∗(B) = r) =∞∑k=r

P (N∗(B) = r |N(B) = k)P (N(B) = k)

=

∞∑k=r

(k

r

)pr(1− p)k−r e

−Λ(B)Λk(B)

k!

since, given k points, the number of points retained has a Binomial Bi(k, p) distribution

=e−Λ(B)(pΛ(B))r

r!

∞∑k=r

{(1− p) Λ(B)}k−r

(k − r)!

=e−Λ(B)(pΛ(B))r

r!e(1−p) Λ(B)

=e−pΛ(B)(pΛ(B))r

r!.

Proposition 5 also holds in one dimension.

Proposition 6. Conditional property (cf section 2.1.3 and section 6.3.1). Given the total

number of points of a spatial Poisson process in a region B, the positions V of the points

are independently distributed over B with probability density function

fV (v) =λ(v)

Λ(B)v ∈ B. (67)

6.4.3 Comments and applications

1. If we know the intensity function, Proposition 6 gives a way to simulate a Poisson

process over any region. As in section 6.3.1, we first generate a number n from the

Poisson distribution Po(Λ(B)), then simulate n independent values from the density

(67).

With this ability we could implement a simulation test for a Poisson process using

the comparison of distributions method suggested by Proposition 4. The approach is

feasible even for processes on sets B with irregular shapes.

2. Spatial Poisson process models give a way to formulate problems of detecting spatial

effects in diseases, such as that in Example 36. Suppose that cases occur according

69

to a Poisson model with intensity λcase(·), and controls according to a Poisson model

with intensity λcontrol(·). If there were no spatial effects associated with the disease,

then the ratio

r(x) =λcase(x)

λcontrol(x)

would be constant with respect to spatial location x. Thus a way of detecting spatial

factors would be to estimate the two intensities from observations and map the result-

ing estimated r(x). To allow for inevitable sampling fluctuations, simulation-based

testing (based for example on randomization of the labels of cases and controls) could

be used to show where there the evidence of a difference is strong (Kelsall & Diggle

(1995)).

3. Proposition 6 also gives the likelihood for λ based on an observed pattern of points.

If n points are observed in a region B and their positions are vi, i = 1, . . . , n, then the

likelihood function is

L =e−Λ(B)Λn(B)

n!×

n∏1

λ(vi)

Λ(B)(68)

=1

n!e−Λ(B)

n∏1

λ(vi), (69)

and the log-likelihood

l = −Λ(B) +n∑1

log λ(vi) + constant.

The λ function is often specified in terms of a small number of parameters. In that

case fitting of the model by maximization of l and subsequent inference go ahead along

the same lines as before.

6.5 Marked Poisson processes

Several examples in section 6.1 can be thought of as a sequence of time points at each of

which another variable is observed. A marked Poisson process is a simple model for this.

Given a Poisson process N – on the line, plane or in higher dimensions – with intensity

λ(·), associate with each point Xi of the process a random variable Yi, called the mark at

Xi. Then the new process {N,Y1, . . . } is called a marked Poisson process.

For some modelling problems it might be appropriate to take the Yi to be independent

and identically distributed. In others there may be interest in possible dependence between

marks at different points, and in possible changes in the distribution of marks with position

of the point.

6.5.1 Examples

Example 41. Insurance risk

70

The arrival of claims at an insurance company and the sizes of the claims might be modelled

as a marked Poisson process. An initial assumption, to be checked, might be that marks (claim

amounts) are independent and identically distributed. The difference between premium income

and claim payouts is the key to financial viability of the company. If the ith claim is made at

time Xi and is of size Yi, and premiums bring income at a steady rate ρ net of running costs,

then the assets A(t) of the insurance company at time t are

A(t) = A(0) + ρt−N(t)∑

1

Yi,

where N(t) is the number of claims up to time t. The probability that A(t) remains positive

for a long time, and of how large the reserves A(0) need to be to make this probability large,

are of great interest.

Sums of the formN(t)∑

1

Yi,

for a Poisson process N with marks Yi arise in many contexts. They are called compound

Poisson processes.

Example 42. Earthquakes

A sequence of earthquakes could be modelled by attaching marks representing earthquake mag-

nitude to the times of occurrence. Questions about dependence between magnitudes close in

time are highly relevant to predictability and the possibility of warning systems. The same ques-

tion arises too about the times themselves and motivates further development of the Poisson

models we have considered in this course.

Example 43. Floods

Floods such as those in Example 33 could be modelled by a marked Poisson process, the mark

for a flood occurrence being the magnitude of the flood. Marks could include more information

too, becoming multi-dimensional. If further data were available, for example about weather

conditions at the times of floods, or environmental conditions such as dryness/wetness of the

ground in the period before the flood, then it too could be modelled as part of a multi-dimensional

mark.

Example 44. Rainfall

A widely-used point process model for rainfall attempts to mimic the occurrence and heaviness of

rain at a place in terms of the passage of rain cells (clouds in which water vapour is condensing)

over the place. The arrivals of rain cells are modelled by a Poisson process and the time a rain

cell takes to pass over the place and the intensity of the rain it brings are attached as random

marks. Marks in this case are two-dimensional.

6.5.2 Likelihood

For a marked Poisson process in which the marks are conditionally independent given

the point process, the likelihood may be written down immediately from (69). Conditionally

71

independent given the point process means that the marks {Y1, Y2, . . . } are independent given

the positions of the points of the point process N . Nevertheless the mark Yi attached to

the point at position Xi is allowed have a distribution which depends on Xi: we denote

its conditional probability density function by k(yi |xi). Then the likelihood based on

observation of points in a set B is

L =e−Λ(B)Λn(B)

n!×

n∏1

λ(xi)

Λ(B)k(yi |xi)

∝ e−Λ(B)n∏1

λ(xi)k(yi |xi)

and the log-likelihood

l = −Λ(B) +n∑1

{log λ(xi) + log k(yi |xi)}. (70)

This is a basis for model fitting and refinement.

Example 45. Burbage floods

An initial model for the times and severities of the Burbage Brook flood events is based on

a marked point process. Dates of floods are assumed to come from a Poisson process, and

the excess flood flows over 4 cumecs are modelled as conditionally independent marks with

exponential distributions whose means 1/µ(t) may depend on time.

6.5.3 Special case

Another way to view a marked Poisson process is as a point process in a higher dimensional

space. For example, if the Poisson process is one-dimensional with points at Xi, i = 1, . . .

and the marks Yi are also one-dimensional, then the points (Xi, Yi) form a point process

in two dimensions. It is a remarkable fact that when the marks Yi are independent and

identically distributed this two-dimensional process is itself a Poisson process. If the original

Poisson process has intensity λ(x) and the mark probability density is k(y) then the intensity

of the two-dimensional Poisson process (Xi, Yi) is µ(x, y) = λ(x) k(y).

To see this, consider a rectangle (x1, x2)× (y, y + δy). We know that the number of points

of our original process N in (x1, x2) has a Poisson distribution with mean∫ x2

x1λ(u) du. For

each of the points Xi ∈ (x1, x2), the corresponding Yi falls in the interval (y, y + δy) with

probability p =∫ y+δyy k(s) ds independently of the others. Hence the thinning result of

Proposition 5 in Section 6.4.2 shows that the number of pairs (Xi, Yi) in our rectangle has

a Poisson distribution with mean

p

∫ x2

x1

λ(u) du =

∫ x2

x1

λ(u) du

∫ y+δy

yk(s) ds =

∫ x2

x1

∫ y+δy

yλ(u)k(s) ds du.

This suggests that we have a 2-dimensional Poisson process with intensity µ(x, y) = λ(x)k(y),

but the definition requires that we show that for general sets B in the plane, N(B) has a

Poisson distribution, not just for rectangles. However a general (Borel) subset B of R2

72

can be decomposed into small rectangles. The Poisson distribution holds for each of those

by the above, and the additivity of N then guarantees that it holds for B. Independence

follows from the independence properties of the one-dimensional Poisson process and the

points Yi, and additivity is immediate from the set-up.

The result generalizes to higher dimensions (both of N and the Yi) by essentially the same

argument.

Example 46. Moving objects

Suppose, at a particular time, objects (trees in a forest, stars in space, molecules in a container,

people in a country, . . . ) are located in specific places described mathematically as a collection

of points in the appropriate space. Subsequently the objects move5, their new positions forming

a different collection of points. In many instances, although individual points move in ways very

specific to themselves, the overall pattern of points does not change. Why is this? Is there a

mathematical explanation?

For simplicity, consider objects on the line (−∞,∞). Suppose that they are initially at the

points of a homogeneous Poisson process of intensity λ. Now move each object by a random

amount. If Xi is the initial position of the ith object, then after the move it will be at a position

Xi + Yi, say, where Yi is its random displacement. We assume that the displacements Yi are

independent and have a common distribution function K(y) = P (Y ≤ y) with density k(y).

The property in 6.5.3 shows that (Xi, Yi) is a spatial Poisson process with intensity µ(x, y) =

λµ(y). We calculate the distribution of the number of objects in an interval (t1, t2) after the

move. The number in this interval corresponds to the number of points (Xi, Yi) in the diagonal

band t1 < x+ y < t2 in the x-y-plane. It therefore has a Poisson distribution with mean∫ ∞y=−∞

∫ t2−y

x=t1−yλ k(y) dx dy = λ

∫ ∞y=−∞

k(y)(t2 − t1) dy

= λ (t2 − t1).

Similarly the joint distributions of numbers of objects in disjoint intervals after the move have

independent Poisson distributions. The additivity property is clearly satisfied, and so we conclude

that the points after the displacement continue to be a homogeneous Poisson process.

Example 47. The M/G/∞ queue

Suppose that customers, starting at time 0 arrive in a queue as a Poisson process of rate λ, and

are served immediately by one of the infinite number of servers. Each customer has a service

time with probability density function k(y) and distribution function K(y), independently of

other customers.

We denote the arrival time of the ith customer by Xi, and their service time by Yi. Then

the property in 6.5.3 shows that (Xi, Yi) is a spatial Poisson process with intensity function

µ(x, y) = λk(y).

If we are interested in the number of customers in the queue at time t, then we note that

customer i is in the queue at time t if and only if Xi ≤ t (because otherwise the customer has

not yet arrived) and Xi + Yi ≥ t (because otherwise the customer has already left). So we

5in the case of trees, maybe not individually, but their offspring are in different places

73

are interested in the number of points of the spatial Poisson process in the set of (x, y) ∈ R2

satisfying x ≤ t and x+ y ≥ t. This will be Poisson distributed with mean∫ t

x=0

∫ ∞y=t−x

λk(y) dy dx = λ

∫ t

v=0{1−K(t− x)} dx

= λ

∫ t

0{1−K(w)} dw

→ λE(Y ), as t→∞

(since, as is true generally for positive random variables, E(Y ) =∫∞

0 {1−K(w)} dw).

Thus after the system has attained equilibrium, the number of customers it contains has a

Poisson distribution with mean the product of the arrival rate and the mean service time. The

discussion in Example 46 above shows also that, in equilibrium, the times of departures of

customers from the system form a Poisson process with rate equal to the arrival rate.

6.6 Beyond . . .

Some areas of application and development of point process models are:

• Doubly stochastic Poisson processes (Cox processes). These are Poisson processes

in which the intensity function λ(·) is itself a random process. Such a model may

be appropriate, for example, in connection with insurance protection against natural

disasters, since the processes leading to the occurrence of disasters are not themselves

fully predictable. Similarly for many other natural phenomena.

• Clustered processes. Each point in a Poisson process may generate others nearby,

creating a cluster. Appropriate for the modelling of certain animal and plant popu-

lations. Used too in the rainfall modelling described above; clusters of rain cells are

found to give a more realistic description of rain systems than single cells.

• Processes with inhibition. Sometimes points do not occur too close to each other.

Trees in a forest, for example, tend to grow no closer to their neighbours than their

canopies will permit. Models which represent such inhibition behaviour are available.

• Line processes. A line in the plane is specified by its slope and its intercept with one of

the axes; that is, it is specified by a pair of numbers. If we regard that pair of numbers

as a point in R2 and generate such points by a spatial Poisson process, then we have a

model for random lines. Such models and developments of them are used for example

in geomorphology, traffic studies and the modelling of the structure of paper. Similar

models for triangles have been used to study the alignment of prehistoric structures.

• Shapes. Extensions of point process ideas to theories of general random shapes are

widely used in microscopy, image processing, face recognition and in modelling the

structures of materials.

74

Documents

Applied Probability - Jonathan Jordanjonathanjordan.staff.shef.ac.uk/ApplProb/notes.pdf · Applied Probability MAS371/MAS6071 School of Mathematics and Statistics University of She