Maritime Abnormality Detection using Gaussian Processessjrob/Pubs/KAIS_Marine-Abnormality.pdfMaritime Abnormality Detection using Gaussian Processes 3 Assessing the performance of

Knowledge and Information Systems

Maritime Abnormality Detection usingGaussian Processes

Mark Smith1, Steven Reece2, Stephen Roberts2, Ioannis Psorakis2, Iead Rezek3

1ISSG, Babcock Marine & Technology Division, Devonport Royal Dockyard, United Kingdom2Department of Engineering Science, University of Oxford, Oxford, United Kingdom3Schlumberger Research, Cambridge, United Kingdom

Abstract. Novelty, or abnormality, detection aims to identify patterns within datastreams that do not conform to expected behaviour. This paper introduces noveltydetection techniques using a combination of Gaussian processes, extreme value theoryand divergence measurement to identify anomalous behaviour in both streaming andbatch data. The approach is tested on both synthetic and real data, showing itself tobe effective in our primary application of maritime vessel track analysis.

1. Introduction

The global picture of maritime traffic is large and complex, consisting of densevolumes of (mostly legal) ship traffic. Techniques that identify illegal traffic couldhelp to reduce the impact from smuggling, terrorism, illegal fishing etc. In thepast, surveillance of such traffic has suffered due to a lack of data. However sincethe advent of electronic tracking the amount of available data has grown beyondan analyst’s ability to process without some form of automation. One part ofthe analyst’s workload lies in the detection of anomalous behaviour in otherwisenormal appearing tracks. Our goal is to detect anomalous vessels using an auto-mated approach. In this paper we exploit techniques from the field of anomalydetection, particularly extreme value theory to identify potential deviations fromnormal behaviour. The latter is modelled using a non-parametric Bayesian ap-proach, namely sequential Gaussian process regression. This is also extendedby applying techniques from the field of divergence measurement namely theHellinger distance to construct an adjacency matrix from which clusters or com-munities of vessel types can be formed, and anomalous tracks identified.

Received Feb 01, 2013Revised May 04, 2013Accepted Jul 17, 2013

2 M. Smith et al

An anomaly has many different interpretations depending on the context inwhich it is used, i.e. it may refer to a data point arising from: a different distri-bution, measurement error, population variability or execution error. However,fundamentally an anomaly is a data point which stands out in contrast to theother data points around it, (Grubbs 1969). It is the task of anomaly detectionto infer whether this data point deviates significantly given the intrinsic variabil-ity of the population of normal data. An effective technique should therefore becapable of recognising and modelling data points that occur due to such anoma-lous events and distinguish these from outliers associated with the tails of thereference distribution of non-anomalous data.

This paper applies extreme value statistics to identify likely anomalous sam-ples which are extreme values. The probability distribution governing these ex-treme values is sequentially updated to enable context sensitive decisions. Thisis achieved by means of linking this distribution to a sequential Gaussian pro-cess model which regresses the vessel’s track and forecasts a distribution overfuture data. Whilst application in this manner can identify individual anoma-lous points it is unable to determine if the entire track can be considered ananomaly. We address this issue by undertaking unsupervised clustering of vesseltracks to identify clusters of a given vessel class and detecting abnormalities fromthe track cluster. This is achieved through bivariate Gaussian process regressionmodelling of vessel track latitudinal and longitudinal data, and applying theinverse Hellinger distance in order to discover an adjacency matrix. An unsuper-vised approach to clustering based on community detection using non-negativematrix factorisation is then used to identify communities of similar tracks withinthe adjacency matrix. Anomalies can then be identified by noting a discrepancybetween the class of vessel and vessels assignment to a labelled community ofvessels.

This paper begins by providing a brief overview of existing approaches tomaritime situational awareness, followed by a description of our methods foridentifying anomalous tracks. The methods section is broken down into two sub-sections. Firstly, the identification of anomalous points is considered, secondlywe consider the detection of anomalous tracks. In each instance the techniquesare applied to both synthetic and real vessel tracks extracted from their GPSco-ordinates. This is to provide both insights into the workings of the approach,and provide an empirical validation of the technique.

2. Current Techniques

The field of marine anomaly detection has employed a variety of methods includ-ing Neural Networks in (Rhodes et al 2005), Bayesian Networks in (Mascaro etal 2011), Support Vector Machines in (Li et al 2006), GPs in (Will et al 2001)and Kalman filters in (Laws et al 2011). Common between all methods are twomain tasks; creating a model of normality (free from the presence of anomalies),and using a metric from this model to (allowing for some quantifiable variability)identify anomalous points. These tasks are inherent within the two main uses forthe detection of anomalies: accommodation and discordancy. Accommodation isthe task in which the goal is to create a model of normality that does not includeanomalous observations. Discordancy tests provide a metric indicator of a pointbeing an anomaly. Both have different aims but within each is some model ofnormality and a measure of deviation.

Maritime Abnormality Detection using Gaussian Processes 3

Assessing the performance of these different methods is a difficult task asthere exists no established benchmarks of what are considered to be marineanomalies, therefore hindering comparison, (Laxhammar 2008). This implies thata data set can be considered under a variety of contexts, leading to different typesof anomalies being identified on the same input data. For example, a sequentialtime series analysis of a vessel track, where the previous location and the dy-namics of the vessel are considered, could highlight sudden changes in vesseldynamics; possibly indicative of evasive manoeuvring. Such a sequential timeseries model has the advantage that it can be used in online analysis, but itmay miss patterns when the entire track is considered as a whole, (Mascaro etal 2011). Other indications of anomalous behaviour within the data could bedeviations from a standard route, unexpected port arrival, close approach andzone entry, (Lane et al 2010). Even when a particular marine anomaly has beenselected for identification, the data needs to be considered in the context of ex-ternal factors, for example, the class of vessel, time of day, tidal status. Sincethese may have to be taken into account when analysing the data as the form ofthe anomaly may vary.

Critical to marine anomaly detection is an interpretation of the data thatallows the salient features of the desired anomaly to be identified, (Laxhammaret al 2009). Further, models for different kinds of anomalies may need to becombined or considered to increase the certainty of an anomaly being detected.For example a model identifying anomalous vessel speeds could be combinedwith a model of anomalous zone entries (anomalous spatial locations). Vesselsidentified as demonstrating anomalous speeds may be pleasure craft in a knownunrestricted speed location, giving false positives if simply considered only onthe basis of the speed model. Conversely a vessel entering a port at high speedmay be highly anomalous behaviour.

3. Identifying Anomalous Points

The approach we detail in this paper develops methods for the two main un-derlying tasks within anomaly detection: modelling normality and subsequentlyusing a cost function to identify points as anomalous. In this section we constructa model of normality using Gaussian processes (GPs) which allows us to cap-ture the dynamics of vessels in a non-anomalaous data set without prescribinga particular parametric form (such as is required for Markov state models, forexample). The GP provides a sequentially updated posterior distribution overunseen data, which we link to an extreme value distribution to provide a robustand adaptive metric for anomaly detection.

3.1. Gaussian Processes

In order to model the vessel track we use a Gaussian process, providing a mecha-nism to continuously predict vessel locations at any future time point, includinga measure of uncertainty about the vessel location. The GP is a stochastic pro-cess, (Rasmussen et al 2006), that expresses the dependent variable, y, in termsof an independent variable (called the input variable) x, via a function f(x). Thisfunction we can see as a draw from a probability distribution over functions,

4 M. Smith et al

y = f(x) ∼ GP(m(x), k(x,x)),

where m(x) describes the mean function of the distribution at x and k is acovariance function which describes the information coupling between two valuesof the independent variable as a function of the distance of their respective inputs.The matrix resulting from application of the covariance function to a vector ofinput variables can then be expressed as K(x,x). The covariance function thusencodes our prior beliefs and assumptions about the function that we wish tomodel, (Rasmussen et al 2006). Valid covariance functions can take a variety offorms which we quantify empirically in Section 4. Denoting r = |xp − xq| as the(Euclidean) distance between two independent values, xp and xq, we considerthree covariance functions: the squared exponential

kSE(r) = σ20exp

(− r2

2λ2

), (1)

the Matern 32

k 32(r) = σ2

0

(1 +

√3r

λ

)exp

(−√

3r

λ

), (2)

and the Matern 12 covariance function

k 12(r) = σ2

0exp(− rλ

). (3)

The above selection was driven by prior knowledge about typical vessel tra-jectories, for example, periodic kernels, (Rasmussen et al 2006), were excludedfrom the set of covariance functions as the data in our feature space is not recur-rent. Hence, we opted for covariance functions capable of reflecting the physicalproperties of shipping vessels, such as smoothness and differentiability.

We also assume that observations of vessel tracks are corrupted by addi-tive i.i.d Gaussian noise with variance component ε2. Thus, the full covariancefunction for the observations is given as

v(xp, xq) = k(xp, xq) + ε2δ(|xp − xq|),

where δ is the Kronecker delta, which is one if p = q and zero otherwise. Thematrix resulting from application of the covariance function to a vector of inputvariables can then be expressed as V (x,x).

The hyperparameters σ0, λ and ε are, respectively, the amplitude, output andnoise scale. They encode the characteristics of the track and so depend on thedynamics of the vessel. A vessel undertaking manoeuvring will not exhibit thesame smooth track characteristics as one exhibiting regular motion. Thus, thehyperparameters need to be learnt from an anomaly free training data set whichconsists of n observations, D = {(xi, yi)|i = 1, . . . , n}. The xi and yi pointsrepresent the independent and dependent variable values respectively.

The nature of the Gaussian process is such that, conditional on observed


data, predictions can be made about the function values, f(x∗) at any locationx∗. The distribution of these values at point x∗ is Gaussian

f∗|x∗,x,y ∼ N (f∗,Var[f∗]).

with the mean and variance given as

f∗ =m(x∗) +K(x, x∗)>V (x,x)−1(y −m(x)),

Var[f∗] =K(x∗, x∗)−K(x, x∗)>V (x,x)−1K(x, x∗).

(4)

3.2. Sequential Gaussian Process Updates

In may real world problems we receive data sequentially and the data set cangrow to an arbitrarily large size. If we were to continue to update our beliefsin the light of new observations we could naively repeat the matrix inversion inEquation 4 with every observation. This inversion is expensive as its computa-tional complexity grows as O(n3) in the number of samples, i.e. the dimensionof V above. Closer inspection however, reveals that matrix V is changed onlyin the addition of a new row and column with increasing n. Hence, it is possi-ble to reformulate the matrix inversion as a sequential Cholesky decomposition,(Osbourne 2010), and the entire n × n matrix inversion does not have to berecalculated each time n increases.

We decompose a matrix into the product of a lower triangular matrix, R,and its conjugate transpose

V (x,x) , R(x,x)>R(x,x).

Based on this decomposition, the parameters of the predictive distribution aregiven as

f∗ =m(x∗) + b>x,x∗axV (x∗, x∗),

Var[f∗] =V (x∗, x∗)− b>x,x∗bx,x∗ ,

(5)

and where a and b are given as

ax ,R(x,x)>\(y −m(x)),

bx,x∗ ,R(x,x)>\V (x, x∗).(6)

When we receive new data, the V matrix is changed only in the addition of somenew rows and columns, i.e.

V (x,x) =

(V (x1:n−1,x1:n−1) V (x1:n−1, xn)V (xn,x1:n−1) V (xn, xn)

).

Consequently the Cholesky decomposition can also be computed iteratively, (Os-bourne 2010), via

6 M. Smith et al

R(x1:n,x1:n) =

(R(x1:n−1,x1:n−1) S

0 U

),

where

S =R(x1:n−1,x1:n−1)>\V (x1:n−1,x1:n),

U =chol(V (xn, xn)− S>S).

Using the iterative Cholesky update, the predictive distribution, Equation 5, canalso be expressed iteratively by computing the vector a, in Equation 6, via thefollowing rule

a1:n =

(a1:n−1

U>\(yn −m(xn)− S>a1:n−1)

).

The recursion avoids the computationally expensive matrix inversion, in Equa-tion 4, and allows the Cholesky factor to be expressed as an efficient update rule.Thus the GP can be adapted to efficiently update its belief in the light of newevidence, as necessitated for online operation. We next consider how to furtheradapt the GP model in order to capture the correlation between time varyinglongitudinal and latitudinal data.

3.3. Multiple-Output Gaussian Processes

The GP model can be adapted to consider the correlation between multipleoutputs through parametrization of the covariance matrix, (Osbourne 2010).Thus it is possible to discover the function that describes the relationship betweenlatitude and longitude.

Through unconstrained optimization, where the upper-triangular elementsin the variance-covariance matrix are re-parameterised in such way that theresulting estimate must be positive semi-definite we can discover the correlationbetween the selected output for the given inputs.

While many techniques for parametrization of the covariance matrix exist,(Pinheiro et al 1996), the spherical parametrization benefits from the computa-tional efficiency of the Cholesky decomposition and allows the parametrisationto be expressed in terms of the variance and covariance between outputs. Inthis form it is the Cholesky factor of the output n × n covariance which is pa-rameterised, Σ = L>(θ)L(θ). Thus it can be ensured that the final covariancematrix remains positive semi-definite. By letting Li denote the ith column of theCholesky factorisation of Σ and li denotes the spherical coordinates of the firsti elements of Li, where i = 2, . . . , n. We can then express the parameterisationsas


[L1]1 = [l1]1[Li]1 = [li]1 cos([li]2)

[Li]2 = [li]1 sin([li]2) cos([li]3)

. . .

[Li]i−1 = [li]1sin([li]2) . . . cos([li]i)

[Li]i = [li]1 sin([li]2) . . . sin([li]i).

The product of the Cholesky factor parametrisation of the output spherical co-variance between two outputs (latitude and longitude) can therefore be expressedas

Σ =

[[l1]21 [l1]1[l2]1 cos([l2]2)

[l1]1[l2]1 cos([l2]2) [l2]22

].

We now consider a principled means of determining whether a new data pointshould be updated into the model of normal system behaviour.

3.4. Extreme Value Theory

Extreme value theory has previously been used in novelty detection (Roberts2000, Lee at al 2008). Extreme value theory provides a probability that a datapoint is anomalous. A threshold probability is chosen beyond which we can quan-tify a value as having not arisen from the underlying distribution. The the-ory itself focuses on the statistical behaviour of Mn = max{X1, . . . , Xn} whereX1, . . . , Xn is a sequence of independent random variables with a distributionfunction F . In theory the distribution of Mn can be derived exactly for all valuesof n, i.e.

Pr{Mn ≤ z} =Pr{X1 ≤ z, . . . , Xn ≤ z}=Pr{X1 ≤ z} × . . .× Pr{Xn ≤ z}={F (z)}n.

(7)

In practice the distribution function F is unknown and extreme value theoryallows us to approximate this distribution. It states that the entire range ofpossible limit distributions for Mn is given by one of three types of cumulativedistribution function, I, II and III, known as the Gumbel, Frechet and Weibull,respectively, and given as

I : G(z) = exp

{− exp

[−(z − βα

)]}−∞ < z <∞

(8)

II : G(z) =

0, z ≤ β,

exp

{−(z − βα

)−ξ}, z > β

8 M. Smith et al

III : G(z) =

exp

{−

[−(z − βα

)ξ]}, z < β

1, z ≥ β

Each family has a scale and location parameter, α and β respectively. Ad-ditionally the Frechet and Weibull families have a shape parameter ξ, (Coles2001). Although we have three models to choose from, the underlying targetdistribution, F , in our case is assumed to be Gaussian, due to the modellingconstraints imposed by the GP. The extreme value probability is then restrictedto the analytical form of the Gumbel distribution.

Assuming that some “normal” data is identically and independently Gaussiandistributed, one can obtain the extreme quantiles by inverting Equation 8

zp = β − α log (− log (p)).

The value of p acts as a novelty threshold, below which a test point is classified“abnormal”. The parameters α and β require estimation and typically dependon the sample size n of the data set. As proposed in (Roberts 2000), we makeuse of decoupled estimators for α and β given, respectively, as

α = (2 log (n))− 1

2 (9)

β = (2 log (n))12 − log (log (n) + log (2π))

2 (2 log (n))12

. (10)

In the next subsection we show how extreme value theory can be applied toanomaly detection in Gaussian processes (GPs).

3.5. Gaussian Process-Extreme Value Theory (GP-EVT)

In much of the existing work on novelty detection using extreme value methods,the work has focused on non-sequential conditions, or more precisely, on a fixedtraining dataset. While the extreme value will adequately account for the changesin our belief about the location of extreme events for a fixed sample size, theframework is rarely extended to account for dynamic changes in the underlyinggenerating distribution and changes in the sample size.

In this work we model the typical system dynamics using GP regression.At some arbitrary point in the future, say x∗, we can interrogate the GP andcompute the predictive (Gaussian) distribution at that point, conditional onthe trajectory’s past samples. This predictive distribution, which now featuresa context (time) dependent mean, f∗, and variance, Var[f∗], allows rescaling ofthe extreme event quantile e,

e = f∗ +√

Var[f∗]zp (11)

and so reflects temporal changes in the statistics of the base distribution.In order to estimate the number density of data points n(x∗) at each x∗ in

Equation 5, a Gaussian kernel smoother is applied using


n(x∗) =

m∑i=1

φh (x∗;xi) , (12)

where m is the total number of previously observed non-anomalous observationsand φh(x∗;xi) is a (non-normalised Gaussian) radial basis function

φh (x∗;xi) = exp

{−| x∗ − xi |

2

2h2

},

in which xi is the most recent observation and | · | denotes the Euclidean distance.The kernel width h is set to be equal to twice the length scale λ in Equations1, 2 or 3, depending on the choice of kernel used to model the vessel tracks.This coupling of λ to the GP regression model ensures that tracks with longcorrelation lengths and smaller sampling rates will feature the same sensitivityto outliers as tracks with short correlation lengths and high sampling rates. Also,the coupling ensures that the smoothing of the sampling processes does not comeat a cost of an additional parameter which would require additional estimation.

With the expected number of observations obtained by Equation 12, theextreme value distribution parameters can be updated in a timely fashion toreflect also the dynamics of the sampling process. Thus, the scaling (Equation9), and location (Equation 10), parameters can be estimated, using the predictednumber of data points, n(x∗), contributing information at the location of interestx∗, (Miller 1992), by

α(n∗) = (2 log (n(x∗)))− 1

2 (13)

and

β(n∗) = (2 log (n(x∗)))12 − log (log n(x∗)) + log (2π)

2 (2 log (n(x∗)))12

(14)

for a fixed novelty detection threshold, p, which in this work is set to 0.95.To reiterate, the GP provides a mechanism to predict the distribution of

future mean values and to adjust the scaling of the extreme value quantile. Also,the kernel smoothing approach to the sampling process provides an estimate ofthe future sample size. Their combination is used for novelty detection. If the newdata point value falls within a novelty measure of the predicted value then thenew data point is included in the model update. The key advantage of using ourapproach is thus the incorporation of future uncertainty in both sampling andobservation processes to provide the means for a more accurate novelty detectionalgorithm. The graphical model representation of the complete model is shownin Figure 1.

4. Application

We demonstrate the efficacy of the approach presented in the previous sectionby application to both synthetic data and real data. We use synthetic data to

10 M. Smith et al

GP

y

n

x∗ε

σ0

λ zp p

Fig. 1. A graphical model representation of the GP-EVT model. At the centreis the Gaussian process (GP) which models the track’s dynamics. It has fixed(pre-inferred) hyper-parameters shown as grey nodes to the left. Also shown isthe estimate of the sample size, n. The extreme value percentile is a deterministicnode, shown as a square box, which depends upon p, the novelty level, and samplesize n.

illustrate some of the features of our method, and provide a real world exampleof its application to vessel tracks.

4.1. Synthetic Data Illustration

Synthetic data was generated from the Matern32 kernel, Equation 2, with pa-

rameters set to σ0 = 1, λ = 2 and σ = 0.01. Anomalies were generated byoffsetting randomly selected samples that were previously drawn from the GPby a fixed offset value, making the point anomalous with respect to surroundingpoints. The GP predictive distribution was calculated for 1000 samples withinthe windowed region of track, the window ending at the time period for the newobserved sample. A fixed kernel width was used in order to estimate n at eachx∗.

GP extreme value theory was then applied by considering each new datapoint with respect to the previously learned underlying function. If the newpoint falls within the predictive uncertainty of the next data point it is includedin the sequential update otherwise it will be excluded. An example of such anupdate step is shown in Figure 3. The new data point falls outside the EVTbound, Equation 11, and so has been excluded. Notice, that if a data point hasnot been observed for a period, the predictive uncertainty grows, allowing forthe possibility of a dynamic change in the underlying base function and the newdata point to be included in the update. In this manner anomalous points withinthe data can be clearly identified while accommodating for the dynamics of theunderlying function and irregular observations. An intuitive illustration of howthe irregularity of observed data points effects the scaling of the extreme valuedistribution, and hence our novelty bounds, is given in Figure 2.


0 10 20 30 40 50 60 70−5

0

5

Time (Arbitrary Units)

Y

0 10 20 30 40 50 60 700

10

20


Num

ber

of P

oint

s

−4 −3 −2 −1 0 1 2 3 40

5

10

Y

f(Y

), fe

(Y)

(a) The sequential GP-EVT (continuous line, upper plot) stopped at a region of high observa-tion density. The estimate of the number of data points which contribute to the GP inferencehas increased significantly, as indicated by the observation density shown in the middle plot.Consequently the location of the extreme value distributions, illustrated by the continuouslines in the bottom plot, move away from the posterior predictive distributions (dashed line).

0 10 20 30 40 50 60 70−5

0

5


Y

0 10 20 30 40 50 60 700

10

20


Num

ber

of P

oint

s

−4 −3 −2 −1 0 1 2 3 40

0.5

1

Y

f(Y

), fe

(Y)

(b) The sequential GP-EVT (continuous line, upper plot), stopped at a region of very lowobservation density. The estimate of the number of data points which contribute to the GPinference has decreased significantly, as indicated by the observation density shown in themiddle plot. Consequently the location of the extreme value distributions, illustrated by thecontinuous lines in the bottom plot, move toward the posterior predictive distributions (dashedline).

Fig. 2. Simulation of the effect of varying observation density on the extremevalue distribution and hence the anomaly detection. Both plots show a snapshotfrom the last observed sample. In 2a the observation rate is high, while in 2bthe observation rate is low. The observation density affects the location of theprobability density function of the extreme value distribution fe(y) (blue lines,lower plot), relative to the predicted Gaussian PDF f(y) (red dashed line, lowerplot), drifting closer to the base distribution as the number density of pointsdecreases. This is due to the relationship between the observation density andthe location and scaling of the extreme value distribution, expressed in Equations13 and 14.

12 M. Smith et al

2 3 4 5 6 7 8−4

−3

−2

−1

0

1

2

3

4

X

Y

(a) The GP predicts forward to the new artificially perturbed data point, and by using GP-EVTthe new observation is classified as an anomaly.

2 3 4 5 6 7 8 9 10−4

−3

−2

−1

0

1

2

3

4

X

Y

(b) GP predicts forward after detecting and excluding the artificial anomaly. The uncertaintybounds continue to increase (until they reach their maximum as set by the prior distribution).The subsequent observations fall well within the error bound and so will be included in thenext update.

Fig. 3. Simulation of the GP prediction and anomaly detection. The continuousline shows the predicted mean function and the grey areas show the EVT boundof the GP predictive distribution for p = 0.95. The bound is open to the rightand widening until the next observation has been included. Once it has, thestandard deviation bound of the GP is updated, as seen between time steps 4and 9. The dashed line shows the error bound produced if we consider the 95%bound from the mean function (1.64 standard deviations from the mean).

4.2. Vessel Track Anomaly Detection

The GP-EVT methodology was also applied to real-world vessel track data,consisting of a set of GPS coordinates. This is presented in an appropriate onedimensional feature space parameterisation, providing a means of identifyingchanges in vessel dynamics.

Feature Extraction In order to convert data to a sufficient feature space rep-resentation we consider the first received data point as the beginning of thevessel track. We relate all subsequent data points to it by computing both thedistance and time taken from this originating sample point. In order to take intoaccount the approximated spherical geometry of the earth’s surface we calculatethis distance by application of the Haversine formula,


Matern 32

Matern 12

SE

0.3345 0.3312 0.3337

Table 1. Table of mean standardised likelihood scores for the Matern 32 , Matern 1

2and standard squared exponential kernels.

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Matern 3/2 Matern 1/2 Squared exponential

Sta

ndar

dise

d Lo

g Li

kelih

ood

Sco

res

Fig. 4. Box plot of log scores for the different covariance functions applied toeach track.

A = cosφs cosφf

∆σ = arctan

(√sin2

(∆φ

2

)+A sin2

(∆λ

2

))where φs and φf are the latitude of two points and ∆λ and ∆φ are their differ-ences in longitude and latitude respectively. This choice of feature space has theadvantage of converting the GPS information into a 1D feature vector, reducingthe computational demands of processing the data. Also, the arc length betweenpoints d for a sphere of radius r and ∆σ is given in radians by d = r∆σ.

Choice of Covariance Functions The choice of covariance function is crucialin the methods ability to provide the most accurate representation of the ves-sel dynamics. To determine the optimal covariance function we investigated theperformance of the standard squared exponential kernel, Equation 1, Matern3

2 ,

Equation 2, and the Matern12 kernel, Equation 3. Clean, i.e. anomaly free train-

ing data, was extracted from the training corpus and the GP kernel functionparameters were estimated by maximising the marginal likelihood of the data,(Rasmussen et al 2006). The likelihoods were standardised to the training datalength and the mean standardised likelihoods are shown in Table 1.

The results suggest almost comparable performance, in terms of goodness offit, of all three tested covariance functions. However, as shown in Figure 4, thereis a substantial difference in the robustness. The Matern 1

2 kernel frequently findspoorer fits to the data. The squared exponential performs in the middle range,occasionally finding worse solutions than the Matern 3

2 kernel but better than

the Matern 12 kernel.

14 M. Smith et al

(a) A plot of the GPS trackin which there were no detectedanomalies.

0 0.02 0.04 0.06 0.08 0.1 0.12−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

Time Difference

Nor

mal

ised

Dis

tanc

e

(b) Sequential predictions applied to feature ex-tracted data, also showing that all data pointsfall within the EVT bound.

Fig. 5. Sequential GP-EVT method applied to a dredging vessel operating offthe coast of France near Le Havre.

Vessel Track Modelling The methodology was also applied to real world ves-sel track data. The Matern 3

2 kernel was chosen to model the underlying dynamicsusing hyperparameters learnt from anomaly free training data, which were cho-sen to be sufficiently long enough so that the underlying dynamics of the vesselscould be captured.

Figure 5 shows an example vessel track without outlying points. The track isfrom a dredger which follows a smooth trajectory and does not make any suddenchanges in acceleration. Shown in Figure 5b are the sequential EVT bounds sea-sawing until the next observation arrives. All observations fall well inside thepredictive boundary of the GP-EVT bound and, consequently, no anomalies aredetected.

Figure 6 shows an example of a vessel track with some points which our modellabels as anomalies. As can be seen in Figure 6a, the vessel remains within a con-fined area and there are short sudden movements, Figure 6b. These are markedas anomalies and are perhaps the result of the vessel drifting, manoeuvring orbeing moored. Figure 6c also shows an enlarged section of the sequential com-putation of the EVT bound and makes clear the non-linear relationship betweenthe GP standard deviation bound and the actually computed EVT bound whichincludes essential parameters such as the inferred number of observations.

5. Comparison of GP-EVT with a Traditional KalmanFilter Approach

In this section we compare a traditional approach to anomaly detection with ourGaussian process and EVT approach. The traditional approach uses a Kalmanfilter (KF) to model the normal behaviour of the ship and then determines thatthe data is anomalous if it is more than a fixed number of standard deviations


(a) A plot of the GPS trackin which there were several de-tected anomalies.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−2

−1

0

1

2

3

4

5

Time Difference

Nor

mal

ised

Dis

tanc

e

(b) Sequential predictions applied to feature ex-tracted data, also showing some data pointsthat fall outside the GP-EVT bound.

0 0.005 0.01 0.015 0.02 0.025 0.03−1.5

−1

−0.5

0

0.5

1

1.5

Time Difference

Nor

mal

ised

Dis

tanc

e

(c) Magnified section of the plot in Figure 6b.The GP-EVT bound has predicted forward tothe new data point and included the point inthe update.

Fig. 6. Sequential GP-EVT method applied to a small vessel operating off thecoast of France near Cherbourg and whose track suggests unusual navigationbehaviour.

from the mean, (Markou et al 2003) (typically 3 to 5 standard deviations). Thisapproach to anomaly detection we term the gating approach.

Further, the KF approach requires a process model of the normal behaviourof the ship. Typically, a near constant velocity model is chosen to model thecontinuous trajectory without imposing any excessive smoothness on the trajec-tory, (George et al 2011). We compare our approach to a KF which uses the nearconstant velocity model. This provides a fair comparison as both Matern 3

2 andconstant velocity model are second order differentiable. We further investigateboth a traditional KF using the standard deviation gating approach to excludeanomalies and a KF which uses the EVT in a manner similar to the GP. In sodoing, we are able to compare both models of normal ship behaviour (namelythe Matern 3

2 and the near constant velocity model) and also both approachesto detecting and excluding anomalies (namely, the EVT and standard deviationgating approaches).

When using the KF and GP the mean and standard deviation were predicted

16 M. Smith et al

GP-EVT GP KF-EVT KF

0.8032 0.7889 0.6545 0.6119

Table 2. AUC for KF using the near constant velocity model (with and withoutEVT) and GP using a Matern 3

2 model (again with and without the EVT). Thiswas repeated for a range of confidence regions at 1, 1.64, 3 and 5 standarddeviations. Correspondingly the confidence region of the GP and KF using theextreme value approach was set by selecting probabilities of 0.84, 0.95, 0.99 and0.999.

forward to the same time step as the new observation. If the point lies withina pre-chosen confidence region (defined as a multiple of the standard deviationabout the mean) it is included in the update. ROC curves were plotted for theresults. The resulting area under curve (AUC) which compares the KF using thenear constant velocity model (with and without EVT) against the GP using aMatern 3

2 model (again with and without the EVT) are shown in Table 2.We note that both the KF and GP performance is significantly improved

using the EVT as opposed to gating. This is due to the fact that the gatingapproach uses a fixed threshold which does not take into account the density ofobservations i.e as we observe more samples we gain a better understanding of thetrue distribution of values. The EVT, however, uses a dynamic threshold whichtakes into account the density of observations therefore better utilises availableinformation to adjust the threshold.

Although the results indicate a significant improvement of our model overthe KF approach this is a limitation of the near constant velocity model usedand not a critique of KF based methods. We note that the Matern 3

2 GP modelcan be efficiently implemented within the KF as a Markov process model, (Har-tikainen et al 2010). Furthermore, it is possible to implement the near constantvelocity model as a GP model see, for example (Reece et al 2010). Thus, it ispossible to match the AUC of the KF approach and GP approach by replacingthe near constant velocity model in the KF by the Markovianised Matern model,(Hartikainen et al 2010).

However the results illustrate the significant improvement obtained usingEVT as opposed to a simple gating mechanism based on the number of standarddeviations between a datum and the expected position of the ship.

6. Identifying Anomalous Tracks

Whilst the previous section dealt with the identification of anomalous points,it does not allow the track to be considered as a whole. Thus patterns and ab-normalities at a broader scale may be missed. To address this issue we clusterthe non-anomalous vessel tracks and then detect abnormalities from the identi-fied cluster. Discovery of clusters is dependent on identifying tracks with similarcharacteristics, as determined by the characteristics of the vessel class. We dis-cover similarity (or adjacency) through the use of GP modelling and applicationof the Hellinger distance.


6.1. Hellinger Distance

Detection of clusters is dependent on the creation of a matrix of distances (anadjacency matrix), in this instance this is not simply a distance in the spatialsense but must express the distance in the intrinsic dynamics that created thetrack i.e. the track shape. Fortunately the GP describes a draw from a probabil-ity distribution over functions, and many distance measures between probabilitydistributions exist, (Basseville 1989). However, only true metrics produce a sym-metric adjacency matrix, (Budzynski et al 2008). The Hellinger distance is onesuch measure of similarity between two probability distributions and is definedas

h2 (f (x) , g (x)) =1

2

∫ (√f (x)−

√g (x)

)2dx,

We consider the case in which f and g are GPs, defined via their mean andcovariance functions.

In order to simplify the derivation of the squared Hellinger distance, h2 be-tween GPs, it can be assumed that both distributions have the same zero mean,µf = µg = 0, which we ensure within the data by trend removal. Thus thesquared Hellinger distance can take the simplified form

h2 (f (x;µf ,Σf ) , g (x;µg,Σg)) = 1−| 12Σ1 + 1

2Σ2|−12

|Σf |−14 |Σg|−

14

. (15)

Thus in this case, the distance metric is dependent only on the choice of kernelfunction used to model the data.

Application of a GP model to each track and estimation of the track hyperpa-rameters through maximising the marginal likelihood of the data allow the trackdynamics to be inferred. The combination of GPs and the Hellinger distance inthis manner provides the ability to define a measure between two stochasticallysampled functions. Comparison between tracks can be made through generationof a covariance matrix using hyperparameter values inferred from the GP appliedto each track and a common input scale. An adjacency structure can then beformed by calculating the inverse distance between each GP.

6.2. Community Detection

The use of inverse Hellinger distance between GPs allows us to map our datato a relational space, where each pair i, j of vessel tracks is assigned a similarityvalue sij . We encode all similarity pairs to a matrix S so that sij is the degreeof coupling between the paths of vessels i and j. From such relational structure,we seek to apply an appropriate clustering scheme in order to extract classes ofvessels that exhibit similar movement patterns. In (Reece et al 2011) we notetracks were clustered using GPs and in which the most likely mixture modelfor the data was identified, this corresponding to the cluster or community fora given type of track. The distance measure in this instance was the likelihoodfunction. Unfortunately, this method scales poorly with the number of tracks.

By treating S as an adjacency matrix from a network analysis perspective,where N vessels are nodes and similarities (in the Hellinger metric sense) are

18 M. Smith et al

link weights, we apply the idea of community detection, (Porter et al 2009), inorder to discover groups of strongly connected nodes, so that a given vessel i hasmore similar paths with vessels inside a community than with the ones outside.

Towards the above goal, we employ a Bayesian non-negative matrix factori-sation (NMF) scheme, (Psorakis et al 2011), which has already been successfullyapplied to a wide range of community detection problems, (Psorakis et al 2011,Simpson et al 2013, Psorakis et al 2012). In this approach, communities aretreated as explanatory latent variables for the observed link weights, so that thestronger the similarity between two vessel paths the more likely it is that theybelong to the same community. We extract such latent grouping via an appro-priate factorisation of S 'WH, S ∈ RN×N , W,HT ∈ RN×K , where both theinner rank K and the factor elements wik, hki are inferred via an appropriateMaximum a Posteriori (MAP) scheme. This model has the advantage of not onlydiscovering overlapping communities (soft-partitioning) but also quantifying howstrongly each vessel belongs to a particular class. Such result allows us to quan-tify how the broadcast vessel class “disagrees” with the one we infer from thepath similarity matrix. It also avoids the scaling issues as posed by the methodproposed in (Reece et al 2011).

6.3. Detecting Anomalous Tracks

Anomalous tracks can be identified by noting those tracks not assigned to avessel community i.e. those that have track characteristics sufficiently differentfrom all other vessel tracks and are not assigned to a cluster. Additionally we canidentify discrepancies between the class of vessel from which the vessel claims tobelong and the cluster to which it has been assigned.

Furthermore we note that once a reference cluster has been formed using theNMF community model, and a soft membership score πi obtained for each node,then subsequent tracks can be tested using extreme value theory. By taking eachof the distances di between tracks in a cluster the standard deviation of thegroup can be calculated

sd =

√√√√ 1

N

N∑i=1

πid2i ,

where N is the number of distances within the group. Selecting the trackwith the highest probability of belonging to the community as the centroid wecan now test subsequent tracks to identify anomalies with respect to the cluster.For example a vessel claiming to belong to a given vessel class can be testedagainst a reference community to identify anomalies with respect to the testcommunity. Extreme value theory can again be applied in the same manner asused previously; estimating the variability of the population based on the obser-vation density. The distance between a given test track and the group centroidcan then be determined and appropriately scaled to the group using the standarddeviation calculated above.


010

2030

4050

−0.4

−0.2

0

0.2

0.4−0.4

−0.2

0

0.2

0.4

XY

Z

(a) Functions generated from a multiple-outputGP with hyperparameter values of ε = 5, λ =0.01, [l1]1 = 0.1, [l2]1 = 0.1 and [l2]2 = 0.1 (setA). The mean function using the inferred hy-perparameter values is shown plotted throughthe sample data.

010

2030

4050

−0.1

−0.05

0

0.05

0.1−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

XY

Z

(b) Functions generated from a multiple-outputGP with hyperparameter values of ε = 20,λ = 0.01, [l1]1 = 0.02, [l2]1 = 0.02 and [l2]2 =0.02 (set B). The mean function using the in-ferred hyperparameter values is shown plottedthrough the sample data.

Fig. 7. Plots of points drawn from a multiple output GP and the subsequentmultiple output GP regression.

7. Application

We demonstrate the effiacy of the method through application to both real andsynthetic data. Synthetic data is used to illustrate the principles behind themethodology and a real world example is used to demonstrate its applicability.

7.1. Synthetic Data Illustration

Data was generated from a multiple-output GP using a Matern 32 kernel with

known scaling parameters. For the first 6 generated functions, Figure 7a, hyper-parameter values of ε = 5 and λ = 0.01 were respectively assigned to the outputand noise scale. Additionally the output correlation hyperparameters were as-signed values [l1]1 = 0.1, [l2]1 = 0.1 and [l2]2 = 0.1, these hyperparameter valueswill be referred to as set A. A further 6 functions were generated from a GP, Fig-ure 7b, with hyperparameter values of ε = 20, λ = 0.01, [l1]1 = 0.02, l2]1 = 0.02and [l2]2 = 0.02, these hyperparameter values will be referred to as set B.

Inferring hyperparameter values from the generated functions and taking theinverse of the Hellinger distance metric between tracks provides for the creationof a adjacency matrix, Figure 8. This allows a clear split in the data to beidentified.

Application of the network NMF community detection then correctly iden-tifies functions 1 through 6 (generated from set A hyperparameters) as beingfrom one community, and functions 7 through to 12 as belonging to a distinctlyseparate community (generated from set B hyperparameters).

20 M. Smith et al

2 4 6 8 10 12

2

4

6

8

10

12

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 8. Synthetic data example: Matrix of the inverse Hellinger distance (adja-cency) between inferred functions. The axis of the matrix correspond to (in order)the distance between functions 1 through 6 (generated from set A hyperparam-eters) and then functions 7 through 12 (generated from set B hyperparameters).Distances which are very close to one another have a value close to 1 and furtherapart close to zero, this is as represented via the colour bar to the right of theFigure.

7.2. Vessel Track Anomaly Detection

The outlined methodology was also applied to real-world vessel track data. Thedata consists of a set of collected AIS data, this was combined with the reg-istered vessel information in order to identify the class of vessel. As such thedata collected can be used to evaluate the methodology due to possession of theground truth. A small illustrative test set of data is used to provide a means ofillustrating the technique. The tracks used were for cargo vessels shown in Figure9, fishing vessels shown in Figure 10, and sailing vessels shown in Figure 11. Theinverse Hellinger distance (adjacency) between tracks is shown in Figure 12. Theresulting community structure from application of the aforementioned techniqueis shown in Figure 13.

Abnormalities can be identified in Figure 13 by noting those tracks unassignedto a community (Tracks 8-12). The majority of the unassigned tracks belong tothe sailing vessel class Figure 11, and it can noted that these tracks do not exhibitcommon movement characteristics.

8. Conclusion

Extreme value theory has proven to be an extremely successful framework foranomaly detection. Unlike novelty detection based directly on the sample dis-tribution, extreme value distributions capture our beliefs that extreme eventsshould become more extreme if large numbers of measurements are expectedand vice versa. Such detection, however, has to be dynamic, context sensitiveand timely if it is to be useful for marine tracking. Extreme value distributionsalone are not readily adapted to perform this task.

In this paper we present an alternative to endowing extreme value distribu-tions directly with dynamic properties. Our approach simultaneously models thedynamic properties of the underlying extreme value generative distribution and


00.1

0.20.3

0.4

−1

−0.5

0

0.5

1−0.2

−0.1

0

0.1

0.2

Time

Cargo 1

Latitude

Long

itude

(a)

00.1

0.20.3

0.4

−1

−0.5

0

0.5

1−0.2

−0.1

0

0.1

0.2

Time

Cargo 2

Latitude

Long

itude

(b)

00.02

0.040.06

0.08

−0.4

−0.2

0

0.2

0.4−0.1

−0.05

0

0.05

0.1

Time

Cargo 3

Latitude

Long

itude

(c)

00.05

0.10.15

0.2

−0.5

0

0.5

1−0.1

−0.05

0

0.05

0.1

Time

Cargo 4

Latitude

Long

itude

(d)

Fig. 9. GP multiple output regression through cargo vessel GPS coordinates.

the dynamic properties of the data sampling process. To our knowledge this isthe first time that extreme value distributions have been made dynamic throughthe use of GP’s.

Our approach offers several advantages. GPs provide a flexible, non-parametricand intuitive tool to describe typical vessel dynamics. Also, measurement andprediction is performed in continuous time thus allowing on-demand anomaly de-tection. The sequential update of the GP covariance matrix bypasses the need forinverting massive matrices and substantially reduces the computational burdensfor which GPs are well known.

Our empirical experiments on vessel data suggest that the method is capa-ble of detecting anomalies that resemble mooring or drifting, and unexpecteddepartures from regular movements. The sample size prediction plays the im-portant role of adapting the observation process in time. As the effective samplesize reduces, the extreme value distribution approaches the regular Gaussiandistribution, as Equation 7 suggests. With increasing density of observations,however, the extreme value distribution diverges and EVT bound increases. Al-though, the choice of GP kernel function becomes less critical with increasingamounts of data, for smaller sample sizes, the kernel function is critical andour empirical results have shown that the Matern3

2 kernel outperforms the nearconstant velocity model.

Furthermore by moving away from the detection of anomalous points and

22 M. Smith et al

00.1

0.20.3

0.4

−0.1

−0.05

0

0.05

0.1−0.02

−0.01

0

0.01

0.02

Time

Fishing 5

Latitude

Long

itude

(a)

00.05

0.10.15

0.20.25

−0.04

−0.02

0

0.02−0.04

−0.02

0

0.02

0.04

Time

Fishing 6

Latitude

Long

itude

(b)

00.01

0.020.03

0.040.05

−0.04

−0.02

0

0.02

0.04−0.015

−0.01

−0.005

0

0.005

0.01

0.015

Time

Fishing 7

Latitude

Long

itude

(c)

0

0.02

0.04

0.06

−0.01

−0.005

0

0.005

0.01−4

−2

0

2

4

6

x 10−3

Time

Fishing 8

Latitude

Long

itude

(d)

Fig. 10. GP multiple output regression through fishing vessel GPS coordinates.

considering the similarity between tracks via use of the Hellinger distance be-tween Gaussian processes we have provided a methodology in which anomaloustracks can be detected. A more appropriate application of the work may be inidentifying deviations from typical shipping routes. Vessels which are forced tooperate under certain operating conditions may belong to a community wheretheir tracks are forced to take a restrictive form; conditioned by the operat-ing conditions. Vessels deviating from the constrained conditions may have adistinctly different functional form which may be identified as belonging to adifferent community structure.

9. Future work

The representative choice of distance as the dependent variable feature for anomalydetection is open to discussion. While it provides a single dimension and, thus,fast estimation it does fail to capture some aspects of ship tracks. To capture suchfeatures the GPS coordinates can be simultaneously modelled with a bivariateGP and extrema modelling.

The training using typical vessel tracks will be extended to shipping lanesand vessel types. This allows anomaly detection not just on the basis of individ-ual points but entire tracks and so offers the possibility of preventing accidents


00.05

0.10.15

0.20.25

−4

−2

0

2

x 10−4

−1.5

−1

−0.5

0

0.5

1

1.5

x 10−4

Time

Sailing 9

Latitude

Long

itude

(a)

00.1

0.20.3

0.4

−5

0

5

10

x 10−5

−6

−4

−2

0

2

4

x 10−5

Time

Sailing 10

Latitude

Long

itude

(b)

00.1

0.20.3

0.4

−2

−1

0

1

2

x 10−4

−2

−1.5

−1

−0.5

0

0.5

1

x 10−4

Time

Sailing 11

Latitude

Long

itude

(c)

00.01

0.020.03

0.040.05

−4

−2

0

2

4

x 10−5

−4

−2

0

2

4

x 10−5

Time

Sailing 12

Latitude

Long

itude

(d)

Fig. 11. GP multiple output regression through sailing vessel GPS coordinates.

2 4 6 8 10 12

2

4

6

8

10

120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 12. Vessel track data: Matrix of the inverse Hellinger distance (adjacency)between inferred functions for the vessel data. The axis of the matrix correspondsto (in order) the distance between the functions inferred for cargo vessel tracks1 to 4, fishing vessel tracks 5 to 8 and sailing vessel tracks 9 to 12. Distanceswhich are very close to one another have a value close to 1 and further apartclose to zero, this is as represented via the colour bar to the right of the Figure.

24 M. Smith et al

5

1

2

3

4

6 7

11

10 9

8

12

Fig. 13. Network diagram illustrating the relationship between different vesseltypes. Each edge connection is weighted (illustrated by the varying edge thick-ness), the weights being determined by the inverse Hellinger distance betweenthem. Nodes 1 to 12 relate to the different classes of vessel, cargo (1-4) Figure9, fishing (5-8) Figure 10 and sailing (9-12) Figure 11. Different colours relateto the different communities within the data, and the membership score of eachnode per community is defined by the pie-chart. The black nodes are unassignedand do not belong to a given community.

such as that of the MS Costa Concordia early in 2012. Kernel-regression basedprediction of the sample size can be readily extended using Poisson processes.

10. Acknowledgements

This work was funded by ISSG, Babcock Marine & Technology Division, De-vonport Royal Dockyard. Ioannis Psorakis is funded from a grant via MicrosoftResearch, for which we are most grateful. This work was further supported byEPSRC project EP/I011587/1.

References

[1] Basseville M (1989) Distance measures for signal processing and pattern recognition. SignalProcessing 18(4): pp 349-369

[2] Budzynski R, Kondracki W and Krolak A (2008) Applications of distance between prob-ability distributions to gravitational wave data analysis. Classical and Quantum Gravity25(1): 015005

[3] Coles S (2001) An Introduction to Statistical Modelling of Extreme Values. Springer, UK,pp 45-73

[4] George J, Crassidis J, Singh T et al (2011) Anomaly Detection using Content-Aided TargetTracking. Journal of Advances in Information Fusion 6(1): pp 39-56

[5] Grubbs F (1969) Procedures for detecting outlying observations in samples. Technometrics11(1): pp 1-21

[6] Hartikainen J and Sarkka S (2010) Kalman Filtering and Smoothing Solutions to Tempo-


ral Gaussian Process Regression Models. Proceedings of IEEE International Workshop onMachine Learning for Signal Processing (MLSP), Kittila, Finland, pp 379-384

[7] Lane R, Nevell D, Hayward S et al (2010) Maritime anomaly detection and threat assess-ment. Proceedings of 13th Conference on Information Fusion (FUSION), Edinburgh, UK,pp 1-8

[8] Laws K, Vesecky J and Paduan J (2011) Monitoring coastal vessels for environmentalapplications: Application of Kalman filtering. Proceedings of 10th Current, Waves andTurbulence Measurements (CWTM), Monterey, California, USA, pp 39-46

[9] Laxhammar R (2008) Anomaly detection for sea surveillance. Proceedings of 11th Interna-tional Conference on Information Fusion, Cologne, Germany, pp 1-8

[10]Laxhammar R, Falkman G and Sviestins E (2009) Anomaly detection in sea traffic - acomparison of the Gaussian Mixture Model and the Kernel Density Estimator. Proceedingsof 12th International Conference on Information Fusion, Seattle, Washington, USA, pp 756-763

[11]Lee H and Roberts S (2008) On-Line Novelty Detection Using the Kalman Filter and Ex-treme Value Theory. Proceedings of 19th International Conference on Pattern Recognition,Tampa, Florida, USA, pp 1-4

[12]Li X, Han J and Kim S (2006) Motion-Alert: Automatic Anomaly Detection in MassiveMoving Objects. Proceedings of IEEE Intelligence and Security Informatics, San Diego,California, USA, pp 166-177

[13]Markou M and Singh S (2003) Novelty Detection: A Review - Part 1: Statistical Approaches.Signal Processing 83(12): pp 2481-2497

[14]Mascaro S, Nicholson A and Korb K (2011) Anomaly Detection in Vessel Tracks usingBayesian networks. Proceedings of Eighth UAI Bayesian Modeling Applications Workshop,Barcelona, Spain, pp 99-107

[15]Miller S, Miller W and McWhorter P (1992) Extremal dynamics: A unifying physicalexplanation of fractals, 1/f noise, and activated processes. Journal of Applied Physics 73(6):pp 2617-2628

[16]Osborne M (2010) Bayesian Gaussian Processes for Sequential Prediction, Optimisationand Quadrature. University of Oxford, UK, pp 49-54 and pp 79-90

[17]Pinheiro J and Bates D (1996) Unconstrained parameterizations for variance-covariancematrices. Statistics and Computing 6(3): pp 289-296

[18]Porter M, Onnela J and Mucha P (2009) Communities in Networks. Notices of the AmericanMathematical Society 56(9): pp 1082-1097

[19]Psorakis I, Roberts S, Ebden M et al (2011) Overlapping community detection usingBayesian non-negative matrix factorization. Physical Review E 83(6): 066114

[20]Psorakis I, Rezek I, Roberts S et al (2012) Inferring social network structure in ecologicalsystems from spatio-temporal data streams. Journal of The Royal Society Interface 9(76):pp 3055-3066

[21]Rasmussen C and Williams C (2006) Gaussian Processes for Machine Learning. the MITPress, USA, pp 25-50

[22]Reece S and Roberts S (2010) The Near Constant Acceleration Gaussian Process Kernelfor Tracking. IEEE Signal Processing Letters 17(8): pp 707-710

[23]Reece S, Mann R, Rezek I et al (2011) Gaussian Process Segmentation of Co-MovingAnimals. Proceedings of AIP Conference Proceedings, Chamonix, France, pp 430-437

[24]Rhodes B, Bomberger N, Seibert M et al (2005) Maritime situation monitoring and aware-ness using learning mechanisms. Proceedings of Military Communications Conference, At-lantic City, New Jersey, USA, pp 646-652

[25]Roberts S (2000) Extreme value statistics for novelty detection in biomedical signal pro-cessing. Proceedings of First International Conference on Advances in Medical Signal andInformation Processing, University of Bristol, UK, pp 166-172

[26]Simpson E, Roberts S, Psorakis I et al (2013) Dynamic Bayesian Combination of MultipleImperfect Classifiers, In: Guy T, Karny M and Wolpert D (eds), Decision Making andImperfection. Springer, New York, USA, pp 1-35

[27]Will J, Peel L and Claxton C (2011) Fast Maritime Anomaly Detection using Kd-TreeGaussian Processes. Proceedings of IMA Maths in Defence Conference, Swindon, UK

26 M. Smith et al

Derivation of the Hellinger metric, Equation 15The squared Hellinger distance is a measure of similarity between two prob-

ability distributions and is defined as

h2 (f (x) , g (x)) =1

2

∫ (√f (x)−

√g (x)

)2dx, (16)

where f (x) and g (x) denote probability distributions. Equation 16 can alterna-tively be expressed as

h2 (f (x) , g (x)) =1−∫ (√

f (x)√g (x)

)dx.

In the instance distributions f (x) and g (x) are multivariate Gaussian, theHellinger distance would take the form

h2 (f (x;µf ,Σf ) , g (x;µg,Σg)) =

1− 1

(2π)n2√|Σf |

1

(2π)n2√|Σg|∫ √

exp

(−1

2(x− µf )

>Σf−1 (x− µf )

)√

exp

(1

2(x− µg)

>Σg−1 (x− µg)

)dx,

The terms inside the exponents can also be combined and expressed in quadraticform, (x− µ∗)>C−1(x− µ∗) +B, by making the following associations

µ∗ =(Σf−1 + Σg

−1)−1 (Σf−1µf + Σg

−1µg

),

C−1 =1

2Σf−1 +

1

2Σg−1,

B = (µf − µg)>

(Σg + Σf )−1 1

2(µf − µg) .

The integral can now be solved and the expression simplified

h2 (f (x) , g (x))

= 1−| 12Σf

−1 + 12Σg

−1|− 12

|Σf |14 |Σg|

14

exp

(−1

4(µf − µg)

>(Σg + Σf )

−1(µf − µg)

).

Under the assumption that both distributions have the same zero mean, µf =µg = 0, this can be further simplified


h2 (f (x;µf = 0,Σf ) , g (x;µg = 0,Σg))

= 1−| 12Σf

−1 + 12Σg

−1|− 12

|Σf |14 |Σg|

14

To avoid the inverse covariances in this form, the fraction can be multiplied

top and bottom by (|Σf ||Σg|)−12 , in addition to application of the determinant

identity |AB| = |A||B|, thus

h2 (f (x;µf = 0,Σf ) , g (x;µg = 0,Σg))

= 1−| 12Σf + 1

2Σg|−12

|Σf |−14 |Σg|−

14

= 1−√

2|Σf

14 + Σg

14 |

|Σf + Σg|12

It can be noted, as a means of verifying the result, that by setting Σ = σ2, theform is consistent with the Hellinger distance between two univariate Gaussiandistributions when µf = µg = 0 namely

1−√

2σfσgσ2f + σ2

g

exp

− (µf − µg)2

4(σ2f + σ2

g

) .

Author Biographies

Mark Smith received the B.Eng degree (First class honours) from theUniversity of Wales Bangor, UK, in 2007. He currently works for ISSG,Babcock International Group and is concluding an M.Sc by Researchat the Department of Engineering science, University of Oxford, UK.His research interests include embedded development, data mining,and machine learning.

Steven Reece received the mathematics and physics B.Sc (Hons)from Liverpool University, UK in 1988, part III of the mathematicstripos and the computer science diploma from Cambridge University in1989 and 1990 respectively and the DPhil in engineering science fromOxford University in 1998. He has published over 50 papers in ArtificialIntelligence and machine learning and has six patents. He is a researchfellow in the Engineering Department at Oxford University. His currentresearch interests are in data fusion and multi-sensor systems.

28 M. Smith et al

Stephen Roberts is Professor of Machine Learning in the Depart-ment of Engineering Science, University of Oxford. He studied Physicsat the University of Oxford (1987) and after a period of industrial re-search he returned to Oxford and was awarded his PhD in 1991. He wasappointed to the faculty of Imperial College, London in 1994 and tookup his present position in 1999. His main research interests lie in theapplication and development of mathematical methods in data analy-sis and data-driven machine learning, in particular statistical learningand inference and their application to complex problems in heteroge-neous information fusion. His early contributions to the field includeBayesian models for real-time data modelling and signal processing.More recent research has focused on non-parametric Bayesian mod-els for multi-sensor data fusion, global optimisation, complex systems,game theory and network analysis. Particular emphasis is placed onthe real-world applications of advanced theory and over many yearshe has applied these statistical methods to diverse problems in astro-physics, biology, finance and engineering. He has some 230 publicationsand holds eight patents.

Ioannis Psorakis is currently concluding his D.Phil degree at theUniversity of Oxford, where he attended as a Microsoft Research PhDScholar. He holds a B.Eng from the Technical University of Crete(Greece) and an M.Sc (with Distinction) in Information Technologyfrom the University of Glasgow. His research interests include socialnetwork analysis, data mining and scalable machine learning.

Iead Rezek is a research scientist at Schlumberger Research, Cam-bridge, U.K. He received the Dipl-Ing (FH) degree from the Fach-hochschule Ulm, Germany, and the BSc from the University of Ply-mouth, U.K. in 1991, both in electrical engineering. He received hisMSc in physical sciences and engineering in medicine in 1992, andhis PhD in information theory applied to biomedical data analysis in1997, both from Imperial College, London, U.K. In 2000, he joinedthe University of Oxford in 2000 as postdoctoral research assistantand subsequently held lectureship positions at Aston University andImperial College London. His main research interests lie in modellingthe interactions in human and biological networks. He is a memberof the Royal Statistical Society and a member of the RSS statisticalcomputing committee.

Correspondence and offprint requests to: Mark Smith, Devonport Royal Dockyard, Plymouth,

Devon, PL1 4SG, UK. Email: [email protected]

Documents

Maritime Abnormality Detection using Gaussian Processessjrob/Pubs/KAIS_Marine-Abnormality.pdfMaritime Abnormality Detection using Gaussian Processes 3 Assessing the performance of