[Advanced Information and Knowledge Processing] Probabilistic Modeling in Bioinformatics and Medical Informatics || Software for Probability Models in Medical Informatics

16

Software for Probability Models in MedicalInformatics

Richard Dybowski

InferSpace, 143 Village Way, Pinner HA5 5AA, [email protected]

Summary. The purpose of this chapter is to make the reader aware of some ofthe software packages available that can implement probability models connectedwith medical informatics. The modelling techniques considered are logistic regres-sion, neural networks, Bayesian networks, class-probability trees, and hidden Markovmodels.

16.1 Introduction

Several developments over the past 10 years have impacted significantly on softwarefor probabilistic models. Three of these are the substantial advances made in com-puting technology, the explosive growth of the Internet, and the rise of open-sourcesoftware.

The 45-fold increase in CPU speed witnessed during the 1990s1 has enabled anumber of computationally-intensive techniques to be readily available to the dataanalyst. These techniques include optimizations (such as parameter estimation andevolutionary computation [19, 16]), Monte Carlo simulations (particularly bootstrapmethods [18, 15] and MCMC sampling [23]), and advanced computer graphics fordata visualization [34, 9].

The Internet has enabled those with common interests to communicate witheach other via discussion groups. The existence of these virtual forums for softwarepackages allows the users of those packages to readily provide technical support andoffer new developments. Furthermore, through the existence of the World Wide Web,a number of free, peer-reviewed, online journals have appeared, such as Journal ofStatistical Software and Journal of Machine Learning Research. Furthermore, severalconferences, such as Uncertainty in Artificial Intelligence and Neural InformationProcessing Systems, have made their proceedings available online at no charge.

The Internet has also been crucial to the creation and evolution of the open-source movement, which we now describe.

1 Moore’s Law states that the computer industry doubles the power of micropro-cessors every 18 months.

474 Richard Dybowski

16.2 Open-source Software

The usual economics of software development is that a business organization pro-duces an item of computer software and sells it to those wishing to have it, withthe buyers of the product having no control over its development. However, duringthe 1990s, an alternative economic model for software development emerged calledOpen Source [37, 2].

The basic principle of open-source software development is that the human-readable computer program underlying the software (the source code) is availableto everyone. This is in stark contrast to the situation with proprietary software,in which only the company producing the software can modify it. Because of theavailability of the source code, a large syndicate of interested users can make the im-provements they wish to see in the software. The proposed improvements are vettedby a hierarchically coordinated group of volunteer programmers and incorporatedin the next release of the software [36, 20]. Thus, a subset of users sort out knownproblems and eventually get the functionality they require. The most famous exam-ple of this approach to software development is the success of the Linux operatingsystem [52].

An important device in maintaining the open-source status of software is theuse of open-source licences [42], such as the GNU General Public License (GPL)[47]. This legally-binding document ensures that any item of software covered by itremains open source. The license allows people to modify the source code of GPLsoftware, but any modified software they distribute is also covered by the GPL. Aperson may distribute GPL software for free or sell it for profit, but they cannotsell it under a restrictive license, for that would inhibit development of the softwarewithin the open-source framework. Other open-source licences include the ArtisticLicence, the Apache License, and the Python License.

We will bring to the reader’s attention open-source software relevant to thetheme of this chapter.

Cautionary Note:

Most public-domain software (open-sourced, freeware, and shareware) has not beenrigorously tested; therefore, it is more likely to contain programming bugs thancommercially-developed software.

16.3 Logistic Regression Models

[Logistic regression was featured in Sections 3.3 and 10.4.]There are many statistics packages competing with each other, ranging from the

more advanced packages, such as S-Plus, designed for professional statisticians, tothe more user-friendly packages, such as Data Desk, for non-specialists. All the majorcommercial statistics packages – such as Genstat, GLIM, SAS, S-Plus, SPSS, andStatistica – provide some means of fitting logistic models to data, although there is

16 Medical Informatics Software 475

some package-to-package variation in the functionality and diagnostics available.2 3

This is not surprising given that some packages were designed, at least initially, forspecific types of users. For example, SPSS was designed for social scientists, whereasGLIM was designed for those requiring generalized linear models.

Collett [12, Chap. 9] compared the logistic regression capabilities of six packages;however, given that his review was written in 1991, some of his comments may nolonger apply.

16.3.1 S-Plus and R

S-Plus is one of the most sophisticated statistics packages available [10, 49, 50].4 It isbased on the S language [3], and is supported by an active discussion group. Harrell[25] has contributed substantially to the S-Plus code pertaining to regression. Thisincludes routines for contemporary statistical techniques such as bootstrapping formodel validation. The GLIB S-Plus package by Raftery and Volinsky [40] aids theuse of model-averaged logistic regression (Section 10.5).

Example 1. The logistic regression model

logit(p(Kyphosis = 1|Age, Number, Start)) = β0 + β1Age + β2Number + β3Start

can be fitted to the kyphosis data set [10] using the S-Plus code

kyph.glm <- glm(Kyphosis ˜ Age + Number + Start,family = binomial, data = kyphosis)

where glm is the generalized linear modelling function. The regression coefficientsand associated standard errors can be viewed using the command summary(kyph.glm)

Coefficients:Value Std. Error t value

(Intercept) -2.03693225 1.44918287 -1.405573Age 0.01093048 0.00644419 1.696175

Number 0.41060098 0.22478659 1.826626Start -0.20651000 0.06768504 -3.051043

Given a new vector of values xnew for Age, Number and Start, the command

predict(kyph.glm, newdata = xnew, type = "response")

provides the estimated probability p(Kyphosis = 1|xnew) from the regressionmodel. �2 Updated links to software and discussion groups featured in this chapter are

available from the website http://robots.ox.ac.uk/∼parg/pmbmi.html.3 Comparisons between a number of mathematical and statistical packages – such

as Gauss, Mathematica, Matlab, and S-Plus – are available at the Scientific Webwebsite: http://www.scientificweb.com/ncrunch/

4 http://www.insightful.com/


In our opinion, the best open-source statistics package is R [28], a language andenvironment for statistical computing and graphics.5 It is very similar to S-Plus, butnot identical to it; however, a lot of code written for S-Plus can be run unchangedwithin the R environment; for example, the glm and predict commands of Example1 are available in R. One of the advantages of R over S-Plus is that the graphicscapability of R is generally superior to that available from S-Plus.

The evolution of R (the R Project) is managed by a group of coordinated teams.There are a number of discussion groups associated with these teams, includinga highly effective support forum. There is also a very good online journal calledR-News for the R community.

16.3.2 BUGS

[MCMC methods were featured in Chapters 2 and 10.]BUGS (Bayesian inference Using Gibbs Sampling) [45] generates the necessary

code to perform MCMC sampling from a model specification supplied by a user.6

The syntax is an extension of the S langauge.The original format (Classic BUGS) has a command-line interface that provides

univariate Gibbs sampling and a simple Metropolis-within-Gibbs routine. A morerecent version (WinBUGS) provides a GUI for use with Windows, and it has a moresophisticated univariate Metropolis sampler. Although there is, as yet, no versionof WinBUGS for the Linux or Unix platforms, WinBugs can be run within theseplatforms by using an emulator such as Wine.

A wide range of S-Plus and R routines (CODA) supplement BUGS by providingdiagnostics and plots for MCMC analysis and convergence checks.

Example 2. In this example, adapted from Spiegelhalter et al. [44], WinBUGS isused to estimate the regression coefficients of a random-effects logistic regressionmodel that allows for over-dispersion. The data consist of N plates for which, in theith plate, there were ri positive outcomes out of ni units. The hierarchical structureof the model is

ri ∼ Binomial(pi, ni)

logit(pi) = β0 + β1x1,i + β2x2,i + β12x1,ix2,i + bi

β0, β1, β2, β12 ∼ Normal(0, 106)

bi ∼ Normal(0, τ−1)

τ ∼ Gamma(10−3, 10−3)

for which the corresponding BUGS code can be written as

model{for(i in 1:N){

r[i] ˜ dbin(p[i],n[i])logit(p[i]) <- beta0 + beta1*x1[i] + beta2*x2[i] +

beta12*x1[i]*x2[i] + b[i]

5 http://www.r-project.org/6 http://www.mrc-bsu.cam.ac.uk/bugs/


b[i] ˜ dnorm(0.0,tau)}beta0 ˜ dnorm(0.0,1.0E-6)beta1 ˜ dnorm(0.0,1.0E-6)beta2 ˜ dnorm(0.0,1.0E-6)beta12 ˜ dnorm(0.0,1.0E-6)tau ˜ dgamma(0.001,0.001)

}

Following a “burn in” using 1000 samples, 9000 samples from the posterior distri-bution of the parameters provided the following estimates:

β SE

(intercept) −0.5496 0.1927

x1 0.0772 0.307

x2 1.356 0.2773

x1x2 −0.823 0.4205

�Several specialized versions of BUGS have been developed for specific domains.

One of these is PKBugs, which was designed to provide hierarchical pharmacokineticand pharmacodynamic models. Further details of PKBugs, along with an example,are given in Chapter 11.

Another MCMC package is Hydra [51], a suite of Java libraries. Although it doesprovide a range of MCMC samplers, its use requires some familiarity with the Javalanguage.

16.4 Neural Networks

[Neural networks were featured in Chapters 3 and 12, and Section 10.6.]There are at least 35 commercial packages and 47 freeware/shareware packages

for neural computation. James [29] provides a review of some of the commercialpackages, and a tabular comparison of 12 packages is given in Table 16.1. Some ofthe major mathematical packages, such as Mathematica, and Matlab, have neural-network toolboxes designed specifically for them. This also true of some of thestatistical packages, including S-Plus and Statistica; however, in our opinion, thetoolbox with the most functionality is an open-sourced package called Netlab.

16.4.1 Netlab

Netlab was developed to accompany the seminal book Neural Networks for PatternRecognition by Chris Bishop [4], and, in our experience, it has the best functionalityof any neural-network package.7

Matlab [27] is a powerful commercial software package for performing technicalcomputations,8 and Netlab is a collection of open-source routines designed to be7 http://www.ncrg.aston.ac.uk/netlab/8 http://www.mathworks.com/


executed within the Matlab environment. In addition to being open-sourced, Netlabcontains a powerful collection of routines for neural computation, including routinesfor Bayesian computations, latent-variable models (such as Generative TopographicMapping [48]), and Gaussian processes [22]. Nabney [33] has written an excellenttextbook and manual for Netlab, which provides many worked examples based onBishop’s book.

Example 3. The topology of a classification multilayer perceptron (MLP) with asingle hidden layer is defined by the number of input nodes nin for the featurevectors x, the number of hidden nodes nhidden, and the number of output nodesnout for the conditional class probabilities p(y = k|x). With these structural values,an MLP can be created in Matlab by using the Netlab mlp routine; for example,

nin = 4; nhidden = 6; nout = 1;alpha = 0.1; % weight-decay coefficientnet = mlp(nin, nhidden, nout, ‘logistic‘, alpha);

The string ‘logistic‘ specifies that the logistic function is to be used for theoutput-node activation function.

The mlp routine initializes the network weights to random values; however, thecommand net = mlpinit(net, prior) can be used to randomly select the weightsfrom a zero-mean Gaussian distribution with covariance 1/prior.

If a data set consists of input-target pairs (xi, yi), the MLP can now be trainedfor, say, 1000 cycles using the quasi-Newton optimization algorithm:

options = foptions; options(14) = 1000; % algorithm options[nnet, options] = netopt(net, options, xdata, ydata, ‘quasinew‘);

where xdata is the matrix of xi values, and ydata is the vector of the yi valuesassociated with xdata. Forward propagation of a new feature vector xnew along thetrained network is performed with

cpd = mlpfwd(nnet, xnew);

which estimates the conditional probability distribution p(y|xnew). �

16.4.2 The Stuttgart Neural Network Simulator

A limitation of Netlab is that it does not provide networks specifically for the con-struction of temporal neural networks. In contrast, the Stuttgart Neural NetworkSimulator (SNNS), supports time-delay networks (TDNN), Jordan networks, Elmannetworks, and extended hierarchical Elman networks.9

16.5 Bayesian Networks

[Bayesian networks were featured in Chapters 2 and 4 and Section 10.9.]As a result of the large interest in Bayesian networks (BNs), many software pack-

ages have been designed to support BN development. We are aware of at least eight

9 http://www-ra.informatik.uni-tuebingen.de/SNNS/


commercial and 30 academic BN packages, with at least 20 of the latter providingsource code. Kevin Murphy has compiled an extensive table that summarizes thefeatures of 37 BN packages. A modified version of this table is shown in Table 16.2.Korb and Nicholson [30] have made a detailed comparison of 12 packages, includingBayes Net Toolbox, Hugin, Bayesware Discoverer, and Tetrad.

Table 16.1. Summary of some academic and commercial software for neural-networkdevelopment. The “Code” column states whether the source code is available (N ≡no). If it is, the language used is given. The “GUI” column states whether a GUI isavailable (Y ≡ yes). The “RBF” column states whether radial basis function networkscan be developed. The “t” column states whether temporal (dynamic) networks can bedeveloped beyond the use of delay vectors. The “B” column states whether Bayesianneural techniques can be used. The “Free” column states whether the software is free(R ≡ only a restricted version is free).

Package Developer Code GUI RBF t B Free

BrainMaker California ScientificSoftware

N Y N N N N

MathematicaNeural Networks

Wolfram Research Mathematica N Y Y N N

Matlab NeuralNetwork Toolbox[24]

The MathWorks Matlab Y Y Y Y a N

Netlab [33] I.T. Nabney et al. Matlab N Y N Y b Y

NeuralWorks NeuralWare N Y Y Y N R

NeuroSolutions[38]

NeuroDimension N Y Y Y N R

NuRho soNit.biz N Y N N N R

PDP++ [35] R.C. O’Reilly et al. C++ Y N Y N Y

SNNS University ofStuttgart

C Y Y Y N Y

Statistica NeuralNetworks

StatSoft N Y Y N N R

ThinksPro Logical DesignsConsulting

N Y Y Y N R

Tiberius P. Brierley VB/Excel c Y N N N Ra Bayesian regularization.b Includes the evidence framework and MCMC-based approximation.c Backpropagation code also available in Fortran 90 and Java.


Table 16.2. Summary of some academic and commercial software for BN de-velopment (adapted from Murphy [32], reprinted with permission). The “Code”column states whether the source code is available (N ≡ no). If it is, the languageused is given. The “CVN” column states whether continuous-valued nodes can beaccommodated (N ≡ restricted to discrete-valued nodes; Y ≡ yes and without dis-cretization; D ≡ yes but requires discretization). The “GUI” column states whethera GUI is available. The “θ” column states whether parameter learning is possible.The “G” column states whether structure learning is possible. The “Free” columnstates whether the software is free (R ≡ only a restricted version is free).

Package Developer Code CVN GUI θ G Free

BayesBuilder SNN Nijmegen N N Y N N R

BayesiaLab Bayesia N D Y Y Y R

BayeswareDiscoverer

Bayesware N D Y Y a Y a R

BN PowerCon-structor

J. Cheng N N Y Y Y b Y

BNT [32] K. Murphy Matlab/C Y N Y Y Y

BNJ W.H. Hsu et al. Java N Y N Y Y

BUGS [45] MRC/ImperialCollege

N Y Y Y N Y

CoCo [1] c J.H. Badsberg C/Lisp N Y Y Y Y

Deal [6] S.G. Bøttcher et al. R Y d Y Y Y Y

GDAGsim [53] D. Wilkinson C Y e N N N Y

GRAPPA P.J. Green R N N N N Y

Hugin Hugin Expert N Y Y Y Y R

Hydra [51] G. Warnes Java Y Y Y N Y

JavaBayes [14] F.G. Cozman Java N Y N N Y

MIM [17] f HypergraphSoftware

Y Y Y Y Y R

MSBNx [26] Microsoft N N Y N N R

Netica Norsys Software N Y Y Y N R

Tetrad [43] P. Spirtes et al. N Y N Y Y Y

WebWeaver Y. Xiang Java N Y N N Ya Uses the “bound and collapse” algorithm [41] to learn from incomplete data.b Uses Cheng’s three-phase construction algorithm [11].c Analyzes associations between discrete variables of large, complete, contingency

tables.d Restricted to conditional Gaussian BNs.e Restricted to Gaussian BNs.f Provides graphical modelling for undirected graphs and chain graphs as well as

DAGs.


16.5.1 Hugin and Netica

The best-known commercial packages for BN development are Hugin and Netica.10

Hugin (Hugin Expert) has an easy-to-use graphical user interface (GUI) forBN construction and inference (Figure 16.1).11 It supports the learning of bothBN parameters and BN structures from (possibly incomplete) data sets of samplecases. The structure-learning is done via the PC algorithm [46]. APIs (applicationprogrammers interfaces) are available for C, C++ and Java, and an Active-X serveris provided. These enable the inference engine to be used within other programs.Hugin is compatible with the Windows, Solaris and Linux platforms.

Like Hugin, Netica (Norsys Software) supports BN construction and inferencethrough an advanced GUI.12 It has broad platform support (Windows, Linux, SunSparc, Macintosh, Silicon Graphics, and DOS), and APIs are available for C, C++,Java, and Visual Basic. Netica enables parameters (but not structures) to be es-timated from (possibly incomplete) data; however, although the functionality ofNetica is less than that of Hugin, it is considerably less expensive.

16.5.2 The Bayes Net Toolbox

In 1997, Kevin Murphy started to develop the Bayes Net Toolbox (BNT) [32] inresponse to weaknesses of the BN systems available at the time.

BNT is an open-sourced collection of Matlab routines for BN (and influence dia-gram) construction and inference, including dynamic probability networks.13 It allowsa wide variety of probability distributions to be used at the nodes (e.g., multino-mial, Gaussian, and MLP), and both exact and approximate inference methods areavailable (e.g., junction tree, variable elimination, and MCMC sampling).

Both parameter and structure estimation from (possibly incomplete) data aresupported. Structures can be learnt from data by means of the K2 [13] and IC/PC[46] algorithms. When data are incomplete, the structural EM algorithm can beused.14

Example 4. In BNT, a directed acyclic graph (DAG) is specified by a binary-valuedmatrix {ei,j}, where ei,j = 1 if a directed edge goes from node i to node j. For theDAG shown in Figure 16.1(a), this adjacency matrix is obtained by

N = 4; % Number of nodesdag = zeros(N,N); % Initially no edgesC = 1; S = 2; R = 3; W = 4; % IDs for the four nodesdag(C,[R S]) = 1; dag(R,W) = 1; dag(S,W)=1; % Edges defined

Next, the type of nodes to be used for the BN are defined:

discrete_nodes = 1:N; % All nodes are discrete-valuednodesizes = 2*ones(1,N); % All nodes are binarybnet = mk_bnet(dag,node_sizes,’discrete’, discrete_nodes);

10 Hugin and Netica also support the development of influence diagrams.11 http://www.hugin.com/12 http://www.norsys.com/13 http://www.ai.mit.edu/∼murphyk/Software/BNT/bnt.html14 See Section 2.3.6 and Section 4.4.5.


�� Cloudy

��

��

��

��

�

�� Sprinkler

��

��

�� Rain

��

��

�� WetGrass

(a) (b)

(c) (d)

Fig. 16.1. (a) A DAG for the classic “wet grass” scenario. (b) A Hugin renderingof this DAG. The histograms show the probability distributions at each node X,which, initially, are the prior probabilities p(X). (c) Cloudy = true; consequently,the probabilities are updated to the posterior distributions p(X|Cloudy = true).(d) Cloudy = true and Rain = false; therefore, the probability distributions areupdated to p(X|Cloudy = true, Rain = false).

(In BNT, false = 1 and true = 2.) The BN definition is completed by definingthe conditional probability distribution at each node. For this example, binomialdistributions are used, and the values are entered manually. If V1, . . . , Vn are theparents of node X, with sizes |V1|, . . . , |Vn|, the conditional probability distributionp(X|V1, . . . , Vn) for node X can be defined as follows:

CPT = zeros(|V1|, . . . , |Vn|, |X|);CPT(v1, . . . , vn, x) = p(X = x|V1 = v1, . . . , Vn = vn);

· · · repeated for each p(x|v1, . . . , vn)

bnet.CPD{X} = tabular CPD(bnet, X, ‘CPT‘, CPT);

We can perform inferences with this BN. To enter the evidence that Cloudy =true and find the updated probability p(WetGrass = true|Cloudy = true) via thejunction-tree algorithm, we can use


evidence = cell(1,N); evidence{C} = 2; % Cloudy = trueengine = jtree_inf_engine(bnet); % Use junction-tree algorithm[engine, loglikelihood] = enter_evidence(engine, evidence);marg = marginal_nodes(engine, W); % p(W|Cloudy = true)prob = marg.T(2); % p(W = 2|Cloudy = true)

An alternative to the above manual approach is to let BNT learn the probabilitiesfrom available data. To obtain the maximum-likelihood estimates of the probabilitiesfrom a complete data set data, we can first initialize the probabilities to randomvalues,

seed = 0; rand(’state’, seed);bnet.CPD{C} = tabular_CPD(bnet, C, ’CPT’, ’rnd’);bnet.CPD{R} = tabular_CPD(bnet, R, ’CPT’, ’rnd’);bnet.CPD{S} = tabular_CPD(bnet, S, ’CPT’, ’rnd’);bnet.CPD{W} = tabular_CPD(bnet, W, ’CPT’, ’rnd’);

and then apply

bnet = learn_params(bnet, data);

�

There are some limitations to BNT. Firstly, it does not have a GUI. Secondly,BNT requires Matlab to run it. It would be better if a version of BNT was developedthat is independent of any commercial software. Furthermore, Matlab is a subop-timal language in that it is slow compared with C, and its object structure is lessadvanced than that of Java or C++. The desire to overcome these drawbacks wasthe motivation behind the OpenBayes initiative.

16.5.3 The OpenBayes Initiative

Although there are a number of software packages available for constructing andcomputing graphical models, no single package contains all the features that onewould like to see, and most of the commercial packages are expensive. Therefore, inorder to have a package that contains the features desired by the BN community,InferSpace launched the OpenBayes initiative in January 2001, the aim of which wasto prompt the building of an open-sourced software environment for graphical-modeldevelopment.

There have been other BN-oriented open-source initiatives, such as Fabio Coz-man’s JavaBayes system.

16.5.4 The Probabilistic Networks Library

In late 2001, Intel began to develop an open-sourced C++ library called the Proba-bilistic Networks Library (PNL), which initially closely modelled the BNT package.15

PNL has been available to the public since December 2003.

15 http://www.ai.mit.edu/∼murphyk/Software/PNL/pnl.html


16.5.5 The gR Project

In September 2002, the gR project was conceived, the purpose of which is to de-velop facilities in R for graphical modelling [31].16 The project is being managed byAalborg University.

The software associated with the gR project includes (i) Deal [6], for learningconditionally Gaussian networks in R, (ii) mimR, an interface from R to MIM (whichprovides graphical modelling for undirected graphs, DAGs, and chain graphs [17]),and (iii) an R port for CoCo [1], which analyzes associations within contingencytables. The ability of R (and S-Plus) to interface with programs written in C++means that Intel’s PNL could become a powerful part of the gR project.

16.5.6 The VIBES Project

The use of variational methods for approximate reasoning in place of MCMC sam-pling is gaining interest (Chapter 14). In a joint project between Cambridge Uni-versity and Microsoft Research, a system called VIBES (Variational Inference forBayesian Networks) [5] is being developed that will allow variational inference to beperformed automatically on a BN specified through a GUI.

16.6 Class-probability trees

[Class-probability trees were featured in Section 10.10.]Two of the original tree-induction packages are C4.5 [39] and CART [7]. C4.5 has

been superseded by C5.0, which, like its Windows counterpart See5, is a commercialproduct developed by RuleQuest Research.17

CART introduced the concept of surrogate splits to enable trees to handle miss-ing data (Section 10.10). It is available as a commercial package from Salford Sys-tems.18

Class-probability trees can also be created by several statistical packages. Thesefacilities include the S-Plus tree function and the R rpart function. An advantageof the rpart function is that it can use surrogate variables in a manner closelyresembling that proposed by Breiman et al. [7].

Example 5. The R function rpart can grow a tree from the kyphosis data used inExample 1:

kyph.tree <- rpart(Kyphosis ˜ Age + Number + Start,data=kyphosis, parms=list(split=’gini’))

In this example, the Gini index of class heterogeneity has been used. The information-theoretic entropy measure is also available. Figure 16.2 shows a visualization of thetree obtained with

plot(kyph.tree); text(kyph.tree, use.n=TRUE)

16 http://www.r-project.org/gR/gR.html17 http://www.rulequest.com/18 http://www.salford-systems.com/


A new feature vector of values xnew can be dropped down the tree in order toestimate p(Kyphosis = 1|xnew) from a leaf node. This is done using

predict(kyph.tree, newdata = xnew)

�

In Section 10.10.2, we described Buntine’s approach to Bayesian trees [8]. Hisideas are implemented in the IND package, which can now be obtained (along withthe source code) from the NASA Ames Research Center under a NASA softwareusage agreement.19

|Start>=8.5

Start>=14.5Age<55

Age>=111absent

29/0 absent12/0

absent12/2

present3/4

present8/11

Fig. 16.2. Plot obtained when the R function rpart was used to grow a tree (Exam-ple 5). At each split, the left branch corresponds to the case when the split criterionis true for a given feature vector, and the right branch to when the criterion is false.Each leaf node is labelled with the associated classification followed by the frequencyof the classes “absent” and “present” at the node (delimited by “/”).

16.7 Hidden Markov Models

[Hidden Markov models were featured in Chapter 14 and Sections 10.11.4, 2.2.2,4.4.7 and 5.10.]

A Hidden Markov model (HMM) consists of a discrete-valued hidden node Slinked to a discrete- or continuous-valued observed node X. Figure 16.3 shows themodel “unrolled” over three time steps: τ = 1, 2, 3.

Although the majority of statistical software packages enable classical time-series models, such as ARIMA, to be built, tools for modelling with HMMs are nota standard feature. This is also true of mathematical packages such as Matlab andMathematica.19 http://ic.arc.nasa.gov/projects/bayes-group/ind/IND-program.html


��S(1) ��

��

��S(2) ��

��

��S(3)

��X(1) ��X(2) ��X(3)

Fig. 16.3. A graphical representation of an HMM. The model is defined by theprobability distributions p(S(1)), p(S(τ+1)|S(τ)), and p(X(τ)|S(τ)). The last two dis-tributions are assumed to be the same for all time slices τ ≥ 1.

16.7.1 Hidden Markov Model Toolbox for Matlab

Kevin Murphy has written a toolbox for developing hidden HMMs with Matlab.20

Tools for this purpose are also available within his BNT package (Section 16.5.2).We illustrate the HMM toolbox with the following example.

Example 6. Suppose we wish to classify a biomedical time series x = {x(1), x(2), . . . ,x(T )} by assigning it to one of K classes (for example, K physiological states). Weassume that a time series is generated by an HMM associated with a class, therebeing a unique set of HMM parameters θk for each class k. The required classificationcan be done probabilistically by assigning a new time series xnew to the class k forwhich p(θk|xnew) is maximum.

For each of the K classes of interest, we can train an HMM using a sample(data k) of time series associated with class k. The standard method is to use theEM algorithm to compute the MLE θk of θk with respect to data k:

[LL, prior_k, transmat_k, obsmat_k] =dhmm_em(data_k, prior0, transmat0, obsmat0, ’max_iterations’,10);

where prior0, transmat0, obsmat0 are initial random values respectively corre-sponding to the prior probability distribution p(S(1)), the transition probabilitymatrix p(S(τ+1)|S(τ)), and the observation probability matrix p(X(τ)|S(τ)). The re-sulting MLEs for these probabilities are given by prior k, transmat k, and obsmat k,which collectively provide θk.

From θk, the log-likelihood log p(x new|θk) for a new time series x new can beobtained using

loglik = dhmm_logprob(x_new, prior_k, transmat_k, obsmat_k)

which can be related to p(θk|x new) by Bayes’ theorem. �

Acknowledgments

We would like to thank Paulo Lisboa and Peter Weller for their careful reading andconstructive comments on an earlier draft of this chapter.

We are grateful to Kevin Murphy for allowing us to use data from his websiteSoftware Packages for Graphical Models/Bayesian Networks21 for Table 16.2.

20 http://www.ai.mit.edu/∼murphyk/Software/HMM/hmm.html21 http://www.ai.mit.edu/∼murphyk/Software/BNT/bnsoft.html


References

[1] J.H. Badsberg. An Environment for Graphical Models. PhD dissertation, De-partment of Mathematical Sciences, Aalborg University, 1995.

[2] J.M.G. Barahona, P.D.H. Quiros, and T. Bollinger. A brief history of freesoftware and open source. IEEE Software, 16(1):32–33, 1999.

[3] R.A. Becker, J.M. Chambers, and A.R. Wilks. The New S Language. Wadworth& Brooks/Cole, Pacific Grove, CA, 1988.

[4] C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Ox-ford, 1995.

[5] C.M. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A variational inferenceengine for Bayesian networks. In Advances in Neural Information ProcessingSystems 9, MIT Press, Cambridge, MA, 2003.

[6] S.G. Bottcher and C. Dethlefsen. Deal: A package for learning Bayesian net-works. Technical report, Department of Mathematical Sciences, Aalborg Uni-versity, 2003.

[7] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification andRegression Trees. Chapman & Hall, New York, 1984.

[8] W. Buntine. A Theory of Learning Classification Rules. PhD dissertation,School of Computing Science, University of Technology, Sydney, February 1990.

[9] S.K. Card, J.D. Mackinlay, and B. Shneiderman. Readings in InformationVisualization: Using Vision to Think. Morgan Kaufmann, San Fransisco, CA,1999.

[10] J.M. Chambers and T.J. Hastie. Statistical Models in S. Wadsworth &Brooks/Cole Advanced Books & Software, Pacific Grove, CA, 1992.

[11] J. Cheng and R. Greiner. Learning Bayesian belief network classifiers: Algo-rithms and systems. In E. Stroulia and S. Matwin, editors, Proceedings of the14th Canadian Conference on Artificial Intelligence, Lecture Notes in ComputerScience, pages 141–151, Springer-Verlag,New York, 2001.

[12] D. Collett. Modelling Binary Data. Chapman & Hall, London, 1991.[13] G.F Cooper and E. Herskovits. A Bayesian method for the induction of prob-

abilistic networks from data. Machine Learning, 9:309–347, 1992.[14] F.G. Cozman. The JavaBayes system. The ISBA Bulletin, 7(4):16–21, 2001.[15] A.C. Davidson and D.V. Hinkley. Bootstrap Methods and Their Applications.

Cambridge University Press, Cambridge, 1997.[16] K.A. De Jong. Evolutionary Computation: A Unified Approach. MIT Press,

Cambridge, MA, 2003.[17] D. Edwards. Introduction to Graphical Modelling. Springer-Verlag, New York,

2nd edition, 2000.[18] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman &

Hall, New York, 1993.[19] D.B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine

Intelligence. IEEE Press, New York, 1995.[20] K. Fogel. Open Source Development with CVS. Coriolis, Scottsdale, AZ, 1999.[21] N. Friedman. The Bayesian structural EM algorithm. In G.F. Cooper and

S. Moral, editors, Proceedings of the 14th Conference on Uncertainty in Artifi-cial Intelligence, pages 129–138, Morgan Kaufmann, San Francisco, CA, 1998.


[22] M.N. Gibbs. Bayesian Gaussian Processes for Regression and Classification.PhD dissertation, Department fo Computing Science, University of Cambridge,1997.

[23] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov ChainMonte Carlo in Practice. Chapman & Hall, London, 1996.

[24] M.T. Hagan, H.B. Demuth, and M. Beale. Neural Network Design. PWSPublishing, Boston, 1996.

[25] F.E. Harrell. Regression Modeling Strategies. Springer, New York, 2001.[26] E. Horvitz, D. Hovel, and C. Kadie. MSBNx: A component-centric toolkit for

modeling and inference with Bayesian networks. Technical Report MSR-TR-2001-67, Microsoft Research, Redmond, WA, July 2001.

[27] B.R. Hunt, R.L. Lipsman, and J.M. Rosenberg. A Guide to MATLAB: ForBeginners and Experienced Users. Cambridge University Press, Cambridge,2001.

[28] R. Ihaka and R. Gentleman. R: a language for data analysis and graphics.Journal of Computational and Graphical Statistics, 5:299–314, 1996.

[29] H. James. Editorial. Neural Computing and Applications, 5:129–130, 1997.[30] K.B. Korb and A.E. Nicholson. Bayesian Artificial Intelligence. CRC Press,

London, 2003.[31] S.L. Lauritzen. gRaphical models in R. R News, 3(2):39, 2002.[32] K.P. Murphy. The Bayes Net Toolbox for Matlab. Computing Science and

Statistics, 33:331–350, 2001, The Interface Foundation of North America.[33] I.T. Nabney. NETLAB: Algorithms for Pattern Recognition. Springer, London,

2002.[34] G.M. Nielson, H. Hagan, and H. Muller. Scientific Visualization: Overviews,

Methodologies, and Techniques. IEEE Computer Society, Los Alamitos, CA,1997.

[35] R.C. O’Reilly and Y. Munakata. Computational Explorations in Cognitive Neu-roscience. MIT Press, Cambridge, MA, 2000.

[36] R.C. Pavlicek. Embracing Insanity: Open Source Software Development. SAMS,Indianapolis, IN, 2000.

[37] B. Perens. The Open Source definition. In C. DiBona and S. Ockman, edi-tors, Open Sources: Voices From the Open Source Revolution, pages 171–188.O’Reilly & Associates, Sebastopol, CA, 1999.

[38] J.C. Principe, N.R. Euliano, and W.C. Lefebvre. Neural and Adaptive Systems.John Wiley, New York, 2000.

[39] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, SanMateo, CA, 1993.

[40] A.E. Raftery and C. Volinsky, 1999. Bayesian Model Averaging Home Page[WWW]. Available from: http://www.research.att.com/ volinsky/bma.html[accessed 9 July 1999].

[41] M. Ramoni and P. Sebastiani. Learning Bayesian networks from incompletedatabases. Technical Report KMI-TR-43, Knowledge Media Institute, OpenUniversity, February 1997.

[42] D.K. Rosenberg. Open Source: The Unauthorized White Papers. M & T Books,Foster City, CA, 2000.

[43] R. Scheines, P. Spirtes, C. Glymour, and C. Meek. TETRAD II: Tools forDiscovery. Lawrence Erlbaum Associates, Hillsdale, NJ, 1994.


[44] D. Spiegelhalter, A.Thomas, N. Best, and W. Gilks. BUGS: Bayesian inferenceUsing Gibbs Sampling. MRC Biostatistics Unit, Cambridge, 1996.

[45] D.J. Spiegelhalter, A. Thomas, and N.G. Best. WinBUGS Version 1.2 UserManual. MRC Biostatistics Unit, Cambridge, 1999.

[46] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search.MIT Press, Cambridge, MA, 2nd edition, 2001.

[47] R. Stallman. The GNU Operating System and the Free Software Movement.In C. DiBona and S. Ockman, editors, Open Sources: Voices From the OpenSource Revolution, pages 53–70. O’Reilly & Associates, Sebastopol, CA, 1999.

[48] M. Svensen. GTM: The Generative Topographic Mapping. PhD dissertation,Neural Computing Research Group, Aston University, April 1998.

[49] W.N. Venables and B.D. Ripley. Modern Applied Statistics with S-Plus.Springer, New York, 3rd edition, 1999.

[50] W.N. Venables and B.D. Ripley. S Programming. Springer, New York, 2000.[51] G.R. Warnes. HYDRA: a Java library for Markov chain Monte Carlo. Journal

of Statistical Software, 7: issue 4, 2002.[52] M. Welsh, M.K. Dalheimer, and L. Kaufman. Running Linux. O’Reilly &

Associates, Sebastopol, CA, 3rd edition, 1999.[53] D.J. Wilkinson and S.K.H. Yeung. A sparse matrix approach to Bayesian com-

putation in large linear models. Computational Statistics and Data Analysis,44:423–516, 2004.

Documents

[Advanced Information and Knowledge Processing] Probabilistic Modeling in Bioinformatics and Medical Informatics || Software for Probability Models in Medical Informatics