17
BioMed Central Page 1 of 17 (page number not for citation purposes) BMC Bioinformatics Open Access Proceedings The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers, Martijn AR Leisink and Hilbert J Kappen* Address: Department of Medical Physics and Biophysics, Radboud University, Nijmegen, The Netherlands Email: Cornelis A Albers - [email protected]; Martijn AR Leisink - [email protected]; Hilbert J Kappen* - [email protected] * Corresponding author Abstract Background: Computing exact multipoint LOD scores for extended pedigrees rapidly becomes infeasible as the number of markers and untyped individuals increase. When markers are excluded from the computation, significant power may be lost. Therefore accurate approximate methods which take into account all markers are desirable. Methods: We present a novel method for efficient estimation of LOD scores on extended pedigrees. Our approach is based on the Cluster Variation Method, which deterministically estimates likelihoods by performing exact computations on tractable subsets of variables (clusters) of a Bayesian network. First a distribution over inheritances on the marker loci is approximated with the Cluster Variation Method. Then this distribution is used to estimate the LOD score for each location of the trait locus. Results: First we demonstrate that significant power may be lost if markers are ignored in the multi-point analysis. On a set of pedigrees where exact computation is possible we compare the estimates of the LOD scores obtained with our method to the exact LOD scores. Secondly, we compare our method to a state of the art MCMC sampler. When both methods are given equal computation time, our method is more efficient. Finally, we show that CVM scales to large problem instances. Conclusion: We conclude that the Cluster Variation Method is as accurate as MCMC and generally is more efficient. Our method is a promising alternative to approaches based on MCMC sampling. Background The goal of genetic linkage analysis is to link phenotype to genotype. Pedigrees are collected where a trait or disease is believed to have a genetic component. The individuals in the pedigree are genotyped for a number of markers on the chromosome. The markers are at known relative recombination frequencies, so that from the genotypes a distribution over inheritances can be inferred. Linkage of the trait to a specific location in the marker map then is quantified by the extent to which the distribution over inheritances as inferred from the markers can explain the observed phenotypes in the pedigree. Parametric linkage analysis In this article we compute linkage likelihoods with the parametric LOD score (log odds ratio) proposed by Mor- ton [1]. The LOD score is the log ratio of the likelihoods of the hypothesis that the disease locus is linked to the marker loci at a specific location and the hypothesis that it is unlinked to the marker loci. The LOD score requires from NIPS workshop on New Problems and Methods in Computational Biology Whistler, Canada. 18 December 2004 Published: 20 March 2006 BMC Bioinformatics 2006, 7(Suppl 1):S1 doi:10.1186/1471-2105-7-S1-S1 <supplement> <title> <p>NIPS workshop on New Problems and Methods in Computational Biology</p> </title> <editor>Gal Chechik, Christina Leslie, Gunnar Rätsch, Koji Tsuda</editor> <note>Proceedings 1471-2105-7-S1-info.doc</note> <url>http://www.biomedcentral.com/content/pdf/1471- 2105-7-S1-info.pdf</url> </supplement>

BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BioMed CentralBMC Bioinformatics

ss

Open AcceProceedingsThe Cluster Variation Method for Efficient Linkage Analysis on Extended PedigreesCornelis A Albers, Martijn AR Leisink and Hilbert J Kappen*

Address: Department of Medical Physics and Biophysics, Radboud University, Nijmegen, The Netherlands

Email: Cornelis A Albers - [email protected]; Martijn AR Leisink - [email protected]; Hilbert J Kappen* - [email protected]

* Corresponding author

AbstractBackground: Computing exact multipoint LOD scores for extended pedigrees rapidly becomesinfeasible as the number of markers and untyped individuals increase. When markers are excludedfrom the computation, significant power may be lost. Therefore accurate approximate methodswhich take into account all markers are desirable.

Methods: We present a novel method for efficient estimation of LOD scores on extendedpedigrees. Our approach is based on the Cluster Variation Method, which deterministicallyestimates likelihoods by performing exact computations on tractable subsets of variables (clusters)of a Bayesian network. First a distribution over inheritances on the marker loci is approximatedwith the Cluster Variation Method. Then this distribution is used to estimate the LOD score foreach location of the trait locus.

Results: First we demonstrate that significant power may be lost if markers are ignored in themulti-point analysis. On a set of pedigrees where exact computation is possible we compare theestimates of the LOD scores obtained with our method to the exact LOD scores. Secondly, wecompare our method to a state of the art MCMC sampler. When both methods are given equalcomputation time, our method is more efficient. Finally, we show that CVM scales to large probleminstances.

Conclusion: We conclude that the Cluster Variation Method is as accurate as MCMC andgenerally is more efficient. Our method is a promising alternative to approaches based on MCMCsampling.

BackgroundThe goal of genetic linkage analysis is to link phenotype togenotype. Pedigrees are collected where a trait or diseaseis believed to have a genetic component. The individualsin the pedigree are genotyped for a number of markers onthe chromosome. The markers are at known relativerecombination frequencies, so that from the genotypes adistribution over inheritances can be inferred. Linkage ofthe trait to a specific location in the marker map then isquantified by the extent to which the distribution over

inheritances as inferred from the markers can explain theobserved phenotypes in the pedigree.

Parametric linkage analysisIn this article we compute linkage likelihoods with theparametric LOD score (log odds ratio) proposed by Mor-ton [1]. The LOD score is the log ratio of the likelihoodsof the hypothesis that the disease locus is linked to themarker loci at a specific location and the hypothesis thatit is unlinked to the marker loci. The LOD score requires

from NIPS workshop on New Problems and Methods in Computational BiologyWhistler, Canada. 18 December 2004

Published: 20 March 2006

BMC Bioinformatics 2006, 7(Suppl 1):S1 doi:10.1186/1471-2105-7-S1-S1<supplement> <title> <p>NIPS workshop on New Problems and Methods in Computational Biology</p> </title> <editor>Gal Chechik, Christina Leslie, Gunnar Rätsch, Koji Tsuda</editor> <note>Proceedings 1471-2105-7-S1-info.doc</note> <url>http://www.biomedcentral.com/content/pdf/1471-2105-7-S1-info.pdf</url> </supplement>

Page 1 of 17(page number not for citation purposes)

Page 2: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

specification of the disease frequency and penetrance val-ues and therefore falls into the category of parametricscoring functions.

Exact computationsSeveral methods for exact computations are in use.

Lander et al. [2] introduced a Hidden Markov Model(HMM) where the meiosis indicators are the unobservedvariables. This method is linear in the number of loci, butexponential in 2n – f, where n is the number of non-founders and f the number of founders. Kruglyak et al. [3]optimized the method in the program Genehunter.

Elston et al. [4] developed an algorithm that is efficient onpedigrees that have little inbreeding. This method is linearin the number of individuals (in case there is no inbreed-ing) but scales with the number of possible multi-locusgenotypes. The method was made computationally effi-cient in the package Vitesse [5].

Both of these methods exploit particular independenceproperties of the statistical model. Within the frameworkof Bayesian networks, this approach has been generalizedin the junction tree algorithm [6,7]. In the computer pro-gram Superlink [8] this approach is implemented for theapplication of linkage analysis and is the first program tomake use of Bayesian networks for computing exact link-age likelihoods.

Although exact algorithms have been substantiallyimproved over the years, the fact remains that they requirean exponential number of operations and have limitedapplicability.

The Cluster Variation MethodThe Cluster Variation Method originated with the work ofBethe [9] and was extended to non-pair wise marginals byKikuchi [10] to compute properties of magnetic materials,such as Ising models. In later years, the method has beenextended and reformulated [11,12]. Recently the methodhas been introduced into the machine learning commu-nity [13,14]as a method for approximate inference inBayesian networks and undirected graphical models.

The Cluster Variation Method approximates an intracta-ble probability distribution in terms of marginal probabil-ity distributions on clusters of variables. These clusters ofvariables are chosen such that exact computations are fea-sible on each cluster. We make explicit use of the formu-lation of linkage analysis in terms of a Bayesian networkto choose which variables will be contained in the clus-ters. In contrast with MCMC the approximation is deter-ministic and yields estimates of the pedigree likelihood.

CVM and linkage analysisAs large complex pedigrees with individuals genotyped ata large number of locations become increasingly availa-ble, along comes the need for methods of estimating like-lihoods on pedigrees where exact computations are notpossible.

In this article we describe in detail how the Cluster Varia-tion Method can be applied to the problem of geneticlinkage analysis on pedigrees without inbreeding. We dis-cuss extension of our approach to inbred pedigrees.

ResultsWe compare the estimates of the LOD score obtained withour method to exact scores as computed with Vitesse [5].We also compare our method to Markov Chain MonteCarlo (MCMC) simulations. For this we have used version2.5 of the Morgan sampler [15]. This MCMC sampler isoptimized for pedigrees for which exact single locus com-putations are possible. To our knowledge this is the mostadvanced sampler for the pedigrees we consider.

We consider CVM converged if the marker marginalschange by no more than 10-3. We use the following set-tings for the Morgan sampler: the number of prior sam-ples and burn-in samples are set to respectively 50 % and10 % of the number of samples used for the actual esti-mates.

We performed all experiments on a Pentium-IV 2.8 GHzwith 1 GB of physical memory running Linux.

SimulationsWe start by motivating the use of approximate methodswith an example. On the pedigree shown in figure 1, wehave simulated a dominant disease with penetrance val-ues f = (0.02, 0.98, 0.98) and trait allele frequencies t =(0.98, 0.02), at 0 cM. This pedigree can be handled byGenehunter [3]. We have simulated 25 pedigrees, wherehalf of the individuals has genotypic and phenotypic data.14 bi-allelic markers were simulated with marker allelefrequencies m = (0.4, 0.6) for all 14 markers. The markerspacings are 1 cM.

In figure 2 we now compare exact LOD scores computedwith all of the 14 available markers to exact LOD scorescomputed with only a subset of the markers. The figureshows that significant power is lost when markers areexcluded from the multi-point analysis. The solid line rep-resents exact LOD scores computed with all availablemarkers. The LOD score peaks at 0 cM, where the diseasewas indeed simulated. The dotted line represents the LODscore based on 5 markers: for each location of the traitlocus, a LOD score is computed by doing a multi-pointcalculation with the 2 markers to the left of the trait locus,

Page 2 of 17(page number not for citation purposes)

Page 3: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

2 markers to the right of the trait locus, and the marker atwhich the trait locus is located. If the trait locus is locatedon the first marker in the marker map, no markers to theleft of this marker are available so that the first 5 markersare used to calculate the LOD score for this location of thetrait locus. This approach can be characterized as a slidingwindow approach.

In this example the pedigree was small so that exact scorescan be computed with Genehunter for a virtually unlim-ited number of markers. However, as the size of the pedi-gree increases, the number of markers that can be

analyzed simultaneaously drops rapidly. In that case sig-nificant power may be lost. Thus, an accurate approximatemethod that can take into account all markers is desirable.

We now compare the estimates of the CVM and MCMC tothe exact scores. The results are obtained on the pedigreeshown in figure 3. There are 48 individuals of which 10are founders. The number of children per nuclear familyincreases from two in the second generation to five in thethird generation. We simulate phenotypes and genotypesaccording to this pedigree. We consider a dominant dis-ease with penetrance values f = (0.02, 0.90, 0.90). The dis-ease allele frequency has been set to 2 %, so that t = (0.98,0.02). We assume that for each individual in the pedigreethe affection status is known. For a marker spacing of 5cM, we simulated 25 pedigrees with 3 markers and at least15 affected individuals per pedigree. The number of alle-les is 5 per marker with equal frequencies. 70 % of theindividuals in the last two generations is genotyped for allmarkers. The individuals in the first two generations arenot genotyped.

In figure 4 we compare the quality of the approximationresulting from two cluster choices C1 and C2. These clusterchoices are specified in figure 5. The error is defined as theabsolute difference between the exact LOD score and theCVM estimate of the LOD score, averaged over all posi-tions of the trait locus. We see that the error of the larger

Power of analysisFigure 2Power of analysis. Power decreases when markers are excluded from the multi-point analysis. On the pedigree of figure 1 a dominant disease is simulated at 0 cM. Solid line represents exact LOD scores based on all 14 markers; dotted line represents exact LOD scores based on 4 markers surrounding the trait locus.

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

Disease locus (cM)

LOD

all markerssliding window

Pedigree 1Figure 1Pedigree 1. Pedigree used for the results of figure 2.

Page 3 of 17(page number not for citation purposes)

Page 4: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

clusters of C1 is small and an order of magnitude smallerthan the error of the smaller clusters of C2.

This result demonstrates two points. First, it is importantto include the interactions between meiosis variables onadjacent loci into at least one cluster. Second, the accuracyof the approximation can be adapted by increasing thenumber of variables per cluster. Although it is not the casethat an increased number of variables per cluster guaran-tees a higher accuracy of the approximation, in our expe-rience it is generally possible to obtain more accurateestimates by increasing the number of loci and/or thenumber of generations covered by a cluster.

For a difficult problem in the dataset, i.e. one whereMCMC and CVM error are relatively large, we compareour result to MCMC estimates obtained with Morgan. Infigure 6 the decrease of the CVM and MCMC error as afunction of computation time (i.e. number of samples) isshown. We see that a significant increase of the computa-tion time does not significantly decrease the error and var-iance of the MCMC estimate. The error of the CVMestimates obtained with cluster choice C1 is indicated bythe dashed line. The CVM computation time is varied byadjusting the value of the convergence criterion. We con-clude that our method achieves higher accuracy for a givenamount of computation time.

For the other pedigrees in the data set we compare CVM toMCMC, where for each problem MCMC is alotted thecomputation time required by CVM with cluster choiceC1. The results are shown in figure 7. We see that theMCMC estimates are less accurate. The average CVM com-putation time is 700 seconds, although there is a consid-erable degree of variance in the order of 100 seconds.Memory requirements vary between 100 and 250 MB,depending on the informativeness of the markers. We didnot find a correlation between CVM computation timeand the absolute error with respect to the exact distribu-

tion. Also the outliers in the figure are not explained bylarge CVM computation times that consequently lead toan improvement of the MCMC estimate, as the MCMCcomputation time is fixed to the CVM computation time.Since both methods will theoretically converge to theexact solution in the limit of infinite time resources, thisis the only fair comparison. Additional simulations (notshown) indicate, in agreement with the results reported infigure 6, that the MCMC estimates are sufficiently con-verged. The CVM estimates are reproducible; variance is inthe order of the convergence criterion.

We now demonstrate that the method scales to largerproblem instances. We therefore vary the number of

Comparison of cluster choicesFigure 4Comparison of cluster choices. Error of cluster choice C1 versus error of cluster choice C2 for a marker spacing of 5 cM. Error of cluster choice C1 is an order of magnitude smaller than error of cluster choice C2.

−3 −2 −1 0 1−3

−2

−1

0

1

log10

(Error C2)

log 10

(Err

or C

1)

Pedigree 2Figure 3Pedigree 2. Pedigree used for the results of figures 5-8. This pedigree consists of 48 individuals, of which 10 are founders.

Page 4 of 17(page number not for citation purposes)

Page 5: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

markers, since Vitesse can handle very large pedigrees withno loops, but only a small number of markers. We havesimulated a dominant disease on the pedigree of figure 3and 32 bi-allelic markers with equal allele frequencies. Wehave simulated one pedigree where all individuals aregenotyped for all 32 markers. From this instance we create16 problems by selecting a subset of the markers of theoriginal problem.

In figure 8 we show that CVM computation time scalesapproximately linearly with the number of markers, as domemory requirements (not shown). The varying informa-tiveness of the markers explains the fluctuations. Vitessecannot handle more than 10 markers, because memoryrequirements exceed the available 1 GB. Memory require-ment of CVM for 32 markers is 150 MB. In this case,MCMC estimates take several hours to reach convergence.We conclude that our method scales to large probleminstances.

DiscussionWe compared our method to the MCMC implementationof the Morgan sampler, which is to our knowledge themost advanced program for this problem. There are pack-ages that can handle more general pedigrees than Morgan,such as SIMWALK2 [16,17], but here we have investigatedonly pedigrees without inbreeding. Preliminary resultsindicate the CVM approximations based on the clusterchoices presented in this article can give good results oninbred pedigrees. Extension to inbred pedigrees is possi-ble and a direction for further research.

Theoretically, if the sampler is irreducible, MCMC esti-mates should ultimately converge to the exact score. How-ever, in practice this may require extremely long

computation times. The Cluster Variation Method doesnot guarantee that for a fixed choice of clusters the approx-imate marginals will converge to marginals of the exactdistribution, but the same argument that holds for MCMCalso holds for CVM: we can increase the cluster size andconsequently computation time to improve the quality ofthe approximation.

In this article we have proposed cluster choices that gener-ally give good results. Sometimes when many individualsare untyped the estimates can be inaccurate. Interestingly,on these problems the Morgan sampler also experiencedsevere difficulties.

In the approach we have taken, we can define a heuristicto detect errors in the approximation. Suppose we havethree markers; then the LOD score for marker 2 can becomputed either from the marginals defined on thenuclear families on the first and second marker, or fromthe marginals defined on the nuclear families on the sec-ond and third marker. If these LOD scores differ signifi-cantly, one should be very careful in interpreting theestimate and the number of variables per cluster must beincreased. We cannot guarantee that if the LOD scores areconsistent, the approximation is accurate. An obvious anduseful extension would be an automatic procedure thatgives the optimal set of clusters. However such a proce-dure is far from trivial, and the guideline to choose theclusters as large as available memory permits seems towork well in practice.

In the current implementation we have applied a numberof preprocessing techniques to improve the efficiency. Weexpect that even better efficiency can be obtained byapplying more preprocessing techniques such as genotype

Cluster choicesFigure 5Cluster choices. Nodes on neighboring markers l and l + 1 (shaded) that form a cluster. Large node represents the genotype of an individual, a small node the meiosis indicator an individual. In the CVM approximation, only marginals of clusters on the marker loci are computed. A shows cluster choice C1. Genotype nodes of parents and children and meiosis nodes of children in the nuclear family of adjacent marker loci form a single cluster. B shows cluster choice C2. Genotype nodes of parents and children and meiosis nodes of children in a nuclear family on one marker locus form a cluster. The meiosis nodes of the chil-dren form a separate cluster.

� � �� � �� � �� � �

� � �� � �� � �� � �� � �� � �

� � �� � �� � �� � �� � �� � �

� � �� � � � � �� � �� � � � � �� � � � � �� � �

� � �� � �� � �� � �� � �� � �� � �� � �

� � �� � �� � �� � �� � �� � �� � �� � �� �� � � � �� � �

� � �� � �

A

� � �� � �� �� �

� � �� � �� � �� � �� � �� � �

� � �� � �� � �� � �� � �� � �

� � �� � �

! ! !! ! !! ! !" " "" " "" " " # # ## # #$ $$ $ % % %% % %

& && &

' ' '' ' '' ' '( ( (( ( (( ( (

) ) )) ) )) ) )* * ** * ** * *+ + ++ + +, ,, , - - -- - -

. . .. . .

B

Page 5 of 17(page number not for citation purposes)

Page 6: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

elimination [18,19], and techniques specific to Bayesiannetworks such as value abstraction [20] and evidencebased compiling [21]. Also, preliminary simulations indi-cate that smaller clusters can give equally accurate esti-mates with reduced memory requirements.

Other applications fit naturally into the framework pre-sented here. Since the Cluster Variation Method is able toestimate pedigree likelihoods directly, the method pre-sented here can be used directly to estimate recombina-tion frequencies and marker ordering errors with amaximum likelihood approach. Maximum likelihoodhaplotyping on general pedigrees also is very promising inthis framework.

ConclusionIn this article we have demonstrated the feasibility of anew approach to compute linkage likelihoods for linkageproblems that are beyond the reach of exact computa-tions. Previous methods that are suited to deal with theseintractable problems relied on sampled estimates. Wehave shown that a deterministic approach based on theCluster Variation Method is able to obtain accurate esti-mates of LOD scores and generally is more efficient thanMCMC methods.

MethodsA Bayesian network formulationWe briefly describe the Bayesian network that will enableus to compute likelihoods. Any probability distributioncan be represented with a Bayesian network [22]. There-fore the use of Bayesian networks is merely a matter ofconvenient representation of a probability distribution,and is irrelevant to the issue of Bayesian versus frequentiststatistics. Bayesian networks in the context of geneticshave first been applied by Jensen et al. [23]. Theirapproach was extended by Thomas et al. [24]. The use ofBayesian networks for exact computations has been pro-posed by Fishelson et al. [8]. An extensive discussion ofBayesian networks in the context of genetics is given bySheehan et al. [25]. These articles have demonstrated thepower of Bayesian networks for linkage analysis.

The transmission of alleles from parents to children isclearly a directed process. A Bayesian network represents aprobability distribution in terms of a directed graph, i.e. agraph where the links between the variables are directed.A Bayesian network is therefore particularly suited tomodel the probability distribution associated with theproblem of multi-point linkage analysis. By specifyingconditional probability tables for each variable, the for-malism of Bayesian networks guarantees that the corre-sponding probability distribution is consistent andnormalized [22].

Comparison of CVM and MCMC errorFigure 7Comparison of CVM and MCMC error. Error of CVM estimate obtained with cluster choice C1 versus error of MCMC for a marker spacing of 5 cM compared to exact results obtained with Vitesse. MCMC is alotted the same computation time as CVM. CVM yields more accurate esti-mates than MCMC.

−3 −2 −1 0 1−3

−2

−1

0

1

log10

(Error MCMC)

log 10

(Err

or C

VM

)

Error as function of computation timeFigure 6Error as function of computation time. MCMC and CVM error as a function of computation time. Error is defined as the absolute difference between the exact LOD score and the estimated LOD score, averaged over all posi-tions of the trait locus. MCMC error of 10 independent runs do not converge to a better estimate than the CVM estimate with C1. CVM computation time is varied by adjusting the convergence criterion. Thus, CVM achieves higher accuracy for a given amount of computation time. CVM estimates are reproducible.

0 50 100 150 2000

0.5

1

1.5

Computation time (minutes)

Err

or

CVMMCMC

Page 6 of 17(page number not for citation purposes)

Page 7: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

While the Bayesian network is the most convenient formodel specification, it is not possible to apply the ClusterVariation Method directly to a Bayesian network. In orderto perform inference, the Bayesian network is converted toan undirected graphical model by the procedure of moral-ization [7]. Moralization removes the directions of thelinks and adds links between the parent variables of a var-iable, i.e. all variables with a link directed towards a givenvariable. The undirected graphical model representsexactly the same probability distribution as the directedgraphical model; the addition of extra links ensures thatthe correlations encoded in the conditional probabilitytables will be correctly taken into account by the inferencemethod. The formal procedure of converting the Bayesiannetwork is standard practice and used by all of the abovementioned methods that make use of Bayesian networks.

The Cluster Variation Method requires specification ofwhich variables are contained in each cluster, and thisspecification becomes very transparent when the depend-encies between these variables are modeled with a Baye-sian network. The Bayesian network consists of a numberof marker loci, with known relative recombination fre-quencies θ, and a single trait locus linked to the markersat a given position λT. The purpose of linkage analysis isto determine the most likely position of the trait locus rel-ative to the markers. To that end a Bayesian network is

constructed for each possible location λT of the trait locus,so that the likelihood of the trait phenotypes and markergenotypes can be computed for that location of the traitlocus. The ratio of this likelihood and the likelihood ofthe trait locus unlinked to the markers then gives the LODscore for location λT.

Single locus model

First we define the Bayesian network for a single locus.The inheritance model is shown in figure 9A. Each varia-ble is represented graphically by a node. A conditionalprobability table for a variable is defined by the variableitself and all variables (or, equivalently, nodes) that havea link which points to that variable. In the figure, the var-iables are the genotypes and the meiosis indicators. Eachindividual, denoted by the subscript i, possesses two genes

and that correspond to the paternally and

maternally inherited allele, indicated by the superscript p

and m respectively. The meiosis indicators and

indicate whether the paternal or the maternal allele ofrespectively the father and the mother is inherited. The

nodes and take the values 1,..., |ml|, with |ml|

the number of marker alleles for marker locus l. We will

use the shorthand notation = ( , ). The father

and mother of individual i are denoted by f (i) and m (i)respectively, and in the following we will also use

π (i) = (f (i), m (i)) to denote both parents.

Figure 9A is a graphical representation of the followingconditional probability tables in the Bayesian network:

We use boldface to indicate vectors over the missing sub-scripts and superscripts. For each individual i that is not afounder we have two conditional probability tables as inequation 1. If individual i is a founder, we have a prior dis-

tribution on the genotypes instead: P ( |ml), where ml

represents the marker allele frequencies for marker l; on

the trait locus, we have P ( |t), where t represents the

trait allele frequencies.

We note that the genotypes of all non-founders are com-pletely determined by the genotypes of the founders andthe meiosis indicators v. The meiosis indicators com-

Gil p, Gi

l m,

vil p, vi

l m,

Gil p, Gi

l m,

Gil Gi

l p, Gil m,

P P G v G G

P G

il

il

il

il p

il p

f il p

f il m

il m

G v G| , | , ,

|

( ), ,

( ),

( ),

,

π( ) = ( )× vv G Gi

l mm il p

m il m,

( ),

( ),, , . ( )( ) 1

Gil

GilT

Scaling of CVM computation time with number of markersFigure 8Scaling of CVM computation time with number of markers. Scaling of CVM computation time with the number of marker loci is indicated by the crosses. Scaling of computation time is approximately linear. All individuals have marker information, the number of alleles is 2. The red curve represents computation time of Vitesse. Vitesse cannot han-dle more than 10 markers as memory requirements exceed system memory of 1 GB. For the problem of 32 markers, MCMC estimates take several hours to reach convergence.

2 8 14 20 26 320

50

100

150

200

250

Com

puta

tion

time

(s)

Page 7 of 17(page number not for citation purposes)

Page 8: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

pletely specify the flow of the alleles. Thus, the genotypesof non-founders are not strictly necessary for computingthe appropriate likelihood; however, they do simplify thestructure of the Bayesian network such that the ClusterVariation Method can be applied.

The graphical representation for the marker and traitobservations is shown in figure 9B and 9C respectively. Ifan individual i has marker data for marker locus l, then we

have two marker alleles ( , ). However, it is not

known which allele corresponds to the paternal allele andwhich one corresponds to the maternal allele. The phaseambiguity is reflected in the marker observation model.

For example, consider = (1, 2), then the only non-zero

probabilities are

Here we will only consider binary traits, although multi-valued or real-valued traits in principle are possible.Therefore, we need only two possible alleles on the traitlocus, one which is assumed to cause the phenotype andone which is unrelated to the phenotype. The dependence

of the trait Ti on the genotype is specified with the

penetrance values f = (f0, f1, f2). The probabilities f0, f1 and

f2 are the probabilities fn = P (Ti = affected|#g = n), where g

is the number of trait alleles and n = 0, 1, 2. The traitmodel introduces the conditional probability table

Multi-locus modelThe full Bayesian network of the multi-locus model con-sists of the single locus models for all markers and the traitlocus. The recombinations between loci are modelled byadding links between the meiosis indicators of adjacentloci of the same individual, as illustrated in figure 10. Inthe absence of data, the meioses of any two individualsare independent. However, the meioses of a single indi-vidual are not independent. They depend on each otherthrough the relation

These conditional probability tables are parameterized bythe recombination frequency θl+1,l between the adjacentloci1. The first locus does not have a left neighbor, so weuse a flat prior on its meiosis indicators. In the geneticlinkage analysis of a pedigree, it is assumed that therecombination ratios between the markers are known.That is, for any two adjacent markers l, l', the recombina-tion frequency θl,l' is specified. The recombination fre-quency between the markers and the trait locus is fixed fora given position λT of the trait locus and can be deter-mined from the marker map.

mil,1 mi

l,2

Gil

P

P

M G

M G

il

il

il

il

= =

= =

(1,2) (1,2)

(2,1) (1,2)

( ) =

( ) =

1212

and

.

GilT

P Ti il lT| , . ( )G f=( ) 2

P P v v

P v v

il

il

l l il p

il p

l l

il m

il m

v v++

++

+

( ) = ( )×

11

11

1

| , | ,

|

,, ,

,

, ,

θ θ

,, .,θl l+( )1

Coupling of the lociFigure 10Coupling of the loci. Adjacent loci are coupled only

through the meiosis indicators and of the individu-

als, as indicated by the solid lines. Dashed arrows represent links between nodes on the same locus.

vl,pi v

l,mi

vl+1,pi v

l+1,mi

Gl,pi G

l,mi

Gl+1,pi G

l+1,mi

vil p, vi

l m,

Single locus Bayesian networkFigure 9Single locus Bayesian network. A shows transmission

model. The paternal allele of individual i on locus l is

determined by the state of its paternal meiosis indicator

and the genotype of its father denoted by . The mater-

nal allele is analogous. B shows marker observation model and C shows the trait observation model.

Gl,p

f(i)G

l,m

f(i)G

l,p

m(i)G

l,m

m(i)

Gl,pi G

l,mi

vl,pi v

l,mi

.A

Mli

Gl,pi G

l,mi

.B

Ti

GlT ,pi G

lT ,mi

C

Gil p,

vil p,

G f il

( )

Page 8 of 17(page number not for citation purposes)

Page 9: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

Collecting all conditional probability tables, the defini-tion of the full probability distribution is given by:

Here the founder and non-founder individuals aredenoted by F and NF, respectively. The index of the traitlocus depends on the position of the trait locus, so thatlT = lT (λT). The recombination frequency θl,l' depends onλT if either l = lT or l' = lT, so that θl,l' = θl,l' (λT). Otherwise,θl,l' is independent of λT.

This distribution is normalized to one, by construction.Computing marginal distributions is generally intractablebecause the structure of the corresponding Bayesian net-work can be too complex due to loops. Loops are causedby inbreeding in the pedigree and through the coupling ofthe meiosis indicators between different loci.

Calculating LOD scoresThe LOD score of parametric linkage analysis is defined asthe log ratio of the likelihood that the trait locus is linkedto the marker loci at location λT and the likelihood thatthe trait locus is unlinked, denoted by λT = ∞:

The denominator can be rewritten as

P (T, M|f, t, m, θ, λT = ∞) = P (T|f, t) P (M|m, θ),

giving

The denominator has to be computed only once and actsas a normalization constant. We will use the Cluster Vari-ation Method to approximate both likelihoods independ-ently. The likelihood in the numerator has to be estimatedfor each position of the trait locus.

The Cluster Variation MethodIn this section we describe how the Cluster VariationMethod can be used to obtain approximations of mar-ginal distributions of the exact distribution. In order to

apply the Cluster Variation Method more conveniently,we will make a slight change in notation.

The probability distribution of a Bayesian network is ofthe general form

Where π (i) are the nodes with a link directed towardsnode i, xi is the value assumed by node i and x is a vectorof values assumed by all nodes in the Bayesian network. Ifthere are no nodes that have a link pointing to node i, wehave

P (xi|xπ(i)) = P (xi).

We consider evidence to be the observation that a node isclamped to a state, e.g. an individual is affected or hasmarker genotype (1, 2). Suppose we have evidence that

node n is clamped to state , denoted by e = {xn = },

then

For genetic linkage analysis the evidence is on marker gen-otypes M and trait phenotypes T, and we wish to computethe likelihood of these observations given model parame-ters.

We now define

and e = {e1, ..., en}. Here δ (·) is the delta function, whichserves to clamp a node to its observed value. Using thesedefinitions, we can rewrite the likelihood of the evidencee as

so that the probability distribution over nodes withoutevidence xi\e conditional on nodes with evidence xe isgiven by

P

P T P

T

i il

iil

il

l l

T T

( , , , | , , , , )

| , |( )

,

T M v G m t f

G f M G

θ λλ

=

( ) ( )∈ ≠∏F NF TT T

P P

P

lil

il

l l T il

il

il

i

i

( )

, ( )| , ( ) | ,

λ

πθ λ

∏∏

×

( ) ( ) ×−−

∈v v G v G1

1NF

∈∈ ≠∏ ∏( )

F

G t G mil

il l

l l

T T

T T

P( )

( )

| ( | )

( )

λ

λ

3

LOD( | , , , ) log( , | , , , , )

( , | , , , , )λ θ

θ λθ λT

T

T

P

Pf t m

T M f t mT M f t m

== ∞

⎡10

⎣⎣⎢

⎦⎥ .

LOD( | , , , ) log( , , , , )

( | ,. ( )λ θ

θ λT

TP

Pf t m

T M f t mT f t

=⎡

⎣⎢

⎦⎥10 4

| ,)

P P xi ii

( ) ( | ),( )x x= ∏ π

xne xn

e

P e x x P x x

P x

n ne

i ne

i

i ii n

( { }) ( |

( | )

( )

( )

= = = =

×

∑∏≠

x

x

π

π

x

ψ

δ

π

π

π

i i i

i i

i i i ie

i

x

P x

P x x x e

( , )

( | ) :

( | ) ( ) :

( )

( )

( )

x

x

x

=

noevidence

== =

⎧⎨⎪

⎩⎪ { },x xi ie

P xi i ii

( ) ( , ).( )e = ∏∑ ψ πxx

Page 9 of 17(page number not for citation purposes)

Page 10: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

We now have reformulated the probability distribution ofthe Bayesian network in terms of so called potential fun-tions ψ which do not reflect any longer how the linksbetween the nodes were originally directed. Also, thepotential functions ψi (xi, xπ(i)) contain both node i andthe parents of node i. As a result, in the undirected graphassociated with these potential functions all parents ofnode i are connected to each other. This formal procedureis called moralization [7] and is essential to the applica-tion of all inference methods to Bayesian networks. Equa-tion 5 specifies the same distribution as equation 3, butwill be more convenient to apply the Cluster VariationMethod to.

Obtaining the exact distribution from a variational principleThe exact distribution P can be derived from a variationalprinciple:

subject to the constraint that P is normalized to one,where

and KL is the Kullback-Leibler divergence. The solution is

readily given by P = . However, the sum in equation

6 is over an exponential number of states, and is generallyintractable.

At this point one can make various choices in making anapproximation to the optimization problem defined inequation 6. The CVM approximation fits into this frame-work as follows. The approach is to replace P with theproduct

This expression deserves some explanation and motiva-

tion. The labels α ∈ B run over the set of basic clusters B. A

cluster α consists of a set of nodes xα = { , , ..., }

which can in principle be chosen freely. However, the ideais to choose them such that the corresponding marginal

distributions Qα (xα) are tractable for exact computation.

Essential to the CVM approximation is that the clusters αare defined on overlapping subsets of nodes: a singlenode, or even a subset of nodes, can occur in several of the

basic clusters α ∈ B. Although the product2 of the clustermarginals may not be a good approximation of the fulldistribution, the approximation is designed such that thecluster marginals Qα (xα) are accurate approximations of

the exact marginals P (xα).

From the set of the basic clusters B follows the definitionof the set of clusters M. The set M contains all clusters thatcan be constructed by taking intersections of basic clustersα ∈ B, intersections of intersections of basic clusters α ∈B, and so forth. Defining U as B ∪ M, the coefficients αβare defined by

where αγ = 1, ∀γ ∈ B. These coefficients are known as theMoebius numbers or over counting numbers.

How can the form of the distribution of equation 7 andthe coefficients of equation 8 be justified? If the Bayesiannetwork has no loops3 then the exact distributions is ofthe form 7 with αi = {i, π (i)}, B = {αi}. If the Bayesian net-work does have loops, this is not true. However, due to theevidence, many variables become effectively independentand a choice of the basic clusters B exists such that theapproximate marginal distributions Qα (xα) are very closeto the exact marginals P (xα).

Figure 11 shows an example Bayesian network and achoice of clusters indicated by dotted lines, specifying aparticular CVM approximation. We have the variables cor-responding to the paternal (p) and maternal (m) gene ofthe founder individuals 1 and 2 and the child 3, for bothlocus 1 and 2; the index i of the individual is subscripted,

the index l of locus is superscripted: and . Then

we have the paternal and maternal meiosis indicators of

individual 3, also for both loci: and respectively.

In this example, we have chosen the following clusters α∈ B that will determine the approximation:

PP

xi e e i i ii

( | )( )

( , ). ( )\ ( )x xe

x= ∏15ψ π

P P

PP

P

i e ei e e

= =

( )∑

arg min ( || )

( | )log|

( )( )\

\

KL Ψ

Ψx x

x x

xx

6

Ψ Ψ( ) ( , ), ( ),( )x x xx

= = ∑∏ψ πi i ii

x Z

1Z

Ψ

Q Q Qa= ∏ ∏α αα

ββ

ββ

( ) ( ). ( )x x 7

xα1xα2

xαn

a aU

β γγ β∈

= − ∑1 8⊃

, ( )

Gil p, Gi

l m,

vil p, vi

l m,

Page 10 of 17(page number not for citation purposes)

Page 11: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

Clusters α1 and α2 contain all variables of locus 1 and 2respectively; clusters α3 and α4 contain the paternal andmaternal meiosis indicators respectively that link the twoloci. These clusters have the following intersections β ∈ M:

In this example there are no intersections of intersectionsof the basic clusters α ∈ B. This choice of the clusters leadsto the following expression for equation 7:

Approximate free energiesWe now discuss the optimization problem of equation 6,which is to be redefined in terms of the cluster marginals.Inserting the CVM approximation corresponding to equa-tion 7 into expression 6, we obtain

where the minimization is subject to normalization con-straints

α

α

1 31

31

11

11

21

21

31

31

2 3

= { }=

G G G G G G v v

G

p m p m p m p m, , , , , , , ,, , , , , , , ,

2232

12

12

22

22

32

32

3 31

, , , , , , , ,

,

, , , , , , , ,

,

p m p m p m p m

p

G G G G G v v

v

{ }=α vv

v v

p

m m

32

4 31

32

,

, ,

,

, .

{ }= { }α

β β

β β

1 31

2 32

3 31

4 32

= { } = { }= { } = { }

v v

v v

p p

m m

, ,

, ,

, ,

, .

Q Q

Q v Q vp p

α α

β β

1 2

1 2

11

21

31

31

12

22

32

32

31

32

G G G v G G G v, , , , , ,

, ,

( ) ( )( ) (( ) ×

( ) ( )( ) ( )

Q v v Q v v

Q v Q v

p p m m

m m

α α

β β

3 4

3 1

31

32

31

32

31

32

, , , ,

, ,

, ,.

Q Q

Q a F Q

Q

Q Q

γ

γ γ γγ γ

{ } ≈ =

argmin ( || )

argmin ( ) argmin ( ), ( )

KL Ψ

FCVM 9γ∈∈∑

U

Qγ γγx

x∑ =( ) ,1

Example Bayesian networkFigure 11Example Bayesian network. Clusters are indicated by the dotted lines.

G1,p1

G1,m1 G

1,p2

G1,m2

G1,p3

G1,m3

v1,p3

v1,m3

.

G2,p1

G2,m1 G

2,p2

G2,m2

G2,p3

G2,m3

v2,p3

v2,m3

Page 11 of 17(page number not for citation purposes)

Page 12: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

and consistency constraints

Again, aγ are the Moebius numbers. The consistency con-straints ensure that if two clusters have a non-empty inter-section, the marginal distributions on the nodes in theoverlap are consistent.

In equation 9 we have introduced the free energy Fγ (Qγ) ofcluster γ ∈ U:

where

Ψγ contains all conditional probability tables that aredefined on subsets of nodes in cluster γ. The optimizationproblem in equation 9 is now in terms of the distributionsQγ (xγ), which are tractable by choice. The intractable opti-mization problem of equation 6 has been turned into atractable optimization problem, by substituting the exactdistribution P the product Q defined in equation 7. In thenext section we will discuss how the optimization prob-lem of equation 9 can be solved efficiently.

Returning to the example of figure 11, we can now writedown the corresponding free energy which is to be mini-mized with respect to the cluster marginals Qα (xα) and Qβ(xβ):

Here the Moebius numbers are in boldface. The minimi-zation is subject to normalization and consistency con-straints. For example, the consistency constraint betweenclusters α1 and β1:

In the example, we have put all meiosis nodes into differ-ent clusters. In practice this gives inaccurate approxima-tions; it turns out to be neccessary to join all paternal and

maternal meiosis indicators in one cluster, because of thestrong correlations between these variables. The reason isthat if the phase in one of the genotypes of the parents isreversed, for a given state of the meiosis indicators differ-ent alleles are transmitted to the children.

Minimizing the CVM free energyMinimizing the CVM free energy is difficult, since thefunctional FCVM (Q) is high-dimensional and generallynon-convex. Yedidia et al. [13] derived an inference algo-rithm based on the Cluster Variation Method, called Gen-eralized Belief Propagation (GBP). This fixed pointiteration algorithm is not guaranteed to converge, becauseof the non-convexity of the CVM free energy. Convergentalgorithms were proposed by Rangarajan et al. [26] andTeh et al. [27] and more recently by Heskes et al. [28].These so-called double loop algorithms minimize FCVM(Qγ) by iteratively improving a convex upper bound on thenon-convex functional FCVM (Qγ) that can be minimizedby fixed point iteration. The double loop algorithmalways converges to a (local) minimum of the free energy.

We use the double loop approach described in [28].Although single loop algorithms [13] in some cases mayconverge, often they require damping of the fixed pointequations and it can be difficult to find a good trade-offbetween efficiency and robustness of the algorithm. Dou-ble loop algorithms can be slower when single loop algo-rithms converge, but the setting of the parameters of thedouble loop algorithm is less critical and convergence isguaranteed in theory.

We will give an outline of the algorithm; for full details werefer to [28]. The starting point is the issue of the non-con-vexity of FCVM (Q). The free energy of each cluster, Fγ (Qγ(xγ)), is convex in terms of the approximate marginals Qγ(xγ). This can be seen by writing it out:

Here we have introduced the energy E (Qγ) and entropy S(Qγ). These names stem from statistical physics, where theCluster Variation Method is used to compute properties ofcertain metals that can be described as systems of interact-ing magnetic spins. The energy term is linear in the mar-ginal distribution Qγ. By differentiation it can be seen thatthe minus entropy has a positive second derivative, andtherefore – S (Qγ) is convex.

We now take a look at the CVM free energy again andidentify the convex and concave terms:

Q Qγ γ γ γγ γ

γ γ .( ) ( )\

x xx x

= ∀ ⊃ ′′ ′′

F Q QQ

γ γ γ γγ γ

γ γγ

( ) ( )log( )

( ),= ∑ x

x

xx Ψ

Ψγ γπ γ

π( ) ( , ).{ , ( )}

( )x x x=⊂

∏ ψ ii i

i i

F Q Q F Q F Q

F Q F Q

cvm( , ) ( ) ( )

( ) ( )

α α α α α

α α α α

β = + ⋅ + ⋅

+ ⋅ + ⋅

1 1

1 1

1

1 1 2 2

3 3 4 4

⋅⋅ − ⋅

− ⋅ − ⋅

F Q F Q

F Q F Q

β β β β

β β β β

1 1 2 2

3 3 4 4

( ) ( )

( ) ( ).

1

1 1

Q

Q

m

m p

p

α

β

1

31

21

31

3

1

1

1

2

1

3

1

3

1

G G GG G G

, , ,

, , , ,,

, ,

11 v

v v

v

∑ ( )= ( )1 3

1,

F Q Q Q Q E Q S Qγ γ γ γ γ γ γ γ γ γ γ γ γx x x x x( )( ) = ( ) + ( ) ( ) ≡ − ( ) − ( )( )log log .Ψ

Page 12 of 17(page number not for citation purposes)

Page 13: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

Here M+ is the set of clusters β ∈ M : aβ > 0 and M- ∈ M :aβ < 0. Since all free energies are convex, clusters β ∈ M-

with negative Moebius numbers have concave contribu-tions to the total free energy FCVM (Q), which thereforebecomes non-convex.

The double loop algorithm is based on the following idea.Since the fixed point iterations as employed in GBP con-verge if the free energy is convex, a convergent algorithmcan be constructed by iteratively minimizing and improv-ing convex upper bounds to the CVM free energy. Let'sdenote the convex upper bound by Fconv (Q, Q'). Define

as the collection of marginal distributions (i.e. clustermarginals) that are normalized and satisfy all constistencyconstraints between overlapping marginal distributions.Following [28], if the upper bound is at least twice differ-entiable and satisfies the following properties:

1. Fconv (Q, Q') ≥ FCVM (Q) ∀ Q, Q' ∈

2. Fconv (Q, Q) = FCVM (Q) ∀ Q ∈

3. Fconv (Q, Q') is convex in Q ∈ , ∀ Q' ∈ ,

then the algorithm

with Qn the approximate marginals at iteration n, is guar-anteed to converge to a local minimum of the CVM freeenergy FCVM(Q) under the appropriate constraints. Thefree energy decreases with each iteration, since

where the first inequality follows from condition 1 (upperbound) and the second from the definition of the algo-rithm. Condition 2 (touching) in combination with dif-ferentiability ensures that the algorithm is only stationaryin points where the gradient of FCVM is zero. By construc-

tion, Qn ∈ for all n.

A convex upper bound can be obtained easily by bound-ing the concave contributions to the free energy. Since theenergy term in the free energy of each cluster is alreadyconvex, only the concave entropy terms of clusters withnegative Moebius number need to be bounded. A convexupper bound on a concave entropy term can be achievedby linearizing it:

which directly follows from KL (Qβ, ) ≥ 0. Putting this

into expression 10, we obtain for the convex upperbound:

We now see that the both the energy Eβ (Qβ) and the

bounded entropy Sβ (Qβ, ) are linear in the cluster

marginal Qβ (xβ). We can therefore simplify expression 11

by redefining the energies of the basic clusters α ∈ B:

The convex upper bound becomes

This upper bound can be minimized using the single loopalgorithm described in [13].

Thus, the double loop algorithm consists of an outer loopand an inner loop:

Outer loop : compute convex upper bound 12 withQ' = Qn;

F Q F Q a F Q

a F Q

M

M

CVM ( ) = ( ) + ( )

− ( )

∑ ∑

∑∈

+

α α β β ββ

β β ββ

α

. ( )10

Q F Q Qn Q n+1 convargmin ( , ),= ∈

F Q F Q Q

F Q Q F Q

n n n

n n n

CVM conv

conv CVM

( ) ( , )

,,

+ +≤

≤ ( ) = ( )1 1

S Q Q Q

Q Q S Q Q

β β β β β

β β β β β β β

( ) = − ( ) ( ) ≤

− ( ) ( ) ≡ ( )∑

x x

x x

x

x

log

log ,’ ’

ββ

β

,,

Qβ’

F Q Q E Q S Q

a E Q a E Q

B

M M

conv , ’( ) = ( ) − ( )

+ ( ) + ( )∈

∈ ∈

∑ ∑

∑+ −

α αα

α αα

β β ββ

β β ββ∑∑

∑ ∑− ( ) − ( )∈ ∈+ −

a S Q a S Q QM M

β β ββ

β β β ββ

, . ( )’ 11

Qβ’

− ( ) ≡ − ( ) ( )

− ( ) ( )

+

∑ ∑∈ ⊂

E

a

α α α α α α

ββ α

β β β β

β

α

β

Q Q

a QM

x x

x x

x

x

log

log

Ψ

Ψ

ββ αβ β β β

β∈ ⊂−∑ ∑ ( ) ′ ( )M

Q Qx

x xlog

F E Q S Q

a S Q

B

M

conv Q Q, ′( ) = ( ) − ( )

− ( )∈

∑ ∑

∑+

α αα

α αα

β β ββ

. ( )12

Page 13 of 17(page number not for citation purposes)

Page 14: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

Inner loop : minimize Fconv(Q, Q'), using single loopfixed point interations, yielding Qn+1.

The procedure is illustrated in figure 12.

We have described the double loop algorithm with thecase where all subclusters with negative Moebius numbersare bounded. However it is possible to bound convexentropy contributions of sub-clusters with positive Moe-bius numbers as well. This tends to sharpen the boundbecause then the bounding of the convex entropy termscounters the effect of bounding concave entropy terms. Anadvantage of this bound is that the inner loop iterationscheme becomes much simpler. This is the bound that wehave used for the simulations in this article. Even tighterbounds can be obtained by not bounding all concaveentropy terms such that Fconv is convex on the constraintsubset. We refer to [28] for more details on the specificconditions.

Applying the Cluster Variation Method to linkage analysisIn this section we describe how we apply the Cluster Var-iation Method to estimate LOD scores. We outline thealgorithm for the case that the pedigrees is not inbred,

which is the case for which we have performed simula-tions.

From the definition of the Bayesian network in equation3 it follows that the exact likelihood of the phenotypes Tconditional on the marker genotypes M can always berewritten as:

The first factor on the right hand side concerns the likeli-hood of the trait data given an inheritance vector vT on thetrait locus. The second factor is a distribution over traitlocus inheritance vectors, conditional on marker lociinheritance vectors, vM. The last term is the distributionover marker loci inheritance vectors conditional on themarker data M, the marker allele frequencies and therecombination frequencies θ specified by the marker map.Essentially this decomposition is possible because themarker and trait loci are connected only through the mei-osis indicators; the model can be viewed as a HiddenMarkov Model where the meiosis indicators v are the hid-den variables.

Outline The decomposition of the probability distribu-tion P (T|M, f, t, m, θ, λT) of equation 13 is central to ourapproach. We now give an outline of the algorithm andthen discuss each step of the algorithm in more detail.

1. The first step of the algorithm is to make the followingapproximation with the Cluster Variation Method:

where Qγ (xγ) are the approximate marginal distributionsover clusters γ ∈ U = B ∪ M, and αγ the correspondingMoebius numbers. The marginals Qγ are obtained withthe double loop algorithm described in the previous sec-tion. This step is performed only once; the trait locus hasno part in this approximation.

2. In the second step, the likelihood of trait data is com-puted for each location of the trait locus, using theapproximate distribution over inheritances on the markerloci:

for each position of the trait locus λT :

P

P

P

T

T

T

T | , , , , ,

, , ,

, ,

, , ,

M f t m

T G v f t

v vv v G G

θ

θ

λ

λ

( ) =( ) ×

( )

∑ |

|

T

T M

T M T M

PP v G M mM M, | , , ( )θ( ) 13

P QM MU

v G M m x, | , ( ), ,θ γγ

( ) ≈ ( )∈∏ γ

γa 14

Illustration of the double loop algorithmFigure 12Illustration of the double loop algorithm. At iteration n + 1, in the outer loop the convex upper bound Fconv (Q ,Qn+1) to FCVM(Q) is computed, touching FCVM (Q) at Qn. The unique minimum of the convex upper bound is reached using a single loop fixed point iteration scheme in the inner loop and is attained at Qn+1. At this point an outer loop and an inner loop of the double loop algorithm have been completed, and a new upper bound to FCVM (Q) at Q1 is computed in iteration n + 2.

Qnn+1

Q

cvmF (Q)

n+1convF (Q, Q )

convF (Q, Q )n

Page 14 of 17(page number not for citation purposes)

Page 15: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

where we have substituted equation 14 into equation 13.The CVM approximation expressed by the product inequation 14 has the consequence that if the pedigree isnot inbred, the calculations involved in step 15 can beperformed efficiently.

3. Finally, the LOD scores for each location λT are given byequation 4:

Step 1 Simulations indicate that approximating the fulldistribution P (T, M|f, t, m, θ, λT) with the Cluster Varia-tion Method can give bad results when the inheritanceimplied by the trait data T is very different from the inher-itance implied by the marker data M for a given locationof the trait locus. Therefore we choose to approximate thedistribution over marker loci inheritance vectorsP (vM, GM|M, m, θ) independently of the trait data T.

In terms of the conditional probability tables of the Baye-sian network, we have

Here the subscript l runs only over marker loci. This is themulti-locus Bayesian network described previously, butwithout the trait locus. We make the CVM approximation

Now consider for the remainder of this section the casewhere the set of basic clusters B consists of the clusters

where ci are the children in nuclear family i in the pedi-gree, πi are the parents in nuclear family i, G and v are thecorresponding genotype and meiosis nodes and (l, l + 1)represent two adjacent marker loci in the marker map. Thesubscript i runs over all nuclear families in the pedigree,and the subscript l = (1,..., LM - 1), with LM the number ofmarker loci. This is exactly cluster choice C1 of figure 5A.

Given the conditional probability tables that define theBayesian network in equation 16, the approximate mar-ginals can be obtained by minimizing the CVM freeenergy FCVM (Q) of equation 9 corresponding to the clus-ters defined in expression 17. The minimization is doneusing the double loop algorithm.

Step 2 In this step the likelihood of trait data is computedusing the approximate distribution over inheritance vec-tors vM on the marker loci. This computation entails thesummation

Because of factorization assumed by the CVM approxima-tion with the clusters defined in expression 17, this com-putation can be done efficiently. Suppose we would liketo calculate the likelihood of the trait data T for the casewhere the trait locus is located between the fourth andfifth marker:

...M3 → M4 → T ← M5 ← M6... . (19)

Given the marginals , we can derive a distribu-

tion over inheritance vectors vT on the trait locus. This is

possible because the meiosis events on the trait locus arenot directly observed, but only indirectly through the

observed trait data T. We define through the relation

Here Qα is the approximate marginal computed with CVM

only on the adjacent marker loci (l, l + 1) and

is the conditional probability table that

defines the coupling between the meiosis indicators in the

P

P

P

T

T

T

T M f t m

T G v f t

v v

v v G G

| , , , , ,

, , ,, , ,

θ

θ

λ

λ

( ) ≈( ) ×

( )

∑ |

| , ,

T

T M

T M T M

QQaγ γ

γ

γ x( )∏ , ( )15

LOD λT

T

f t m

T M f t m

T f t

, , ,

, , , ,

,

θ

θ λ

( ) =

( )( )

⎣⎢⎢

⎦⎥⎥

log,

.10P

P

P P

P P

M M il

il

li

il

il

l l il

v G M m M G

v v G

,

|

F,NF

, | , |

, ,

θ

θ

( ) = ( )× ( )

∏∏∈

−−

11 ||

| . ( )

( )v G

G m

il

il

li

l

li

P

,NF

F

π( )× ( )

∏∏

∏∏∈

∈il 16

P QM MU

v G M m x, ,, ,θ γγ

( ) ≈ ( )∈∏ γ

γa

αi,ll l l l l

i i i i i i= { }+G v G v Gc c c, , ,G ,c

+1 +1π π, , ( )l 1 17

P

P

P Q

T

T

Ta

T M f t m

T G v f t

v v

v v G G

, , , , ,

, , ,, , ,

θ

θ γγ

λ

λ

( ) ≈

( )

( )

∑ T

T M

T M T M

, , xxγγ

( )∏ . ( )18

QUl lγ ∈ 4 , 5

Qα~

Q

Q P

l l l l

l l l lil

il

i

α

αα

v G v G

v G v G v v

, , ,

, , , | , .~

+ +

+ + +

( ) ≡

( ) ( )

1 1

1 1 1 θ∏∏

P il

ilv v| ,+( )1 θ

Page 15 of 17(page number not for citation purposes)

Page 16: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

Bayesian network. As both of these terms are known,

together they define . We can now define a distribu-

tion over trait locus inheritance vectors as follows:

Summing over all states of the trait locus meiosis indica-

tors { } yields again Qα:

As a result we have an effective distribution over trait locusinheritance vectors defined by the following product. Forthe example where the trait locus is located betweenmarker 4 and marker 5, it is given by:

Note that the marginal distributions of the intersectionsare the unprimed Qβ, since we have equality 20 and thefact that the trait meiosis nodes are not contained in anyintersection of the basic clusters α ∈ B. If the pedigree isnot inbred, this product defines a proper probability dis-tribution.

The summation in equation 18 now becomes

where we have defined l (λT) as the pair of markers flank-ing the trait locus to the left and right. We now observethat if the pedigree is not inbred, the summation involvedin equation 21 can be performed efficiently with the junc-tion tree algorithm. If the pedigree is too inbred, then thislast step can be done using an additional CVM approxi-mation.

Step 3 The last step is straightforward once the likelihoodof the trait has been computed for every location of thetrait locus in the second step.

A heuristic for detecting inaccurate approximationsThe approach we have outlined here allows for a heuristicthat indicates whether the approximation of the trait datalikelihood conditional on the approximate marginalsover marker loci is not accurate. We compute the likeli-hood of the trait data for a given location of the trait locusfrom the marginals on the marker loci flanking the traitlocus. However, if the trait locus is located at the exactposition of a marker, there are two possibilities for themarker loci l (λT). Suppose the trait locus is at marker l3,then one could take either the marginals defined onmarker loci (l2, l3) or the marginals defined on marker loci(l3, l4). The likelihood of the trait data must be the samefor either choice. However this only holds if the approxi-mation is valid. Therefore, in the case that the trait locusis located at a marker, we compute the LOD score for bothoptions to detect a possible inconsistency, indicating aninaccurate approximation. Conversely, we cannot guaran-tee that if there is no inconcistency, the approximation isaccurate.

PreprocessingCurrently we apply three preprocessing steps.

1. The phase of the genotypes of the founders can beclamped on one marker locus as this does not change thelikelihood.

2. In this step, genotypic configurations (assignments ofalleles to the genotypes of the individuals) that are notconsistent with the observed marker alleles M, areremoved from the cluster potentials ψα, α ∈ B. First we runthe double loop algorithm on each marker locus seper-ately. Then some states xα in the cluster marginals Qα (xα)will be assigned zero probability, because the correspond-ing genotypic configuration is not consistent with themarker genotypes observed for that locus.

As an example consider a bi-allelic marker. If both parentshave genotype (1, 1), then the children cannot have thegenotype (1, 2). Consequently, any state in a cluster mar-ginal which corresponds to a child having genotype (1, 2)will have zero probability and does not contribute to thelikelihood. These states can therefore be removed fromthe potentials.

3. Nodes that are not in any intersection of the basic clus-ters B can be integrated out from the potentials ψα beforerunning the double loop algorithm. If individuals are notgenotyped this can give substantial reductions in the

Qα~

Q

Q

P

l l TT

l l l l

il

iT

α

α

θ

’ , , , ,

, , ,

,

v G v v G

v G v G

v v

l l+ +1 1 λ

λ

( ) ≡

( ) ×+ +1 1

, TT iT( ) ( )+∏ P i

lT

i

v v 1, .θ ,λ

viT

Q

Q

T

l l T l lT

l l l l

α

α

’ , , , ,

, , , . ( )

v

v∑ + +

+ +

( ) =

( )G v v G

v G v G

1 1

1 1 20

λ

P

Q Q

T Ml l

Ml l

T

MB l ll

v v G M m

x x

, , , , ,

( ).

, ,

,

4 5 4 5

4 54

θ

α α β ββα

λ( ) ≈

( )∈∈∏

,,l5∏

P

P

Q

T

T T

T Ml T

T Ml T

T M,f,t,m, ,

T G v f,t

x

v v ,G ,G

θ λ

α α

( ) ≈

( ) ×( ) ( )

∑ , ,,

λ λ

(( ) ( )∈∈ ( )( )∏∏ QMB l Tl T

β ββα

x , ( )λλ

21

Page 16 of 17(page number not for citation purposes)

Page 17: BMC Bioinformatics BioMed Central - COREBMC Bioinformatics Proceedings Open Access The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees Cornelis A Albers,

BMC Bioinformatics 2006, 7:S1

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

number of states per cluster marginal that have to bestored in memory.

Authors' contributionsCAA developed the method, performed the simulationsand prepared the manuscript. MARL and HJK co-devel-oped the method and supervised the project.

Notes1It is also possible to choose the parameterization

, which corresponds to reversing the

direction of the links. This choice is equivalent.

2We explicitly do not say that Q is also a probability dis-tribution, because normalization of this product is notguaranteed. This is the price that is paid for having a trac-table optimization problem of the form of equation 9.The reason that normalization of the product cannot beguaranteed is that the clusters α ∈ B are defined on subsetsof variables that are not disjoint. Thus computing the nor-malization constant of the product of marginals is as com-plex as computing the exact likelihood P(e). It is possibleto define a factorization in terms of disjoint subsets of var-iables. This factorization can be guaranteed to be normal-ized, since normalization of the marginal distributions ondisjoint subsets of variables ensures normalization of theproduct of the marginal distributions. Such an approxi-mation does not fit into the framework of the Cluster Var-iation Method. Exactly because the subsets of variables aredisjoint and correlations between variables in disjointsubsets are neglected, the approximation tends to be lesspowerful.

3By following the links (ignoring direction) there is onlyone path from one node to another.

AcknowledgementsWe would like to thank Han Brunner and Tom Heskes for useful discus-sions.

References1. Morton NE: Sequential tests for the detection of linkage. Am J

Hum Genet 1955, 7:277-318.2. Lander ES, Green P: Construction of multilocus genetic linkage

maps in humans. Proc Natl Acad Sci U S A 1987, 84:2363-2367.3. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES: Parametric and

non-parametric linkage analysis: A unified multipointsapproach. Am J Hum Genet 1996, 58:1347-1363.

4. Elston RC, Stewart J: A general model for the analysis of pedi-gree data. Hum Hered 1971, 21:523-542.

5. O'Connell JR: Rapid multipoint linkage analysis via inheritancevectors in the Elston-Stewart algorithm. Hum Hered 2001,51:226-240.

6. Lauritzen SL, Spiegelhalter D: Local computations with probabil-ities on graphical structures and their application to expertsystems. J Royal Statistical Society B 1988, 50:154-227.

7. Jensen : An introduction to Bayesian networks Edited by: Lewin LA. UCLPress; 1996.

8. Fishelson J, Geiger D: Exact genetic linkage computations forgeneral pedigrees. Bioinformatics 2002, 18:S189-S198.

9. Bethe HA: Proc Roy Soc London 1935, 150:552-575.10. Kikuchi R: A Theory of cooperative phenomena. Physical Review

1951, 81:988.11. An G: A note on the Cluster Variation Method. Journal of Sta-

tistical Physics 1988, 52:727-734.12. Morita T: Cluster Variation Method and Moebius inversion

formula. Journal of Statistical Physics 1990, 59:819-825.13. Yedidia JS, Freeman WT, Weiss Y: Generalized Belief Propaga-

tion. Advances in Neural Information Processing Systems 132001:689-695.

14. Yedidia JS, Freeman WT, Weiss Y: Constructing Free-Energyapproximations and Generalized Belief Propagation algo-rithms. IEEE Transactions on Information Theory 2005, 51:2282-2312.

15. Thompson EA, George AW: Discovering disease genes:multipoint linkage analyses via a new markov chain MonteCarlo approach. Statistical Science 2003, 18:515-535.

16. Lange K, Sobel E: Descent graphs in pedigree analysis: applica-tion to haplotyping, location scores, and marker-sharing sta-tistics. Am J Hum Genet 1996, 58:1323-1337.

17. Sobel E, Sengul H, Weeks DE: Multipoint estimation of identity-by-descent probabilities at arbitrary positions amongmarker loci on general pedigrees. Hum Hered 2001, 52:121-131.

18. Lange K, Goradia TM: An algorithm for automatic genotypeelimination. Am J Hum Genet 1987, 40:250-256.

19. Du FX, Hoeschele I: A note on algorithms for genotype andallele elimination in complex pedigrees with incompletegenotype data. Genetics 2000, 156:2051-2062.

20. Friedman N, Geiger D, Lotner N: Likelihood computation withvalue abstraction. Proceedings of Uncertainty in AI 2000:192-200.

21. Chavira M, Allen D, Darwiche A: Exploiting evidence in probabi-listic inference. Proceedings of Uncertainty in AI 2005:112-119.

22. Pearl J: Probabilistic reasoning in intelligent systems: networks of plausibleinference Morgan Kaufmann Publishers Inc; 1988.

23. Jensen C, Kong A: Blocking-Gibbs sampling for linkage analysisin large pedigrees with many loops. Am J Hum Genet 1999,65:885-902.

24. Thomas A, Abkevich V, Bansai A: Multilocus linkage analysis byblocked Gibbs sampling. Statistics and Computing 2000,10:259-269.

25. Sheehan NA, Lauritzen SL: Graphical models for genetic analy-ses. Statistical Science 2003, 4:489-514.

26. Rangarajan A, Yuille AL: The convex-concave principle. Advancesin Neural Information Processing Systems 14 2002:1033-1040.

27. Welling M, Teh Y: The Unified Propagation and Scaling algo-rithm. Advances in Neural Information Processing Systems 142002:953-960.

28. Heskes T, Albers CA, Kappen HJ: Approximate inference andconstrained optimization. Proceedings of Uncertainty in AI2003:313-320.

P v vil m

il m

l l( | , ), ,,

++

11θ

Page 17 of 17(page number not for citation purposes)