15
Aalborg Universitet Bayesian Model Comparison With the g-Prior Nielsen, Jesper Kjær; Christensen, Mads Græsbøll; Cemgil, Ali Taylan; Jensen, Søren Holdt Published in: I E E E Transactions on Signal Processing DOI (link to publication from Publisher): 10.1109/TSP.2013.2286776 Publication date: 2014 Document Version Accepted author manuscript, peer reviewed version Link to publication from Aalborg University Citation for published version (APA): Nielsen, J. K., Christensen, M. G., Cemgil, A. T., & Jensen, S. H. (2014). Bayesian Model Comparison With the g-Prior. I E E E Transactions on Signal Processing, 62(1), 225-238. https://doi.org/10.1109/TSP.2013.2286776 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. ? Users may download and print one copy of any publication from the public portal for the purpose of private study or research. ? You may not further distribute the material or use it for any profit-making activity or commercial gain ? You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us at [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from vbn.aau.dk on: April 22, 2020

Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

Aalborg Universitet

Bayesian Model Comparison With the g-Prior

Nielsen, Jesper Kjær; Christensen, Mads Græsbøll; Cemgil, Ali Taylan; Jensen, Søren Holdt

Published in:I E E E Transactions on Signal Processing

DOI (link to publication from Publisher):10.1109/TSP.2013.2286776

Publication date:2014

Document VersionAccepted author manuscript, peer reviewed version

Link to publication from Aalborg University

Citation for published version (APA):Nielsen, J. K., Christensen, M. G., Cemgil, A. T., & Jensen, S. H. (2014). Bayesian Model Comparison With theg-Prior. I E E E Transactions on Signal Processing, 62(1), 225-238. https://doi.org/10.1109/TSP.2013.2286776

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

? Users may download and print one copy of any publication from the public portal for the purpose of private study or research. ? You may not further distribute the material or use it for any profit-making activity or commercial gain ? You may freely distribute the URL identifying the publication in the public portal ?

Take down policyIf you believe that this document breaches copyright please contact us at [email protected] providing details, and we will remove access tothe work immediately and investigate your claim.

Downloaded from vbn.aau.dk on: April 22, 2020

Page 2: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 1

Bayesian Model Comparison with the g-PriorJesper Kjær Nielsen, Member, IEEE, Mads Græsbøll Christensen, Senior Member, IEEE,

Ali Taylan Cemgil, Member, IEEE, and Søren Holdt Jensen, Senior Member, IEEE

Abstract—Model comparison and selection is an importantproblem in many model-based signal processing applications.Often, very simple information criteria such as the Akaikeinformation criterion or the Bayesian information criterion areused despite their shortcomings. Compared to these methods,Djuric’s asymptotic MAP rule was an improvement, and in thispaper we extend the work by Djuric in several ways. Specifically,we consider the elicitation of proper prior distributions, treatthe case of real- and complex-valued data simultaneously ina Bayesian framework similar to that considered by Djuric,and develop new model selection rules for a regression modelcontaining both linear and non-linear parameters. Moreover,we use this framework to give a new interpretation of thepopular information criteria and relate their performance to thesignal-to-noise ratio of the data. By use of simulations, we alsodemonstrate that our proposed model comparison and selectionrules outperform the traditional information criteria both interms of detecting the true model and in terms of predictingunobserved data. The simulation code is available online.

Index Terms—Bayesian model comparison, Zellner’s g-prior,AIC, BIC, Asymptotic MAP.

I. INTRODUCTION

ESSENTIALLY, all models are wrong, but some are useful[1, p. 424]. This famous quote by Box accurately reflects

the problem that scientists and engineers face when they anal-yse data originating from some physical process. As the exactdescription of a physical process is usually impossible due tothe sheer amount of complexity or an incomplete knowledge,simplified and approximate models are often used instead. Inthis connection, model comparison and selection methods arevital tools for the elicitation of one or several models whichcan be used to make inference about physical quantities orto make predictions. Typical model selection problems are tofind the number of non-zero regression parameters in linearregression [2]–[4], the number of sinusoids in a periodicsignal [5]–[9], the orders of an autoregressive moving average(ARMA) process [10]–[15], and the number of clusters in amixture model [16]–[18]. For several decades, a large variety

Manuscript received July 04, 2012; revised July 31, 2013; accepted October08, 2013. Date of publication is October 22, 2013. Copyright (c) 2013 IEEE.Personal use of this material is permitted. However, permission to use thismaterial for any other purposes must be obtained from the IEEE by sendinga request to [email protected].

J.K. Nielsen and S.H. Jensen are with the Signal and Information ProcessingSection, Department of Electronic Systems, Aalborg University, 9220 Aalborg,Denmark (e-mail: {jkn,shj}@es.aau.dk).

M.G. Christensen is with the Audio Analysis Lab, Department of Ar-chitecture, Design & Media Technology, Aalborg University, 9220 Aalborg,Denmark (e-mail: [email protected]).

A.T. Cemgil is with the Department of Computer Engineering,Bogaziçi University, 34342 Bebek, Istanbul, Turkey (e-mail:[email protected]).

Digital Object Identifier 10.1109/TSP.2013.2286776

of model comparison and selection methods have been devel-oped (see, e.g., [3], [19]–[22] for an overview). These methodscan basically be divided in three groups with the first groupbeing those methods which require an a priori estimate ofthe model parameters, the second group being those methodswhich do not require such estimates, and the third groupbeing those methods in which the model parameters andmodel are estimated and detected jointly [15]. The widely usedinformation criteria such as the Akaike information criterion(AIC) [23], the corrected AIC (AICc) [24], the generalisedinformation criterion (GIC) [25], the Bayesian informationcriterion (BIC) [26], the minimum description length (MDL)[27], [28], the Hannan-Quinn information criterion (HQIC)[10], and the predictive least squares [29] belong to the firstgroup of methods. The methods in the second group typicallyutilise a principal component analysis of the data by analysingthe eigenvalues [11], [15], [30], the eigenvectors [31], [32],or the angles between subspaces [33]. In the third group,the Bayesian methods are found. Although these methods arewidely used in the statistical community [3], [34]–[37], theiruse in the signal processing community has only been limited(see, e.g., [7], [8], [14], [38] for a few notable exceptions)compared to the use of the information criteria. The mainreasons for this are the high computational costs of runningthese algorithms and the difficulty of specifying proper priordistributions. A few approximate methods have therefore beendeveloped circumventing most of these issues. Two examplesof such approximate methods are the BIC [26] and theasymptotic maximum a posteriori (MAP) rule [39], [40].

The original BIC in [26] and the original MDL principlein [27] are identical in form, but they are derived using verydifferent arguments [22, App. C]. Although this type of ruleis one of the most popular model selection methods, it suffersfrom that every model parameter contributes with the samepenalty to the overall model complexity penalty term in themodel selection method. Djuric’s asymptotic MAP rules [40]improve on this by accounting for that the magnitude of thepenalty should depend on the type of models and modelparameters being used. For example, the frequency parameterof a sinusoidal signal is shown to contribute with a three timeslarger penalty term than the sinusoidal amplitude and phase.The asymptotic MAP rules are derived in a Bayesian frame-work and are therefore sometimes also referred to as Bayesianinformation criteria [20], [41] when the name alludes to theunderlying principle rather than the specific rule suggestedin [26]1. In order to obtain very simple expressions for theasymptotic MAP rules, Djuric uses asymptotic considerationsand improper priors, and he also neglects lower order terms

1In this paper, the terms MDL and MAP are therefore preferred over BIC.

Page 3: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 2

during the derivations. The latter is a consequence of the useof improper priors.

In this paper, we extend the work by Djuric in several ways.First, we treat the difficult problem of eliciting proper andimproper prior distributions on the model parameters. In thisconnection, we use a prior of the same form as the Zellner’sg-prior [42], discuss its properties, and re-parametrise it interms of the signal-to-noise ratio (SNR) to facilitate a betterunderstanding of it. Second, we treat real- and complex-valuedsignals simultaneously and propose a few new model selectionrules, and third, we derive the most common informationcriteria in our framework. The latter is useful for assessing theconditions under which the, e.g., AIC and MDL are accurate.As opposed to the various information criteria which aregenerally derived from cross-validation using the Kullback-Leibler (KL) divergence, we analyse the model comparisonproblem in a Bayesian framework for numerous reasons [34],[35]; Bayesian model comparison is consistent under verymild conditions, naturally selects the simplest model whichexplains the data reasonably well (the principle of Occam’srazor), takes model uncertainty into account for estimationand prediction, works for non-nested models, enables a moreintuitive interpretation of the results, and is conceptuallythe same, regardless of the number and types of modelsunder consideration. The two major disadvantages of Bayesianmodel comparison are that the computational cost of runningthe resulting algorithms may be too high, and that the use ofimproper and vague prior distributions only leads to sensibleanswers under certain circumstances. In this paper, we discussand address both of these issues.

The paper is organised as follows. In Sec. II, we give anintroduction to model comparison in a Bayesian frameworkand discuss some of the difficulties associated with the elici-tation of prior distributions and the evaluation of the marginallikelihood. In Sec. III, we propose a general regression modelconsisting of both linear- and non-linear parameters. Forknown non-linear parameters, we derive two model compar-ison algorithms in Sec. IV and give a new interpretation ofthe traditional information criteria. For unknown non-linearparameters, we also derive a model comparison algorithm inSec. V. Through simulations, we evaluate the proposed modelcomparison algorithms in Sec. VI, and Sec. VII concludes thispaper.

II. BAYESIAN MODEL COMPARISON

Assume that we observe some real- or complex-valued data

x =[x(t0) x(t1) · · · x(tN−1)

]T, (1)

originating from some unknown model. Since we are unsureabout the true model, a set of K candidate parametric modelsM1,M2, . . . ,MK is elicited to be compared in the light ofthe data x. Each model Mk is parametrised by the modelparameters θk ∈ Θk where Θk is the parameter spaceof dimension dk. The relationship between the data x andthe model Mk is given by the probability distribution with

density2 p(x|θk,Mk) which is called the observation model.When viewed as a function of the model parameters, theobservation model is referred to as the likelihood function. Thelikelihood function plays an important role in statistics whereit is used for parameter estimation. However, model selectioncannot be solely based on comparing candidate models interms of their likelihood as a complex model can be made tofit the observed data better than a simple model. The variousinformation criteria are alternative ways of resolving this byintroducing a term that penalizes more complex models. Thisis a manifestation of the well known Occam’s razor principlewhich states that if two models explain the data equally well,the simplest model should always be preferred [43, p. 343].

In a Bayesian framework, the model parameters and themodel are random variables with the pdf p(θk|Mk) and pmfp(Mk), respectively. We refer to these distributions as theprior distributions as they contain our state of knowledgebefore any data are observed. After observing data, we updateour state of knowledge by transforming the prior distributionsinto the posterior pdf p(θk|x,Mk) and pmf p(Mk|x). Theprior and posterior distributions for the model parameters andthe model are connected by Bayes’ theorem

p(θk|x,Mk) =p(x|θk,Mk)p(θk|Mk)

p(x|Mk)(2)

p(Mk|x) =p(x|Mk)p(Mk)

p(x)(3)

where

p(x|Mk) =

∫Θk

p(x|θk,Mk)p(θk|Mk)dθk (4)

is called the marginal likelihood or the evidence. For modelcomparison, we often compare the odds of two competingmodels Mj and Mi. In this connection, we define theposterior odds which are given by

p(Mj |x)

p(Mi|x)= BF[Mj ;Mi]

p(Mj)

p(Mi)(5)

where the Bayes’ factor is given by

BF[Mj ;Mi] =p(x|Mj)

p(x|Mi),mj(x)

mi(x)(6)

and mk(x) is an unnormalised marginal likelihood whose nor-malisation constant must be the same for all models. Workingwith mk(x) rather than the normalised marginal likelihoodp(x|Mk) is usually much simpler. Moreover, p(x|Mk) doesnot even exist if improper priors are used. We return to thisin Sec II-A. Since the prior and posterior distributions of themodel are discrete, it is easy to find the posterior odds andthe posterior distribution once the Bayes’ factors are known.For example, we may rewrite the posterior distribution for themodels in terms of the Bayes’ factors as

p(Mk|x) =BF[Mk;Mb]p(Mk)∑Ki=1 BF[Mi;Mb]p(Mi)

(7)

2In this paper, we have used the generic notation p(·) to denote botha probability density function (pdf) over a continuous parameter and aprobability mass function (pmf) over a discrete parameter.

Page 4: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 3

where Mb is some user selected base model which all othermodels are compared against. Therefore, the main computa-tional challenge in Bayesian model comparison is to computethe unnormalised marginal likelihoods, constituting the Bayes’factor for competing pairs of models. We return to this inSec II-B. The posterior distribution on the models may beused to select the most probable model. However, as theposterior distribution contains the probabilities of all candidatemodels, all models may be used to make inference about theunknown parameters or to predict unobserved data points. Thisis called Bayesian model averaging. For example, assume thatwe are interested in predicting a future data vector xp usingall models. The predictive distribution then has the density

p(xp|x) =

K∑k=1

p(Mk|x)p(xp|x,Mk) . (8)

Thus, the model averaged prediction is a weighted sum of thepredictions from every model.

A. On the Use of Improper Prior DistributionsLike Djuric [39], [40], we might be tempted to use improper

prior distributions when we have no or little prior informationbefore observing any data. Whereas this usually works for theinference about model parameters, it usually leads to indeter-minate Bayes’ factors. To see this, let the prior distribution onthe model parameters of the k’th model have the joint densityp(θk|Mk) = c−1

k h(θk|Mk) where ck =∫

Θkh(θk|Mk)dθk

is the normalisation constant. In the limit ck → ∞, the priordistribution is said to be improper. An example of a popularimproper prior pdf is h(θk|Mk) = 1 so that p(θk|Mk) ∝ 1where ∝ denotes proportional to. The posterior distribution onthe model parameters has the pdf

p(θk|x,Mk) =p(x|θk,Mk)p(θk|Mk)

p(x|Mk)(9)

=p(x|θk,Mk)h(θk|Mk)∫

Θkp(x|θk,Mk)h(θk|Mk)dθk

. (10)

Thus, provided that the integral

p(x|Mk) =

∫Θk

p(x|θk,Mk)h(θk|Mk)dθk (11)

converges, the posterior pdf p(θk|x,Mk) is proper even foran improper prior distribution. For two competing modelsMj

and Mi, the Bayes’ factor is

BF[Mj ;Mi] =cicj

p(x|Mj)

p(x|Mi). (12)

The ratio p(x|Mj)/p(x|Mi) is well-defined if the posteriordistributions on the model parameters θj and θi are proper.For proper prior distributions, the scalars ci and cj are finite,and the Bayes’ factor is therefore well-defined. However, forimproper prior distributions, the Bayes’ factor is in generalindeterminate. Specifically, for the improper prior distributionwith h(θj |Mj) = h(θi|Mi) = 1, it can be shown that [44]

cicj

=

0 , dj > di

1 , dj = di

∞ , dj < di

(13)

where dj and di are the number of model parameters in θj andθi, respectively. That is, the simplest model is always preferredover more complex models, regardless of the information inthe data. This phenomenon is known as the Bartlett’s paradox3

[45]. Due to the Bartlett’s paradox, the general rule is that oneshould use proper prior distributions for model comparison.However, there exists one important exception to this rulewhich we consider below. From (12), we also see that vagueprior distributions may give misleading answers. For example,a vague distribution such as the normal distribution with a verylarge variance leads to an arbitrary large normalising constantck which strongly influences the Bayes’ factor [35]. Therefore,the elicitation of proper prior distributions is very importantfor Bayesian model comparison.

1) Common Model Parameters: Consider the case whereone model, the null model MN , is a sub-model4 of all othercandidate models. That is MN ⊆ Mk for k = 1, . . . ,K.We denote the null model parameters as θN and the modelparameters of the k’th model as θk =

[θTN ψTk

]Twhere (·)T

denotes matrix transposition. The prior distribution on θk nowhas the pdf

p(θk|Mk) = p(ψk|θN ,Mk)p(θN |Mk) . (14)

If the null model parameters θN and the additional parametersψk are orthogonal5, then knowledge of the true model doesnot change the knowledge about θN , and we therefore havethat p(θN |Mk) = p(θN |MN ) [35], [37]. Thus, using theprior pdf p(θN |MN ) = c−1

b h(θN |MN ), the Bayes’ factor is

BF[Mk;MN ]

=

∫Θkp(x|θk,Mk)p(ψk|θN ,Mk)h(θN |MN )dθk∫

Θbp(x|θN ,MN )h(θN |MN )dθN

(16)

which is proper if the posterior distribution on the null modelparameters and the prior distribution with pdf p(ψk|θN ,Mk)are proper. That is, the Bayes’ factor is well-defined since cb =ci = cj even if an improper prior distribution is selected onthe null model parameters, provided that they are orthogonalto the additional model parameters ψk.

B. Computing the Marginal Likelihood

As alluded to earlier, the main computational difficulty incomputing the posterior distribution on the models is theevaluation of the marginal likelihood in (4). The integralmay not have a closed-form solution, and direct numericalevaluation may be infeasible if the number of model pa-rameters is too large. Numerous solutions to this problemhave been proposed and they can broadly be dichotomised

3Bartlett’s paradox is also called the Lindley’s paradox, the Jeffreys’paradox, and various combinations of the three names.

4Instead of the null model, the full model, which contains all other candidatemodels, can also be used [4].

5If one set of parameters θN is orthogonal to another set of parametersψk , the Fisher information matrix of the joint parameter vector θk =[θTN ψT

k

]T is diagonal. That is,

I(θk) = I(θN ,ψk) =

[I(θN ) 0

0 I(ψk)

]. (15)

Page 5: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 4

into stochastic methods and deterministic methods. In thestochastic methods, the integral is evaluated using numericalsampling which are also known as Monte Carlo techniques[46]. Popular techniques are importance sampling [47], Chib’smethods [48], [49], reversible jump Markov chain Monte Carlo[50], and population Monte Carlo [51]. An overview overand comparison of several methods are given in [52]. Anadvantage of the stochastic methods is that they in principlecan generate exact results. However, it might be difficult toassess the convergence of the underlying stochastic integrationalgorithm. On the other hand, the deterministic methods canonly generate approximate results since they are based onanalytical approximations which make the evaluation of theintegral in (4) possible. These methods are also sometimesreferred to as variational Bayesian methods [53], and a simpleand widely used example of these methods is the Laplaceapproximation [54]. In order to derive the original BIC andthe asymptotic MAP rule and since the Laplace approximationis used later in this paper, we briefly describe it here.

1) The Laplace Approximation: Denote the integrandof an integral such as in (4) by f(ξk) where ξk =[Re(θTk ) Im(θTk )

]Tis a vector of dk real parameters with

support Ξk. Moreover, suppose there exists a suitable one-to-one transformation ξk = h(ϕk) such that the logarithm of theintegrand

q(ϕk) =

∣∣∣∣∂h(ϕk)

∂ϕk

∣∣∣∣ f (h(ϕk)) (17)

can be accurately approximated by the second-order Taylorexpansion around a mode ϕk of q(ϕk). That is,

ln q(ϕk) ≈ ln q(ϕk)+1

2(ϕk− ϕk)TH(ϕk)(ϕk− ϕk) (18)

whereH(ϕk) =

∂2 ln q(ϕk)

∂ϕk∂ϕTk

(19)

is the Hessian matrix. Under certain regularity conditions [40],the Laplace approximation is then given by∫

Φk

q(ϕk)dϕk ≈ q(ϕk)(2π)dk/2| −H(ϕk)|−1/2 (20)

where Φk is the support of ϕk. The main difficulty incomputing the Laplace approximation is to find a suitableparametrisation of the integrand so that the second-orderTaylor expansion of ln q(ϕk) is accurate. If q(ϕk) consistsof multiple, significant, and well-separated peaks, an integralcan be approximated by a Laplace approximation to each peakat their respective modes [55].

2) The Original BIC and the Asymptotic MAP: The originalBIC [26] and the asymptotic MAP rule [40] are based on theLaplace approximation with h(·) being the identity functionso that

f(ξk) = q(ϕk) = p(x|ξk,Mk)p(ξk|Mk) . (21)

By neglecting terms of order O(1) and assuming a flat prioraround ξk, the marginal likelihood in the asymptotic MAPrule is∫

Ξk

f(ξk)dξk ≈ p(x|ξk,Mk)| −H(ϕk)|−1/2 . (22)

In the MAP rule, the determinant of the observed informationmatrix −H(ϕk) is evaluated using asymptotic considerations,and the asymptotic result therefore depends on the specificstructure of H(ϕk), the number of data points, and theSNR [41]. For the original BIC, however, this determinantis assumed to grow linearly in the sample size N so that

| −H(ϕk)| =∣∣∣∣−Nα α

NH(ϕk)

∣∣∣∣ =

(N

α

)dkO(1) (23)

where α is an arbitrary constant. In the original BIC, α = 1and the original BIC is therefore∫

Ξk

f(ξk)dξk ≈ p(x|ξk,Mk)N−dk/2 , (24)

but α can be selected arbitrarily which we find unsatisfactory.In [40], Djuric shows that the MAP rule and the orignalBIC/MDL coincide for autoregressive models and sinusoidalmodels with known frequencies. However, he also shows thatthat they differ for polynomial models, sinusoidal models withunknown frequencies, and chirped signal models.

III. MODEL COMPARISON IN REGRESSION MODELS

Bayesian model comparison as outlined in Sec. II is appli-cable to any model, but we have to work with a specific modelto come up with specific algorithms for model comparison. Inthe rest of this paper, we therefore focus on regression modelsof the form

Mk : x = sk(φk,ψ,αk)+e = Bψ+Zk(φk)αk+e (25)

where sk(φk,ψ,αk) and e form a Wold decomposition ofthe real- or complex-valued data x into a predictable partand a non-predictable part, respectively. Since the modelparameters are treated as random variables, the predictablepart sk(φk,ψ,αk) is also stochastic like the non-predictablepart. All models include the same null model

MN : x = Bψ + e (26)

where B and ψ are a known N × lN system matrix and aknown or unknown vector of lN linear parameters, respec-tively. Usually, the predictable part of the null model is eithertaken to be a vector of ones so that ψ acts as an intercept ornot present at all. In the latter case, the null model is simplythe noise-only model. The various candidate models differ interms of the lk linear parameters in the vector αk and theN×lk system matrix Zk(φk), which is parametrised by the ρkreal-valued and non-linear parameters in the vector φk. Thesenon-linear parameters may be either known, unknown or notpresent at all. We discuss the first and latter case in Sec. IVand the case of unknown non-linear parameters in Sec. V.Without loss of generality, we assume that the columns of Band Zk(φk) are orthogonal to each other so that ψ has thesame interpretation in all models and therefore can be assignedan improper prior if ψ is unknown. If the columns of B andZk(φk) are not orthogonal to each other, s(φk,ψ,αk) can bere-parametrised so that the columns of the two system matricesare orthogonal [56]. We focus on the regression model in (25)for several reasons. First of all, many common signal models

Page 6: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 5

used in signal processing can be written in the form of (25).Examples of such models are the linear regression model, thepolynomial regression model, the autoregressive signal model,the sinusoidal model, and the chirped signal model, and thesesignal models were also considered by Djuric in [40]. Second,the regression model in (25) is analytically tractable andtherefore results in computational algorithms with a tractablecomplexity. Moreover, the analytical tractability facilitatesinsight into, e.g., the various information criteria. Finally, theregression model in (25) can be viewed as a approximation tomore complex models [3].

A. Elicitation of Prior Distributions

In the Bayesian framework, the unknown parameters arerandom variables. In addition to specifying a distributionon the noise vector, we therefore also have to elicit priordistributions on these unknown parameters. The elicitationof prior distributions is a controversial aspect in Bayesianstatistics as it is often argued that subjectivity is introducedinto the analysis. We here take a more practical view atthis philosophical problem and consider the elicitation as aconsistent and explicit way of stating our assumptions. Inaddition to the philosophical issue, we also face two practicalproblems in the context of eliciting prior distributions formodel comparison. First, if we assume that lk ≤ L, we canselect a subset of columns from Zk(φk) in K = 2L differentways. A careful elicitation of the prior distribution for themodel parameters in each model is therefore infeasible if Lis too large, and we therefore prefer to do the elicitation in amore generic way. Second, even if we have only a vague priorknowledge, the use of improper or vague prior distributionsin an attempt to be objective may lead to bad or non-sensibleanswers [35]. As we discussed in Sec. II, this approach usuallyworks for making inference about model parameters, but maylead to the Bartlett’s paradox for model selection.

1) The Noise Distribution: In order to deduce the observa-tion model, we have to select a model for the non-predictablepart e of the model in (25). As it is purely stochastic, itmust have zero mean, and we assume that it has a finitevariance. As advocated by Jaynes and Bretthorst [57]–[59],we select the distribution which maximises the entropy underthese constraints. It is well-known, that this distribution is the(complex) normal distribution with pdf

p(e|σ2) =[rπσ2

]−N/rexp

(−e

He

rσ2

)(27)

=

{CN (e; 0, σ2IN ) , r = 1

N (e; 0, σ2IN ) , r = 2(28)

where (·)H denotes conjugate matrix transposition, IN is theN × N identity matrix, and r is either 1 if x is complex-valued or 2 if x is real-valued. To simplify the notation, weuse the non-standard notation Nr(·) to refer to either thecomplex normal distribution with pdf CN (·) for r = 1 orthe real normal distribution with pdf N (·) for r = 2. Itis important to note that the noise variance σ2 is a randomvariable. As opposed to the case where it is simply a fixed butunknown quantity, the noise distribution marginalised over this

random noise variance is able to model noise with heavy tailsand is robust towards outliers. Another important observationis that (28) does not explicitly model any correlations inthe noise. However, including correlation constraints into theelicitation of the noise distribution lowers the entropy of thenoise distribution which is therefore more informative [58,Ch. 7], [59]. This leads to more accurate estimates when thereis genuine prior information about the correlation structure.However, if nothing is known about the correlation structure,the noise distribution in (28) is the best choice since it is theleast informative distribution and is thus able to capture everypossible correlation structure in the noise [59], [60].

The Gaussian assumption on the noise implies that theobserved data are distributed as

p(x|αk,ψ,φk, σ2,Mk)

= Nr(x;Bψ +Zk(φk)αk, σ2IN ) . (29)

The Fisher information matrix (FIM) for this observationmodel is derived in App. A and given by (79). The block diag-onal structure of the FIM means that the common parametersψ and σ2 are orthogonal to the additional model parametersand can therefore be assigned improper prior distributions.

2) The Noise Variance: Since the noise variance is acommon parameter in all models and orthogonal to all otherparameters, it can be assigned an improper prior. The Jeffreys’prior p(σ2) = (σ2)−1 is a widely used improper prior forthe noise variance which we also adopt in this paper. Thepopularity primarily stems from that the prior is invariantunder transformations of the form σm for all m 6= 0. Thus,the Jeffreys’ prior includes the same prior knowledge whetherwe parametrise our model in terms of the noise variance σ2,the standard deviation σ, or the precision parameter λ = σ−2.

3) The Linear Parameters: Since we have assumed thatBHZk(φk) = 0, the linear parameters ψ of the null modelare orthogonal to the remaining parameters. We can thereforeuse the improper prior distribution with pdf p(ψ) ∝ 1 forψ. This prior is often used for location parameters as itis translation invariant. As the dimension of the vector αkof linear parameters varies between models, a proper priordistribution must be assigned on it. For linear regressionmodels, the Zellner’s g-prior given by [42]

p(αk|g, σ2,φk,Mk)

= Nr(αk; 0, gσ2[ZHk (φk)Zk(φk)]−1) (30)

has been widely adopted since it leads to analytically tractablemarginal likelihoods and is easy to understand and interpret[4]. The g-prior can be interpreted as the posterior distributionon αk arising from the analysis of a conceptual sample x0 = 0given the non-linear parameters φk, a uniform prior on αk,and a scaled variance gσ2 [61]. Given φk, the covariancematrix of the g-prior also coincides with a scaled version of theinverse Fisher information matrix. Consequently, a large priorvariance is therefore assigned to parameters which are difficultto estimate. We can also make a physical interpretation of thescalar g when the null model is the noise-only model. In thiscase, the mean of the prior on the average signal-to-noise ratio

Page 7: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 6

(SNR) is [62]

E[η|g,Mk] = lkg/N . (31)

Moreover, this value is also the mode of the prior on theaverage SNR in dB [62].

If the hyperparameter g is treated as a fixed but unknownquantity, its value must be selected carefully. In, e.g., [2], [4],[63], the consequences of selecting various fixed choices ofg have been analysed. In [4], [64], the hyperparameter g wasalso treated as a random variable and integrated out of themarginal likelihood, thus avoiding the selection of a particularvalue for it. For the prior distribution on g, a special case ofthe beta prime or inverted beta distribution with pdf

p(g|δ) =δ − rr

(1 + g)−δ/r , δ > r . (32)

was used. The hyperparameter δ should be selected in theinterval r < δ ≤ 2r [4]. Besides having some desirableanalytical properties, p(g|δ) reduces to the Jeffreys’ prior andthe reference prior for a linear regression model when δ = r[65]. However, since this prior is improper, it can only be usedwhen the prior probability of the null model is zero.

4) The non-linear Parameters: The elicitation of the priordistribution on the non-linear parameters φk is hard to do ingeneral. In this paper, we therefore treat the case of non-linearparameters with a uniform prior of the form

p(φk|Mk) = W−1k IΦk

(φk) (33)

where IΦk(·) is the indicator function on the support Φk and

Wk is the normalisation constant. This uniform prior is oftenused for the non-linear parameters of sinusoidal and chirpedsignal models.

5) The Models: For the prior on the model, we select auniform prior of the form p(Mk) = K−1IK(k) where K ={1, 2, . . . ,K}. For a finite number of models, however, it iseasy to use a different prior in our framework through (7).

B. Bayesian Inference

So far, we have elicited our probability model consisting ofthe observation model in (29) and the prior distributions on themodel parameters. These distributions constitute the integrandof the integral representation of the marginal likelihood in (4),and we now evaluate this integral. After some algebra, theintegrand can be rewritten as

p(x|αk,ψ,φk, σ2,Mk)p(αk|g, σ2,φk,Mk)

× p(ψ)p(σ2)p(g|δ)p(φk|Mk)

∝ Nr(α; gmk/(1 + g), gσ2[ZHk (φk)Zk(φk)]−1/(1 + g))

×Nr(ψ;mN , σ2[BHB]−1)Inv-G

(σ2;

N − lNr

,Nσ2

k

r

)× mN (x)

(1 + g)lk/r

(σ2N

σ2k

)(N−lN )/r

p(g|δ)p(φk|Mk) (34)

where Inv-G is the inverse gamma distribution. Moreover, wehave defined

mk , [ZHk (φk)Zk(φk)]−1ZHk (φk)x (35)

mN , (BHB)−1BHx (36)

σ2k ,

xH(IN − PB − g1+gPZ)x

N(37)

mN (x) ,Γ((N − lN )/r)

(Nπσ2N )(N−lN )/r|BHB|1/r

(38)

where PB and PZ are the orthogonal projection matrices ofB and Zk(φk), respectively, and σ2

k is asymptotically equal tothe maximum likelihood (ML) estimate of the noise variance inthe limit σ2

ML = limg→∞ σ2k. The estimate σ2

N is the estimatednoise variance of the null model, and it is defined as σ2

k

for PZ = 0. Finally, mN (x) is the unnormalised marginallikelihood of the null model. The linear parameters and thenoise variance is now easily integrated out of the marginallikelihood in (4). Doing this, we obtain that

p(x|g,φk,Mk) ∝ mN (x)

(1 + g)lk/r

(σ2N

σ2k

)(N−lN )/r

(39)

=mN (x)(1 + g)(N−lN−lk)/r

(1 + g[1−R2k(φk)])(N−lN )/r

(40)

which we define as the unnormalised marginal likelihoodmk(x|g,φk) of model Mk given g and φk. Moreover,

R2k(φk) ,

xHPZx

xH(IN − PB)x(41)

resembles the coefficient of determination from classical linearregression analysis where it measures how well the data setfits the regression. Whereas the linear parameters and the noisevariance were easily integrated out of the marginal likelihood,the hyperparameter g and the non-linear parameters φk are not.In the next two sections, we therefore propose approximateways of performing the integration over these parameters.

IV. KNOWN SYSTEM MATRIX

In this section, we consider the case where there are eitherno non-linear parameters or they are known.

A. Fixed Choices of g

We first assume that g is a fixed quantity. From (6) and(39), the Bayes’ factor is therefore

BF[Mk;MN |g,φk] = (1 + g)−lk/r(σ2N

σ2k

)(N−lN )/r

. (42)

With a uniform prior on the models, it follows from (7) thatthe Bayes’ factor is proportional to the posterior distributionwith pdf p(Mk|x, g,φk) on the models. The model with thehighest posterior probability is the solution to

k = arg maxk∈K

p(Mk|x, g,φk) (43)

= arg maxk∈K

[−(N − lN ) ln σ2

k − lk ln(1 + g)]. (44)

Page 8: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 7

As alluded to in Sec. III-A3, the value of g is vital in modelselection. From (40), we see that if g →∞, the Bayes’ factorin (42) goes to zero. The null model is therefore always themost probable model, regardless of the information in the data(Bartlett’s paradox). Another problem occurs if we assumethat the least squares estimate mk → ∞ or, equivalently,that R2

k(φk) → 1 so that the null model cannot be true.Although we would expect that the Bayes’ factor would alsogo to infinity, it converges to the constant (1 + g)(N−ln−lk)/r,and this is called the information paradox [4], [35], [66]. Forthese two reasons, the value of g should depend on the datain some way. A local empirical Bayesian (EB) estimate is adata-dependent estimate of g, and it is the maximum of themarginal likelihood w.r.t. g [4]

gEBk = arg max

g∈R+

p(x|g,φk,Mk) (45)

= max

((N − lN )R2

k(φk)− lk(1−R2

k(φk))lk, 0

), lk > 0 (46)

where R+ is the set of non-negative real-valued numbers. Thischoice of g clearly resolves the information paradox. Insertingthe EB estimate of g into (44) gives for R2

k(φk) > lk/N theempirical BIC (e-BIC)

k = arg maxk∈K

[− (N − lN ) ln σ2

ML − lk ln(1 + gEBk )

+ (N − lN ) ln(1− lk/(N − lN ))]

(47)

whose form is similar to most of the information criteria.When the null model is the noise-only model so that lN = 0,these information criteria can be written as [20]6

k = arg maxk∈K

[−2N ln σ2

ML − rνkh(νk, N)]

(48)

where νk is the number of real-valued independent param-eters in the model, and h(νk, N) is a penalty coefficient.For h(νk, N) = {2, lnN}, we get the AIC and the MDL,respectively. Note that νk is not always the same as thenumber of unknown parameters [30]. Moreover, if the penaltycoefficient does not depend on the candidate model, νk maybe interpreted as the number of independent parameters whichare not in all candidate models. In nested models with whiteGaussian noise and a known system matrix, this means thatthe noise variance parameter does not have to be countedas an independent parameter. Thus, selecting νk as eitherνk = 2lk/r + 1 or νk = 2lk/r does not change, e.g., theAIC and the MDL.

1) Interpretation of the e-BIC: To gain some insight intothe behaviour of the e-BIC, we here compare it to the AICand the MDL in the context of a linear regression model withN � lk and lN = 0. Under these assumptions, the penaltycoefficient of the e-BIC in (47) reduces to

h(νk, N) = ln(1 + gEBk )−N ln(1− lk/N)/lk (49)

≈ ln(1 + gEBk ) + 1 = ln(1 +NηEB

k /lk) + 1 (50)

6The cost function must be divided by 2r when the information criteriaare used for model averaging and comparison in the so-called multi-modalapproach [21].

where the approximation follows from the assumption thatN � lk so that ln(1−lk/N) ≈ −lk/N . From the approximatee-BIC (ae-BIC) in (50), several interesting observations canbe made. When the SNR is large enough to justify thatNηEB

k � lk, the e-BIC is basically a corrected MDL whichtakes the estimated SNR of the data into account. The penaltycoefficient grows with the estimated SNR and the chance ofover-fitting thus becomes very low, even under high SNR con-ditions where the AIC, but also the MDL tend to overestimatethe model order [67]. When the estimated SNR on the otherhand becomes so low that NηEB

k � lk, the e-BIC reduces to anAIC-like rule which has a constant penalty coefficient. In theextreme case of an estimated SNR of zero, the e-BIC reducesto the so-called no-name rule [20]. Interestingly, empiricalstudies [40], [68] have shown that the AIC performs betterthan the MDL when the SNR in the data is low, and thisis automatically captured by the e-BIC. The e-BIC thereforeperforms well across all SNR values as we have demonstratedfor the polynomial model in [62].

B. Integration over gAnother way to resolve the information paradox is by

treating g as a random variable and integrate it out of themarginal likelihood. For the prior distribution on g in (32)and the unnormalised marginal likelihood in (40), we obtainthe Bayes’ factor given by

BF[Mk;MN |φk] =

∫ ∞0

mk(x|g,φk)

mN (x)p(g|δ)dg

=δ − r

lk + δ − r 2F1

(N − lN

r, 1;

lk + δ

r;R2

k(φk)

)(51)

where 2F1 is the Gaussian hypergeometric function [69,p. 314]. When N is large or R2

k(φk) is very close to one,numerical and computational problems with the evaluation ofthe Gaussian hypergeometric function may be encountered[70]. From a computational point of view, it may thereforenot be advantageous to marginalise (51) w.r.t. g analytically.Instead, the Laplace approximation can be used as a simplealternative. Using the procedure outlined in Sec. II-B1 and theresults in App. B, we get that

BF[Mk;MN |φk]

≈ BF[Mk;MN |g,φk]g(δ − r)r(1 + g)δ/r

√2πγ(g|φk) (52)

where g = exp(τ) and γ(g|φk) can be found from (83) and(84), respectively, with v = 1, w = (N − lN − lk − δ)/r,and u = (N− lN )/r. Since the marginal posterior distributionon g does not have a symmetric pdf and in order to avoidedge effects near g = 0, the Laplace approximation was madefor the parametrisation τ = ln g [4]. This parametrisationsuggests that the posterior distribution on g is approximately alog-normal distribution. The model with the highest posteriorprobability can be found by maximising (52) w.r.t. the modelindex and this yields the Laplace BIC (lp-BIC)

k = arg maxk∈K

[− (N − lN ) ln σ2

k − lk ln(1 + g)

− δ ln(1 + g) + r ln g + (r/2) ln γ(g|φk)]

(53)

Page 9: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 8

Rule h(νk, N) Bayes’ factor

AIC 2 See [21]MDL lnN See [21]MAP Model dependent See [21]e-BIC ln(1 + gEB

k )−N ln(1− lk/N)/lk (42)ae-BIC ln(1 + gEB

k ) + 1 (42)lp-BIC No short expression. See (53) instead. (52)

TABLE IPENALTY TERMS AND BAYES’ FACTORS FOR REGRESSION MODELS WITHA KNOWN SYSTEM MATRIX AND THE NOISE-ONLY MODEL AS THE NULL

MODEL.

Compared to the maximisation in (44), (53) differs in terms ofthe estimate of g and the last three terms. These terms accountfor the uncertainty in our point estimate of g. In Table I, wehave compared the proposed model selection and comparisonrules with the AIC, the MDL, and the MAP rule for regressionmodels with a known system matrix and lN = 0.

V. UNKOWN NON-LINEAR PARAMETERS

In this section, the ρk real-valued and non-linear parametersφk are also assumed unknown, and they must therefore beintegrated out of the marginal likelihood in (40). Since ananalytical marginalisation is usually not possible, we here con-sider doing the joint integral over φk and g using the Laplaceapproximation with the change of variables τ = ln g. Dividing(40) by mN (x) yields the following integral representation ofthe Bayes’ factor in (6)

BF[Mk;MN ] =

∫ ∞−∞

∫Φk

q(φk, τ)dφkdτ (54)

where the integrand is given by

q(φk, τ) = BF[Mk;MN | exp(τ),φk]p(φk|Mk)p(τ |δ)(55)

withp(τ |δ) = exp(τ)p(g|δ)

∣∣g=exp(τ)

. (56)

According to the procedure outlined in Sec. II-B1, we needto find the mode and Hessian of ln q(φk, τ) to approximatethe integrand by a normal pdf. For the uniform prior on φkin (33), the mode of ln q(φk, τ) w.r.t. φk is given by

φMAPk = arg max

φk∈Φk

p(x|g,φk,Mk) = arg maxφk∈Φk

p(x|φk,Mk)

= arg maxφk∈Φk

R2k(φk) = arg max

φk∈Φk

Ck(φk) (57)

where we have defined

Ck(φk) , xHPZx . (58)

Note that Ck(φk) does not depend on the hyperparameter g(and equivalently τ ) so the MAP estimator φ

MAPk is indepen-

dent of the prior on g. Depending on the structure of Zk(φk),it might be hard to perform the maximisation of Ck(φk). InApp. C, we have therefore derived the first and second orderdifferentials of an orthogonal projection matrix as these areuseful in numerical optimisation algorithms for maximisingCk(φk). We also note in passing that the MAP estimator is

identical to the ML estimator for the non-linear regressionmodel in (25). Evaluated at the mode φ

MAPk , the Hessian matrix

H(φk) is given by

H(φMAPk ) =

exp(τ)(N − lN )

rN [1 + exp(τ)]σ2k

D (59)

where we have defined

D ,∂2Ck(φk)

∂φk∂φTk

∣∣∣∣φk=φ

MAPk

. (60)

Using the results in App. C, the (n,m)’th element of D canbe written as

[D]nm = 2Re{wH

[Λnm − T nS−1

k ZHk (φ

MAPk )Tm

− TmS−1k Z

Hk (φ

MAPk )T n

]mk +wHT nS

−1k T

Hmw

−mHk T

Hn (IN − PZ)Tmmk

}(61)

where we have defined

w , x−Zk(φMAPk )mk (62)

Sk , ZHk (φMAPk )Zk(φ

MAPk ) (63)

T i ,∂Zk(φk)

∂φi

∣∣∣∣φk=φ

MAPk

(64)

Λnm ,∂2Zk(φk)

∂φn∂φm

∣∣∣∣φk=φ

MAPk

. (65)

As we demonstrate in the Sec. VI, the value of [D]nm canoften be approximated by only the last term in (61).

Since φMAPk does not depend on the value of τ , the mode

and second-order derivative of ln q(φk, τ) w.r.t. τ are thereforethe same as in Sec. IV-B and can be found in App. B withv = 1, w = (N − lN − lk− δ)/r, and u = (N − lN )/r. Thus,the Laplace approximation of the Bayes’ factor in (54) is

BF[Mk;MN ] ≈ BF[Mk;MN |g, φMAPk ]

g(δ − r)r(1 + g)δ/r

×W−1k (2π)(ρk+1)/2

√γ(g|φMAP

k )| −H(φMAPk )|−1/2 . (66)

When q(φk, τ) consists of multiple, significant, and well-separated peaks, the integral in (54) can be approximatedby a Laplace approximation to each peak at their respectivemodes [55]. In this case, the Bayes’ factor in (66) will be asum over each of these peaks. Since it is not obvious howthe number of peaks should be selected in a computationallysimple manner, we consider only one peak in the simulationsin Sec. VI. Although this is often a crude approximation forlow SNRs, we demonstrate that other model selection rulesare still outperformed.

VI. SIMULATIONS

We demonstrate the applicability of our model comparisonalgorithms by three simulation examples. In the first example,we compare the penalty coefficient of our proposed algorithmswith the penalty coefficient of the AIC, the MDL, and theMAP rule. In the second and third example, we consider model

Page 10: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 9

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

R2k(φk)

h(lk,N

)AIC MDL/MAP ae-BIC e-BIC lp-BIC

101 102 103

N

Fig. 1. Interpretation of the various information criteria for lk = 5. The plots show the penalty coefficient h(lk, N) as a function of R2k(φk) and the

number of data points N . In the left plot, N = 95, and in the right, R2k(φk) = 0.95 for the e-BIC, the ae-BIC, and the lp-BIC.

−10 −8 −6 −4 −2 0 2 4 6 8 10−50

5

10

15

SNR [dB]

‖sp−sp‖2 2

[dB

]

Single Model

AIC MAP LP Oracle

−10 −8 −6 −4 −2 0 2 4 6 8 10

SNR [dB]

Model Averaging

50 60 70 80 902

4

6

8

n

|s p(n)−s p(n)|2

[dB

]

50 60 70 80 90

n

Fig. 3. Prediction performance versus the SNR (top row) and versus the prediction step at an SNR of 0 dB (bottom row) for a periodic signal model. In theplots in the left column, only the model with the highest posterior probability was used whereas all models were used in the plots in the right column.

comparison in models containing unknown non-linear param-eters. Specifically, we first consider a periodic signal modelwhich consists of a single non-linear parameter, the fundamen-tal frequency, and then consider the uniform linear array modelwhich consists of multiple non-linear parameters, the directionof arrivals (DOA). Similar simulations comparing the perfor-mance of the e-BIC in (42) and the lp-BIC in (52) to otherorder selection rules for linear and polynomial models can befound in [4] and [62], respectively. The simulation code can befound at http://kom.aau.dk/~jkn/publications/publications.php.

A. Penalty CoefficientIn Sec. IV-A1, we considered the interpretation of the AIC

[23] and the MDL [26], [27] for a regression model witha known system matrix when the null model is the noise-only model and N � lk. For the linear regression model, theMDL and the MAP are equivalent [40]. Here, we give somemore insight by use of a simple simulation example in whichthe penalty coefficients of the AIC, the MDL/MAP, the e-BIC, the approximate e-BIC (ae-BIC), and the lp-BIC methods

were found as a function of the coefficient of determinationR2k(φk) and the number of data points N . We fixed the

number of linear parameters to lk = 5, and Fig. 1 showsthe results. In the left plot, the penalty coefficients h(lk, N)were computed as a function of R2

k(φk) for N = 95. Sincethe AIC and the MDL/MAP do not depend on the data, theirpenalty coefficients were constant. On the other hand, thepenalty coefficients of the e-BIC, the ae-BIC, and the lp-BIC are data dependent and increased with the coefficientof determination. In the right plot, the penalty coefficientsh(lk, N) were computed as a function of the number of datapoints N for R2

k(φk) = 0.95. Note that the MDL/MAP hadthe same trend as the e-BIC, the ae-BIC, and the lp-BICalthough shifted. The vertical distance between these penaltiesdepends on the particular value of R2

k(φk). In Fig. 1, we setR2k(φk) = 0.95, but if R2

k(φk) ≈ 0.648 was selected instead,the e-BIC and the MDL/MAP would coincide for large valuesof N .

Page 11: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 10

020406080100

Cor

rect

[%]

AIC MAP LP

0

20

40

60

Ove

r[%

]

020406080100

Und

er[%

]

−10 −5 0 5 10 15 200

1

2

SNR [dB]

E(L k

)[·]

Fig. 2. The first three plots show the percentage of correctly detected,overestimated, and underestimated number of harmonic components versusthe SNR for the harmonic signal model. The last plot shows the RMSE ofthe estimated number of harmonic components.

B. Periodic Signal Model

We consider a complex periodic signal model given by

Mk : x(n) =

L∑i=1

αi exp(jωin)ILk(i) + e(n) (67)

for n = 0, 1, . . . , N − 1 where ILk(i) indicates whether the

i’th harmonic component is included in the modelMk or not.This model is a special case of the model in (25) with the nullmodel being the noise-only model, φk = ω, and αk being thecomplex amplitudes. Since no closed-form solution exists forthe posterior distribution on the models for the periodic signalmodel, we consider the approximation in (66) which we referto as the Laplace (LP) method. The method is compared to theAIC and the asymptotic MAP rule by Djuric with the latterhaving the penalty coefficient in (48) given by [9]

h(lk, N) = lnN +3

2lklnN . (68)

For the periodic signal model, the Hessian matrix in (59) is ascalar which can be approximated by [71]

H(ωMAPk ) ≈ − gN(N2 − 1)

∑Li=1 |[mk]i|2i2ILk

(i)

6(1 + g)σ2k

(69)

where ωMAPk is the ML estimate of the fundamental frequency.

In the simulations, we set the maximum number of harmonic

20

40

60

80

100

Cor

rect

[%]

AIC MAP LP AbS ESTER

0

10

20

30

40

Ove

r[%

]

0

20

40

60

Und

er[%

]

−20 −10 0 10 20 30 400

0.5

1

1.5

SNR [dB]

E(L k

)[·]

Fig. 4. The first three plots show the percentage of correctly detected,overestimated, and underestimated number of sources versus the SNR for theuniform linear array model. The last plot shows the RMSE of the estimatednumber of sources.

components to L = 10 and considered K = 2L − 1 = 1023models. Zero prior probability was assigned to the noise-only model as the model comparison performance was eval-uated against the SNR. Moreover, this permits the use ofthe improper prior p(g|δ = r = 1) since g is now acommon parameter in all models. For each SNR from −10dB to 20 dB in steps of 1 dB, we ran 1000 Monte Carloruns. As recommended in [21], a data vector x consisting ofN = 50 samples was generated in each run by first randomlyselecting a model from a uniform prior on the models. For thismodel, we then randomly selected the fundamental frequencyand the phases of the complex amplitudes from a uniformdistribution on the interval [(2π)−1, 2π/max(Lk)] and [0, 2π],respectively. The amplitudes of the harmonics in the selectedmodel were all set to one. Finally a complex noise vector wasgenerated and normalised so that the data had the desired SNR.Besides generating a data vector, we also generated a vectorxp of unobserved data for n = N,N + 1, . . . 2N − 1.

In Fig. 2, the percentage of correctly detected models,overestimated models, underestimated models, and the root-mean-squared-error (RMSE) of the estimated model versusthe SNR are shown. The RMSE is defined as

E(Lk) =

√√√√ L∑i=1

(ILk(i)− ILk

(i))2 (70)

Page 12: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 11

where Lk is the set containing the harmonic numbers of themost likely model. For an SNR below 0 dB, the LP methodand the asymptotic MAP rule had a similar performance andwere better than the AIC. For SNRs above 5 dB, the LPmethod also outperformed the asymptotic MAP rule. In termsof the RMSE, similar observations are made except that theasymptotic MAP rule performs worse than the other methodsfor low SNRs. However, it should be noted that the percentageof correctly detected models is not necessarily the best wayof benchmarking model selection methods. As exemplified in[21], the true model does not always give the best predictionperformance, and it may therefore be advantageous to eitherover- or underestimate the model order. Using the same MonteCarlo setup as above, we have therefore also investigated theprediction performance, and the results are shown in Fig. 3.In the plots in the left column, only the single model withthe largest posterior probability was used for making thepredictions of the predictable part sp whereas all modelswere used as in (8) in the plots in the right column. Theprediction based on a single model and all models was themean of p(xp|x,Mk) and p(xp|x), respectively, where thelatter depends on the former as in (8) with

p(xp|x,Mk) =

∫Θk

p(xp|θk,Mk)p(θk|x,Mk)dθk . (71)

In the top row, the MSE of the total prediction error versusthe SNR is shown, and in the bottom row, the MSE of theprediction error for each prediction step at an SNR of 0 dBis shown. In the four plots, the Oracle knew the true modelbut not the model parameters. From the four figures, we seeagain that the LP method outperformed the other methods withthe AIC being the overall worst. For low SNRs, we also seethat the MSE of the prediction errors were significantly lowerwhen model averaging was used. Moreover, we see that theperformance was also better than the Oracle’s performanceand this demonstrates, as discussed above, that the true modeldoes not always give the best prediction performance. Forhigh SNRs, only the AIC performed slightly worse than theother methods which performed almost as well as the Oracle.Moreover, there was basically no difference between the singleand multi-model predictions since a single model received allposterior probability.

C. Uniform Linear Array Signal Model

In the third and final simulation example, we considerthe problem of estimation the number of narrowband sourcesignals impinging on a uniform linear array (ULA) consistingof M calibrated sensors. For this problem, the model for them’th sensor signal is given by [22, Ch. 6]

Mk : ym(n) =

lk∑i=1

si(n) exp(−jωin) + em(n) (72)

for n = 0, 1, . . . , N − 1 where ωi is the spatial frequency inradians per sample of the i’th source. The spatial frequencyis related to the direction of arrival (DOA) θi of the sourcesignal by

ωi =ωcd sin(θi)

v(73)

where ωc, d, and v are the carrier frequency in radians persecond, the sensor distance in meters, and the propagationspeed in meters per second, respectively. The signal si(n) isthe baseband signal of the i’th source. The signal model in(72) can be written into the form of (25) as

vec(Y ) = (IN ⊗Z(ωk))vec(Sk) + vec(E) (74)

where vec(·) and ⊗ denote the vectorisation and the Kroneckerproduct, respectively. The M ×N matrices Y and E containthe observed sensor signals, and the noise realisations, andthe lk ×N matrix Sk contains the baseband signals. Finally,the M × lk matrix Zk(ω) contains the lk steering vectorswith the (m, i)’th element being given by exp(−jωi(m−1)).As in the previous example, no closed-form expression existsfor the posterior distribution on the models, and we thereforeagain consider the Laplace approximation in (66). By onlykeeping the last term of (61) and by making the approximationZ(ωk)HZ(ωk) ≈ MI lk , the determinant of the negative ofthe Hessian matrix in (59) can be approximated by

| −H(ωMAPk )| ≈

(gM3

6(1 + g)σ2k

)lk lk∏i=1

N−1∑n=0

|si(n)|2 (75)

where ωMAPk coincides with the maximum likelihood estimate

of ω which we have computed using the RELAX algorithm[72]. Using a Monte Carlo simulation consisting of 1000 runsfor every SNR from -20 dB to 40 dB in steps of 2 dB,we evaluated the model detection performance for N = 100snapshots and M = 10 sensors. As in the previous simulation,we generated the model parameters at random in every runwith the baseband signals being realisations from a complex-valued white Gaussian process. The true number of sourceswas either one, two, or three. In addition to comparing theproposed method to the AIC and the asymptotic MAP rule, wealso compared to two subspace-based methods which are oftenused in array processing. These are the MUSIC method usingangle between subspaces (AbS) [33], [73] and the estimationerror (ESTER) method [31] based on ESPRIT. Since neitherof these methods are able to detect whether a source ispresent or not, the all-noise model was not included in theset of candidate models which was set to consist of maximumK = 5 sources. Fig. 4 shows the results of the simulation. Theproposed method (LP) performed better than the other rules forSNRs up to approximately 15 dB where the asymptotic MAPrule achieved the same performance. For low SNRs, the AICperformed better than the asymptotic MAP rule. The MUSICand ESTER methods performed well across all SNRs and onlyslightly worse than the proposed method. All methods exceptthe AIC seem to be consistent order selection rules.

VII. CONCLUSION

Model comparison and selection is a difficult and im-portant problem and a lot of methods have therefore beenproposed. In this paper, we first gave an overview over howmodel comparison is performed for any model in a Bayesianframework. We also discussed the two major issues of doingthe model comparison in a Bayesian framework, namely theelicitation of prior distributions and the evaluation of the

Page 13: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 12

marginal likelihood. Specifically, we reviewed the conditionsfor using improper prior distributions, and we briefly discussedapproximate numerical and analytical algorithms for evaluat-ing the marginal likelihood. In the second part of the paper, weanalysed a general regression model in a Bayesian framework.The model consisted of both linear and non-linear parameters,and we used and motivated a prior of the same form as theZellner’s g-prior for this model. Many of the informationcriteria can be interpreted in a new light using this modelwith known non-linear parameters. These interpretations alsogave insight into why the AIC often overestimate the modelcomplexity for a high SNR, and why the MDL underestimatethe model complexity for a low SNR. For unknown non-linear parameters, we proposed an approximate way of inte-grating them out of the marginal likelihood using the Laplaceapproximation, and we demonstrated through two simulationexamples that our proposed model comparison and selectionalgorithm outperformed other algorithms such as the AIC, theMDL, and the asymptotic MAP rule both in terms of detectingthe true model and in making predictions.

APPENDIX AFISHER INFORMATION MATRIX FOR THE OBSERVATION

MODEL

Let γ denote a mixed parameter vector of complex-valuedand real-valued parameters. Using the procedure in [74,App. 15C], it can be shown that the (n,m)’th element of theFisher information matrix (FIM) for the normal distributionNr(x;µ(γ); Σ(γ)) is given by

[I(γ)]nm =1

r

(∂µ∗(γ)

∂γ∗n

)TΣ−1(γ)

(∂µ∗(γ)

∂γ∗m

)∗+

1

r

(∂µ(γ)

∂γ∗n

)HΣ−1(γ)

(∂µ(γ)

∂γ∗m

)+

1

rtr

(Σ−1(γ)

∂Σ(γ)

∂γ∗nΣ−1(γ)

(∂Σ(γ)

∂γ∗m

)H). (76)

For the observation model in (29), the parameter vector isgiven by γ =

[ψT αTk φTk σ2

]T, and the mean vector

and covariance matrix are given by

µ(γ) = Bψ +Zk(φk)αk (77)

Σ(γ) = σ2IN . (78)

Computing the derivatives in (76) for the observation modelin (29) yields the FIM given by

I(γ) =1

σ2

BHB 0 00 I(αk,φk) 00 0 N

rσ2

(79)

where

I(αk,φk) =

[ZHk (φk)Zk(φk) ZHk (φk)Qk(φk)

QHk (φk)Zk(φk) 2

rRe(QHk (φk)Qk(φk)

)]

Qk(φk) ,∂(Zk(φk)αk)

∂φk.

Note that I(γ) is block diagonal which follows from theassumption that BHZk(φk) = 0.

APPENDIX BLAPLACE APPROXIMATION WITH THE HYPER-G PRIOR

For the hyper-g prior in (32), the integral in (51) with thechange of variables to τ = ln g can be written in the form∫ ∞

0

gv−1(1 + g)w[1 + g(1−R2k(φk))]−udg =∫ ∞

−∞exp(vτ)(1+exp(τ))w[1+exp(τ)(1−R2

k(φk))]−udτ .

Taking the derivative of the logarithm of the integrand andequating to zero lead to the quadratic equation

0 = ατ exp(2τ) + βτ exp(τ) + v (80)

where we have defined

ατ , (1−R2k(φk))(v + w − u) (81)

βτ , (u− v)R2k(φk) + 2v + w − u (82)

For u − w > v, the only positive solution to this quadraticequation is

τ = ln

(βτ +

√β2τ − 4ατv

−2ατ

)(83)

which is the mode of the normal approximation to theintegrand. The corresponding variance at this mode withg = exp(τ) is

γ(g|φk) =

[gu(1−R2

k(φk))

[1 + g(1−R2k(φk))]2

− gw

(1 + g)2

]−1

. (84)

APPENDIX CDIFFERENTIALS OF A PROJECTION MATRIX

Let P = G(GHG)−1GH denote an orthogonal projectionmatrix, and let S = GHG denote an inner matrix product.The differential of S is then given by

dS = (dG)HG+GH(dG) . (85)

This result can be used to show that

dS−1 = −S−1[(dG)HG+GH(dG)]S−1 , (86)

and that

dP = P⊥(dG)S−1GH +GS−1(dG)HP⊥ (87)

where P⊥ = I−P is the complementary projection of P . Letδ denote another differential operator. From the above results,we obtain after some algebra that

δ(dP ) = P⊥(δ(dG))S−1GH +GS−1(δ(dG))HP⊥

+ P⊥[(dG)S−1(δG)H + (δG)S−1(dG)H

]P⊥

− P⊥[(δG)S−1GH(dG) + (dG)S−1GH(δG)

]S−1GH

−GS−1[(δG)HGS−1(dG)H + (dG)HGS−1(δG)H

]P⊥

−GS−1[(dG)HP⊥(δG) + (δG)HP⊥(dG)

]S−1GH .

Page 14: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 13

REFERENCES

[1] G. E. P. Box and N. R. Draper, Empirical model-building and responsesurface. New York, NY, USA: Wiley, Jan. 1987.

[2] E. I. George and D. P. Foster, “Calibration and empirical Bayes variableselection,” Biometrika, vol. 87, no. 4, pp. 731–747, Dec. 2000.

[3] M. Clyde and E. I. George, “Model uncertainty,” Statist. Sci., vol. 19,no. 1, pp. 81–94, Feb. 2004.

[4] F. Liang, R. Paulo, G. Molina, M. A. Clyde, and J. O. Berger, “Mixturesof g priors for Bayesian variable selection,” J. Amer. Statistical Assoc.,vol. 103, pp. 410–423, Mar. 2008.

[5] L. Kavalieris and E. J. Hannan, “Determining the number of terms in atrigonometric regression,” J. of Time Series Analysis, vol. 15, no. 6, pp.613–625, Nov. 1994.

[6] B. G. Quinn, “Estimating the number of terms in a sinusoidal regres-sion,” J. of Time Series Analysis, vol. 10, no. 1, pp. 71–75, Jan. 1989.

[7] C. Andrieu and A. Doucet, “Joint Bayesian model selection and estima-tion of noisy sinusoids via reversible jump MCMC,” IEEE Trans. SignalProcess., vol. 47, no. 10, pp. 2667–2676, 1999.

[8] M. Davy, S. J. Godsill, and J. Idier, “Bayesian analysis of polyphonicwestern tonal music,” J. Acoust. Soc. Am., vol. 119, no. 4, pp. 2498–2517, Apr. 2006.

[9] M. G. Christensen and A. Jakobsson, Multi-Pitch Estimation, B. H.Juang, Ed. San Rafael, CA, USA: Morgan & Claypool, 2009.

[10] E. J. Hannan and B. G. Quinn, “The determination of the order of anautoregression,” J. Royal Stat. Soc., Series B, vol. 41, no. 2, pp. 190–195,1979.

[11] G. Liang, D. M. Wilkes, and J. A. Cadzow, “ARMA model orderestimation based on the eigenvalues of the covariance matrix,” IEEETrans. Signal Process., vol. 41, no. 10, pp. 3003–3009, Oct. 1993.

[12] B. Choi, ARMA model identification. New York, NY, USA: Springer-Verlag, Jun. 1992.

[13] S. Koreisha and G. Yoshimoto, “A comparison among identificationprocedures for autoregressive moving average models,” InternationalStatistical Review, vol. 59, no. 1, pp. 37–57, Apr. 1991.

[14] J. Vermaak, C. Andrieu, A. Doucet, and S. J. Godsill, “Reversible jumpMarkov chain Monte Carlo strategies for Bayesian model selection inautoregressive processes,” J. of Time Series Analysis, vol. 25, no. 6, pp.785–809, Nov. 2004.

[15] T. Cassar, K. P. Camilleri, and S. G. Fabri, “Order estimation ofmultivariate ARMA models,” IEEE J. Sel. Topics Signal Process., vol. 4,no. 3, pp. 494–503, Jun. 2010.

[16] Z. Liang, R. Jaszczak, and R. Coleman, “Parameter estimation of finitemixtures using the EM algorithm and information criteria with appli-cation to medical image processing,” IEEE Trans. Nucl. Sci., vol. 39,no. 4, pp. 1126–1133, Aug. 1992.

[17] C. E. Rasmussen, “The infinite Gaussian mixture model,” in Adv. inNeural Inf. Process. Syst., 2000, pp. 554–560.

[18] M. H. C. Law, M. A. T. Figueiredo, and A. K. Jain, “Simultaneousfeature selection and clustering using mixture models,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, Sep. 2004.

[19] C. R. Rao and Y. Wu, “On model selection,” Institute of MathematicalStatistics Lecture Notes – Monograph Series, vol. 38, pp. 1–57, 2001.

[20] P. Stoica and Y. Selén, “Model-order selection: a review of informationcriterion rules,” IEEE Signal Process. Mag., vol. 21, no. 4, pp. 36–47,Jul. 2004.

[21] P. Stoica, Y. Selén, and J. Li, “Multi-model approach to model selection,”Digital Signal Process., vol. 14, no. 5, pp. 399–412, Sep. 2004.

[22] P. Stoica and R. L. Moses, Spectral Analysis of Signals. EnglewoodCliffs, NJ, USA: Prentice Hall, May 2005.

[23] H. Akaike, “A new look at the statistical model identification,” IEEETrans. Autom. Control, vol. 19, no. 6, pp. 716–723, Dec. 1974.

[24] C. M. Hurvich and C.-L.Tsai, “A corrected Akaike information criterionfor vector autoregressive model selection,” J. of Time Series Analysis,vol. 14, no. 3, pp. 271–279, May 1993.

[25] S. Konishi and G. Kitagawa, “Generalised information criteria in modelselection,” Biometrika, vol. 83, no. 4, pp. 875–890, Dec. 1996.

[26] G. Schwarz, “Estimating the dimension of a model,” Ann. Stat., vol. 6,no. 2, pp. 461–464, Mar. 1978.

[27] J. Rissanen, “Modeling by shortest data description,” Automatica,vol. 14, no. 5, pp. 465–471, Sep. 1978.

[28] ——, “Estimation of structure by minimum description length,” Circuits,Systems, and Signal Process., vol. 1, no. 3, pp. 395–406, 1982.

[29] ——, “A predictive least-squares principle,” IMA J. Math. Control Inf.,vol. 3, no. 2–3, pp. 211–222, 1986.

[30] M. Wax and T. Kailath, “Detection of signals by information theoreticcriteria,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2,pp. 387–392, Apr. 1985.

[31] R. Badeau, B. David, and G. Richard, “A new perturbation analysisfor signal enumeration in rotational invariance techniques,” IEEE Trans.Signal Process., vol. 54, no. 2, pp. 450–458, Feb. 2006.

[32] J.-M. Papy, L. De Lathauwer, and S. Van Huffel, “A shift invariance-based order-selection technique for exponential data modelling,” IEEESignal Process. Lett., vol. 14, no. 7, pp. 473–476, Jul. 2007.

[33] M. G. Christensen, A. Jakobsson, and S. H. Jensen, “Sinusoidal orderestimation using angles between subspaces,” EURASIP J. on Advancesin Signal Process., vol. 2009, pp. 1–11, Nov. 2009.

[34] J. O. Berger and L. R. Pericchi, “The intrinsic Bayes factor for modelselection and prediction,” J. Amer. Statistical Assoc., vol. 91, no. 433,pp. 109–122, Mar. 1996.

[35] ——, “Objective Bayesian methods for model selection: Introductionand comparison,” Institute of Mathematical Statistics Lecture Notes –Monograph Series, vol. 38, pp. 135–207, 2001.

[36] L. Wasserman, “Bayesian model selection and model averaging,” J. ofMathematical Psychology, vol. 44, no. 1, pp. 92–107, Mar. 2000.

[37] A. F. Deltell, “Objective bayes criteria for variable selection,” Ph.D.dissertation, Universitat de València, Valencia, Spain, 2011.

[38] P. M. Djuric and S. M. Kay, “Model selection based on Bayesianpredictive densities and multiple data records,” IEEE Trans. SignalProcess., vol. 42, no. 7, pp. 1685–1699, Jul. 1994.

[39] P. M. Djuric, “A model selection rule for sinusoids in white Gaussiannoise,” IEEE Trans. Signal Process., vol. 44, no. 7, pp. 1744–1751, Jul.1996.

[40] ——, “Asymptotic MAP criteria for model selection,” IEEE Trans.Signal Process., vol. 46, no. 10, pp. 2726–2735, Oct. 1998.

[41] P. Stoica and P. Babu, “On the proper forms of BIC for model orderselection,” IEEE Trans. Signal Process., vol. 60, no. 9, pp. 4956–4961,Sep. 2012.

[42] A. Zellner, “On assessing prior distributions and Bayesian regressionanalysis with g-prior distributions,” in Bayesian Inference and DecisionTechniques. New York, NY, USA: Elsevier, 1986.

[43] D. J. C. MacKay, Information Theory, Inference & Learning Algorithms.Cambridge, U.K.: Cambridge Univ. Press, Jun. 2002.

[44] R. Strachan and H. K. v. Dijk, “Improper priors with well defined Bayes’factors,” Dept. of Econ., Univ. of Leicester, Leicester, U.K., DiscussionPapers in Economics 05/4, 2005.

[45] C. P. Robert, The Bayesian Choice: From Decision-Theoretic Founda-tions to Computational Implementation. New York, NY, USA: Springer,May 2001.

[46] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, 2nd ed.New York, NY, USA: Springer-Verlag, Jul. 2004.

[47] C. Andrieu, A. Doucet, and C. P. Robert, “Computational advances forand from Bayesian analysis,” Statist. Sci., vol. 19, no. 1, pp. 118–127,Feb. 2004.

[48] S. Chib, “Marginal likelihood from the Gibbs output,” J. Amer. StatisticalAssoc., vol. 90, no. 432, pp. 1313–1321, Dec. 1995.

[49] S. Chib and I. Jeliazkov, “Marginal likelihood from the Metropolis-Hastings output,” J. Amer. Statistical Assoc., vol. 96, no. 453, pp. 270–281, Mar. 2001.

[50] P. Green, “Reversible jump Markov chain Monte Carlo computation andBayesian model determination,” Biometrika, vol. 82, pp. 711–732, 1995.

[51] M. Hong, M. F. Bugallo, and P. M. Djuric, “Joint model selection andparameter estimation by population Monte Carlo simulation,” IEEE J.Sel. Topics Signal Process., vol. 4, no. 3, pp. 526–539, Jun. 2010.

[52] C. Han and B. P. Carlin, “Markov chain Monte Carlo methods forcomputing Bayes factors: A comparative review,” J. Amer. StatisticalAssoc., vol. 96, no. 455, pp. 1122–1132, Sep. 2001.

[53] C. M. Bishop, Pattern Recognition and Machine Learning. New York,NY, USA: Springer, Aug. 2006.

[54] L. Tierney and J. B. Kadane, “Accurate approximations for posteriormoments and marginal,” J. Amer. Statistical Assoc., vol. 81, no. 393,pp. 82–86, Mar. 1986.

[55] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian DataAnalysis, 2nd ed. London, U.K.: Chapman & Hall/CRC, Jul. 2003.

[56] A. Zellner and A. Siow, “Posterior odds ratios for selected regressionhypotheses,” Trabajos de Estadística y de Investigación Operativa,vol. 31, pp. 585–603, 1980.

[57] E. T. Jaynes, “Prior probabilities,” IEEE Trans. Syst. Sci. Cybern., vol. 4,no. 3, pp. 227–241, 1968.

[58] ——, Probability Theory : The Logic of Science, G. L. Bretthorst, Ed.Cambridge, U.K.: Cambridge Univ. Press, Apr. 2003.

Page 15: Bayesian Model Comparison with the g-Prior...razor), takes model uncertainty into account for estimation and prediction, works for non-nested models, enables a more intuitive interpretation

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL XX, NO. XX, MONTH YEAR 14

[59] G. L. Bretthorst, “The near-irrelevance of sampling frequency distribu-tions,” in Max. Entropy and Bayesian Methods, 1999, pp. 21–46.

[60] E. T. Jaynes, “Bayesian spectrum and chirp analysis,” in MaximumEntropy and Bayesian Spectral Analysis and Estimation Problems, C. R.Smith and G. J. Erickson, Eds. AA Dordrecht, Holland: D. ReidelPublishing Company, 1987, pp. 1–37.

[61] D. S. Bové and L. Held, “Hyper-g priors for generalized linear models,”Bayesian Analysis, vol. 6, no. 3, pp. 387–410, 2011.

[62] J. K. Nielsen, M. G. Christensen, and S. H. Jensen, “Bayesian modelcomparison and the BIC for regression models,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process., May 2013, pp. 6362–6366.

[63] C. Fernández, E. Ley, and M. F. J. Steel, “Benchmark priors for Bayesianmodel averaging,” J. Econometrics, vol. 100, no. 2, pp. 381–427, Feb.2001.

[64] W. Cui and G. E. I., “Empirical Bayes vs. fully Bayes variable selection,”J. Stat. Planning and Inference, vol. 138, no. 4, pp. 888–900, Apr. 2008.

[65] R. Guo and P. L. Speckman, “Bayes factor consistency in linear models,”in The 2009 International Workshop on Objective Bayes Methodology,Jun. 2009.

[66] A. Zellner, “Comments on ’Mixtures of g-priors for Bayesian VariableSelection’ by F. Liang, R. Paulo, G. Molina, M.A. Clyde and J.O.Berger,” Jul. 2008.

[67] Q. Ding and S. M. Kay, “Inconsistency of the MDL: On the performanceof model order selection criteria with increasing signal-to-noise ratio,”IEEE Trans. Signal Process., vol. 59, no. 5, pp. 1959–1969, May 2011.

[68] Q. T. Zhang and K. M. Wong, “Information theoretic criteria for thedetermination of the number of signals in spatially correlated noise,”IEEE Trans. Signal Process., vol. 41, no. 4, pp. 1652–1663, Apr. 1993.

[69] I. S. Gradshteın, I. M. Ryzhik, and A. Jeffrey, Table of Integrals, Series,and Products. New York, NY, USA: Academic Press, 2000.

[70] R. W. Butler and A. T. A. Wood, “Laplace approximations for hyperge-ometric functions with matrix argument,” Ann. Stat., vol. 30, no. 4, pp.1155–1177, Aug. 2002.

[71] J. K. Nielsen, M. G. Christensen, and S. H. Jensen, “Default Bayesianestimation of the fundamental frequency,” IEEE Trans. Audio, Speech,Lang. Process., vol. 21, no. 3, pp. 598–610, Mar. 2013.

[72] J. Li and P. Stoica, “Efficient mixed-spectrum estimation with applica-tions to target feature extraction,” IEEE Trans. Signal Process., vol. 44,no. 2, pp. 281–295, Feb. 1996.

[73] M. G. Christensen and J. K. Nielsen, “Joint direction-of-arrival and orderestimation in compressed sensing using angles between subspaces,” inProc. IEEE Workshop on Stat. Signal Process., Jun. 2011, pp. 449–452.

[74] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I:Estimation Theory. Englewood Cliffs, NJ, USA: Prentice Hall PTR,Mar. 1993.

Jesper Kjær Nielsen (S’12–M’13) was born inStruer, Denmark, in 1982. He received the B.Sc.,M.Sc. (Cum Laude), and Ph.D. degrees in electricalengineering from Aalborg University, Aalborg, Den-mark, in 2007, 2009, and 2012, respectively.

He is currently with the Department of ElectronicSystems, Aalborg University, and Bang & Olufsenas an industrial postdoctoral researcher. He has beena Visiting Scholar at the Signal Processing andCommunications Laboratory, University of Cam-bridge and at the Department of Computer Science,

University of Illinois at Urbana-Champaign. His research interests includespectral estimation, (sinusoidal) parameter estimation as well as statisticaland Bayesian methods for signal processing.

Mads Græsbøll Christensen (S’00–M’05–SM’11))was born in Copenhagen, Denmark, in March 1977.He received the M.Sc. and Ph.D. degrees in 2002 and2005, respectively, from Aalborg University (AAU)in Denmark, where he is also currently employed atthe Dept. of Architecture, Design & Media Technol-ogy as Professor in Audio Processing. At AAU, heis head of the Audio Analysis Lab which conductsresearch in audio signal processing.

He was formerly with the Dept. of ElectronicSystems, Aalborg University and has been a Visiting

Researcher at Philips Research Labs, ENST, UCSB, and Columbia Univer-sity. He has published more than 100 papers in peer-reviewed conferenceproceedings and journals as well as 2 research monographs. His researchinterests include digital signal processing theory and methods with applicationto speech and audio, in particular parametric analysis, modeling, enhancement,separation, and coding.

Prof. Christensen has received several awards, including an ICASSP StudentPaper Award, the Spar Nord Foundation’s Research Prize for his Ph.D. thesis,a Danish Independent Research Council Young Researcher’s Award, andthe Statoil Prize, as well as grants from the Danish Independent ResearchCouncil and the Villum Foundation’s Young Investigator Programme. He isan Associate Editor for IEEE Transactions on Audio, Speech, and LanguageProcessing and has previously served as an Associate Editor for IEEE SignalProcessing Letters.

A. Taylan Cemgil (M’04) received his Ph.D. (2004)from Radboud University Nijmegen, the Nether-lands. Between 2004 and 2008 he worked as apostdoctoral researcher at Amsterdam Universityand the Signal Processing and CommunicationsLab., University of Cambridge, UK. He is currentlyan associate professor of Computer Engineeringat Bogaziçi University, Istanbul, Turkey. He is amember of the IEEE Machine Learning for SignalProcessing Technical Committee and an associateeditor of IEEE Signal processing Letters and Digital

Signal Processing. His research interests are in Bayesian statistical methods,approximate inference, machine learning and audio signal processing.

Søren Holdt Jensen (S’87–M’88–SM’00) receivedthe M.Sc. degree in electrical engineering from Aal-borg University, Aalborg, Denmark, in 1988, and thePh.D. degree in signal processing from the TechnicalUniversity of Denmark, Lyngby, Denmark, in 1995.Before joining the Department of Electronic Systemsof Aalborg University, he was with the Telecom-munications Laboratory of Telecom Denmark, Ltd,Copenhagen, Denmark; the Electronics Institute ofthe Technical University of Denmark; the ScientificComputing Group of Danish Computing Center for

Research and Education (UNI•C), Lyngby; the Electrical Engineering Depart-ment of Katholieke Universiteit Leuven, Leuven, Belgium; and the Center forPersonKommunikation (CPK) of Aalborg University.

He is Full Professor and is currently heading a research team working in thearea of numerical algorithms, optimization, and signal processing for speechand audio processing, image and video processing, multimedia technologies,and digital communications.

Prof. Jensen was an Associate Editor for the IEEE Transactions on SignalProcessing, Elsevier Signal Processing and EURASIP Journal on Advances inSignal Processing, and is currently Associate Editor for the IEEE Transactionson Audio, Speech and Language Processing. He is a recipient of an EuropeanCommunity Marie Curie Fellowship, former Chairman of the IEEE DenmarkSection and the IEEE Denmark Section’s Signal Processing Chapter. He ismember of the Danish Academy of Technical Sciences and was in January2011 appointed as member of the Danish Council for Independent Research—Technology and Production Sciences by the Danish Minister for Science,Technology and Innovation.