RESEARCHARTICLE OpenAccess ......AdrionandMansmannBMCMedicalResearchMethodology2012,12:137 RESEARCHARTICLE OpenAccess Bayesianmodelselectiontechniquesas

Adrion and Mansmann BMCMedical ResearchMethodology 2012, 12:137http://www.biomedcentral.com/1471-2288/12/137

RESEARCH ARTICLE Open Access

Bayesian model selection techniques asdecision support for shaping a statisticalanalysis plan of a clinical trial: An example froma vertigo phase III study with longitudinalcount data as primary endpointChristine Adrion* and Ulrich Mansmann

Abstract

Background: A statistical analysis plan (SAP) is a critical link between how a clinical trial is conducted and the clinicalstudy report. To secure objective study results, regulatory bodies expect that the SAP will meet requirements inpre-specifying inferential analyses and other important statistical techniques. To write a good SAP for model-basedsensitivity and ancillary analyses involves non-trivial decisions on and justification of many aspects of the chosensetting. In particular, trials with longitudinal count data as primary endpoints pose challenges for model choice andmodel validation. In the random effects setting, frequentist strategies for model assessment and model diagnosis arecomplex and not easily implemented and have several limitations. Therefore, it is of interest to explore Bayesianalternatives which provide the needed decision support to finalize a SAP.

Methods: We focus on generalized linear mixed models (GLMMs) for the analysis of longitudinal count data. A seriesof distributions with over- and under-dispersion is considered. Additionally, the structure of the variance componentsis modified. We perform a simulation study to investigate the discriminatory power of Bayesian tools for model criticismin different scenarios derived from the model setting. We apply the findings to the data from an open clinical trial onvertigo attacks. These data are seen as pilot data for an ongoing phase III trial. To fit GLMMs we use a novel Bayesiancomputational approach based on integrated nested Laplace approximations (INLAs). The INLA methodology enablesthe direct computation of leave-one-out predictive distributions. These distributions are crucial for Bayesian modelassessment. We evaluate competing GLMMs for longitudinal count data according to the deviance informationcriterion (DIC) or probability integral transform (PIT), and by using proper scoring rules (e.g. the logarithmic score).

Results: The instruments under study provide excellent tools for preparing decisions within the SAP in a transparentway when structuring the primary analysis, sensitivity or ancillary analyses, and specific analyses for secondaryendpoints. The mean logarithmic score and DIC discriminate well between different model scenarios. It becomesobvious that the naive choice of a conventional random effects Poisson model is often inappropriate for real-lifecount data. The findings are used to specify an appropriate mixed model employed in the sensitivity analyses of anongoing phase III trial.

*Correspondence: [email protected] for Medical Information Sciences, Biometry and Epidemiology (IBE),Ludwig-Maximilians University, Marchioninistr. 15, 81377 Munich, Germany

© 2012 Adrion and Mansmann; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of theCreative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,distribution, and reproduction in any medium, provided the original work is properly cited.

Adrion and Mansmann BMCMedical ResearchMethodology 2012, 12:137 Page 2 of 22http://www.biomedcentral.com/1471-2288/12/137

Conclusions: The proposed Bayesian methods are not only appealing for inference but notably provide asophisticated insight into different aspects of model performance, such as forecast verification or calibration checks,and can be applied within the model selection process. The mean of the logarithmic score is a robust tool for modelranking and is not sensitive to sample size. Therefore, these Bayesian model selection techniques offer helpfuldecision support for shaping sensitivity and ancillary analyses in a statistical analysis plan of a clinical trial withlongitudinal count data as the primary endpoint.

Keywords: Statistical analysis plan, Sensitivity analysis, Longitudinal count data, Bayesian generalized linear mixedmodels, INLA, Predictive performance, Bayesian model evaluation, Informed model choice

BackgroundA statistical analysis plan (SAP) is a critical link betweenhow a clinical trial is conducted and the clinical studyreport. To secure objective study results, regulatorybodies expect that the SAP will meet requirements in pre-specifying inferential analyses and other important statis-tical techniques. Writing a good SAP for a model-basedsensitivity or ancillary analysis [1,2] involves non-trivialdecisions on and justification of many aspects of the cho-sen model setting. In particular, trials with longitudinalcount data as primary endpoint pose challenges for modelchoice and model validation. This paper explores tools forthis decision process when sensitivity analyses are per-formed using generalized linear mixed models (GLMMs)for the analysis of longitudinal count data. These tools canbe used to build transparent strategies for shaping the finalmodels reported in the SAP.The documentation of longitudinal profiles for the pri-

mary endpoint offers many advantages. They are moreinformative compared with a single timepoint analysis andgive information about the ’speed of efficacy’ [3]. Treat-ment effects evaluated by comparing change over timein quantitative outcome variables between the treatmentgroups are of great interest [4,5]. The analysis of longitu-dinal profiles offers an effective way to handle compositeendpoints like: (1.) the long-term effect of experimentaltreatment (E) is better than that of standard treatment (S),and (2.) patients under E reach a pre-specified effect fasterthan those under S.We are interested in parametric modeling approaches

for quantifying absolute effects, adjusting for baselinecovariates and handling stratification. There is a rich lit-erature on nonparametric methods for longitudinal data,for example, Brunner et al. [6]. These models do, ingeneral, allow estimation of relative effects. Omar et al.[7] provide an overview of several alternative parametricapproaches in trying to deal with individual longitudi-nal profiles: (i) the ’summary statistic method’ [8] usinga suitable summary measure (e.g. rates of change, post-treatment mean, last value of the outcome measure, orarea under a curve) calculated for each subject, and subse-quently analyzed with rather simple statistical techniques;(ii) repeated measures analysis of variance; (iii) marginal

models based on generalized estimating equations (GEE)[9]; (iv) mixed effects modeling approach involving fixedand random effects components [10,11].Mixed effects (or random effects) models allow us to

investigate the profile of individual patients, estimatepatient effects and describe the heterogeneity of treatmenteffects over individual patients. They account for differ-ent sources of variation (patient effects, center effects,measurement errors) and provide direct estimates of thevariance components which might be of interest in theirown right. Furthermore, they allow us to address variouscovariance structures and are useful for accommodatingoverdispersion often observed among count response data[10-12].The EMA Guideline on Missing Data in Confirma-

tory Clinical Trials from 2010 [13] explicitly considersrandom effects approaches (i.e. generalized linear mixedeffects models (GLMMs) in the case of a non-Gaussianresponse) as an approach to handling trials with a seriesof primary endpoints measured repeatedly over time.Mixedmodels are also helpful for handling missing values.They are applicable under missing completely at ran-dom (MCAR) as well as missing at random (MAR) [14],while simple repeated univariate analyses for each timepoint using test procedures such as the t-test, ANOVA, orthe Wilcoxon rank sum test rely on the more restrictiveassumption of MCAR. Also, for non-ignorable missingdata mechanisms, newer model-based strategies for lon-gitudinal analyses are increasingly available and offer theopportunity to account for dropout patterns (e.g. pat-tern mixture models [15]). To be fully compatible withthe intention-to-treat (ITT) principle, one has to explic-itly consider incomplete individual profiles to correctlyincorporate the information available for all randomizedpatients.These points in summary may explain why our interest

focuses on GLMMs as a powerful tool for the sensitiv-ity analysis of longitudinal count data. What we need isto pre-specify in detail a robust, valid, and parsimoniousstrategy for the data to come (see ICH E9, EMA or PSIGuidelines [13,16,17]). Writing the SAP prospectively fora randomized clinical trial with longitudinal counts asthe primary endpoint asks for a series of decisions when


specifying a GLMM for the analysis. Consideration shouldmainly be given to the following issues:

(i) Distributional assumptions: Poisson, negativebinomial, or more sophisticated extensions, e.g.,accounting for zero-inflation.

(ii) Transformation of outcome variable: e.g.log-transformation for skewed continuous positivevariables [18], or variance-stabilizing transformations(e.g. inverse hyperbolic sine-transformation fornon-negative count variables). An FDA guideline [19]postulates that a rationale for the choice of datatransformation along with the interpretation of theestimates of treatment effects based on thetransformed scale should be provided. In somesituations, transformation of endpoint data isindicated and preferred to untransformed analyses onthe original scale. However, careful considerationshould be given to using a transformation whichshould be pre-specified at the protocol design stage.

(iii) Variance-covariance structure: specifying whetherrandom effects (e.g. patient-specific intercept,patient-specific slopes) are appropriate; specificationof the within-error structure. Altogether, randomeffects selection can be challenging, particularly whenthe outcome distribution is not normal (see [20-22]for more details).

(iv) Methods for handling dropouts: e.g. dealing withinformative drop-outs, applying an analysis in whichthe last observation is carried forward, accounting fornon-ignorable missing data mechanisms(pattern-mixture models). This approach must befully compatible with the intention-to-treat principle.

(v) Use of covariate- or baseline-adjusted analyses,handling multi-center data: specifying the meanstructure by identifying the fixed effects terms.

The last issue is proposed by Pocock et al. [23] for avoid-ing misuse and data-driven selection of covariates withinthe clinical trial setting. The typical strategy for settlingthis complex issue is to decide on a simplemodel onwhichthe primary analysis is based and to use sensitivity anal-yses to assess the robustness of the derived result underrealistic model deviations.In this paper, we propose using pilot or pre-study data

to make an informed choice about the sensitivity analysisstated in the SAP. Pilot or pre-study (commonly called a“feasibility” or “vanguard” study) data come from a trialin an earlier phase, from a registry, or from a proof-of-concept study. For phase III trials, data from phase IItrials generally exist [24]. In this respect we could alsouse data from the comparable treatment arms of relatedstudies. Using these data helps to shape and justify inadvance the modeling strategy for analyzing the main

trial data, and to check the validity and the appropri-ateness of several model assumptions. It is imperativeto minimize misspecification of the assumed GLMM,and this analysis enables the trial statistician to definea broad and robust setting for the final choice of themodel.Having determined the main focus of this paper, we

need to motivate our choice of Bayesian tools for achiev-ing our goal. Within the GLMM framework, analyticalmethods for model assessment and goodness-of-fit cri-teria are not straightforward, and frequentist approachesremain limited. The inclusion of random effects makestheoretical derivations rather complex, imposing com-putational challenges. Some proposed model evaluationprocedures focus on checking the deterministic compo-nents (i.e. mean and variance-covariance structure) of aGLMM based on the cumulative sums of residuals, orassess the overall adequacy by means of a goodness-of-fit statistic which can be used in a manner similar tothe well-known R2 criterion [25,26]. Furthermore, forsmall sample sizes, likelihood-based inference via penal-ized quasi-likelihood in the case of a longitudinal discreteoutcome can be unreliable with variance componentsbeing difficult to estimate. In contrast, many easy-to-implement tools are available within the Bayesian frame-work. We will briefly review Bayesian tools developedrecently and demonstrate their usefulness: For assess-ing goodness-of-fit and performing model validation, weapply the probability integral transform (PIT) [27-29] asa graphical posterior model check. We implement formalstatistical criteria such as the deviance information cri-terion (DIC) [30], conditional predictive ordinate (CPO)[31,32], or proper scoring rules [28,29,33-36] to comparethe forecasting capability of different competing GLMMs.A further objective is exploring the behavior of these dif-ferent Bayesian methods for model validation concerningthe coherence of their preference for a certain candidatemodel.The article is organized as follows: TheMethods section

reviews Bayesian strategies for GLMMs in the countresponse situation. The main idea of integrated nestedLaplace approximation (INLA) proposed by Rue et al. [37]is described briefly. We also introduce tools for modelranking and for evaluating the performance of the pro-posed model alternatives based on a prediction-orientedapproach. Additionally, a case study is presented whichwill be used in the subsequent section to motivate themethodology. The Results section applies the proposedBayesian methodology to clinical trial data on vertigoattacks and presents the findings of our simulation study.The Discussion section contains the limitations of themethods proposed.More technical material is provided inthe Appendix. Selected R-INLA code with further detailsconcerning the INLA approach is included in the Web


Supplementary Material of this paper [see Additional files1 and 2].For data analysis, inla-program [38] based on the

open-source software R version 2.12.1 [39] was used todemonstrate the applicability of the Bayesian toolbox.

MethodsBayesian generalized linear mixedmodels for longitudinalcount dataIn the following, regression approaches to modeling dis-crete count outcomes are briefly outlined. In the clinicaltrial setting, we assume that each patient i, i = 1, . . . ,N ,is randomized to a new drug (xi = 1) or a standard treat-ment (xi = 0). The observations yij for each patient arecounts measured in the course of time during each studyvisit, j = 1, . . . , ni (presuming an imbalanced design), withtime tij ∈ R and ti1 = 0. The linear predictor is defined as

ηij = (β0 + b0i) + (β1 + b1i) tij + β2 xi + β3 xi tij , (1)

with β = (β0, β1, β2, β3)T being the population-level

parameter vector (fixed effects), b0i denoting patient-specific random intercepts and b1i subject-by-visit ran-dom slopes. The fixed effects (in a frequentist framework)account for group-specific effects (e.g. treatment groupor time), serving at the same time as parameters ofinterest in a clinical trial. We want to relate the countresponse to explanatory variables such as time and treat-ment. In the most general case, a standard assumptionfor a GLMM with both random intercept and slope isthat bi = (b0i, b1i)T follows a bivariate normal distribu-tion with mean zero and an unknown precision matrixQ = Q(φ) depending on parameters φ, i.e.

bi|Q ∼iid N2(0,Q−1).

The variance covariance matrix Q−1 for variance com-ponents φ is parameterized in terms of precisions and acorrelation parameter. That is,

Q−1 =(

1/τb0i ρ/√

τb0iτb1iρ/

√τb0iτb1i 1/τb1i

),

where τ. refers to the marginal precision of random effectsb.i. Therefore, it is necessary to allow for the correlationρ between random intercepts and slopes. In GLMMs for-mulated within a Bayesian framework, a non-Gaussianhyperprior distribution must be assigned to the precisionmatrix Q(φ), where τ. and ρ represent the hyperparame-ters. As proposed by Fong et al. [40] and Wakefield [41],we assume

Q ∼ Wishart2(r,R−1).

The prior parameters of the Wishart prior are(r,R11,R22,R12), where r > 1 (to obtain a proper prior)

in the case of two dependent random effects. R12 is ele-ment (1, 2) of the inverse of R and R12 = R21 be-cause of symmetry. Integration over Q gives a marginalt2(0, [ (r − 1)]−1 R, r − 1)-distribution of bi = (b0i, b1i)T,with t2 denoting the Student’s t distributionwith 2 degreesof freedom.

Poisson GLMMPoisson loglinear regression is a common choice for mod-eling count response data. The probability function can beexpressed as

Pr μ(y) = exp(−μ)μy/y!

for y = 0, 1, 2, . . . and μ > 0. For longitudinal count datawith i = 1, . . . ,N subjects and j = 1, . . . , ni measure-ments per subject, the observed counts yij are condition-ally independent Poisson variables Yij ∼ Poi(μij), with theconditional mean of Yij related to the linear predictor bya logarithmic link function. Let μij = E (Yij|β , bi). Hence,the resulting predictor in a standard Poisson GLMM forpredicting the mean rate is

log(μi) = ηi = Xiβ + Zibi,

where Xi is an ni × p matrix and Zi is an ni × q matrix,with β a p×1 vector of population-level parameters (fixedeffects) and bi a q × 1 vector of zero-mean normallydistributed random effects. In the longitudinal settingdescribed in equation (1), p = 4, q = 2 and Zi = (1, ti).The primary Poisson assumption is equidispersion, i.e. theequality of the mean and the variance functions. However,this is often inconsistent with empirical evidence. In real-ity, the value of the variance often exceeds that of themeanμij, resulting in overdispersion. Thus, although they arewidely used to model count data, Poisson GLMMs maynot well be suited to types of count outcomes from specificapplications.

Negative binomial GLMMA conventional modeling approach for overdispersedcount data where the variance exceeds the mean (i.e. agiven rate μij) is the negative binomial (NB) loglinearregression. In the classical univariate setting (dropping thesubscript i), the NB density can be written as

Pr k,p(y) = �(y + k)�(k) �(y + 1)

pk (1 − p)y ,

for y = 0, 1, 2, . . ., probability 0 < p ≤ 1, and k ∈ R,k > 0. �(n) = (n − 1)! denotes the Gamma function,and y represents the number of failures which occur ina sequence of Bernoulli trials before a target number ofsuccesses is reached. Additionally, the hyperparameter k(often called “size”) plays the role of an overdispersion


parameter. For negative binomial data, the correspondingmean and variance are then given by

μ = k (1 − p)p

and σ 2 = μ+μ2/k = k (1 − p)p2

,

with p = k/(k + μ) = μ/σ 2, k = μ2/(σ 2 − μ).Overdispersion in the negative binomial model can

be interpreted by unobserved heterogeneity among theobservations y. If this phenomenon is not taken intoaccount in the modeling process, it can lead to underes-timated variance which then leads to incorrect posteriorinference. It must be kept in mind that in the NB regres-sion, the dispersion parameter takes observation-specificvalues. In the limit k → ∞, holding μ fixed, the varianceapproaches the mean and therefore the negative bino-mial NB(k, p) converges to Poi(μ) (with μ = k 1−p

p ) in adistributional manner.

Zero-inflated GLMMIn many biometrical and epidemiological applications,the count data encountered often contain a high propor-tion of extra zeros relative to the Poisson distribution,which is routinely applied for these situations. Therefore,a major source of overdispersion in these situations is apreponderance of zero counts. Zero-inflated count mod-els provide a parsimonious yet powerful way to modelthis type of situation. Such models assume that the dataoriginate from a mixture of two separate processes: onegenerates only zeros, and the other is either a Poisson ora negative binomial data-generating process. The result ofa Bernoulli trial is used to determine which of the twoprocesses generates an observation.Hence, as regards zero-inflated estimation method in

general, two regression equations are created: one predict-ing whether the count occurs (“always zero group”) anda second predicting differences in the occurrence of thecount (“not always zero group”). While these differencesare not modeled with standard Poisson or negative bino-mial regression, zero-inflated models first account for theexcessive zeros by predicting group membership – i.e. anunobserved latent dichotomous outcome – based on theconstellation of covariates included in the model and thenpredicting frequency of counts for only those in the “notalways zero group”. The zero-inflated version of a distri-bution D of a random variable Y ∼ ZID(π0, θ), whereZID denotes a zero-inflated distribution, has a probabilityfunction of the form

fZID(y) = π0 I[y=0] + (1 − π0) fD(y; θ ) ,

where fD(y|θ) is the probability function of distribution Dwith parameters θ . Hence, fZID(y) exhibits an additional,zero-inflation hyperparameter π0 for the proportion of

additional zeros. From the equation above, the probabil-ity of zero is equal to π0 + (1 − π0)fD(y = 0|θ), while theprobability of y > 0 is given by (1 − π0)fD(y|θ ).Two popular models that account for data with excess

zeros are the zero-inflated Poisson (ZIP) and the zero-inflated negative binomial (ZINB). The ZIP distributionintroduced by Lambert [42] is the simplest ZID.In the longitudinal setting, the full ZIP mixed effects

model has the following representation:

Yij ∼ ZIP(π0ij,μij) and

Yij ∼{0, with probability π0,ijPoi(μij), with probability (1 − π0,ij).

A ZIP model will reflect the data accurately whenoverdispersion is caused by an excess of zeros. In gen-eral, a ZIP mixed effects model can be used when one isnot certain about the nature of the source of zeros, andobservations are overdispersed and simultaneously corre-lated because of the sampling design or the data collectionprocedure. By contrast, if overdispersion is attributed tofactors beyond the inflation of zeros, a ZINB model ismore appropriate [43]. Furthermore, the rate of zero-inflationmay change over time. This problem goes beyondthe scope of this paper, and we focus on ZIP GLMMs as analternative to the Poisson GLMM generally used for ana-lyzing longitudinal counts. More details concerning theseissues can be found in Hilbe [44] or Lambert [42].Again, a Bayesian approach provides an easy way to deal

with zero-inflated hierarchical count data by incorporat-ing a prior for π0 (generally beta prior or a uniform priorwhen no information is available). For longitudinal datawith repeated observations, the correlation structure maybe taken into account by introducing random effects in theproposed zero-inflated model. More details can be foundin Dagne [45] or Ghosh et al. [46].

NMMwith variance-stabilizing transformationIt is not uncommon for a regression model to be inap-propriate for a given response variable but reasonablefor some transformation provided that it is methodolog-ically justified. For a longitudinal count outcome, thismeans that a normal mixed effects model (NMM) shouldbe considered as an alternative modeling strategy, withan assumption of Gaussian errors on the transformedscale: an inverse hyperbolic sine-transformation [47] ofthe response y via

arcsinh(y) = log(y +√y2 + 1)

can be performed to accomplish stabilization of the vari-ance and is often useful in practice. For more detailsconcerning the asymptotic variance-stabilizing transfor-mation resulting from negative binomial count data, see


the Appendix A1. This approach is motivated by analyz-ing the data with a robust and well-understood algorithmas regards parameter estimation. Particularly in a fre-quentist framework, likelihood-based inference is far lessstraightforward for GLMMs than it is for NMMs. Analyt-ical intractability is the reason why a variety of numericalintegration techniques for maximizing the likelihood havebeen developed (e.g. Gauss quadrature or penalized quasi-likelihood). In a Bayesian framework, computation is amajor issue for complex hierarchical GLMMs since theusual implementation based on the Markov chain MonteCarlo (MCMC) method tends to exhibit poor perfor-mance, lack of convergence or slow mixing propertieswhen applied for such models. As regards computationalcost, NMMs clearly outperform mixed models for non-Gaussian response.

Bayesian inference using the INLA approachFor Bayesian GLMMs, an analytical computation of theposterior marginals of the unknown fixed parameters andhyperparameters is not possible: The posterior marginalsare not available in closed form because of the non-Gaussian outcome. Hence, the standard approach used toobtain posterior estimates are MCMC methods [48-50].However, within the MCMC framework several problemsin terms of both convergence and computational timeoccur in practical applications. Recently, Rue et al. [37]proposed an approximate alternative for parameter esti-mation in a subclass of Bayesian hierarchical models, theso-called latent Gaussian models. These are models witha structured additive predictor

ηi = α +nf∑l=1

f (l)(uli) +nβ∑g=1

βgxgi + εi, (2)

where f (l)(·) represents an unknown function of continu-ous covariates u, comprising for example nonlinear effectsof covariates, time trends, spatial dependencies, or inde-pendent identically distributed individual-level parame-ters (random effects). The βg ’s denote the linear effect ofsome covariates x, and the εi’s are unstructured terms.Gaussian priors are assigned to α, f (l)(·), βg and ε, whereasthe priors for the hyperparameters φ do not have tobe Gaussian. Random effects are introduced by definingf (ui) = fi and letting {fi} be independent, have zero meanand be Gaussian distributed. INLA is a new computa-tional approach to statistical inference for latent GaussianMarkov random field (GMRF) models that can bypassMCMC. Known problems with MCMC no longer applyusing INLA as no Monte Carlo inference is involved.The theoretical background and computational issues aredescribed in detail in Rue et al. [37,51]. In short, a latentGMRF model, which underlies INLA, is a hierarchicalmodel which can be characterized through three stages. In

the first stage, the distributional assumption is formulatedfor the observables yi, usually assumed to be conditionallyindependent given some latent parameters and, possibly,some additional hyperparameters. In the second stage, ana priori model for the unknown parameters is assignedand the corresponding GMRF is specified. The third andlast stage of the model consists of determining the priordistributions for the hyperparameters. With this method,a recipe for fast Bayesian inference using accurate, deter-ministic approximations to the marginal posterior den-sity for the hyperparameters and the marginal posteriordensities for the latent variables is provided in a fullyautomated way. The INLA computational approach com-bines Laplace approximations and numerical integrationin a very efficient manner. Three types of approximationare available: Gaussian, full Laplace, or simplified Laplaceapproximation. Each of these approaches has differentfeatures varying in accuracy and computational cost. Inthis article, we used the full Laplace approximation forthe numerically inaccessible integrals of the posteriormarginal density as this approximation is supposed to bethe most accurate [37,52].Using the INLA approach it is also possible to challenge

the model itself. For example, a set of competing GLMMscan be assessed through cross-validation in a reason-able time without reanalyzing the model after omissionof observation yij. Hence, within the INLA framework,GLMMs can be fitted at low computational cost, givingaccess to various predictive measures for model compar-ison. Additionally, this approach facilitates the validationof distributional assumptions concerning the model beingstudied.Details on how to use the open-source software inla

can be found in the manual offered by Martino andRue [38] or [53], and on the website www.r-inla.org.The inla-program, written in C and bundled withinan R-interface [39] called R-INLA, can be down-loaded from the webpage for Windows, MAC andLinux, or simply by typing the following command linewithin R source("http://www.math.ntnu.no/inla/givemeINLA.R"). Accordingly, R-INLA per-mits model specification and post-processing of resultsdirectly in R. All analyses in this paper were run using theR-INLA package built in October 2011.

Methods for model assessment and comparisonDiagnostic checking of the model against the data com-pletes the model building process. The aim of diagnosticchecking is to compare the data with the fitted modelin such a way that it is possible to detect any system-atic discrepancies. Forms of model assessment commonin both frequentist and Bayesian methods involve mea-suring the goodness-of-fit to evaluate whether the chosenfinal model provides an adequate fit to the longitudinal

http://www.math.ntnu.no/inla/givemeINLA.R

http://www.math.ntnu.no/inla/givemeINLA.R


data and to firmly establish the model’s credibility (modelassessment). For example, we can check whether a spe-cific random effect at a certain grouping level is war-ranted or whether it can be eliminated. To identify modeldeficiencies and facilitate model comparison and modelselection, several Bayesian tools recently proposed by var-ious authors are available. These tools can be applied toaddressing the issue of predictive performance of a model,or to identify model deficiencies and to detect possibleoutliers or surprising observations yij that do not fit thegiven model and may require further attention. Addi-tionally, methods for model comparison should provideinformation about which model performs best.

Deviance information criterion (DIC)Appropriate statistical selection of the best model from acollection of hierarchical GLMMs is problematic mainlybecause of ambiguity in the “size” of such models arisingfrom the shrinking of their random effects towards a com-mon value. To address this problem, Spiegelhalter et al.[30] suggest DIC as a generalization of the Akaike infor-mation criterion (AIC) which can be used as a Bayesianapproach for model comparison and to assess the ade-quacy of hierarchical models. DIC compares the globalperformance and predictive accuracy of different alterna-tive models accounting for model complexity. Like AIC,the basic idea of DIC is a trade-off between model fit andmodel complexity. Models with more parameters tend tofit the data better than models with less parameters. How-ever, a larger set of parameters makes the model morecomplex with the danger of overfitting. Hence, modelselection should account for both goodness-of-fit andcomplexity of the model. The smaller the DIC the bet-ter the trade-off between model fit and complexity. Themodel with the smallest DIC is considered to be themodelthat would best predict a replicate data set of the samestructure as that currently observed. DIC is based on theposterior distribution of the Bayesian deviance statistic,

D(θ) = −2 log f (y|θ) + 2 log h(y) , (3)

where f (y|θ) is the likelihood function for the observeddata vector y given the parameter vector θ , and h(y) issome standardizing function of the data (thus not hav-ing an impact on model selection). In this approach, thefit of a model is summarized by the posterior expectationof the deviance D = E θ |y[D], while the complexity of amodel is captured by the effective number of free param-eters pD, which is typically less than the total number ofparameters. For non-hierarchical models, pD should beapproximately the true number of parameters. pD can bethought of as the “posteriori mean of the deviance” minusthe “deviance evaluated at the posterior means”

pD = E θ |y[D]−D(E θ |y[ θ ] ) = D(θ) − D(θ ).

DIC is then defined as

DIC = 2 D − D(θ ) = D(θ ) + 2 pD= D + pD, (4)

DIC is scale-free. Because of the standardizing func-tion h(y) in (3), DIC values have no intrinsic meaning,and only differences in DIC across candidate models aremeaningful. The question of what constitutes a notewor-thy difference in DIC between two candidate models hasnot yet received a satisfactory answer. Differences of 3 to 5are normally being thought of as the smallest that are stillnoteworthy [49,50].Spiegelhalter et al. [30] and Plummer [54] discuss some

limitations of DIC: Although widely used, DIC lacks aclear theoretical foundation. It can be shown that DICis an approximation of a penalized loss function basedon the deviance, with a penalty from a cross-validationargument. However, this approximation is valid only whenthe effective number of parameters in the model is muchsmaller than the number of independent observations.The ratio pd/n may be used as an indicator of the validityof DIC. In disease mapping or random effects models forlongitudinal data this assumption often does not hold andtherefore DIC under-penalizes more complex models.

Computational details DIC is simple to calculate usingMCMC simulation and is routinely implemented in Win-BUGS [55-58]. With the INLA approach, both compo-nents of DIC, pD and D, can be computed by settingthe option dic = TRUE in the control.computestatement within the inla(.)-call. For further detailssee [37].

Conditional predictive ordinate (CPO)As a device for detecting possible outliers or surpris-ing observations yij within a posited model and thereforechecking the model fit, the conditional predictive ordinate(CPO) for each observation can be computed [59]. To bemore precise, this predictive quantity given by

CPOij = π(yij|y−ij) (5)

constitutes the position of the observed value yij withinthe leave-one-out cross-validatory posterior predictivedistribution evaluated at the observed value yij. A smallvalue of CPOij indicates an observation yij that is unlikelyunder the model fit without the observation in question,i.e. ’surprising’ in the light of the prior knowledge and theother observations [60]. Accordingly, this observation isnot expected under the cross-validation posterior predic-tive distribution of the current model. CPO measure isdiscussed among others by Gelfand et al. [32], Congdon[61] and Gilks et al. [48]. Since for discrete data CPOij canbe used to estimate the probability of observing yij in the


future when y−ij is already observed, it can be interpretedeasily.

Computational details Examination of model perfor-mance at the level of the individual observation can pro-vide added insight into discovering the reasons for poorglobal performance. For each observation yij of the model,we use the value of the cross-validation predictive densityat the observed data points as a local discrepancy mea-sure. A plot for CPOij versus ij can be used informally asa diagnostic tool to reveal surprising observations. WithMCMC sampling, calculating the CPO predictive quan-tity requires refitting the model by single case omission.With inla, CPO can be returned for each observation atlow computational cost without rerunning the analysis byusing the option cpo = TRUE. However, in practice, theassumptions behind the numerical computation might failfor some observations. For these points, the CPO valueshave to be re-computed manually. That is, yij is removedfrom the model and simply refitted only computing theposterior marginals for the linear predictor for this obser-vation. (As the results from fitting the whole model canbe used to improve e.g. initial values, this process allows amore efficient implementation). For further reading referto [51].

Proper scoring rules as a toolbox for the assessment ofprognostic performanceBesides model choice criteria such as DIC, CPO or graph-ical techniques, the comparison and ranking of differ-ent competing models can be based on proper scoringrules which were proposed by Gneiting and Raftery [33]for assessing the quality of probabilistic forecasts (see[34-36,62] for more details). Scoring rules provide a suit-able summary measure for the evaluation of probabilisticforecasts, by assigning a numerical score based on theposterior predictive distribution P and on the event ythat materializes. We take scoring rules to be negativelyoriented penalties that a forecaster wishes to minimize:Specifically, if the forecaster quotes the predictive dis-tribution P and y is the observed value, the penalty iss(P, y). We write s(P,Q) for the expected value of s(P,Y ),when Y ∼ Q. Models with smaller score values shouldbe preferred to models with larger values. Additionally,propriety is an essential property of a scoring rule thatencourages honest and coherent predictions. Gneitingand Raftery [33] contend that the goal of probabilisticforecasting is to maximize the sharpness of the predictivedistributions subject to calibration. Calibration refers tothe statistical consistency between the probabilistic fore-casts and the observations y, and is a joint property ofthe predictive distributions and the actual observation y.Sharpness refers to the concentration of the predictive dis-tribution, and is a property of the forecast only [28,29,63].

Hence, in the context of model comparison, scoring rulesprovide a diagnostic approach to assessing the predictiveperformance of a model.The most prominent example of strictly proper scoring

rules is the logarithmic score [64] which is defined as

LSij = − log(πyij) , (6)

where πyij = Pr(Yij = yij | y−ij) indicates the cross-validated leave-one-out predictive probability mass at theobserved value yij, i = 1, . . . ,N , j = 1, . . . , ni. The sub-script −ij in y−ij denotes that for patient i observationj is removed. Concerning logarithmic score values, thefollowing relationship holds:

LSij = − log(CPOij).

Gneiting and Raftery [33] proposed ranking compet-ing forecast procedures (i.e. competing models) on thebasis of their mean scores, e.g. LS = (

∑Ni=1 ni)−1 ∑

i,j LSij,and not by graphical methods such as boxplots. Thedifference in the mean scores can be considered sinceonly the mean scores are still proper. Therefore, we wantto compare the mean scores of two rival models byusing a formal significance test to assess if score differ-ences are statistically significant on a certain level. ThepairedMonte Carlo permutation test [65,66] based on theobservation-level scores provides a convenient approach,as unlike the paired t-test it does not require distribu-tional assumptions (e.g. normality of individual scores) ortrust in asymptotic behavior. Permutation tests comparethe observed score values, suitably summarized in a teststatistic, with randomly permutated score values, whichcan be viewed as samples under the null hypothesis H0 ofno difference.

Computational details To calculate the scores in theMCMC setting, a leave-one-out cross-validation ap-proach using the posterior predictive distribution is thegold standard, obtained by reanalyzing the data with-out a suspect statistical unit. However, full and exactcross-validation is extremely time-consuming in practiceand often generally infeasible within an MCMC analysis.Marshall and Spiegelhalter [67] proposed the “full-datamixed approach” (ghost sampling) generating full ’ghost’sets of random effects for each unit without repeatedly fit-ting the model with one particular observation removed(for more details also compare [68] or [69]).As alternative to MCMC, the INLA approach can be

applied to compute omnibus predictive performancemea-sures such as the mean cross-validated logarithmic scoreof different competing models. Using inla, the quanti-ties needed for calculating these score values are availableas by-product of the main computations when setting the


option cpo = TRUE in the control.compute state-ment within the inla(.)-call [37,38,51].

Probability integral transform (PIT)Unusually small values of CPO indicate surprising obser-vations. However, what is meant by ’small’ has to becalibrated to the level of the Gaussian field in order tocompare CPO values. One possible calibration procedureis to compute the probability integral transform (PIT)proposed by Dawid [27]. In the univariate case, PIT is atool for assessing calibration and therefore evaluates theadequacy of a single model.The PIT value for each single observation is defined as

PITij = Pr (ynewij ≤ yij|y−ij) , (7)

with y−ij being the observation vector with the ijth com-ponent omitted, and is simply the value that the pre-dictive cumulative distribution function (CDF) attainsat the observation yij. This procedure is performed incross-validation mode meaning that in each step of thevalidation process the ensuing leave-one-out posteriorpredictive distribution is calculated.Unusually large or small values of PITij indicate pos-

sible outliers or surprising observations not supportedby the model under consideration. If the observation isdrawn from the predictive distribution, which is an idealand desirable situation, and the predictive distribution iscontinuous, the PIT has a uniform distribution on theunit interval [ 0, 1]. To evaluate whether a data vector yseems to come from a specific distribution, i.e. to checkcalibration empirically, a histogram of all PIT values canbe plotted and checked for uniformity [28,29,33]. A his-togram of the PITs far from the uniform might indicate aquestionable model and hint at reasons for forecast fail-ures and model deficiencies. U-shaped histograms indi-cate under-dispersed predictive distributions, inverse-Ushaped histograms point at over-dispersion, and skewedhistograms occur when central tendencies are biased.In the case of count data, the predictive distribution isdiscrete resulting in PIT values no longer being uni-form under the null hypothesis of an ideal forecast. Toovercome this problem, several authors suggest a “non-randomized” version of PIT values (see [28] formore tech-nical details). Hence, an adjusted PIT can be used instead,defined as

PITij = Pr (ynewij < yij|y−ij)+0.5·Pr (ynewij = yij|y−ij) , (8)

These adjusted PIT values can be interpreted in exactlythe same way as in applications with continuous outcomedata. However, when using PIT as a diagnostic tool it hasto be considered that PIT does not take into account thesharpness of the density forecast, as opposed to properscoring rules providing a combined assessment of bothcalibration and sharpness simultaneously.

Computational details In the MCMC setting, nonran-domized PIT values for count outcomes are cumbersomeand rather time consuming because of the leave-one outcross-validation approach. To reduce the computationalburden, the INLA approach can be applied to computePITij, i = 1, . . . ,N , j = 1, . . . , ni.

Motivating example: vertigo phase III dose-finding study(BEMED trial)Study synopsis The BEMED trial (Medical treatmentof Meniere’s disease with Betahistine; EudraCT No.:2005-000752-32; BMBF177zfyGT; Trial Registration: Cur-rent Controlled Trials ISRCTN44359668) is an ongo-ing investigator-initiated, multi-center, national, random-ized, double-blind, placebo-controlled, clinical trial witha parallel group design. This dose-finding phase III trialrecruiting patients from several dizziness units through-out Germany comprises three arms: therapy with highdose betahistine-dihydrochloride (3 × 48 mg per day)vs. a low dose (2 × 24 mg per day) vs. placebo. Totaltreatment time will be nine months with a three monthfollow-up. The objective of this study is to evaluate theeffects of betahistine in high-dosage vs. low-dosage vs.placebo on the occurrence of vertigo attacks. The studywas approved by the local ethics committee and is per-formed in accordance with the International Conferenceon Harmonization Guidelines for Good Clinical Prac-tice, as well as with the Declaration of Helsinki. Writteninformed consent was obtained frompatients whomet thestudy inclusion criteria.

Design aspects and statistical analyses A sample size ofn = 138 patients in total (46 in each group) to be ana-lyzed was considered necessary. The total treatment timewill be nine months with a three month follow-up. Theprimary efficacy outcome is the number of vertigo attacksin the three treatment arms during the last three monthsof the 9 month treatment period. The primary efficacyanalysis is nonparametric and will be performed accord-ing to the ITT principle. The closed testing procedure isused to avoid adjusting the significance level. Sensitivityanalyses will be performed using a longitudinal approachto quantify patient profiles and the ’speed of efficacy’, i.e.,how quickly reduction in attack frequency is achieved inthe three treatment groups. For the prospectively speci-fied SAP, it has to be decided which candidate set of mixedeffects models proposed at the beginning of the MethodsSection seems appropriate for analyzing the counts.

Informed model choice The decision on the models forsensitivity and ancillary analyses and handling of infor-mative missing data of the large phase III BEMED trial isbased on data from a pre-study with a comparable study


population, comparable intervention, and same definitionof primary outcome (frequency of vertigo attacks).

ResultsApplication to clinical trial dataVertigo pre-study: data andmodeling detailsTo demonstrate the applicability of the Bayesian tool-box within the GLMM framework, we used real lifelongitudinal count data from an open, non-masked,exploratory trial conducted by the dizziness unit, Depart-ment of Neurology, UniversityHospitalMunich, Germany[70]. 112 patients between the ages of 18 and 80 yearswith Meniere’s disease received either a low dosage ofbetahistine-dihydrochloride, i.e. 16 or 24 mg tid, or ahigher dosage of 48 mg tid for at least 12 months. 50patients were in the low dosage group (coded as zero)and 62 in the high dosage group (coded as one). Bothtreatment groups did not differ with respect to patientcharacteristics at baseline measurement (ti1 ≡ t1 = 0,for all i = 1, . . . , 112). In particular, there was no signif-icant difference in the number of attacks at baseline (see[70] for more details). The full dosage was given fromthe beginning of the treatment. Since the major aim ofthe treatment of Meniere’s disease is reducing the attackfrequency, the efficacy outcome variable was the numberof vertigo attacks per month during a 3-month period,i.e. during a period of 3 months preceding treatment andthen every 3 months for up to 12 months. Follow-upexamination every 3 months showed that the mean num-ber of attacks per month decreased in both groups overtime, and was significantly lower in the high-dosage thanin the low dosage group; the longer the treatment, thegreater the difference between the two treatment groups.Longitudinal data are displayed in Figure 1. Moderate ver-tical differences between the individual profiles could beidentified.We consider a count outcome variable, yij, which in

our example represents the number of vertigo attacks permonths for the ith patient measured at time tij ≡ tj =0, 3, 6, 9, 12, for j = 1, 2, 3, 4, 5. To account for between-patient variability we introduced patient-level randomintercepts as well as patient-specific slopes, and then fittedmain effects and interaction models:

ηij = (β0 + b0i) + β1 timej + β2 dosagei · timej model (I)

ηij = (β0 + b0i) + (β1 + b1i)timej + β2 dosagei · timej model (IS)

The main effect for a treatment group, defined bydosagei, was left out of the systematic part since treat-ment effect was expected to happen slowly with timeand not in a way that a strong effect is established aftera short time and stays stable for the duration of thelongitudinal observation.

We considered flexible models allowing for overdisper-sion and zero-inflation, respectively. Hence, both formod-els of type (I) and (IS), we investigated four different typesof GLMM by changing the distributional assumption:

a) Poisson model for yij ∼ Poi(μij). Poisson GLMMwas used as the “reference model” as thisdistributional assumption is often the default choice.

b) Zero-inflated Poisson (ZIP) model, which willexplain the mean attack frequency and thezero-inflation probability (i.e. assuming an excess ofzero observations).

c) Negative Binomial (NB) model, as a robustalternative to accommodate substantial extravariation or overdispersion.

d) Normal mixed effects model (NMM), forarcsinh-transformed outcome “attack frequency” asan alternative modeling strategy to accomplishstabilization of variance.

All models included patient-specific random interceptsb0i|σ−2

b ∼iid N (0, σ 2b ), while the need for patient-

specific slopes associated with time was investigated forall candidate models. Therefore, for the latter type ofGLMM, correlated patient-specific intercepts and slopesbeing zero mean bivariate normal were assumed, i.e.(b0i, b1i)T|Q ∼iid N2(0,Q−1). For models of type (I), aGamma prior was assigned to the precision σ−2

b . Accord-ing to Fong et al. [40], for models of type (IS) we assumedQ to follow a Wishart2(r,R−1)-distribution with Q = I2.In general, independent zero-mean Gaussian priors withfixed small precisions were assigned to each componentof the population-level parameter vector β . As the accu-racy of the simplified Laplace approximation is often notsufficient for the computation of predictive measures [52],the full Laplace approximation was chosen in the follow-ing application, in combination with the so-called GRIDintegration strategy for numerically exploring the approx-imative posterior marginal densities (for more detailsconcerning this issue see [37]).

Vertigo pre-study: analysis resultsIn Table 1 INLA summaries for the vector of population-level parameters (fixed effects) are described. Addition-ally, 95% credible intervals are reported. These 95%equal-tail intervals correspond to the 2.5% and 97.5% per-centiles of the corresponding posterior distribution andenable assessment of whether, e.g., time profiles of the pri-mary efficacy outcome variable differ in both treatmentgroups (dosage∗time).We conclude that posterior esti-mates for models of type (I) and of type (IS), respectively,agree between differing distributional assumptions.In Figure 1B) and C), the conditional mean estimates of

the number of attacks depending upon fixed effects for


A) Original data

time since baseline [months]

num

ber

of a

ttack

s pe

r m

onth

s

10

20

30

0 3 6 9 12

low dosage

0 3 6 9 12

high dosage

B) Poisson, I


num

ber

of a

ttack

s pe

r m

onth

s

10

20

30

40

0 3 6 9 12

low dosage

0 3 6 9 12

high dosage

C) Poisson, IS


num

ber

of a

ttack

s pe

r m

onth

s

10

20

30

0 3 6 9 12

low dosage

0 3 6 9 12

high dosage

Figure 1 Trajectory plots for vertigo data. Effect of betahistine-dihydrochloride on the frequency of attacks of vertigo in a total of 112 Meniere’sdisease patients; 2 treatment groups: “low-dosage” (50 patients) vs. “high-dosage” (62 patients). A) individual trajectories for vertigo data. B) and C)display the conditional posterior mean trajectories of the number of attacks depending upon fixed and random effects after fitting a Poisson GLMM(I: model with random intercepts. IS: model with random intercepts and slopes). The same color is used to indicate observations and model-basedestimates for the same patient.

time, interaction for treatment group and time, and ran-dom effects are visualized by means of trajectory plots,assuming a Poisson (IS) GLMM. Furthermore, Figure 2illustrates the approximated posterior marginals for themost important fixed effects by comparing INLA resultswith those obtained using the MCMC approach (seeAppendix A3).However, our key scientific problem was to quantify

the goodness of competing models in terms of prediction

accuracy. The question to be answered was how structuraldifferences concerning random effects or distributionalassumptions affect the performance of a posited model.Calibration check was performed by PIT histograms serv-ing as an informal tool for discordancy diagnostics (seeFigure 3). In contrast to NB GLMM and arcsinh NMM,which seem to be sufficiently well calibrated for type (I)-models, the Poisson (I) and the ZIP (I) model were slightlyU-shaped, indicating a worse predictive performance for

Table 1 INLA summaries for estimatedposterior means of population-level parameters (togetherwith 2.5% and 97.5%posterior quantiles) using full Laplace approximations

ParameterModel

Intercept time dosage∗timePoisson, I 1.366 (1.123, 1.603) -0.051 (-0.063, -0.038) -0.130 (-0.150, -0.109)

Poisson, IS 1.638 (1.432, 1.837) -0.189 (-0.275, -0.107) -0.173 (-0.288, -0.061)

ZIP, I 1.375 (1.121, 1.620) -0.049 (-0.062, -0.036) -0.115 (-0.137, -0.093)

ZIP, IS 1.628 (1.421, 1.830) -0.209 (-0.302, -0.119) -0.198 (-0.323, -0.075)

NB, I 1.447 (1.193, 1.695) -0.069 (-0.090, -0.050) -0.127 (-0.156, -0.098)

NB, IS 1.642 (1.433, 1.840) -0.190 (-0.289, -0.101) -0.168 (-0.289, -0.049)

arcsinh∗ , I 2.056 (1.853, 2.259) -0.067 (-0.084, -0.051) -0.074 (-0.096, -0.052)

arcsinh∗ , IS 2.055 (1.854, 2.255) -0.068 (-0.114, -0.022) -0.073 (-0.134, -0.012)

∗ INLA posterior means and 95% credibility intervals for arcsinh-transformed outcome modeled as continuous response with Gaussian error terms.I: model with random intercept, IS: model with random intercept and slope associated with time.


Figure 2 Vertigo data: INLA vs. MCMC approach. Bayesian inference for fixed effects (Poisson random slope model): comparison of samples froma long MCMC chain (�) with the posterior marginals computed with the Laplace approximation (—) obtained by using INLA. The vertical blue lineshows the posterior mean.

higher columns at the right-hand end of the histograms.Visual assessment of PIT histograms for type (IS)-modelsrevealed noticeable deviations from uniformity due tomiscalibration of density forecasts.Additionally, competing random slope models did not

clearly outperform each other. The difference betweennegative binomial and Poisson was marginal because ofa small degree of overdispersion: e.g. the hyperparameterk was estimated to be rather large for the NB (I) model

with random intercepts, with a posterior mean of 7.03,95% credible interval [ 4.65, 10.36]. For the NB (IS) model,the posterior estimates for k were even larger (data notshown). Likewise, there was no convincing evidence forzero-inflation (e.g. the posteriormean for zero-probabilityhyperparameter π0 was estimated to be 0.09, 95% credi-ble interval [0.05, 0.14], for ZIP (I). For model type ZIP(IS), the posterior mean for π0 was even less). Modelingthe arcsinh-transformed outcome by means of an NMM

Poisson, I

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

Poisson, IS

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

ZIP, I

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

ZIP, IS

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

NB, I

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1.2

NB, IS

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

arcsinh LMM, I

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

arcsinh LMM, IS

PIT

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

Figure 3 Vertigo data: PIT histograms for all candidate models. U-shaped histograms indicate under-dispersed predictive distributions, humpor inverse-U shaped histograms point at overdispersion, and skewed histograms occur when central tendencies are biased. Dashed gray lines showthe histogram height corresponding to perfect calibration.


has several computational advantages, and the PIT valuesseemed reasonably close to uniform for model type (I),hence yielding fairly correct forecasts. Nevertheless, thisdoes not take the sharpness of the density forecasts intoaccount, as opposed to proper scoring rules.Table 2 enables a comparison of LS and DIC for all

eight types of GLMM (the lowest mean score and DIC isprinted in bold face). The Poisson (IS) model was rankedbest in the leave-one-out predictive assessment by the log-arithmic score. A permutation test was applied to decidewhether the difference in mean log scores was signifi-cant on a 5% level. More exactly, we used the Poisson(IS) model emerging with the lowest LS as the referencemodel and tested in a pairwise manner. The last columnof Table 2 depicts Monte Carlo p-values based on permu-tation tests (9999 permutations) for comparison of LS forthe Poisson (IS) GLMM with LS if the remaining compet-ing models are chosen for data analysis. n.a. means that apermutation test is not applicable because of backtrans-formation of LS obtained within the arcsinh-NMM (seeAppendix A2 for further details).The Poisson (IS) model and the negative binomial

(IS) counterpart do not differ significantly with respectto their mean cross-validated logarithmic scores. Thesame holds for the ZIP (IS) alternative. Ranking thesemodels by means of their DIC value (disregardingNMM types) revealed that they are very close to eachother.In summary, there was no evidence of considerable

over-dispersion and excess of zeros. Inclusion of a zero-inflation component is apparently not necessary for thesepre-study data. Applying mean log score and DIC to rankall eight models considered so far suggests that randomintercept models are inferior to random intercept andslope models.Hence, we are inclined to believe that a Poisson random

intercept and slopemodel is suitable for these longitudinalcount response data.

SAP for BEMED trial: selection of candidatemodelsAfter detailed analyses of the pre-study data describedabove, we will present these results in the SAP and choosea negative binomial model with random intercepts aswell as random slopes as a robust candidate to conductsensitivity analyses for the efficacy data of the BEMEDtrial. Accordingly, this proposedmodeling strategy will bedetermined in the SAP.

Simulation studySampling detailsIn the last section, a prediction-oriented Bayesian toolboxwas applied to real-life clinical count data. It is also impor-tant to investigate whether these tools help to evaluatedifferent model alternatives and whether the model com-parisons are valid. To assess the discriminatory power aswell as the properties of DIC and mean logarithmic scorein the longitudinal count response situation, a simulationstudy was carried out. Following the real data structure ofour clinical trial about patients with vertigo attacks, a par-allel group design was assumed with four measurementsoccurring at times t = (t1, t2, t3, t4) = 0, 1, 2, 3 (exactlybalanced design) for all subjects. There are two groupseach of size n, with different fixed time slopes, parame-terized by β1 = −0.3 and β2 = −0.5, but equal startingpoints at time t1 = 0. To be more detailed, we consideredrepeated count outcomes to follow a negative binomialdistribution, conditioned on the random effects. Accord-ingly, the true sampling model is Yij|μi ∼iid NB(k, pi),i = 1, . . . , 2n, j = 1, . . . , 4. To account for patient-specificvariability, a random intercept ai was introduced, so themodel can be summarized as

logμij = α + ai + tij[ β2Gi + β1(1 − Gi) ] ,

with ai|σ−2a ∼ N (0, σ 2

a ), and Gi representing the placeboand the verum group, respectively. The standard deviation

Table 2 Posteriormean of the deviance (D), deviance of themean (D(θ )), effective number of parameters (pD) asmeasureof model complexity, DIC value, andmean of logarithmic scores (LS)

Model D D(θ ) pD DIC LS p-value‡

Poisson, I 2034 1931 103 2137 2.046 < 0.0001

Poisson, IS 1633 1463 170 1803 1.641 ref.

ZIP, I 2020 1917 103 2123 2.018 < 0.0001

ZIP, IS 1637 1467 170 1807 1.644 0.5797

NB, I 2006 1903 103 2109 1.932 < 0.0001

NB, IS 1659 1489 170 1829 1.665 0.6841

arcsinh, I 994∗ 886∗ 108∗ 1102∗ 1.967 n.a.

arcsinh, IS 834∗ 636∗ 198∗ 1032∗ 1.960 n.a.

∗ Comparison of DIC for NMMs is not applicable because of different arcsinh-transformed outcomes.‡Monte-Carlo permutation test for paired individual logarithmic scores.


of the random intercept was set to σa = 0.3 and thepopulation intercept fixed at α = 3. The following candi-date GLMMs are ranked by DIC as well as evaluated withrespect to their forecasting capability:

• negative binomial (true data generating distribution),• Poisson,• zero-inflated Poisson,• zero-inflated negative binomial,• NMM for arcsinh-transformed count outcome.

The ZIP and ZINB model were chosen to investigatewhether the zero-inflated component improves the modelperformance. To define simulation scenarios we variedthe sampling size (n, numbers per group) and the degreeof over-dispersion as follows: n = 20, 50, 100 and k =0.5, 1, 5, 10, 20, 50. By combining the possible values ofthe sample size and the overdispersion parameter k, wetherefore obtain count data for 18 different simulationscenarios which can be analyzed for all five rival modelsdescribed above. Each model scenario provided r = 100simulation runs to assess the variability of results.All analyses were performed using the INLA approach.

To get reliable results and to enhance the accuracy ofBayesian predictive measures (i.e. DIC and logarithmicCPO values), the full Laplace approximation in combi-nation with the so-called GRID integration scheme waschosen as the strategy for deterministic approximation ofthe latent Gaussian field and the posterior marginals ofthe hyperparameters. For more details of estimation pro-cedure, the reader is referred to [37,51] and the Additionalfiles 1 and 2 (Supplementary Material).When working with small data sets, the prior distri-

bution can become influential in the posterior results,especially with respect to the spread of the posterior dis-tribution, even if non-informative settings are chosen.This can particularly be an issue with prior distributionson the variance components. Therefore, for prior spec-ification we followed the procedure outlined in [40] soas not to favor one modeling strategy over another. Weassumed a marginal Cauchy distribution for the patient-specific intercept ai. A 95% range of [−0.6, 0.6] for aigives a prior σ−2

a ∼ Ga(0.5, 0.001115), and hence, inte-gration over σ−2

a gives the marginal distribution of ai ast1(0, 0.00223, 1).

Simulation resultsFigure 4 depicts boxplots for different simulation sce-narios if count response data {y(r)ij }, i = 1, . . . , 2n; j =1, . . . , 4; r = 1, . . . , 100, are analyzed by choosing a Pois-son GLMM, i.e. a wrong modeling strategy in the caseof high overdispersion. Both DIC as a measure of modelselection and mean of logarithmic score (LS(r), r =1, . . . , 100) were calculated for all 100 runs. The striking

feature of these plots is that for all 6 setups, DIC and LSdiscriminate strongly between the wrong model and thetrue negative binomial model generating the counts. If aPoisson model is chosen for data with a considerably highamount of overdispersion (small k), higher score valuesare assigned to the predictive distribution. DIC is clearlyinfluenced by the sample size because of the deviancemeasure depending on the likelihood, whereas for LS thenumber of sampling units does not impact scaling of themean of the scores.Based on these simulations, we conclude that DIC and

LS provide a suitable measure for ranking and evaluatingmodel alternatives defined by different error distributionsor variance structures.For all three sample size situations (i.e. 40, 100, 200

units in total), Figure 5 reveals the difference in mean logscores for the true negative binomial GLMM comparedwith the following model alternatives: Poisson (neglect-ing over-dispersion), zero-inflation (assuming an excessof zeros) and a Gaussian response model after arcsinh-transformation of the counts yij. The inadequacy of thePoisson model in terms of probabilistic forecasting is evi-dent in the case of high overdispersion, denoted by theparameter k. If k → ∞, LS of the “wrong” Poisson modelapproximates the mean log score of the true NB model.Furthermore, Table 3 reports the area underneath the

receiver operating curve (AUC) as a summary measurefor LS(r), r = 1, . . . , 100, of the true NB model and acompeting model alternative, as displayed in Figure 5.For each combination of k (degree of overdispersion) andn (sample size) the discriminatory power of the meanlog score was investigated. Perfect discrimination corre-sponds to an AUC value of 1 while random discriminationcorresponds to an AUC value of 0.5. The AUC can beinterpreted as being equal to the probability that LS ofthe wrong model exceeds that of the true NB model, i.e.the probability that the wrong model has a lower predic-tive performance compared with the true data generatingdistribution. Accordingly, it is the probability that testresults from a randomly selected pair of LS(r) values forthe wrong model and the true NB model are correctlyordered, namely Pr(LS(r,wrong)

> LS(r′,true)), r �= r′.

An AUC near 1 indicates that mean log score perfectlydiscriminates between the true NB model and a compet-ing (wrong) model adopted for a particular scenario. Sincethe true data generating distribution was negative bino-mial without an excess of zeros, the ZINB model did notperform worse than the true NB model and is suitablefor prediction in almost the same manner, resulting inan AUC value of approximately 0.5 for all scenarios. Forsmall k, the NB models clearly outperforms the compet-ing (zero-inflated) Poisson models that do not account foroverdispersion. Analyzing negative binomial data with an


Figure 4 Simulation study: Discriminatory power of DIC and LS for different scenarios (100 runs per scenario). Data generating process:longitudinal, negative binomial counts with subject-specific intercept (balanced design); modeling strategy: Poisson GLMMwith random intercept;number of subjects per group: n = 20, 50, 100; degree of overdispersion: k = 0.5, 1, 5, 10, 20, 50.As k → ∞, the degree of overdispersion decreasesand the negative binomial converges to a Poisson distribution. Hence, DIC and LS decline. Note that the range of DIC increases in the case of a largersample size.

arcsinh NMM as an alternative to accomplish variance-stabilization, the AUC is lower than that of a wrongPoisson model. However, if k → ∞ and the amountof overdispersion goes down, the choice of an NMM forarcsinh-transformed counts results in AUC clearly largerthan 0.5. Hence, the quality of observation-level predic-tions of the NMM is worse than that of the (zero-inflated)Poisson. If the negative binomial converges in distributionto the Poisson, the arcsinh-transformation of the countoutcome is no longer appropriate.

DiscussionWe have discussed Bayesian strategies for model evalu-ation of GLMMs for longitudinal count data and usedintegrated nested Laplace approximations to do the cal-culations. We especially looked at tools such as the DIC,logarithmic score, and PIT. These techniques for modelassessment are implemented in the package R-INLA

which can easily be used in R and aim to score the mod-els with respect to their appropriateness explaining theobserved data. Therefore, a very practical toolbox is at thehand for statisticians. It must be noted that other instru-ments such as pivotal quantities [71] or different properscoring rules [28] can be used if the calculations are donewith MCMCmethods (e.g. using WinBUGS [55,72]).We applied this toolbox to the typical task of a clini-

cal trial statistician of making decisions for pre-specifiedsensitivity analyses or the efficacy analysis in a statisti-cal analysis plan. Data from a former trial were used aspilot data for an ongoing phase III trial. Our interest wasto give some insight and guidance in the most importantaspect of deciding on a final SAP. The main task consistedof deciding which GLMM should be used for longitudinalcount data. To this end, we performed a Bayesian analy-sis of the pilot data with different models and employeda prediction-based approach to derive statements onmodel fit.


Figure 5 Simulation study: Variability of mean LS within different simulation scenarios. Variability of mean logarithmic score LS(r)

for truenegative binomial (NB) compared with arcsinh, (zero-inflated) Poisson and zero-inflated NB model (r = 1, . . . , 100 iterations per scenario). Samplesize: n = 20, 50, 100 subjects per group; k = 0.5, 1, 5, 10, 20, 50determines the amount of overdispersion. Each row is a different value of k (amount

of overdispersion), and each plot shows the mean LS(r)

for each competing model. The column of panels on the right has the largest sample size n;the top row exhibits the results for highly overdispersed counts (k = 0.5).


Table 3 Area under the curve (AUC) for comparison of

mean logarithmic score LS(r)

of true vs. wrongmodelingstrategy (r = 1, . . . , 100 iterations per simulation scenario)

k, degree of AUC

overdispersionModel

n = 20 n = 50 n = 100

0.5 Poi 1 1 1

0.5 ZIP 1 1 1

0.5 arcsinh 0.796 0.781 0.953

0.5 ZINB 0.498 0.498 0.499

1 Poi 1 1 1

1 ZIP 1 1 1

1 arcsinh 0.683 0.831 0.883

1 ZINB 0.497 0.501 0.499

5 Poi 0.999 1 1

5 ZIP 0.999 1 1

5 arcsinh 0.673 0.796 0.877

5 ZINB 0.526 0.513 0.509

10 Poi 0.903 0.988 1

10 ZIP 0.909 0.990 1

10 arcsinh 0.733 0.826 0.913

10 ZINB 0.562 0.526 0.515

20 Poi 0.675 0.831 0.920

20 ZIP 0.686 0.837 0.925

20 arcsinh 0.773 0.902 0.966

20 ZINB 0.628 0.564 0.548

50 Poi 0.526 0.567 0.644

50 ZIP 0.548 0.582 0.658

50 arcsinh 0.799 0.914 0.985

50 ZINB 0.742 0.667 0.512

True model: negative binomial GLMM; competing modeling strategies: Poisson,ZIP, NMM for arcsinh-transformed counts, ZINB. Sample size: n = 20, 50, 100patients per group; degree of overdispersion: k = 0.5, 1, 5, 10, 20, 50. AUC can beinterpreted as a summary measure for the goodness of discrimination betweenthe true negative binomial model generating the longitudinal data and rivalmodels which should be taken into consideration in practice. For k → ∞, thedifference between a negative binomial and the alternative Poisson modeldissolves because of convergence in distribution; therefore, AUC→ 0.5 andboth model alternatives approximate with respect to their forecasting ability.

We next discuss four important aspects of this pro-cess: prior distributions, normality assumption for ran-dom effects, Bayesian model evaluation, and modeling ofclinical trial data.

Prior distributionsBayesian analysis needs a specification of prior distri-butions. However, when fitting a GLMM in a Bayesiansetting, specifying prior distributions is not straightfor-ward; this is particularly true for variance components.Fong et al. [40] pointed out that the priors for variance

components should be chosen carefully. To quantify thesensitivity of the posterior distributions with respect tochanges in the priors for the random effects precisionparameters, Roos & Held [73] discuss a measure based onthe so-called Hellinger distance for GLMMs with binaryoutcome but not for count data. Adapting their approachto count data is a topic for future research. In this study,we followed advice from the literature: in the case of neg-ative binomial models, estimation of the posterior meanof the dispersion parameter can be affected when a vagueprior specification is used to characterize the gammahyper-parameter. To circumvent the problem of distort-ing posterior inferences, e.g. Lord et al. [74] recommenda non-vague prior distribution for the dispersion parame-ter tominimize the risk of amis-estimated posteriormeanand to obtain stable and valid results. This issue is par-ticularly relevant for data characterized by a small samplesize in combination with low sample mean values. Thesituation is quite complex and the only practical way tohandle this issue is a careful simulation study to investi-gate whether changing priorswould influence the decisionon the relevant model. The material provided in the WebSupplement may help a statistician set up such simulationstudies.

Gaussian random effectsThroughout our article the distribution of random effectswas assumed to be Gaussian. One reason was thatBayesian inference was based on the INLA approach.Within the INLA methodology an extension to non-Gaussian random effects is not straightforward due tothe central role of the latent Gaussian field. The mainchallenge in applying INLA to latent models is that theapproach depends heavily on the latent Gaussian priorassumption to work properly. For further details on thisissue see [75]. Recently, Martins & Rue [75] proposed anextension that allows INLA to be applied to models wheresome independent components of the latent field havea so-called “near-Gaussian” prior distribution. All in all,the assumption of Gaussian distributed random effectsthat is usually taken for granted may be subject to criti-cism, and there are a number of situations in which thismight not be a realistic assumption. From a theoreticalpoint of view, this normality assumption may be droppedin favour of other symmetric but heavier-tailed densities,such as the Student t-distribution which allows to iden-tify and accommodate for outliers both on the level of thewithin-group errors but also at the level of random effects[76]. Further research is needed to investigate the impactof inappropriate distributional assumptions, i.e. to under-stand its influence not only on posterior inference, butalso on several Bayesian instruments which are applied formodel evaluation.


Bayesianmodel evaluationThe INLA approach for approximate fully Bayesian infer-ence on the class of latent Gaussian models providesan attractive and convenient alternative to an infer-ence scheme based on sampling-based methods such asMCMC, and avoids its computational burden. By takingadvantage of the properties of latent Gaussian models,INLA outperforms MCMC schemes in terms of bothaccuracy and speed. Bayesian approaches naturally leadto posterior predictive distributions, from which anydesired functional can readily be computed. We earlierdiscussed Bayesian methods for assessing probabilisticforecasts via proper scoring rules serving as a loss func-tion. These scores can be used for an omnibus evaluationof both sharpness and calibration of predictive distri-butions and provide a usable instrument for assessingthe validity of different competing, non-nested model-ing strategies. It should also be noted that there is avariety of proper scoring rules with a unique and well-defined underlying decision problemwhich can be appliedin a given situation as well. According to Gneiting [62]there are many options and considerations in choosinga specific scoring function, and there is a need for the-oretically principled guidance. In this article, we havefocused on the logarithmic score which is easily calcu-lated and available from INLA. The mean logarithmicscore was competitive for the simulated negative bino-mial data, and most importantly, it was able to identifythe correct model as the one best suited for predic-tion, namely the true model generating the data emergedwith the smallest LS. In contrast to proper scoring rules,PIT histograms allow evaluation of the predictive qual-ity of a model with respect to calibration only, neglect-ing sharpness. Furthermore, the DIC was applied as acommon model selection criteria that takes into accountgoodness of fit while penalizing models for overfitting.Despite its computational simplicity, DIC does have sev-eral drawbacks, particularly tending to under-penalizecomplex hierarchical models. Likewise, DIC is not suit-able for comparing a model for transformed outcomewith competing models for data on the original scale.Accordingly, predictive checks should be preferred torank different non-nested GLMMs alternatives. Never-theless, the question of what constitutes a noteworthydifference in DIC or mean scores to distinguish betweencompeting model types has not yet received a satisfactoryanswer. For Bayes factors, calibration scales have beenproposed, but no credible scale has been proposed forthe difference in DIC or the difference in mean scores[54]. In conclusion, we recommend using several instru-ments for model evaluation to gain further insight intodifferent aspects of a statistical model, such as forecastingability, combined assessment of calibration and sharp-ness, and comparison of features of the model-based

posterior predictive distribution to equivalent features ofthe observed data.

Modeling of clinical trial dataIn the clinical trial setting, Bayesian instruments based onthe INLA approach can be applied as decision supportto pre-specify a suitable final model for sensitivity analy-ses. Provided that adequate pilot data exist, an appropriatemodeling strategy is developed using prior informationobtained from a trial in an earlier phase. Sensitivityanalyses are important to investigate the effects of devia-tions from a statistical model and its underlying assump-tions. Furthermore, it is necessary to assess in what waythe (posterior) inference depends on extreme observa-tions and on unverifiable model assumptions. Altogether,situation-specific robustness of the proposed analysesmust be checked carefully.

ConclusionsThe statistical model must be specified in the SAP beforeacquiring the real trial data. Similar independent data (forexample, from patients treated with the standard treat-ment) may serve as a basis for decision-making. Theanalyses proposed in the SAP have to be appropriateand should rely on a minimum of assumptions [77,78].Sensitivity analyses help to assess whether, for example,results from simple testing procedures applied to the pri-mary efficacy analysis agree with the results obtainedby additional, more complex analyses. These analysesmay consider more complex settings for individual-levelparameters, or different distributional forms of individualinhomogeneity. By studying agreement between differentstrategies via sensitivity analyses in the statistical report,such as simple tests for the primary analysis together withmodeling approaches for sensitivity analyses, it is possi-ble to explore robustness and accurate estimation of thetreatment effects.We look at the situation when there are various pos-

sible analyses of a given hypothesis (in our case: notreatment×time interaction), all of which have differentdistributional assumptions (specified by different assump-tions in terms of the random effects structure andcorresponding distributional assumptions). In this case,robustness would come from different analyses, with dif-ferent assumptions, showing substantial agreement. Onthe first look, there is no ordering that would allow us todeclare one analysis better than another by virtue of rely-ing on a specific distributional assumption. On a secondlook, the logarithmic score, the DIC, and the PIT providescores to establish such an ordering.We concentrated on justifying the distributional

assumptions in the count response situation, that is,checking deviations from the assumptions regarding thestochastic part of the hierarchical model, since there


was no evidence for specific prognostic factors or fac-tors for baseline adjustment, which would improve theprecision of the results when considered in the model.We also explored the effect of random effects structuresthat include subject-specific intercepts and/or slopes.The simulation could also demonstrate how much powerwould be lost if we chose a more general model (NBdistribution, random intercept and random slope) com-pared with a simple model (Poisson distribution, randomintercept, no random slope) when the simple model istrue.However, the Bayesian toolbox used is all-purpose and

can be applied to detect other more complex forms ofmisspecification as well, such as non-Gaussian distributedrandom effects [75], alternative functional relationshipsin the population-mean structure, random effects pre-cisions depending on a binary covariate (e.g. treatmentgroup), alternative prior distributions or different hyper-prior parameter values. Finally, part of the sensitivityanalyses of the trial data may also be checking whether themodeling assumptions for the primary efficacy analysisare reasonable or not.

AppendixA1. Variance stabilizing transformation for negativebinomial outcomeAssuming a random variable Y ∼ NB(k, p) usingthe notation as described before. In the limit, if dis-persion parameter k moves to infinity, then E μ[Y ]=Var μ[Y ]= μ and negative binomial NB(k, p) convergesto Poi(μ) in a distributional matter. Hence, variance ofY ∼ NB(k, p) can be described as a function of μ, i.e.Var μ[Y ]= μ(μ/k + 1) := v(μ), for μ ≥ 0.Searching for an asymptotic variance-stabilizing func-

tion means searching for a function T : R → R with thefollowing property

Var μ[T(Y )]≈ const := c. (9)

The transformation T is assumed to be strictly monotone,and without loss of generality it is assumed to be strictlyincreasing. Transformations satisfying (9) do not neces-sarily exist; but if they do exist, they are unique [79]. UsingTaylor approximation

T(Yμ) ≈ T(μ) + T ′(μ) · (Yμ − μ)

we can write

Var μ[ T(Y ) ] ≈ Var μ[ T(μ) + T ′(μ) · (Y − μ) ]

= (T ′(μ))2 · Var μ[Y − μ]

= (T ′(μ))2 · v(μ).

The goal is to find a real-valued, measurable transfor-mation T such that 1 = Var μ[ T(Y ) ]≈ (T ′(μ))2 · v(μ).This produces

Lemma. Let Yμ be a family of random variables withmean E μ[Y ]= μ and variance Var μ[Y ]= v(μ). Thenthe asymptotic variance-stabilizing transformation for Yis given by

T(y) =∫ y

−∞1√v(u)

∂u,

achieving that Var μ[ T(Y ) ] is independent from μ.�

Hence, according to this Lemma the asymptoticvariance-stabilizing transformation for a negative bino-mial distribution is given by

T(y) =∫ y

0

1√v(u)

∂u =∫ y

0

1√u(z u + 1)

∂u

= arcsinh(√z y)/√z

(10)

where z ≡ 1/k represents an overdispersion parame-ter which is defined by specifying NB(k, p). For z → 0(i.e. k → ∞),

arcsinh(√

(z y))/√z −→ √

z.

Therefore, the variance stabilizing transformation forPoisson distribution is

√z. See [79,80] for a more detailed

derivation.

A2. Backcalculation of themean logarithmic score forarcsinh-transformed outcome to the original count scaleIf count data exhibit overdispersionwith respect to a Pois-son model a Normal mixed effects model (NMM) forarcsinh-transformed count response can be performedto accomplish stabilization of variance. However, thisinvolves that the predictive performancemeasures such asthe proper scoring rules are not computed on the originalscale. In this section, we detail the calculation of a “correc-tion term” needed to back-transform the mean log scorefor the arcsinh NMM to the mean log score within theoriginal count scale. Note that only themean of log scorescan be back-transformed, not the log score values of theobservational level.Let φ be the probability density for Y in R, and ψ

be the probability density of Z in R, with z = g(y) =arcsinh(y) = log(y + √

y2 + 1), g−1(z) = exp(z)−exp(−z)2 ,

with the derivation g′(y) = 1√1+y2

.


Suppose we have computed themean log score, denotedas mean(LS) | arcsinh. Using the substitution y = g−1(z) anddy =[ g−1]′ (z) dz, the following equations hold:

mean(LS) | original

=E[− log(φ(Y ))

]=

∫ ∞

−∞[− logφ(Y )] ·φ(Y ) dy (11)

=∫ ∞

−∞[− logφ(g−1(z))] ·φ(g−1(z))·[ g−1]′ (z) dz

=∫ ∞

−∞[− logφ(g−1(z))

] · φ(g−1(z))g′[ g−1(z)]

dz

=∫ ∞

−∞[− log

{ψ(z) · g′[ g−1(z)]

}]ψ(z) dz

=E[− logψ(Z)

]︸︷︷︸mean(LS) | arcsinh

− E[log{g′[ g−1(Z)] } ]

where [ g−1]′ (z) = 1g ′[g−1(z)] and ψ(z) = φ(g−1(z))

g ′[g−1(z)] . Hence,the correcting subtrahend required to convert the meanlog score for the arcsinh NMM to the original scale resultsfrom

E[log{g′[ g−1(Z)] } ]

.

This is simply the empirical sample mean of a cer-tain transformation of the materialized and the arcsinh-transformed counts, respectively. After backcalculationthe mean log score for the arcsinh NMM can be com-pared with themean log scores of othermodel alternativesassessing their forecasting capability.

A3. Vertigo data: comparison of INLA andMCMCFigure 2 illustrates the approximated posterior marginalsfor the fixed effects (posterior marginal distribution forintercept and hyperparameters not shown). The dottedcurve with overlaid histogram is the posterior marginaldensity resulting from a MCMC run based on 1000 (nearindependent) samples; the output was constructed withthe built-in MCMC-sampler (more precisely the “one-blockMCMC-sampler” described in [81, chapter 4]) avail-able within the inla program). Apparently, the Laplaceapproximation gives an accurate fit indicating thatMCMCand INLA provide comparable posterior estimates in thislongitudinal setting.

Additional files

Additional file 1: Web-based SupplementaryMaterial. This documentcontains further technical details concerning the INLA approach.Furthermore, chunks of R code to illustrate the use of the R packageR-INLA are provided.

Additional file 2: R file to generate NB GLMMdata used in thesimulation study. R function make.negbin.rfc(.) creates a data setwith longitudinal counts (data generating process used: negative binomialrandom intercept models for varying degrees of overdispersion andsample sizes).

Competing interestsThe authors declare that they have no competing interests.

Authors contributionsThe authors’ responsibilities were as follows: CA (guarantor) did the statisticalanalysis, designed the simulation study and wrote the first draft of themanuscript. UM reviewed and critiqued the manuscript and made substantialcontributions to subsequent drafts. All authors checked and approved thefinal version of the manuscript.

AcknowledgementsThis work was supported by the LMUinnovativ research priority project, MunichCenter of Health Sciences (MC-Health) of the LMU. Access to the vertigo datawas kindly facilitated by Prof. Michael Strupp, Department of Neurology andGerman Dizziness Center (IFBLMU), Campus Grosshadern. We are grateful toProf. Havard Rue, Department of Mathematical Sciences, Norwegian Universityof Science and Technology (NTNU), Trondheim, Norway, for helping with theR-INLA package and for his support and advice regarding the INLAmethodology. We thank the associate editor and reviewers for their insightfulcomments which were very useful in preparing the final version of this paper.

Received: 23 April 2012 Accepted: 2 August 2012Published: 10 September 2012

References1. Schulz KF, Altman DG, Moher D: CONSORT 2010 Statement: updated

guidelines for reporting parallel group randomised trials. BMJ 2010,340:c332.

2. Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ,Elbourne D, Egger M, Altman DG: CONSORT 2010 explanation andelaboration: updated guidelines for reporting parallel grouprandomised trials. BMJ 2010, 340:c869.

3. Brandes J, Saper J, Diamond M, Couch J, Lewis D, Schmitt J, Neto W,Schwabe S, Jacobs D, MIGR-002 Study Group: Topiramate for migraineprevention: a randomized controlled trial. JAMA 2004, 291(8):965–973.

4. Shih WJ, Quan H: Planning and analysis of repeatedmeasures at keytime-points in clinical trials sponsored by pharmaceuticalcompanies. Stat Med 1999, 18(8):961–973.

5. Galbraith S, Marschner IC: Guidelines for the design of clinical trialswith longitudinal outcomes. Controlled Clin Trials 2002, 23(3):257–273.

6. Brunner E, Domhof S, Langer F: Nonparametric Analysis of LongitudinalData in Factorial Experiments.Wiley Series in Probability and Statistics,New York: Wiley; 2002.

7. Omar RZ, Wright EM, Turner RM, Thompson SG: Analysing repeatedmeasurements data: a practical comparison of methods. Stat Med1999, 18(13):1587–1603.

8. Matthews J, Altman D, Campbell M, Royston P: Analysis of serialmeasurements inmedical research. BMJ 1990, 300(6719):230–5.

9. Liang KY, Zeger SL: Longitudinal data analysis using generalizedlinear models. Biometrika 1986, 73:13–22.

10. Verbeke G, Molenberghs G:Models for Discrete Longitudinal Data. SpringerSeries in Statistics, New York: Springer; 2005.

11. Diggle P, Heagerty P, Liang KY, Zeger S: Analysis of longitudinal data.2nd edition. Oxford: Oxford University Press; 2003.

12. Breslow NE, Clayton DG: Approximate inference in generalized linearmixedmodels. J Am Stat Assoc 1993, 88(421):9–25.

http://www.biomedcentral.com/content/supplementary/1471-2288-12-137-S1.pdf

http://www.biomedcentral.com/content/supplementary/1471-2288-12-137-S2.r


13. CHMP: Guideline onMissing Data in Confirmatory Clinical Trials. London,UK: Committee for Medicinal Products for Human Use (CHMP), EuropeanMedicines Agency (EMA); 2010. [http://www.ema.europa.eu/docs/en GB/document library/Scientific guideline/2010/09/WC500096793.pdf].[EMA/CPMP/EWP/1776/99 Rev. 1. Date for coming into effect: 1 January2011. (This guideline replaces Points to Consider on Missing Data inClinical Trials (CPMP/EWP/1776/99) from 2001)].

14. Little R, Rubin D: Statistical analysis with missing data. 2nd edition, WileySeries in Probability and Statistics, New York: Wiley; 2002.

15. Fairclough DL: Design and Analysis of Quality of Life Studies in Clinical Trials.2nd edition, Interdisciplinary Statistics, Boca Raton, FL: Chapman &Hall/CRC; 2010.

16. ICH Harmonised Tripartite Guideline: Statistical principles for clinical trials –E9. International Conference on Harmonisation E9 Expert Working Group;1998. [http://www.ich.org/fileadmin/Public Web Site/ICH Products/Guidelines/Efficacy/E9/Step4/E9 Guideline.pdf]. [CPMP/ICH/363/96.Current Step 4 version].

17. PSI Statisticians in the Pharmaceutical IndustryPSIProfessionalStandardsWorkingParty: Guidelines for standardoperating procedures for good statistical practice in clinicalresearch. Drug Inf J 1998, 32:665–682. [http://www.psiweb.org/docs/gsop.pdf].

18. Keene O: The log transformation is special. Stat Med 1995, 14(8):811–9.19. FDA: Guideline for the format and content of the clinical and statistical

sections of an application. Rockville, MD: Food and Drug Administration,Center for Drug Evaluation and Research, Department of Health andHuman Services. Published by U.S. Dept. of Health and Human Services,Public Health Service; 1988.

20. Cai B, Dunson D: Bayesian covariance selection in generalized linearmixedmodels. Biometrics 2006, 62(2):446–457.

21. Abad AA, Litiere S, Molenberghs G: Testing for misspecification ingeneralized linear mixedmodels. Biostatistics 2010, 11(4):771–786.[http://dx.doi.org/10.1093/biostatistics/kxq019].

22. Litiere S, Alonso A, Molenberghs G: The impact of a misspecifiedrandom-effects distribution on the estimation and the performanceof inferential procedures in generalized linear mixed models. StatMed 2008, 27(16):3125–3144.

23. Pocock SJ, Assmann SE, Enos LE, Kasten LE: Subgroup analysis, covariateadjustment and baseline comparisons in clinical trial reporting:current practice and problems. Stat Med 2002, 21(19):2917–2930.

24. Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios L, Robson R, Thabane M,Giangregorio L, Goldsmith C: A tutorial on pilot studies: the what, whyand how. BMCMedical Res Methodology 2010, 10:1–10.

25. Pan Z, Lin D: Goodness-of-fit methods for generalized linear mixedmodels. Biometrics 2005, 61(4):1000–1009.

26. Vonesh E, Chinchilli V, Pu K: Goodness-of-fit generalized nonlinearmixed-effects models. Biometrics 1996, 52(2):572–587.

27. Dawid AP: Statistical theory: the prequential approach. J R Stat Soc. SerA (Gen) 1984, 147(2):278–292.

28. Czado C, Gneiting T, Held L: Predictive model assessment for countdata. Biometrics 2009, 65(4):1254–1261.

29. Gneiting T, Stanberry L, Grimit E, Held L, Johnson N: Assessingprobabilistic forecasts ofmultivariate quantities, with an applicationto ensemble predictions of surface winds. TEST 2008, 17:211–235.

30. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A: Bayesian measuresof model complexity and fit (with discussion). J R Stat Soc: Ser B (StatMethodology) 2002, 64(4):583–639.

31. Gelman A, Meng X:Model checking andmodel improvement. InMarkov chain Monte Carlo in practice, Interdisciplinary statistics. Edited byGilks WR, Richardson S, Spiegelhalter DJ. Boca Raton, FL: Chapman &Hall/CRC; 1996:189–202.

32. Gelfand AE, Dey DK, Chang H (Eds):Model determination using predictivedistributions with implementation via sampling-based methods. Oxford:Oxford University Press; 1992. [http://statistics.stanford.edu/∼ckirby/techreports/ONR/SOL%20ONR%20462.pdf].

33. Gneiting T, Raftery AE: Strictly proper scoring rules, prediction andestimation. J Am Stat Assoc 2007, 102(477):359–378.

34. Gneiting T, Balabdaoui F, Raftery AE: Probabilistic forecasts, calibrationand sharpness. J R Stat Soc: Ser B (Stat Methodology) 2007, 69(2):243–268.

35. Gneiting T, Ranjan R: Comparing Density Forecasts Using Thresholdand QuantileWeighted Scoring Rules. Technical Report 533,

Department of Statistics, University of Washington, Washington 2008,[http://www.stat.washington.edu/research/reports/2008/tr533.pdf].

36. Ehm W, Gneiting T: Local Proper Scoring Rules. Technical Report 551,Department of Statistics, University of Washington, Washington 2009,[http://www.stat.washington.edu/research/reports/2009/tr551.pdf].

37. Rue H, Martino S, Chopin N: Approximate Bayesian inference forlatent Gaussianmodels by using integrated nested Laplaceapproximations. J R Stat Soc: Ser B (Stat Methodology) 2009,71(2):319–392.

38. Martino S, Rue H: Implementing Approximate Bayesian Inferenceusing Integrated Nested Laplace Approximation: a manual for theinla program. Department of Mathematical Sciences, NTNU, Trondheim,Norway 2010, [http://www.math.ntnu.no/∼hrue/r-inla.org/doc/inla-manual/inla-manual.pdf].

39. R Development Core Team: R: A Language and Environment for StatisticalComputing. Vienna, Austria: R Foundation for Statistical Computing; 2012.[http://www.R-project.org]. [ISBN 3-900051-07-0].

40. Fong Y, Rue H, Wakefield J: Bayesian inference for generalized linearmixedmodels. Biostatistics 2010, 11(3):397–412.

41. Wakefield J:Multi-level modelling, the ecologic fallacy, and hybridstudy designs. Int J Epidemiol 2009, 38(2):330–336.

42. Lambert D: Zero-inflated poisson regression, with an application todefects in manufacturing. Technometrics 1992, 34:1–14.

43. Yau KKW, Wang K, Lee AH: Zero-inflated negative binomial mixedregression modeling of over-dispersed count data with extra zeros.Biometrical J 2003, 45(4):437–452.

44. Hilbe JM: Negative binomial regression. 2nd edition. Cambridge:Cambridge Univ. Press; 2011.

45. Dagne GA: Hierarchical Bayesian analysis of correlated zero-inflatedcount data. Biometrical J 2004, 46(6):653–663.

46. Ghosh SK, Mukhopadhyay P, Lu JC: Bayesian analysis of zero-inflatedregression models. J Stat Planning and Inference 2006, 136(4):1360–1375.

47. Jeffrey A: Inverse Trigonometric and Hyperbolic Functions. InHandbook of Mathematical Formulas and Integrals. Edited by Jeffrey A, DaiH. Orlando, FL: Academic Press; 2000:124–144.

48. Gilks WR, Richardson S, Spiegelhalter DJ (Eds):Markov chain Monte Carlo inpractice. 1st edition, Interdisciplinary statistics, Boca Raton, FL: Chapman &Hall/CRC; 1996.

49. Gelman A: Bayesian data analysis. 2nd edition, Texts in statistical science,Boca Raton, FL: Chapman & Hall; 2004.

50. Carlin BP, Louis TA: Bayesian methods for data analysis, Volume 78. BocaRaton, FL: Chapman & Hall/CRC; 2009.

51. Martino S, Rue H: Case Studies in Bayesian Computation using INLA.In Complex Data Modeling and Computationally Intensive StatisticalMethods. Edited by Mantovan P, Secchi P. Milan: Springer Verlag Italia;2010:99–114.

52. Held L, Schrodle B, Rue H: Posterior and cross-validatory predictivechecks: a comparison of MCMC and INLA. In Statistical modelling andregression structures – Festschrift in honour of Ludwig Fahrmeir. Edited byKneib T, Tutz G. Heidelberg: Physica-Verlag; 2010:91–110.

53. Simpson D, Lindgren F, Rue H: Fast approximate inference with INLA:the past, the present and the future. Technical Report at arxiv.org,Department of Mathematical Sciences, Norwegian University of Scienceand Technology (NTNU), Trondheim, Norway 2011.

54. Plummer M: Penalized loss functions for Bayesian modelcomparison. Biostatistics 2008, 9(3):523–539.

55. Lunn DJ, Thomas A, Best N, Spiegelhalter D:WinBUGS – a Bayesianmodelling framework: concepts, structure, and extensibility. StatComput 2000, 10:325–337.

56. Spiegelhalter D, Thomas A, Best N, Lunn D:WinBUGS version 1.4 usermanual.MRC Biostatistics Unit, Cambridge 2007,www.mrc-bsu.cam.ac.uk/bugs.

57. Cowles MK: Review of WinBUGS 1.4. The Am Stat 2004, 58(4):330–336.

58. Sturtz S, Ligges U, Gelman A: R2WinBUGS: a package for runningWinBUGS from R. J Stat Software 2005, 12(3):1–16.

59. Geisser S: Predictive inference: An introduction, Volume 55 of Monographson statistics and applied probability. New York: Chapman & Hall; 1993.

60. Pettit LI: The conditional predictive ordinate for the normaldistribution. J R Stat Soc: Ser B (Stat Methodology) 1990, 52:175–184.

http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2010/09/WC500096793. pdf

http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2010/09/WC500096793. pdf

http://www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/Efficacy/E9/Step4/E9_ Guideline.pdf

http://www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/Efficacy/E9/Step4/E9_ Guideline.pdf

http://www.psiweb.org/docs/gsop.pdf

http://www.psiweb.org/docs/gsop.pdf

http://dx.doi.org/10.1093/biostatistics/kxq019

http://statistics.stanford.edu/~ckirby/techreports/ONR/SOL%20ONR%20462.pdf

http://statistics.stanford.edu/~ckirby/techreports/ONR/SOL%20ONR%20462.pdf

http://www.stat.washington.edu/research/reports/2008/tr533.pdf

http://www.stat.washington.edu/research/reports/2009/tr551.pdf

http://www.math.ntnu.no/~hrue/r-inla.org/doc/inla-manual/inla-manual.pdf

http://www.math.ntnu.no/~hrue/r-inla.org/doc/inla-manual/inla-manual.pdf

http://www.R-project.org


61. Congdon P: Applied Bayesian modelling.Wiley Series in Probability andStatistics, Chichester: Wiley; 2003.

62. Gneiting T:Making and evaluating point forecasts. J Am Stat Assoc2011, 106(494):746–762. [http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.r10138].

63. Held L:Methoden der statistischen Inferenz: Likelihood und Bayes.Heidelberg: Spektrum Akad. Verlag; 2008.

64. Good IJ: Rational decisions. J R Stat Soc: Ser B (Stat Methodology) 1952,14:107–114.

65. Ludbrook J, Dudley H:Why permutation tests are superior to t and Ftests in biomedical research. Am Stat 1998, 52(2):127–132.

66. Paul M, Held L: Predictive assessment of a non-linear random effectsmodel for multivariate time series of infectious disease counts. StatMed 2011, 30(10):1118–1136.

67. Marshall EC, Spiegelhalter DJ: Approximate cross-validatory predictivechecks in disease mappingmodels. Stat Med 2003, 22(10):1649–1660.

68. Marshall EC, Spiegelhalter DJ: Identifying outliers in Bayesianhierarchical models: a simulation-based approach. Bayesian Analysis2007, 2(2):409–444.

69. Stern HS, Cressie N: Posterior predictive model checks for diseasemappingmodels. Stat Med 2000, 19(17-18):2377–2397.

70. Strupp M, Hupert D, Frenzel C, Wagner J, Hahn A, Jahn K, Zingler VC,Mansmann U, Brandt T: Long-term prophylactic treatment of attacksof vertigo in meniere’s disease – comparison of a high with a lowdosage of betahistine in an open trial. Acta oto-laryngologica 2008,128(5):520–524.

71. Johnson V: Bayesianmodel assessment using pivotal quantities.Bayesian Analysis 2007, 2(4):719–733. [http://ba.stat.cmu.edu/journal/2007/vol02/issue04/johnson.pdf].

72. Lunn D, Spiegelhalter D, Thomas A, Best N: The BUGS project:evolution, critique and future directions. Stat Med 2009,28(25):3049–3067. [http://dx.doi.org/10.1002/sim.3680].

73. Roos M, Held L: Sensitivity analysis in bayesian generalized linearmixedmodels for binary data. Bayesian Analysis 2011, 6(2):259–278.

74. Lord D, Miranda-Moreno L: Effects of low sample mean values andsmall sample size on the estimation of the fixed dispersionparameter of Poisson-gammamodels for modeling motor vehiclecrashes: A Bayesian perspective. J Saf Sci 2008, 46:751–770.

75. Martins TG, Rue H: Extending INLA to a class of near-Gaussian latentmodels. Technical Report, Preprint Statistics No. 6/2012, Department ofMathematical Sciences, Norwegian University of Science and Technology(NTNU), Trondheim, Norway 2012, [http://www.math.ntnu.no/preprint/statistics/2012/S6-2012.pdf].

76. Pinheiro JC, Liu C, Wu YN: Efficient algorithms for robust estimation inlinear mixed-effects models using the multivariate t Distribution. JComput and Graphical Stat 2001, 10(2):249–276. [http://www.jstor.org/stable/1391011].

77. Berger V, Lunneborg C, Ernst M, Levine J: Parametric analyses inrandomized clinical trials. Journal of Modern Applied Statistical Methods2002, 1:74–82.

78. Gelfond JA, Heitman E, Pollock BH, Klugman CM: Principles for theethical analysis of clinical and translational research. Stat Med 2011,30(23):2785–2792. [http://dx.doi.org/10.1002/sim.4282].

79. Tibshirani R: Estimating transformations for regression via additivityand variance stabilization. J Am Stat Assoc 1988, 83(402):394–405.

80. Hastie T, Tibshirani R: Generalized Additive Models.Monographs onStatistics and Applied Probability, London: Chapman & Hall/CRC; 1990.

81. Rue H, Held L: Gaussian Markov Random Fields: Theory and Applications.Monographs on Statistics and Applied Probability, London: Chapman &Hall/CRC; 2005.

doi:10.1186/1471-2288-12-137Cite this article as: Adrion and Mansmann: Bayesian model selection tech-niques as decision support for shaping a statistical analysis plan of a clinicaltrial: An example from a vertigo phase III study with longitudinal countdata as primary endpoint. BMCMedical ResearchMethodology 2012 12:137.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.r10138

http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.r10138

http://ba.stat.cmu.edu/journal/2007/vol02/issue04/johnson.pdf

http://ba.stat.cmu.edu/journal/2007/vol02/issue04/johnson.pdf

http://dx.doi.org/10.1002/sim.3680

http://www.math.ntnu.no/preprint/statistics/2012/S6-2012.pdf

http://www.math.ntnu.no/preprint/statistics/2012/S6-2012.pdf

http://www.jstor.org/stable/1391011

http://www.jstor.org/stable/1391011

http://dx.doi.org/10.1002/sim.4282

Documents

RESEARCHARTICLE OpenAccess ......AdrionandMansmannBMCMedicalResearchMethodology2012,12:137 RESEARCHARTICLE OpenAccess Bayesianmodelselectiontechniquesas