model-based gene set analysis of genome-scale data

GOing Bayesian: model-based gene set analysis ofgenome-scale dataSebastian Bauer1, Julien Gagneur2 and Peter N. Robinson1,3,4,*

1Institute for Medical Genetics, Charite-Universitatsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin,2European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, 3Max Planck Institute forMolecular Genetics, Ihnestrasse 73, 14195 Berlin and 4Berlin-Brandenburg Center for Regenerative Therapies(BCRT), Charite-Universitatsmedizin Berlin, Berlin, Germany

Received November 30, 2009; Revised January 14, 2010; Accepted January 15, 2010

ABSTRACT

The interpretation of data-driven experiments ingenomics often involves a search for biologicalcategories that are enriched for the respondergenes identified by the experiments. However,knowledge bases such as the Gene Ontology (GO)contain hundreds or thousands of categories withvery high overlap between categories. Thus, enrich-ment analysis performed on one category at a timefrequently returns large numbers of correlatedcategories, leaving the choice of the most relevantones to the user’s interpretation.

Here we present model-based gene set analysis(MGSA) that analyzes all categories at once byembedding them in a Bayesian network, in whichgene response is modeled as a function of the acti-vation of biological categories. Probabilistic inferenceis used to identify the active categories. The Bayesianmodeling approach naturally takes category over-lap into account and avoids the need for multipletesting corrections met in single-category enrich-ment analysis. On simulated data, MGSA identifiesactive categories with up to 95% precision at arecall of 20% for moderate settings of noise,leading to a 10-fold precision improvement oversingle-category statistical enrichment analysis.Application to a gene expression data set in yeastdemonstrates that the method provides high-level,summarized views of core biological processes andcorrectly eliminates confounding associations.

INTRODUCTION

Many studies in functional genomics follow a data-drivenapproach. Experiments such as transcriptional profiling

with microarrays, ChIP-on-chip or gene knock-outscreens are done at the scale of the whole genome withoutspecifying a prior hypothesis. Instead, one seeks to discovernew phenomena and generate new hypotheses from thedata. Loosely formulated, the main question driving theanalysis of data-driven experiments is: what is going on?Although for technical and biological reasons the

nature of the data differs between these types of experi-ments, they often can be summarized by a list of geneswhich responded to the experiment, e.g. genes found to bedifferentially expressed, bound by a particular transcrip-tion factor or whose knock-down elicits a phenotype ofinterest. However, extensive lists of responder genes arenot per se useful to describe the experiment. A practicalway to address the question of what is going on? is toperform a gene category analysis, i.e. to ask whetherthese responder genes (which we will refer to as thestudy set) share some biological features that distinguishthem among the set of all genes tested in the experiment(which we will refer to as the population). Gene categoryanalysis involves a list of gene categories, such as thoseprovided by the Gene Ontology (GO) (1) or the pathwaysof the KEGG database (2), and a statistical method foridentifying enriched categories such as overrepresentationanalysis using Fisher’s exact test (3), gene set enrichment(4–7), logistic regression (8), random-set analysis (9) andBayesian techniques for analyzing GO terms in a contextwhere not all annotated genes are observed (10).Gene category analyses that follow the mentioned

approaches often return a large number of significantcategories, which are related to one another and leaveto the user the task of choosing the most meaningfulcategories at the risk of relying on biased judgments.The reason for the correlation is that genes can belongto multiple categories, so that if one category is signifi-cantly overrepresented in the study set, then it is morelikely that other categories with many genes in commonwith it will also be significantly overrepresented. While

*To whom correspondence should be addressed. Tel: +49 30 450569122; Fax: +49 30 450569915; Email: [email protected]

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Published online 19 February 2010 Nucleic Acids Research, 2010, Vol. 38, No. 11 3523–3532doi:10.1093/nar/gkq045

� The Author(s) 2010. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Dow

nloaded from https://academ

ic.oup.com/nar/article/38/11/3523/3100635 by guest on 18 D

ecember 2021

these enrichments are correct in a statistical sense, longlists of results can make it difficult to focus on the mostimportant ones in a biological sense.With GO, annotations are understood to follow the

true-path rule, meaning that whenever a gene is annotatedto a term it is also implicitly associated with all parents ofthat term. This is an important cause of gene sharing, aphenomenon which we have termed the inheritanceproblem (11). A number of methods have been proposedto deal with these problems. The elim algorithm exam-ines the GO graph in a bottom-up fashion (12). Oncea GO term is found to be significantly overrepresentedaccording to Fisher’s exact test, all genes annotated to itare removed from further analysis. A variant of this algo-rithm downweights the genes rather than eliminating themcompletely (12). The parent–child algorithm determinesoverrepresentation of a term in the context of annotationsto the term’s parents by redefining the groups used tocalculate the hypergeometric distribution (11).All of the aforementioned procedures successively test

overrepresentation for each of the categories. The abovemethods make use of the structure of GO to address sta-tistical dependencies, but they are limited to ontologiesand do not fundamentally differ from the originalparadigm of term-for-term testing with the exact Fishertest. More recently, an approach called GenGO was pres-ented, which fits a model on all the terms simultaneously.GenGO models the set of responder genes as members of aset of active GO terms. The objective of GenGO is toidentify the most likely combination of active terms,while allowing for some false positive and some falsenegative responder genes. The fitting procedure optimizesan objective function that combines the likelihood of themodel with a penalization that tends to limit the overallnumber of active terms. This procedure was shown tooutperform other methods for detecting GO-term over-representation on simulated data and to identify conciseyet biologically relevant sets of significantly overrepre-sented GO terms when applied to real data sets (13).Here we present model-based gene set analysis

(MGSA), a model-based approach that significantlyimproves over GenGO by scoring terms with their poste-rior probabilities, leading to more robust results and toimprovements over standard methods across a broaderrange of sensitivity cutoffs. Our benchmark on simulateddata confirm the drastic improvements of model-basedapproaches over term-by-term methods. MGSA uses asimple and adaptable Bayesian network that provides agreat deal more flexibility as to the kinds of data thatcan be used in the analysis and to extensions of themodel. MGSA is implemented as part of the free andopen-source software Ontologizer (14), a Java appli-cation that implements a large number of GO enrichmentmethods and enables visual exploration of the results.

METHODS

Model

We model gene response in a genome-wide experimentas the result of an activation of a number of biological

categories. These categories can be pathways as definedby the KEGG database (2), GO terms (15) or any otherscheme (5,16) that associates genes to potentiallyoverlapping biologically meaningful categories. Becausewe primarily work with GO, we call these categoriesterms. Our method does not make use of the graph struc-ture of GO other than utilizing the true-path rule, whichstates that if a gene is associated to a term, then it is alsoassociated to all of terms along the path up to the root ofthe ontology. Apart from that we make no explicit use ofthe structure.

We assume that the experiment attempts to detect genesthat have a particular state (such as differential expres-sion), which can be on or off. The true state of any geneis hidden. The experiment and its associated analysisprovide observations of the gene states that are associatedwith unknown false positive (�) and false negative rates(�), which we will assume to be identical and independentfor all genes.

For instance, in the setting of a microarray experiment,the on state would correspond to differential expression,and the off state would correspond to a lack of differentialexpression of a gene. Our model, hence, assumes that dif-ferential expression is the consequence of the annotationto some terms that are on.

An additional parameter p represents the prior proba-bility of a term being in the on state. The probability p istypically low (<0.5), introducing an effective penalizationfor the number of active terms. This ingredient promotesparsimonious explanations of the data.

More formally, our model is a Bayesian networkwith three layers augmented with a set of parameters(Figure 1):

(1) The term layer T ¼ fT1, . . . ,Tmg consists of Booleannodes that represent the m terms. There is a Booleanvariable associated with each node that can have thestate values on (1) or off (0).

(2) The hidden layer H ¼ fH1, . . . ,Hng contains Booleannodes that represent the n annotated genes. There areedges from the terms to their annotated genes. Forinstance, if gene H1 is annotated to terms T1 and T2

Figure 1. A Bayesian network to model gene response with genecategories. Gene categories, or terms (Ti, ellipses) can be either on oroff. Terms that are on activate the hidden state (Hj, rectangles) of allgenes annotated to them, the other genes remain off. The observedstates (Oj, diamonds) of the genes are noisy observations of their truehidden state. The parameters of the model (light gray nodes) are theprior probability of each term to be active, p, the false positive rate, �and the false negative rate, �.

3524 Nucleic Acids Research, 2010, Vol. 38, No. 11

Dow



ecember 2021

then there is an edge between T1 and H1 and anotheredge between T2 and H1. The state of the nodesreflects the true activation pattern of the genes.Each node can have the state values on (1), or off (0).

(3) The observed layer O ¼ fO1, . . . ,Ong containsBoolean nodes reflecting the state of all observedgenes. The observed gene state nodes are directlyconnected to the corresponding hidden gene statenodes in a one-to-one fashion.

(4) The parameter set contains continuous nodes withvalues in ½0,1� corresponding to the parameters ofthe model �, � and p. These parameterize the distri-butions of the observed and the term layer asdetailed below.

For didactic purposes, we will initially explain asimplified version of our procedure in which the parame-ters �, � and p are considered to have known, fixed values.We will then show how the Bayesian network can be aug-mented to search for optimal values for �, � and p.

The state propagation of the nodes can be modeledusing various local probability distributions (LPDs),denoted by P. The joint probability distribution for thisBayesian network can be written as

PðT,H,OÞ ¼ PðTÞPðHjTÞPðOjHÞ

¼ PðTÞYni¼1

PðHijTÞPðOijHiÞ:ð1Þ

We model the state of each Tj 2 T according to aBernoulli distribution with hyperparameter p, i.e.PðTj ¼ 1Þ ¼ p. Denote by mxjT the number of terms thathave state x for a given T, i.e. mxjT ¼ jf j jTj ¼ xgj then

PðTÞ ¼ pm1jTð1� pÞm0jT : ð2Þ

In the following, we denote by TðHiÞ � T the set ofterms to which gene Hi is annotated, i.e. Hi and theancestors of Hi. For the T! H links, any node Hi 2 His on (Hi ¼ 1) if at least one term in TðHiÞ is on. Otherwiseit is off:

PðHi ¼ 1jTÞ ¼1 if 9 Tj 2 TðHiÞ : Tj ¼ 10 otherwise:

�ð3Þ

Note that this transition is deterministic. For theH! O connection, we choose the following twoBernoulli distributions: PðOi ¼ 1jHi ¼ 0Þ ¼ � andPðOi ¼ 0jHi ¼ 1Þ ¼ �.

Therefore, � is the probability that a gene i is observedto be on (i.e. Oi ¼ 1), although its true hidden state isactually off (i.e. Hi ¼ 0) and thus, none of the termswhich annotate the gene are on. Correspondingly, � isthe probability of a gene being observed to be off,although at least one term that annotates it is on.

Denote by nxyjT ¼ j ijOi ¼ x ^Hi ¼ y� �

j the number ofgenes having observed activation x and true activationy according to the states of T. For instance, n01jT corre-sponds to the number of genes observed to be notdifferentially expressed but whose true activation stateis on. Then, by considering the LPDs of nodes, we get

the following product of Bernoulli distributions forPðOjTÞ ¼

Qni¼1 PðHijTÞPðOijHiÞ:

PðOjTÞ ¼ �n10jTð1� �Þn00jTð1� �Þn11jT�n01jT : ð4Þ

Markov Chain Monte Carlo algorithm for marginalprobabilities inference with known parameters

As it is often the case, marginal posteriors for our networkcannot be derived analytically. We estimate these valuesusing a Metropolis–Hasting algorithm, which is a Markovchain Monte Carlo (MCMC) method (17–19). TheMCMC algorithm performs a random walk over theterm and parameter configurations, which asymptoticallyprovides a random sampler according to the target distri-bution PðTjOÞ.Given the current configuration of the terms denoted by

Tt, the algorithm proposes a neighbor state Tp accordingto a proposal density function QTð�jT

tÞ. We sample a valuer uniformly from the range (0,1). Then, if

r < PacceptðTt,TpÞ ¼

PðTpjOÞQTðTtjTpÞ

PðTtjOÞQTðTpjTtÞð5Þ

the proposal is accepted, i.e. Ttþ1 ¼ Tp, otherwise it isrejected, i.e. Ttþ1 ¼ Tt. Using Bayes’ law, we have

PðTpjOÞ ¼PðOjTpÞPðTpÞ

PðOÞð6Þ

and similarly for Tt. Substituting these expressions forPðTpjOÞ and PðTtjOÞ cancels out the normalizationconstant PðOÞ. The acceptance probability is then:

PacceptðTt,TpÞ ¼

PðOjTpÞPðTpÞQTðTtjTpÞ

PðOjTtÞPðTtÞQTðTpjTtÞ: ð7Þ

Equation (5) is used iteratively to define a random walkthrough the space of configurations. A burn-in period con-sisting of a certain number of iterations is used to initializethe MCMC chain (in our implementation, the default is20 000 iterations). Following this, l further iterations(default 106) are performed. Let CðTiÞ be the number ofsamples in which term Ti was on. Then

PðTijOÞ �CðTiÞ

l:

In order to finish the description of the algorithm, weneed to define classes of operations of which a proposal ischosen, that is, we need to specify QTðT

pjTtÞ. We denoteby Tp $T Tt the binary relation that states that Tp beconstructed from Tt by either

� toggling the on/off state of a single term, or by� exchanging the state of a pair of terms that contains asingle on term and a single off term.

We denote by NðTÞ the neighborhood of a given config-uration for T, that is, the number of different operationsthat can be applied once to T in order to get a new con-figuration. At first, there are m terms in total, each ofwhich can be toggled. In addition, there are m0jTm1jT

possibilities to combine terms that are on with terms

Nucleic Acids Research, 2010, Vol. 38, No. 11 3525

Dow



ecember 2021

that are off. Thus, there are a total ofNðTÞ ¼ mþm0jTm1jT valid state transitions. We wouldlike to sample the valid proposals with equal probability,therefore the proposal distribution QT is determined by

QTðTpjTtÞ ¼

1NðTtÞ

if Tp $T Tt

0 otherwise:

�, ð8Þ

which we can use to rewrite Equation (7) to:

PacceptðTt,TpÞ ¼

PðOjTpÞPðTpÞNðTtÞ

PðOjTtÞPðTtÞNðTpÞ:

The procedure is shown in Algorithm 1. For simplicity,the burn-in period is omitted from the pseudocode. It iseasy to see that all states of the chain are reachable fromany state, as the Markov chain is finite and it is possible toreach an arbitrary state from any other state by a fixednumber of operations. This accounts for the reducibilityof the chain. Moreover, the chain is aperodic as it is alwayspossible to stay in the same state, as any proposal can berejected. Therefore, the resulting Markov chain is ergodic,which is a sufficient condition for a convergence to a sta-tionary distribution, which matches the desired targetdistribution.

MCMC algorithm for marginal probabilities inferrencewith unknown parameters

The estimation of the parameter �, � and p can be easilyintegrated directly into the MCMC algorithm. The param-eters now must be explicitly considered in the joint prob-ability distribution:

Pðp,T,H,�,�,OÞ

¼ PðpÞPðTjpÞPðHjTÞPð�ÞPð�ÞPðOjH,�,�Þ,ð9Þ

where PðTjpÞ is given by Equation (2), PðHjTÞ is given byEquation (3) and PðOjH,�,�Þ corresponds to PðOjHÞ ofthe basic model. As p, � and � are now true random vari-ables, we must define a prior distribution on them as well.Here, we have used uniform distributions to introduce aslittle bias as possible.

We are seeking for a scheme to sample from jointposterior distribution

Pðp,T,�,�jOÞ ¼Pðp,T,�,�,OÞ

PðOÞ:

In order to utilize the Metropolis–Hasting algorithmfor this purpose, we are required to provide an efficientcalculation for the numerator. This is straighforward,because the numerator factors to

Pðp,T,�,�,OÞ ¼ PðpÞPðTjpÞPð�ÞPð�ÞPðOjT,�,�Þ, ð10Þ

and moreover, PðOjT,�,�Þ can be determined usingEquation (4).

In addition to term state transitions, we also need totake parameter transitions within the proposal densityinto account. We define the new proposal density as amixture of the state transition density QT and a parametertransition density Q�. We denote the current realization ofthe parameters by �t ¼ f�t,�t,ptg and by �p $� �t therelation of whether �p can be constructed from �t.The fully specified proposal density is then

QsðTp,� pjTt,� tÞ

¼

QTðTpjTtÞs if Tp $T Tt and � p ¼ � t

Q�ð�pj� tÞð1� sÞ if � p $� � t and Tp ¼ Tt

0 otherwise:

8><>:The parameter s 2 ð0,1Þ can be used to balance state

transition proposals against parameter proposals. That isto say, depending on the outcome of a Bernoulli processwith hyperparameter s, we either propose a new state tran-sition or a new parameter setting. For the experimentsdescribed in this article, s was set to 0.5.

Many possibilities for the proposal density of theparameter transition Q� and for the relation $� can beenvisaged. We have considered transitions �p $� �p forwhich �p differs from �t in the realization of not morethan a single variable.

In contrast with the configuration space of the terms’activation state, the domain of these new variables is con-tinuous. However, an internal study revealed that thealgorithm is not overly sensitive to the exact parametersettings. Therefore, we can restrict the range of the

Algorithm 1: A Metropolis–Hasting algorithm to estimatePðT1 ¼ 1jOÞ.

Data: O, l (number of steps)Result: PðT1 ¼ 1jOÞ, . . . ,PðTm ¼ 1jOÞÞ

Tt ð0, . . . ,0Þ|fflfflfflfflffl{zfflfflfflfflffl}m times

;

for t 1 to l doTp � QTð�jT

tÞ, i.e. choose a neighbor candidateby either� toggling a term� exchanging an active term with an inactiveone

a PðOjTpÞPðTpÞNðTtÞ

PðOjTtÞPðTtÞNðTpÞ

r � Uð0,1Þ

if r < a then

Tt Tp

endend

returnCðT1Þ

l, . . . ,

CðTmÞ

l

� �


Dow



ecember 2021

variables to a set of discrete values. For the experimentsdescribed in this work, we used the restrictions�,� 2 f0:05kj0 < k < 20g and p 2 f1=m,:::,20=mg, wherem is the number of terms.

At last, we can state the proposal density function forparameter transitions

Q�ð�pj� tÞ ¼

1jAjþjBjþjPj if � p $� � t

0 otherwise,

�ð11Þ

in which A, B and P stand for the domain of the param-eters �, � and p respectively. Note that Q� is symmetric,i.e. Q�ð�

tj� pÞ ¼ Q�ð�pj� tÞ.

Simulation and evaluation of the performance

The simulations were based on revision 1.846 (datedOctober 21, 2009) of the GO term definition file. Werestricted the entire simulation study to genes ofDrosophila melanogaster. Annotations for this specieswere taken from revision 1.157 (dated October 19, 2009)of the gene association file provided by FlyBase (20), usingall annotations regardless of their evidence code. Thisresults in a total of 12 484 genes that are annotateddirectly or by inheritance to 7078 GO terms.

Study sets were generated according to our model asfollows. One value for the false positive rate � and onefor the false negative rate � were set. A number (varyingfrom one to five) of unrelated terms (i.e. pairs of termsrelated by parent–child relationships were avoided) arerandomly picked to be in on state. In the remainder ofthis section, we denote by lij the state or label of termi within study set j, i.e. lij ¼ 1, if term i is on, or lij ¼ 0otherwise.

Each single study set j is then filled with all genesthat are annotated to the term Ti for all lij ¼ 1. Next,the noisy observations are simulated by removingeach gene with a probability of � from the study set.Then, genes from the population not annotated toany of the active GO terms were added to the study setwith a probability of �. The whole procedure wasrepeated 1500 times for each combination of � and �providing 1500 different study sets of varying sizes forthat combination.

All tested algorithms were then applied to the simulatedstudy sets. Note that the study set generation procedurecontrols merely the expected values of the proportion offalse positive and false negative genes for the study sets,whereas the actual proportion of each individual study setmay differ. MGSA0 and GenGO0 (which use fixed valuesof the parameters �, � and p) were supplied with the valuesof � and � used for the simulation and p was set accordingto the number of GO terms that were set to on. The appli-cation of the algorithms results in prediction values(scores) for lij, denoted by pij. We remark that for poste-rior marginal probabilities higher values (rather thanlower as with P-values) indicate stronger support for thestate on.

Benchmarking of the methods was done by usingstandard measures for the evaluation of discriminationprocedures. We made use of receiver operating character-istic (ROC) curves and precisison/recall curves, pooling

the results of all study sets with identical parameter com-binations. In addition to the values of a ROC analysis, wecalculated the k-truncated ROC value for each studyset j via

ROCkðjÞ ¼1

kP

Xki¼1

ti,

in which P ¼P

i lij is the total number of positives and tirepresents the number of true positives above the i-th falsepositive (21,22). We reported the average over k-truncatedROC values of all study sets for k=10.

Analysis of yeast growth media

Raw tiling array data comparing yeast fermentativegrowth (YPD: Yeast extract Pepton Dextrose) and respi-ratory growth (YPE: Yeast extract Peptone Ethanol) (23)were processed to provide normalized intensity valuesfor each probe in each hybridization. The expressionlevel of each transcript in each growth condition wasestimated by the midpoint of the shorth (shortest intervalcovering half of the values) of the probe intensities of thetranscript across all arrays of the growth condition.Transcripts were called expressed if their expressionlevel was above the threshold (24). Transcript expressionlevels of the two conditions ‘YPD’ and ‘YPE’ werenormalized against each other using the vsn method (25)as differential expression at the transcript level appearedto still depend on average expression value. Next, tran-scripts were called differentially expressed if they showedat least 2-fold change between the two conditions. Wethen compared a study set of 510 differentially expressedgenes to the population of 5308 genes (23) (SupplementaryTable 1). We used GO annotations obtained from theSaccharomyces Genome Database (26) as of October 22,2009 and restricted our analysis to the biological processontology.

RESULTS

Bayesian networks to model experimental observations

In order to summarize the meaning of a long list of genesby naming biologically meaningful categories or terms, wepropose a knowledge-based system in form of a Bayesiannetwork. We model the state of the genes as a function ofthe activity of the associated terms. The true state of thegenes is hidden and propagated to corresponding entitiesat the observation layer, by which we reflect the noisynature of the data. The errors between the observationand the hidden states are assumed to be independentand to occur with a potentially unknown false positive(�) and false negative rate (�), identical for all genes.Furthermore, an additional parameter p represents theprior probability of a term being in the on state(Figure 1). The purpose of this model is to infer themarginal posterior probability of each term i beingactive, i.e. PðTijOÞ, given the observations of the experi-ment. See the ‘Methods’ section for a formal introductionof the model and the inference process.


Dow



ecember 2021

In practice, the parameters �, � and p are not known.While the estimation is taken into account by the specifi-cation of the algorithm, we also run instances in whichthese parameters are fixed a priori to the true ones. Thisprovides an upper bound for the parameter estimation. Inthe following, we refer to the general case as MGSA andto the case with known parameters as MGSA0.

Performance on simulated data

We simulated 1500 study sets in which the number ofactive terms varied from one to five (‘Methods’ section).The simulations were performed with 12 484 Drosophilagenes that are annotated directly or indirectly viato parent–child relationships to 7078 GO terms. Wefollowed this approach for each combination of� 2 f0:1, 0:4g and � 2 f0:25, 0:4g, resulting in a total of6000 simulated study sets. We then compared MGSAand MGSA0 against three single-term associationprocedures: the standard term-for-term (TfT) GOoverrepresentation analysis by Fisher’s exact test (3),Parent–Child union (PCU) analysis (11) and topologicalweight (TopW) analysis (12), and against the other globalmodel approach, GenGO (13). Similar to our approach,GenGO has two parameters that are intended to capturefalse positive and false negative responders and an

additional parameter that penalizes superfluous terms. Inthe original implementation of GenGO (13), a heuristicprocedure was used to search for the best values of theseparameters. Unfortunately, the full GenGO software isnot applicable for batched runs. We have implementedthe algorithm denoted as GenGO0 in the simple casewhere the parameters are known. For the simulationsdescribed here, we follow the authors’ recommendationto set the penalty parameter to 3, while the remainingparameters were set to the optimal values. This providesan upper bound on the performance of the GenGOprocedure with unknown parameters.

GO analyses typically contain a very large number ofterms. Therefore, an important issue is whether a GOanalysis method inflates the number of terms reported assignificant. The most critical measure is therefore the pre-cision, i.e. the proportion of true positives among the trueand false positives. Comparing the precision of the differ-ent methods as a function of the recall, which is the pro-portion of true positives among all positive terms,demonstrates the drastic improvement of global modelapproaches. Both global model methods, GenGO0 andMGSA, dominate all three single-term associationapproaches by a factor of at least 3 (5 for MGSA) inprecision at 20% recall (Figure 2A, B) across allinvestigated parameter settings. For a false positive rate

GenGO'MGSA' MGSA TopW PCU TfT

Precision at Recall of 0.2: α=0.1, β=0.4

0.0

0.2

0.4

0.6

0.8

1.0

global model single-term associations

Precision at Recall of 0.2: α=0.4, β=0.25

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Precision/Recall: α=0.1,β=0.4

Recall

Pre

cisi

on

GenGO'MGSA'MGSATopWPCUTfT

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Precision/Recall: α=0.4,β=0.25

Recall

Pre

cisi

on

GenGO'MGSA'MGSATopWPCUTfT

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC: α=0.1,β=0.4

False positive rate

True

pos

itive

rat

e

GenGO' (0.605)MGSA' (0.957)MGSA (0.922)TopW (0.777)PCU (0.817)TfT (0.864)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC: α=0.4,β=0.25

False positive rate

True

pos

itive

rat

e

GenGO' (0.515)MGSA' (0.913)MGSA (0.821)TopW (0.716)PCU (0.758)TfT (0.81)

GenGO'MGSA' MGSA TopW PCU TfT

global model single-term associations

A

B

C

D

E

F

Figure 2. Benchmarking on simulated data set. Performance of the TfT, PCU, TopW, GenGO0

, MGSA0

and MGSA algorithms on simulated dataset with different settings of false positive (�) and false negative (�) rates. In each row, the leftmost panel shows the precision for a recall of 0.2(A, B), the middle panel precision as a function of recall (C, D) and the rightmost panel the ROC curve (E, F).


Dow



ecember 2021

� of 0.1, the improvement reaches even 8- to 10-fold.Moreover, MGSA largely outperforms GenGO’ in allsettings, for example, with a precision of �95% versus�80% for GenGO’ in the case of � ¼ 0:1 and � ¼ 0:4.Values of k-truncated ROC scores (‘Methods’ sectionand Table 1) confirm the ranking of these methods whenfocusing on stringent cutoffs. Moreover, these improve-ments of MGSA are seen at any cutoff. MGSAoutperform all other methods for the whole range ofrecall cutoffs and with all investigated parameter settings(Figure 2C, D).

Notably, the performance of GenGO0, which reportsonly a single maximum likelihood solution and discardsany alternative solution, even if it is almost as likely, dropsoff much earlier than MGSA (Figure 2C, D). Thisbehavior is more apparent in Receiver OperatingCharacteristic curves (ROC curves, Figure 2E, F), whichplot the true positive rate (or recall) as a function of thefalse positive rate (proportion of false positives among allnegative terms). Indeed, away from the most stringentzone, GenGO appears as the least accurate of all testedmethods. See also Supplementary Figures S1–S12 forresults on other parameter combinations

Together these results on simulation confirm the drasticimprovement of global model approaches and demon-strate that our marginal posterior method, MGSA,largely outperforms GenGO by showing an accuratebehavior on the whole range of cutoffs.

Dealing with unknown values of the parameters �, �and p had required a significant extension of our basicalgorithm (‘Methods’ section). The simulation dataallowed us to investigate the ability of the full MGSAalgorithm to cope with unknown parameter values. Weran the basic version of the algorithm, MGSA0, in whichthe parameters are known and fixed a priori. Performanceof MGSA0 are displayed in the precision–recall and theROC curves (Figure 2) and represent what MGSA couldreach if the parameters were known. As Figure 2 shows,the performance of the algorithm is not drastically affectedby this, showing that the full MGSA algorithm performsreasonably well when dealing with unknown parametervalues.

Analysis of expression data from fermentative andrespiratory respiration in yeast

We applied our method to 510 yeast genes found to bedifferentially expressed in fermentative growth comparedwith respiratory growth (‘Methods’ section). Respirationand fermentation are two well-studied growth modes of

yeast, thus facilitating the interpretation of the results.We ran TfT, PCU and MGSA on the biological processontology of GO. Using Benjamini–Hochberg correctionfor multiple testing and a cutoff of false discoveryrate of 0.1, the term-for-term method returns 193significantly enriched terms, many of which are highlyrelated (Supplementary Table 1). We investigated howthe results of MGSA fluctuate by running 20 independentMarkov chains, each of length 107, using a cutoff of 0.5 onthe posterior probability to call a term on, i.e. a level atwhich a term is estimated to be more likely to be on thanto be off. MGSA reports only seven terms with a marginalposterior probability >0.5 (in at least one of the 20 runs).Out of these 7 terms, 6 showed a posterior probability>0.5 consistently across the 20 chains (Figure 3 andSupplementary Table 1), while the seventh one,monocarboxylic acid metabolic process, had estimated pos-terior probability between 0.410 and 0.502. Hence, resultsfor the most likely terms were reproducible between runs.We checked the robustness of these results against varia-tions in the study set by creating 2 000 random subsamplesof the study set containing 90% of the original genes. Theterms identified by the original analysis were consistentlyidentified in the subsamples (Supplementary Figure 13).Among the seven terms, oxidation reduction summarizes

the main biological process that distinguishes growth inthese two different media, namely the use of oxidationphosphorylation during respiration to regenerate ATP.The other terms, such as, carbohydrate transport, ormonocarboxylic acid metabolic process capture processesthat are linked to the change of carbon source but notdirectly involved in the oxidation reduction reactions.Hence, MGSA provides a high-level, summarized viewof the core biological process, respiration, avoiding redun-dant results while still keeping the necessary level ofgranularity in other branches of the ontology.The term cell death illustrates very well the difference

between single-term association approaches and globalmodel approaches. Both tested enrichment-basedapproaches, TfT and PCU, report cell death as anenriched term whereas MGSA does not. It happens thatmitochondria are implicated both in cell death and in res-piration (27). The differentially expressed genes annotatedto cell death encode mitochondrial proteins and are alsoinvolved in respiration. Hence, it is correct to report celldeath enriched for differentially expressed genes. However,cells are not dying in any of these two conditions. Theenrichment is due to the sharing of genes with respiration,a process which is genuinely differentially activated. In thisstudy set, 113 genes are annotated to oxidation reductionincluding 25 out of 28 genes annotated to cell death.MGSA, which infers the terms that are on and notsimply enriched, does not report cell death. One shouldalso note that cell death is not a type of respiratorypathway or vice versa. Methods such as TfT thatexamine the statistical significance of each term separatelycannot compensate for correlations between terms due togene sharing. Although methods such as PCU and TopWcan compensate for some kinds of statistical correlationsthat arise because of the inheritance of annotations fromdescendent nodes in the GO graph (11,12), they fail in

Table 1. ROC10 Analysis

� � GenGO0 MGSA0 MGSA TfT PCU TopW

0.1 0.4 0.35 0.44 0.41 0.31 0.24 0.260.4 0.25 0.22 0.28 0.25 0.22 0.17 0.15

k-truncated ROC curves were generated for the simulated data shownin Figure 2 for k ¼ 10. The ROC10 score is the area under the ROCcurve up to the tenth false positive. k-truncated ROC scores range from0 to 1, with 1 corresponding to the most sensitive and selective result.


Dow



ecember 2021

situations such as the one described here because,oxidation reduction and cell death share some annotatedgenes but are not directly connected to one another bythe graph structure of GO.

DISCUSSION

Of trees and forests

Data-driven molecular biology experiments can be used toidentify a list of genes that respond in the context ofa given experiment. With the advent of technologies,such as microarray hybridization and next-generationsequencing, which enable biologists to generate datareflecting the response profiles of thousands of genes orproteins, gene category analysis has become ever moreimportant as a means of understanding the salientfeatures of such experiments and for generating newhypotheses. By using the knowledge as it is provided byGO, KEGG or other similar systems of categorization,these analyses have become a de facto standard for molec-ular biological research. Almost all previous methods arebased on algorithms that analyze each term in isolation.For each term under consideration, the methods considerwhether the study set is significantly enriched in genesannotated to the term compared with what one wouldexpect based on the frequency of annotations to theterm in the entire population of genes or using otherrelated statistical models (28).

We suggest that single-term association methods thatdetermine the significance of each term in isolation essen-tially do ‘not see the forest for the trees’, by which wemean that they tend to return many related terms whichare statistically significant if considered individually, butthey are not designed to return a set of core terms thattogether best explain the set of genes in the study set.Although some methods have been developed thatpartially compensate for statistical dependencies in GO(11,12), the work of Lu and colleagues (13) addressedthe problem by modeling the gene responses using allcategories together.

Modeling requires formulating a generative process ofthe data. We, and Lu and colleagues (13), considered thecategories as the potential cause of the gene responses.Fitting the model then enables distinguishing betweenthe causal categories (according to the model) from thecategories merely associated with gene response.Although one cannot conclude that the identifiedcategories are causal in reality (this is only a model andone only has observational data), this feature ofmodel-fitting explains why it provides a better answer tothe question what is going on? than testing for associationson a term for term basis.

Searching for an optimal set of terms that togetherexplain a biological observation is a more difficult problemthan examining each term for enrichment one at a time. Inparticular, the model used to find term sets must specifyhow the terms interact with one another. This imposes

......

......

A B

C

Figure 3. Application on a respiratory versus fermentative growth expression dataset in yeast. (A) Ranked list of the 192 overrepresented terms usinga term-for-term Fisher’s test with Benjamini–Hochberg correction for multiple testing. Many of the top terms are redundant and relate to similarfunctions. The term cell death (highlighted in blue) is a spurious association (see text). (B) Ranked list of the top 10 terms identified by a single run ofMGSA (six of them with a posterior >0.5 in green). (C) Error bars (95% confidence intervals) obtained with 20 runs of MGSA. Each of the seventerms was identified with a posterior >0.5 in at least one of the 20 runs.


Dow



ecember 2021

assumptions on the model that must be kept in mind whenthe results are interpreted. For instance, the model pre-sented in this work assumes that activation of a singleterm suffices to activate genes, and does not require thata certain minimum number of genes are annotated by theterm set. The Bayesian framework we present can easily beadapted for different kinds of models encoding differentbiological assumptions by choosing different priors or dis-tributions. This will be the subject of future research.

A Bayesian framework for inference

In contrast to the method of Lu and coworkers (13), weembed the model into a Bayesian Network, to which weapply standard methods of probabilistic inference. Thisnot only leads to an intuitive derivation of the score, butalso increases the versatility of the framework. That is,although we have demonstrated our method with simpleclasses of LPDs for the nodes, one can easily use moreinvolved distributions. Moreover, we show that by aug-menting the Bayesian network, we get a streamlinedapproach that includes both the inference of the statesand of the parameters. This in principle also enables theinclusion of prior knowledge, be it parameters that areknown or estimated in form of Pð�Þ and Pð�Þ, or aspecific term configuration in form of PðTÞ or PðTjpÞ.

Instead of scoring terms by finding the maximuma posteriori as in the GenGO method (13), we usemarginal posterior probabilities. Finding the maximuma posteriori provides a single combination of activeterms and is not informative about alternative solutions.If several solutions show near-maximal likelihood then itis implausible that the single one with the largest likeli-hood is always the right solution. In contrast, marginalposterior as in MGSA associates a natural weight to eachterm that reflects a measure of certainty of its involvementin the process. Importantly, using marginal posteriorprobabilities increases the robustness and lowers the sen-sitivity of the procedure to high variance related to themultimodality of the problem, i.e. the existence of localmaxima of the likelihood function. As we demonstratedusing simulations, MGSA is indeed more robust thanGenGO.

The Java implementation that we provide in theOntologizer estimates the marginal posteriors with aMCMC algorithm and is fast enough to perform 106

steps in <3 s on a standard 2.5GHz PC. Since runs ofMCMC are not guaranteed to converge in any a prioridefined number of steps, we suggest that users repeat theanalysis in order to see how the reported marginalprobabilities of the top terms fluctuate. If fluctuationsare too large, the number of MCMC steps should beincreased.

Using a Bayesian approach that models the data withall categories simultaneously, rather than using hypothesistesting on each category, avoids the issue of multipletesting. Moreover, the interpretation of the score, whichis simply the probability of a category to be active, mightappear more natural than a P-value. Finally, one shouldnote that the ranking of the scores is reversed to values

given by hypothesis-based procedures, i.e. high marginalsgive high support in the Bayesian setting, while high con-fidence in the hypothesis-based approach is indicated bylow P-values.

CONCLUSION

We have addressed the question of gene category analysisusing a model-based approach, MGSA. In this Bayesianmodel, the genes responding to the experiment areassumed to belong to a small number of ‘active’categories. Therefore, to answer the question of what isgoing on in an experiment, MGSA infers the ‘active’categories, among all considered categories, given theactual gene state observations. We have shown thatunder the assumptions of our model our approach isbetter in identifying the causal sets than other procedures.We suggest that considering the forest instead of the treesis an advantageous strategy for gene category analysis,and that global model procedures such as the one pre-sented in this article may be better able to describe thebiological meaning of high-throughput data sets than areprocedures that examine associations of categories one ata time. We note that the Bayesian network analyzed in thisarticle is but one of many potential network structuresthat are made possible with our framework.Our methods have been integrated into the Ontologizer

project, an easy-to-use Java Webstart application for per-forming analysis of overrepresentation. The Ontologizeras well as the implementation of the described benchmarkprocedure have been released under the terms of themodified BSD licence. The application is available fromhttp://compbio.charite.de/index.php/ontologizer2.html.The source code is available from http://sourceforge.net/projects/ontologizer/.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

We are grateful to Emilie Fritsch for insightfuldiscussions about the yeast gene expression datasetanalysis and to Simon Anders for critical reading ofthe manuscript.

FUNDING

This work was supported by grants from the DeutscheForschungsgemeinschaft (DFG SFB 760 and RO 2005/4-1). Additional support was provided by the Lab ofLars Steinmetz. Funding for open access charge:Deutsche Forschungsgemeinschaft.

Conflict of interest statement. None declared.


Dow



ecember 2021

REFERENCES

1. The Gene Ontology Consortium. (2000) Gene Ontology: tool forthe unification of biology. Nat. Genet., 25, 25–29.

2. Kanehisa,M. and Goto,S. (2000) KEGG: Kyoto Encyclopedia ofGenes and Genomes. Nucleic Acids Res., 28, 27–30.

3. Rhee,S.Y., Wood,V., Dolinski,K. and Draghici,S. (2008) Use andmisuse of the Gene Ontology annotations. Nat. Rev. Genet., 9,509–515.

4. Jiang,Z. and Gentleman,R. (2007) Extensions to gene setenrichment. Bioinformatics, 23, 306–313.

5. Mootha,V.K., Lindgren,C.M., Eriksson,K.-F., Subramanian,A.,Sihag,S., Lehar,J., Puigserver,P., Carlsson,E., Ridderstrle,M.,Laurila,E. et al. (2003) PGC-1a-responsive genes involved inoxidative phosphorylation are coordinately downregulated inhuman diabetes. Nat. Genet., 34, 267–273.

6. Nam,D. and Kim,S.-Y. (2008) Gene-set approach for expressionpattern analysis. Brief. Bioinform., 9, 189–197.

7. Oron,A.P., Jiang,Z. and Gentleman,R. (2008) Gene setenrichment analysis using linear models and diagnostics.Bioinformatics, 24, 2586–2591.

8. Sartor,M.A., Leikauf,G.D. and Medvedovic,M. (2009) LRpath: alogistic regression approach for identifying enriched biologicalgroups in gene expression data. Bioinformatics, 25, 211–217.

9. Newton,M.A., Quintana,F.A., den Boon,J.A., Sengupta,S. andAhlquist,P. (2007) Random-set methods identify distinct aspectsof the enrichment signal in gene-set analysis. Ann. Appl. Stat., 1,85–106.

10. Vencio,R.Z.N., Koide,T., Gomes,S.L. and Pereira,C.A.B. (2006)BayGO: Bayesian analysis of ontology term enrichment inmicroarray data. BMC Bioinformatics, 7, 86.

11. Grossmann,S., Bauer,S., Robinson,P.N. and Vingron,M. (2007)Improved detection of overrepresentation of Gene-Ontologyannotations with parent child analysis. Bioinformatics, 23,3024–3031.

12. Alexa,A., Rahnenfuhrer,J. and Lengauer,T. (2006) Improvedscoring of functional groups from gene expression data bydecorrelating GO graph structure. Bioinformatics, 22, 1600–1607.

13. Lu,Y., Rosenfeld,R., Simon,I., Nau,G.J. and Bar-Joseph,Z. (2008)A probabilistic generative model for GO enrichment analysis.Nucleic Acids Res., 36, e109.

14. Bauer,S., Grossmann,S., Vingron,M. and Robinson,P.N. (2008)Ontologizer 2.0—a multifunctional, tool for GO term enrichmentanalysis and data exploration. Bioinformatics, 24, 1650–1651.

15. Barrell,D., Dimmer,E., Huntley,R.P., Binns,D., O’Donovan,C.and Apweiler,R. (2009) The GOA database in 2009 – anintegrated Gene Ontology Annotation resource. Nucleic AcidsRes., 37, D396–D403.

16. Subramanian,A., Tamayo,P., Mootha,V.K., Mukherjee,S.,Ebert,B.L., Gillette,M.A., Paulovich,A., Pomeroy,S.L., Golub,T.R.,Lander,E.S. et al. (2005) Gene set enrichment analysis: aknowledge-based approach for interpreting genome-wide expressionprofiles. Proc. Natl Acad. Sci. USA, 102, 15545–15550.

17. Andrieu,C., De Freitas,N., Doucet,A. and Jordan,M.I. (2003) Anintroduction to MCMC for machine learning. Mach. Learn., 50,5–43.

18. Diaconis,P. (2009) The Markov chain Monte Carlo revolution.Bull. Am. Math. Soc., 46, 179–205.

19. Diaconis,P. and Saloff-Coste,L. (1995) What do we know aboutthe Metropolis algorithm?. STOC’95: Proceedings of theTwenty-Seventh Annual ACM Symposium on Theory ofComputing. ACM pp. 112–129.

20. Tweedie,S., Ashburner,M., Falls,K., Leyland,P., Mcquilton,P.,Marygold,S., Millburn,G., Osumi-Sutherland,D., Schroeder,A.,Seal,R. et al. (2009) FlyBase: enhancing Drosophila GeneOntology annotations. Nucleic Acids Res., 37(Suppl. 1),D555–D559.

21. Gribskov,M. and Robinson,N.L. (1996) Use of receiver operatingcharacteristic (ROC) analysis to evaluate sequence matching.Comput. Chem., 20, 25–33.

22. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L.,Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving theaccuracy of PSI-BLAST protein database searches withcomposition-based statistics and other refinements. Nucleic AcidsRes., 29, 2994–3005.

23. Xu,Z., Wei,W., Gagneur,J., Perocchi,F., Clauder-Munster,S.,Camblong,J., Guffanti,E., Stutz,F., Huber,W. and Steinmetz,L.M.(2009) Bidirectional promoters generate pervasive transcription inyeast. Nature, 457, 1033–1037.

24. David,L., Huber,W., Granovskaia,M., Toedling,J., Palm,C.J.,Bofkin,L., Jones,T., Davis,R.W. and Steinmetz,L.M. (2006) Ahigh-resolution map of transcription in the yeast genome.Proc. Natl Acad. Sci. USA, 103, 5320–5325.

25. Huber,W., von Heydebreck,A., Sultmann,H., Poustka,A. andVingron,M. (2002) Variance stabilization applied to microarraydata calibration and to the quantification of differentialexpression. Bioinformatics, 18(Suppl. 1), S96–S104.

26. Hong,E.L., Balakrishnan,R., Dong,Q., Christie,K.R., Park,J.,Binkley,G., Costanzo,M.C., Dwight,S.S., Engel,S.R., Fisk,D.G.et al. (2008) Gene Ontology annotations at SGD: new datasources and annotation methods. Nucleic Acids Res., 36, D577.

27. Green,D.R. and Kroemer,G. (2004) The pathophysiology ofmitochondrial cell death. Science, 305, 626–629.

28. Goeman,J.J. and Buhlmann,P.P. (2007) Analyzing gene expressiondata in terms of gene sets: methodological issues. Bioinformatics,23, 980–987.


Dow



ecember 2021

Documents

model-based gene set analysis of genome-scale data