37
THE CHALLENGES OF SYSTEMS BIOLOGY Lessons from the DREAM2 Challenges A Community Effort to Assess Biological Network Inference Gustavo Stolovitzky, a,b Robert J. Prill, a and Andrea Califano b a IBM Computational Biology Center, Yorktown Heights, New York, USA b Department of Biomedical Informatics, Columbia University, New York, New York, USA Regardless of how creative, innovative, and elegant our computational methods, the ultimate proof of an algorithm’s worth is the experimentally validated quality of its predictions. Unfortunately, this truism is hard to reduce to practice. Usually, modelers produce hundreds to hundreds of thousands of predictions, most (if not all) of which go untested. In a best-case scenario, a small subsample of predictions (three to ten usually) is experimentally validated, as a quality control step to attest to the global soundness of the full set of predictions. However, whether this small set is even representative of the global algorithm’s performance is a question usually left unaddressed. Thus, a clear understanding of the strengths and weaknesses of an algorithm most often re- mains elusive, especially to the experimental biologists who must decide which tool to use to address a specific problem. In this chapter, we describe the first systematic set of challenges posed to the systems biology community in the framework of the DREAM (Dialogue for Reverse Engineering Assessments and Methods) project. These tests, which came to be known as the DREAM2 challenges, consist of data generously donated by participants to the DREAM project and curated in such a way as to become problems of network reconstruction and whose solutions, the actual networks behind the data, were withheld from the participants. The explanation of the resulting five chal- lenges, a global comparison of the submissions, and a discussion of the best performing strategies are the main topics discussed. Key words: systems biology; network theory; pathway inference; reverse engineering; mathematical modeling; assessment methods; machine learning You may say I’m a DREAMer, but I’m not the only one ... John Lennon Introduction The Dialogue for Reverse Engineering As- sessments and Methods (DREAM) project had its formal kickoff in the spring of 2006, when a group of systems biologists, from both compu- tational and experimental extraction, convened Address for correspondence: Gustavo Stolovitzky, IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598. Voice: 914-945-1292; fax: 914-945-4217. [email protected] at the New York Academy of Sciences in Man- hattan. What distinguished these scientists was a common passion for understanding biology at a genome-wide level and a common frustra- tion with the increasingly artificial boundaries between the experimental and the computa- tional sciences in modern biological research. In a very insightful meeting 1 the group dis- cussed the need for, and timeliness of, an effort to “take the pulse” of how well the theoreti- cal reverse-engineering community was objec- tively doing in reconstructing protein–protein, gene regulatory, signaling, and metabolic net- works. 2,3 The consensus was that a full-fledged effort to harness and understand the limita- tions and strengths of the plethora of biological The Challenges of Systems Biology: Ann. N.Y. Acad. Sci. 1158: 159–195 (2009). doi: 10.1111/j.1749-6632.2009.04497.x C 2009 New York Academy of Sciences. 159

Lessons from the DREAM2 Challenges : A Community Effort to Assess Biological Network Inference

Embed Size (px)

Citation preview

THE CHALLENGES OF SYSTEMS BIOLOGY

Lessons from the DREAM2 Challenges

A Community Effort to Assess BiologicalNetwork Inference

Gustavo Stolovitzky,a,b Robert J. Prill,a and Andrea Califanob

aIBM Computational Biology Center, Yorktown Heights, New York, USAbDepartment of Biomedical Informatics, Columbia University, New York, New York, USA

Regardless of how creative, innovative, and elegant our computational methods, theultimate proof of an algorithm’s worth is the experimentally validated quality of itspredictions. Unfortunately, this truism is hard to reduce to practice. Usually, modelersproduce hundreds to hundreds of thousands of predictions, most (if not all) of which gountested. In a best-case scenario, a small subsample of predictions (three to ten usually)is experimentally validated, as a quality control step to attest to the global soundnessof the full set of predictions. However, whether this small set is even representativeof the global algorithm’s performance is a question usually left unaddressed. Thus, aclear understanding of the strengths and weaknesses of an algorithm most often re-mains elusive, especially to the experimental biologists who must decide which toolto use to address a specific problem. In this chapter, we describe the first systematicset of challenges posed to the systems biology community in the framework of theDREAM (Dialogue for Reverse Engineering Assessments and Methods) project. Thesetests, which came to be known as the DREAM2 challenges, consist of data generouslydonated by participants to the DREAM project and curated in such a way as to becomeproblems of network reconstruction and whose solutions, the actual networks behindthe data, were withheld from the participants. The explanation of the resulting five chal-lenges, a global comparison of the submissions, and a discussion of the best performingstrategies are the main topics discussed.

Key words: systems biology; network theory; pathway inference; reverse engineering;mathematical modeling; assessment methods; machine learning

You may say I’m a DREAMer, but I’m not the only one . . .

John Lennon

Introduction

The Dialogue for Reverse Engineering As-sessments and Methods (DREAM) project hadits formal kickoff in the spring of 2006, when agroup of systems biologists, from both compu-tational and experimental extraction, convened

Address for correspondence: Gustavo Stolovitzky, IBM T.J. WatsonResearch Center, P.O. Box 218, Yorktown Heights, NY 10598. Voice:914-945-1292; fax: 914-945-4217. [email protected]

at the New York Academy of Sciences in Man-hattan. What distinguished these scientists wasa common passion for understanding biologyat a genome-wide level and a common frustra-tion with the increasingly artificial boundariesbetween the experimental and the computa-tional sciences in modern biological research.In a very insightful meeting1 the group dis-cussed the need for, and timeliness of, an effortto “take the pulse” of how well the theoreti-cal reverse-engineering community was objec-tively doing in reconstructing protein–protein,gene regulatory, signaling, and metabolic net-works.2,3 The consensus was that a full-fledgedeffort to harness and understand the limita-tions and strengths of the plethora of biological

The Challenges of Systems Biology: Ann. N.Y. Acad. Sci. 1158: 159–195 (2009).doi: 10.1111/j.1749-6632.2009.04497.x C© 2009 New York Academy of Sciences.

159

160 Annals of the New York Academy of Sciences

FIGURE 1. The number of publications retrieved from a PubMed search of “PathwayInference” or “Reverse Engineering.” On average, this number has been doubling every twoyears since about 1995. (Color is shown in online version.)

reverse-engineering methods was going to bea key prerequisite to their increasing value tobiology. Indeed, at that time, the field of re-verse engineering biological networks was be-ginning to experience considerable expansion,which generated much confusion about whichmethods were truly valuable from a practicalperspective. Evidence of this trend is shownin Figure 1, where we report the number ofpublications retrieved from a PubMed searchof “Pathway Inference” or “Reverse Engineer-ing.” Figure 1 shows what appears to be anexponential growth, in which citations to thesekey words have been roughly doubling everytwo years for the last decade or so.

While such growth in the number of re-verse engineering–related publications wasfueled by very innovative and elegant compu-tational methods for network reconstruction—arising from physics, computer science, mathe-matics, and engineering—the group shared thefeeling that, ultimately, an algorithm’s worthwas to be found in the experimentally vali-dated quality of its predictions. A key problem is

that computational methods can, in the blink ofan eye, generate large numbers of predictions,from a few hundred to hundreds of thousands,most (if not all) of which usually go untested.Even worse, and this would be a best case sce-nario, a very small subsample of predictions—usually three or more, but rarely more thanten—would be validated using sound experi-mental assays and then presented as valuablecriteria for the soundness of the entire set of pre-dictions. Thus, a clear characterization of therelative strengths and weaknesses of the algo-rithms on an objective basis was usually missing.It should be noted that the same could be saidfor high-throughput (and even low-throughput)experimental approaches, whose false positiveand, equally importantly, false negative ratesare rarely considered a requisite for publica-tion. This generated, for instance, more thana few puzzled looks when the first experimen-tally generated, genome-wide interactomes inyeast4–6 showed minimal overlap.

At the spring 2006 meeting, it was agreedthat while the obstacles for the creation of a

Stolovitzky et al.: Lessons from the DREAM2 Challenges 161

methodology of reverse engineering assessmentwould be formidable, the value of such effortswould also be considerable and thus justified.The difficulties in creating well-posed chal-lenges for network reconstruction are multifold,including what type(s) of experimental data toprovide for the reconstruction, how these datashould be generated, how to ensure appropri-ate characterization of the ground truth (i.e.,the molecular interaction model) supported bythese data, and how to objectively evaluatethe performance of the challenge participants,among many others. A discussion of these top-ics was presented in a previous work,7 and it isunnecessary to further discuss these considera-tions here. It is sufficient to say that the outcomeof the thinking and discussions of the DREAMcommunity8 culminated in the first set of chal-lenges that was presented to the community inmid-2007: the DREAM2 challenges.

Similar to CASP, the well-known Critical As-sessment of Structure Predictions community-wide experiment,9 the DREAM2 Challengeswere based on donated data sets. The iden-tity of the data donors was kept anonymous,and some information pertaining to the datawas masked, withheld from public disclosure, orotherwise obfuscated. Depending on the partic-ular challenge, the objective of the DREAM2challenges was to infer an underlying molecularinteraction network, including transcriptionaland post-translational interactions, as well asmixed biological circuits of various kinds. Inthis chapter we describe the five challenges ofDREAM2, the community response to thesechallenges (the submitted reverse engineerednetworks), and the comparative performanceof the different challenge participant groups.We conclude with global lessons extracted fromthis community-wide exercise.

The DREAM2 Challenges

The BCL6 Transcriptional-TargetChallenge

BCL6 is a transcription factor that plays akey role in both normal and pathological B

FIGURE 2. Venn diagram illustrating the selectionthe BCL6 targets and decoy genes for Challenge 1.We took as targets the 53 genes resulting from theintersection of genes identified through ChIP-on-chipas bound to BCL6 (“ChIP-on-chip Positives” set) andgenes that responded to BCL6 perturbations (“Func-tionally validated” set), but that were not identified inthe Polo et al. ChIP-on-chip experiment (“Polo et al.”set). One hundred forty seven decoy genes were ran-domly chosen among the genes that were not in anyof these three data sets. (Color is shown in onlineversion.)

cell physiology.10 Unpublished data on BCL6genomic binding sites formed the basis of achallenge in which participants attempted todiscriminate functionally validated BCL6 tran-scriptional targets from “decoy” targets forwhich evidence indicates no physical and func-tional control by BCL6.

The Venn diagram of Figure 2 illustratesthe procedure for selecting the BCL6 transcrip-tional and decoy target genes. We took advan-tage of an unpublished ChIP-on-chip analysisdataset on centroblast B cells (provided by thelaboratory of Riccardo Dalla Favera and An-drea Califano at Columbia University), whichwas used to identify a set of 4,616 genes towhose promoter regions BCL6 appears to bindwith a False Discovery Rate less than 10−4,based on the recently published CSA algo-rithm.11 A second data set corresponded toa panel of Burkitt Lymphoma lines (Ramos)whose gene expression levels were profiled un-der a series of molecular perturbations that leadto loss of endogenous BCL6 (e.g., CD40 re-ceptor activation), with and without expressionof exogenous BCL6.12 This analysis, denotedas “Functionally Validated” set in Figure 2,identified 102 functional BCL6 gene targets.

162 Annals of the New York Academy of Sciences

Intersection of these two datasets provided a setof 56 genes whose promoter regions are boundby BCL6 and that respond to BCL6 pertur-bations. An independently published dataset ofChIP-on-chip data measured in diffuse largeB cell lymphomas13 identified a cohort of 437genes whose promoters bound to BCL6. Only3 of these 437 genes were also part of the56 genes, leaving only 53 previously unknowngenes that were identified as BCL6 targets withevidence of BCL6 function. An additional setof 147 decoy genes were randomly selectedamong the genes that were at “the bottom”of the ChIP-on-Chip statistical significance (P> 0.5) and that were not identified as candidateBCL6 targets either by the functional analysisor by the previously published ChIP-on-Chipexperiment.13

Participants were provided the set of 200genes and asked to identify the BCL6 targetsfrom the decoys. The information of how manyof the 200 genes were actual BCL6 target geneswas withheld from the participants. A panelof 336 microarray gene expression profilesfor a wide variety of normal, tumor related,and experimentally manipulated humanB-cell populations previously published12 wasalso provided to the participants. This panelcontained the measurements originally usedto identify the 102 genes that responded tothe BCL6 perturbations, but this informationwas not made explicit to the participants.ChIP-on-chip assays were not disclosed to theparticipants. While the independent ChIP-on-chip data of Polo et al.13 was publicly available,only 3 of the 56 targets were included in thatset. Thus it was assessed that the challengewould not be biased by availability of theseresults. Participants could use any informationin the public domain, including DNA sequenceand gene ontology annotations. Submissionswere requested in a standardized format inwhich the list of 200 genes had to be orderedaccording to the confidence that a gene isa BCL6 target. In other words, a perfectprediction would consist of a submission inwhich BCL6 targets were to appear at the

top of the list and decoys at the bottom ofthe list.

The Protein–Protein SubnetworkChallenge

The yeast two-hybrid (Y2H) system is a high-throughput method for discovering interactionsbetween pairs of proteins.14 The basic idea be-hind this technology is the activation of a re-porter gene by the binding of a transcriptionfactor onto an upstream activating sequence.In the Y2H screening, the transcription fac-tor is divided into two separate fragments: thebinding domain (BD) and the activating do-main (AD). The BD (the domain responsiblefor binding the promoter of the reporter gene)is fused to one of our proteins of interest, calledthe bait. The AD (the domain responsible foractivation of transcription) is fused to the otherprotein of interest, called the prey. The expres-sion of the reporter gene is taken as an indica-tion that the bait captures (interacts) with theprey.

The procedure for the generation of the“gold standard” for this challenge was as fol-lows. In experiments performed in the labo-ratory of Dr. Marc Vidal at the Dana FarberCancer Institute, 67 proteins were used as bait(that is, were fused with the BD) and 80 proteinswere used as prey (that is, fused to the AD). Inindependent experiments, each of the 67 × 80assays of protein–protein interactions was per-formed and replicated nine times. If a pair ofproteins was observed to interact in three ormore replicas, the pair was marked as a puta-tive protein–protein interaction. If any of thosepairs had been reported to interact previouslyin the literature, then the pair was removedfrom our list of interacting proteins (becausewe desired that participants only predict inter-actions that were not previously reported). Thisprocess yielded 48 interacting proteins that in-teracted in at least three replicas and were pre-viously unknown to interact. The number ofunique proteins participating in these 48 in-teracting pairs was 47. The 48 interactions

Stolovitzky et al.: Lessons from the DREAM2 Challenges 163

FIGURE 3. Schematics of the procedure to choose the gold standard set in Challenge 2. A set of 47proteins (the DREAM data set) was identified, that contained some protein—protein interaction pairs. Of the1,128 possible protein interaction pairs among these 47 proteins, three nonoverlapping sets were identified.The “gold standard positive” set was composed of 48 protein–protein interactions not previously reported inthe literature but validated in at least three out of nine independent Y2H experiments. The “gold standardnegative” set was composed of 1,022 pairs not previously reported in the literature and never seen to interactin any of the nine replicates of the Y2H screen. The remaining 58 protein pairs forming the “Undecided set”were omitted from the positive and negative sets and ignored in the scoring because either the interactionwas already published, it was not tested in our Y2H, it was detected as a positive in one or two replicates,or some combination of these situations. (Color is shown in online version.)

composed the gold standard positive set andformed the basis for the creation of the neg-ative set as shown in Figure 3 and explainedbelow. (The term positive set denotes the set ofprotein pairs that interact and negative set de-notes the set of protein pairs that do not inter-act.) To arrive at the gold standard negative set,we enumerated the 1,128 possible protein in-teractions among 47 proteins. (In general, thereare n(n + 1)/2 possible interaction among n pro-teins, counting self interactions.) The gold stan-dard negative set was composed of the subsetof the 1,128 possible interactions for which (1)the interaction was negative in all replicates ofthe Y2H screen, and (2) there are no reports ofthat interaction in the literature. These criteriayielded a gold standard negative set consistingof 1,022 non-interacting protein pairs. The re-

maining 58 of the 1,128 possible protein pairswere omitted from the positive and negative setsand ignored in the scoring because either theinteraction was already published, it was nottested in our Y2H, it was detected as a positivein one or two replicates, or some combinationof these situations.

The protein–protein subnetwork challengeconsisted of discriminating the protein pairsthat interact (positives) from those that do not(negatives) among the list of 47 proteins. A guid-ing principle in the construction of the chal-lenge was to highlight what may be called ab

initio prediction, as opposed to literature min-ing, so interactions described in the literaturewere eliminated from the positive and negativesets, and predictions of published interactionswere ignored in the scoring. All data in the

164 Annals of the New York Academy of Sciences

FIGURE 4. The five-gene-network of Challenge 3. (A) Promoter regions (1 pr1, 1 pr2,4 pr, and 5 pr) were responsive only to the network genes 1, 3, 4, and 5 but not to theendogenous genes that could potentially activate these promoter regions. In a galactose-richmedium, gene 2 is repressed. Gene 2 interacts with gene 5 through their protein products.(B) Graph representation of the network in (A), used as gold standard in Challenge 3. (Coloris shown in online version.)

public domain was available to participants,but the unpublished Y2H screen on which thechallenge was based was not disclosed for theduration of the challenge. Participants submit-ted a ranked list of the 1,128 protein pairs or-dered from high to low confidence that a pairof proteins interacts. In other words, interact-ing pairs of proteins were to appear at the topof the list and non-interacting pairs were to ap-pear at the bottom of the list. Only the 1,070pairs that constituted the gold standard wereevaluated.

The Five-Gene Network Challenges

In a fruitful experimental and computationalcollaboration, Dr. Maria Pia Cosma and Dr.Diego diBernardo, of the TIGEM Telethon In-stitute of Genetics and Medicine, in Naples,Italy, designed a synthetic five-gene networkthat was then integrated in the Saccharomyces cere-

visiae genome. The synthetic-biology network,named IRMA (for In vivo Reverse engineeringand Modeling Assessment) has since been usedby its authors for “benchmarking” modelingand reverse-engineering methods.15 Data fromthis network, the design of which is schemat-ically shown in Figure 4A, formed the basisof the DREAM2 five-gene network challenge.The network was designed in such a way thatthe promoter regions of the genes (representedby 1 pr1, 1 pr2, 4 pr, and 5 pr in Fig. 4A), wasresponsive to the network genes 1, 3, 4, and 5,as shown in Figure 4A. This was achieved by

knocking out the endogenous genes that couldpotentially activate these promoter regions. Apositive feedback loop (formed by 4 → 5 →1 → 4), and two negative feedback loops (4→ 5 → 1 → 3 –| 4 and 5 → 1 → 2 –|5) made for an interesting network dynamic.When growing in a glucose medium, the or-ganism harboring this synthetic circuit growsnormally and reaches a steady state. Whenswitched to a galactose reach medium, gene 2 isrepressed, changing the network dynamics to anew steady state. In the change from glucose togalactose, the transient dynamics of the systemcan in principle be used to infer the topology ofthe network through reverse engineering algo-rithms. This was the rationale for two relatedDREAM2 challenges based on this syntheticnetwork.

Two slightly different versions of the samenetwork were built. In a cell cycle–dependentversion of the network, gene 1, which encodesa transcription factor, was activated andtranslocated to the nucleus only in the late G1phase of the cell cycle. In a second version ofthe network (a cell cycle–independent version),gene 1 was constitutively active independentof the cell cycle. Both versions of the networkcan be switched “on” and “off” by changingthe concentration of galactose. In both casesthe network is on when yeast cells are grownin galactose medium, and it is off when grownin glucose medium. DREAM2 participantsattempted to infer the network structure fromtwo independent datasets constituting two

Stolovitzky et al.: Lessons from the DREAM2 Challenges 165

independent subchallenges: FiveGeneNet1 andFiveGeneNet2.

FiveGeneNet1 Subchallenge

The data set for this subchallenge was col-lected using the cell cycle–independent net-work and consisted of two time series basedon quantitative PCR (qPCR) measurements ofthe five network genes. In each of the time se-ries, yeast cultures were grown in a minimalmedium with glucose for 10 hours and thenshifted to a galactose medium. Cells were col-lected from the galactose medium at regularintervals, and the mRNA of the five genes inthe synthetic network was quantified by qPCR.The culture was not cell-cycle synchronized inthis experiment. The two time series providedin this subchallenge corresponded to samplestaken at regular intervals of 20 minutes for 3 hr(time series qPCR_A) and for 5 hr (time seriesqPCR_B).

FiveGeneNet2 Subchallenge

In this subchallenge, the yeast populationscontaining the cell cycle–dependent networkwere cell-cycle synchronized by adding al-pha factor to their medium and then exposedto two different treatments. In treatment 1,cells grew in glucose-supplemented minimalmedium, whereas in treatment 2 cells grew ingalactose-supplemented minimal medium. Fol-lowing synchronization, the populations werereleased into the cell cycle, and samples werecollected every 20 minutes for 3 hours. Wholegenome expression profiles were obtained us-ing Affymetrix chips. The datasets providedto the DREAM2 participants contained twotime series (corresponding to the two differ-ent treatments) for 588 genes from the origi-nal Affymetrix microarray, which included thefive genes in the synthetic network plus genesknown in the literature to be regulated by thenetwork genes. The identity of the 588 geneswas kept anonymous, including which of the588 were the five genes of interest.

For each of the two subchallenges, partici-pants were given the freedom to submit pre-

dicted networks in one or more of a varietyof formats in which the directionality of edgeswas either specified or not, and the signs ofedges were either specified as excitatory orinhibitory. The resulting six categories wereUndirected Unsigned, Undirected Excitatory,Undirected Inhibitory, Directed Unsigned, Di-rected Excitatory, and Directed Inhibitory. Inall cases, participants submitted a list of net-work edges ordered by the confidence that theedge existed. The gold standards for the Five-GeneNet1 subchallenge are shown in Table 1and are an example of the format requestedfrom the participants. (The gene numbers cor-respond to the graph of Fig. 4B). In the undi-rected submissions, the order of the first andsecond columns was inessential, but if the pairgene_k gene_l appeared in the list, then thepair gene_l gene_k should not have appeared.In the directed submissions, the first and secondcolumns indicated the source and destination ofa directed edge, respectively. In both directedand undirected submissions, the third columnindicated an arbitrary measure of confidenceof the existence of an edge. The third columnwas requested at submission to verify that theorder of the submitted pairs was indeed in theorder of higher to lower confidence. In otherwords, regulatory interactions were to appearat the top of the list and non-interactions wereto appear at the bottom of the list.

In the FiveGeneNet2 subchallenge we usedthe same gold standards as the FiveGeneNet1subchallenge described above. Of course, theunderlying network is the same; only the mea-surements had changed. We disregarded allpredicted edges for which both genes were notmembers of the five-gene synthetic network.

The in Silico Network Challenges

In silico networks formed the basis of a setof challenges in which participants attemptedto infer the underlying network structure fromsimulated data. These challenges, while some-what contrived, enabled the assessment of net-work reconstructions for which the underlying

166 Annals of the New York Academy of Sciences

TABLE 1. Gold standards for the six categories in the five-gene network challengea

Undirected_Unsigned Undirected_Signed-Excitatory Undirected_Signed-Inhibitory

gene_1 gene_2 1 gene_1 gene_2 1 gene_2 gene_5 1gene_1 gene_3 1 gene_1 gene_3 1 gene_3 gene_4 1gene_1 gene_4 1 gene_1 gene_4 1 gene_1 gene_1 0gene_1 gene_5 1 gene_1 gene_5 1 gene_1 gene_2 0gene_2 gene_5 1 gene_4 gene_5 1 gene_1 gene_3 0gene_3 gene_4 1 gene_1 gene_1 0 gene_1 gene_4 0gene_4 gene_5 1 gene_2 gene_2 0 gene_1 gene_5 0gene_1 gene_1 0 gene_2 gene_3 0 gene_2 gene_2 0gene_2 gene_2 0 gene_2 gene_4 0 gene_2 gene_3 0gene_2 gene_3 0 gene_2 gene_5 0 gene_2 gene_4 0gene_2 gene_4 0 gene_3 gene_3 0 gene_3 gene_3 0gene_3 gene_3 0 gene_3 gene_4 0 gene_3 gene_5 0gene_3 gene_5 0 gene_3 gene_5 0 gene_4 gene_4 0gene_4 gene_4 0 gene_4 gene_4 0 gene_4 gene_5 0gene_5 gene_5 0 gene_5 gene_5 0 gene_5 gene_5 0

Directed_Unsigned Directed_Signed-Excitatory Directed_Signed-Inhibitory

gene_1 gene_2 1 gene_1 gene_2 1 gene_2 gene_5 1gene_1 gene_3 1 gene_1 gene_3 1 gene_3 gene_4 1gene_1 gene_4 1 gene_1 gene_4 1 gene_1 gene_1 0gene_2 gene_5 1 gene_4 gene_5 1 gene_1 gene_2 0gene_4 gene_5 1 gene_5 gene_1 1 gene_1 gene_3 0gene_5 gene_1 1 gene_1 gene_1 0 gene_1 gene_4 0gene_3 gene_4 1 gene_1 gene_5 0 gene_1 gene_5 0gene_1 gene_1 0 gene_2 gene_1 0 gene_2 gene_1 0gene_1 gene_5 0 gene_2 gene_2 0 gene_2 gene_2 0gene_2 gene_1 0 gene_2 gene_3 0 gene_2 gene_3 0gene_2 gene_2 0 gene_2 gene_4 0 gene_2 gene_4 0gene_2 gene_3 0 gene_2 gene_5 0 gene_3 gene_1 0gene_2 gene_4 0 gene_3 gene_1 0 gene_3 gene_2 0gene_3 gene_1 0 gene_3 gene_2 0 gene_3 gene_3 0gene_3 gene_2 0 gene_3 gene_3 0 gene_3 gene_5 0gene_3 gene_3 0 gene_3 gene_4 0 gene_4 gene_1 0gene_3 gene_5 0 gene_3 gene_5 0 gene_4 gene_2 0gene_4 gene_1 0 gene_4 gene_1 0 gene_4 gene_3 0gene_4 gene_2 0 gene_4 gene_2 0 gene_4 gene_4 0gene_4 gene_3 0 gene_4 gene_3 0 gene_4 gene_5 0gene_4 gene_4 0 gene_4 gene_4 0 gene_5 gene_1 0gene_5 gene_2 0 gene_5 gene_2 0 gene_5 gene_2 0gene_5 gene_3 0 gene_5 gene_3 0 gene_5 gene_3 0gene_5 gene_4 0 gene_5 gene_4 0 gene_5 gene_4 0gene_5 gene_5 0 gene_5 gene_5 0 gene_5 gene_5 0

aThe table exemplifies the submission format. In the Undirected submissions, the order of the first and secondcolumns was inessential. In the Directed submissions, the first and second columns represented, respectively, thesource and destination of a directed edge. The third column, indicating an arbitrary measure of confidence of thecorresponding edge that was in the actual network, was requested at submission to verify that the order of thesubmitted pairs was indeed in the order of higher to lower confidence.

Stolovitzky et al.: Lessons from the DREAM2 Challenges 167

network, dynamical equations, and noise modelwere known with certainty to the evaluator. Inthis way, we trade realism of the problem foraccuracy of the gold standard.

Three in silico networks were created andendowed with deterministic dynamics withoutnoise using the methods described by Mendesand collaborators.16 Dr. Pedro Mendes pro-vided the networks and the corresponding data.The phenomenological models of gene regula-tory networks had Hill-type kinetics for induc-tion/repression, and only additive interactionswere assigned. This challenge consisted of threesubchallenges.

InSilico1 Subchallenge

The challenge consisted of reverse engineer-ing a network of 50 nodes and 98 directededges. The network was constructed to haveErdos-Renyi topology (see Fig. 5A).

InSilico2 Subchallenge

The challenge consisted of identifying a net-work of 50 nodes and 99 directed edges. Thenetwork was constructed to have scale-freetopology (see Fig. 5B).

InSilico3 Subchallenge

The challenge consisted of reverse engineer-ing an integrated gene-protein-metabolite net-work consisting of 24 metabolites, 23 proteins,and 20 genes connected through 146 directededges. The network was constructed manuallyto mimic known biochemical network struc-tures (see Fig. 5C).

In the first two networks, an edge representeda causal relationship between the expressionlevels of two genes. For the latter network, edgesrepresented processes such as protein transla-tion, protein–protein interactions, and enzyme-mediated metabolic reactions.

Simulated data for the InSilico1 and InSilico2

subchallenges mimicked typical functional ge-nomic experiments, such as gene expression mi-croarray data, or quantitative PCR data, withthe additional advantage that the number of in

silico experiments to the number of genes waslarger than usual. The data for the reconstruc-tion of the InSilico3 network also contained theequivalent of the protein and metabolite con-centrations, in addition to the mRNA abun-dance. For the first two subchallenges, the dataprovided to the participants consisted of thefollowing.

• Knock-down data: This dataset containedsteady-state mRNA abundance for the 50genes under 51 experimental conditions.In addition to the expression of the “wild-type” strain, steady-state expression pro-files of the 50 heterozygous (+/–) knock-down strains were provided.

• Knock-out data: This dataset containedsteady-state mRNA abundance for the 50genes under 51 experimental conditions.In addition to the expression of the wild-type strain, steady-state expression profilesof the 50 heterozygous (–/–) null-mutantstrains were provided.

• Time-series data: This dataset containedthe mRNA concentration for the 50 genesas their abundance evolved dynamically inthe wild-type strain (+/+). Time courses(trajectories) of the network recoveringfrom several external perturbations in-duced by different initial conditions wereprovided. There were 23 time courses dif-fering in the initial conditions; each timecourse was measured at 26 time points.

For the third subchallenge, the data wasslightly different. It consisted of the following.

• Knock-down data: This dataset containedsteady-state levels of metabolites, proteins,and mRNA for the wild-type and the 20heterozygous knock-down strains for eachgene (+/–).

• Knock-out data: This dataset containedsteady-state levels of metabolites, proteins,and mRNA for the wild-type and the 20null-mutant strains for each gene (–/–).The knock-out of the gene encoding G14

168 Annals of the New York Academy of Sciences

FIGURE 5. Gold standard networks used in Challenge 4. (A) The InSilico1 network consisted of 50 nodesand 98 directed edges and Erdos-Renyi topology. (B) The InSilico2 network consisted of 50 nodes and 99directed edges and a scale-free topology. (C) The InSilico3 network consised of 24 metabolites (elipses), 23proteins (rectangles), and 20 genes (parallelograms) connected through 146 directed edges. The networkwas constructed manually to mimic known biochemical network structures. (Color is shown in online version.)

was in silico “lethal” (i.e., most of the steadystates turned out to be zero).

• Time-series data: This dataset containedtime courses (trajectories) of the network

recovering from several external perturba-tions. There were 23 time series differingin the initial conditions; each time serieshad 26 time points.

Stolovitzky et al.: Lessons from the DREAM2 Challenges 169

Participants were given the freedom to sub-mit predicted networks in one or more of avariety of formats in which the directional-ity of edges was either specified or not, andthe signs of edges were either specified as ex-citatory/inhibitory, or not. As was the casein the five-gene network challenge, these op-tions resulted in the following six categories persubchallenge: Undirected Unsigned, DirectedUnsigned, Undirected Excitatory, UndirectedInhibitory, Directed Excitatory, Directed In-hibitory. In all cases, participants submitted aranked list of network edges ordered from highto low confidence that an edge was present. Inother words, regulatory interactions were to ap-pear at the top of the list and non-interactionswere to appear at the bottom of the list.

The convention for the format of the submis-sions was not straightforward for the InSilico3

subchallenge due to the existence of enzyme-mediated reactions and bi-molecular reactions.The specific conventions used to format theprediction files can be found in the DREAM2in silico challenges data description page of theDREAM website.17

The Genome Scale Network Challenge

A long-standing question in the reverse en-gineering community is: To what extent can agene regulatory network be reverse engineeredfrom steady-state gene expression profiles? Acompendium of more than 500 normalizedE. coli Affymetrix microarrays representingabout 200 distinct conditions18 formed the ba-sis of a challenge in which participants at-tempted to reverse engineer the E. coli generegulatory network from microarray data. Ofthe more than 500 original microarrays, morethan 300 were collected at the laboratories ofJames J. Collins and Tim Gardner at BostonUniversity, and more than 200 were down-loaded from the Gene Expression Omnibus(GEO) website. The approximately 500 exper-imental conditions were concentrated mainlyon antibiotic, SOS, and stress response. Theselected 300 microarrays reasonably covered

many stress pathways. The currently knownE. coli gene regulatory network as reportedby RegulonDB,19 arguably the most completegene regulatory network database for any or-ganism, was the source of the gold standard setfor this challenge. The participants were not in-formed that the data pertained to E. coli. Genesin the microarray data were anonymized by re-moving the labels and adding a small amountof Gaussian noise so that the data sources couldnot be easily identified by Internet searches onthe expression values. Included in the microar-ray data were 3,456 coding regions that in-cluded all genes in RegulonDB and randomlychosen additional genes. Among the includedgenes were 320 known and putative transcrip-tion factors annotated in RegulonDB.

To arrive at the gold standard, we began withthe then current annotated transcription factor-gene regulation as curated in RegulonDB ver-sion 4(beta). To this we added publicly availablesigma factor regulation available from the Reg-ulonDB website, and removed self-regulation.The gold standard network so constructed con-tained 1,095 genes and 152 transcription fac-tors (some of them counted among the 1,095genes) with annotated interactions, and 3,102edges, about one third of which were sigma fac-tor interactions. Some caveats related to usingRegulonDB as a gold standard for assessmentof predicted networks are illustrated in Figure 6.At best, the RegulonDB network is incomplete.Based on ChIP-qPCR data of three transcrip-tion factors, it was estimated that RegulonDBis 85% complete.20 Therefore, some false pos-itives may be incorrectly assigned. The func-tional depth of the microarray compendiummay not be sufficient to recover large portionsof the RegulonDB network. It is easy to imag-ine the situation in which a high-quality net-work prediction can have a low recall but highprecision.

Participants were given the freedom to sub-mit predicted directed networks in which thesigns of edges were either specified as excita-tory/inhibitory or not specified. These optionsresulted in the following categories: Directed

170 Annals of the New York Academy of Sciences

FIGURE 6. Relationships among predicted, goldstandard, and actual networks in Challenge 5. The“gold standard” set consisted of the transcriptionfactor-target interactions as curated in RegulonDBv.4(beta) enhanced with publicly available sigmafactor regulation also available from the RegulonDBwebsite and with self-regulation removed. This goldstandard is a subset of the actual E. coli transcrip-tional network, represented the “Real network” in thefigure. The “Predicted network” set may contain someincorrectly assigned false positives. Insufficiently sam-pled conditions in the provided data may have leadto predictions with a low recall but high precision.(Color is shown in online version.)

Unsigned, Directed Excitatory, and DirectedInhibitory. The direction was specified by thefact that for the resulting network, any edge hadto originate from a TF. In all cases, participantssubmitted a ranked list of network edges or-dered from high to low confidence that an edgeis present. In other words, regulatory interac-tions were to appear at the top of the list andnon-interactions were to appear at the bottomof the list.

Participants predicted a network of 3,456genes × 320 transcription factors. For scoring,this network was mapped to the 1,095 genes× x 152 transcription factors in the gold stan-dard. Transcription factor-gene edges that arepresent in RegulonDB form the gold standardpositive set. The gold standard negative set wasdefined as the transcription factor-gene entriesthat were absent in RegulonDB. It is clear, how-ever, that absence of an edge in RegulonDBdoesn’t mean that edge is inexistent in the ac-tual network.

Assessment of Predicted Networks

In order to assess how well a predicted net-work approximates the gold standard networkthat it attempts to recapitulate we require ametric, or a suite of metrics. Broadly speaking,an assessment of a predicted network shouldexpress the concept of distance between thepredicted network and the gold standard net-work. Consider, for example, the edit distance(or Hamming distance), which can be definedon a network as the number of corrections (e.g.,added or removed edges) to transform one net-work into another. Edit distance is an attractivemetric given its simplicity; however, it does notdiscriminate between the two types of errorsthat occur when predicting a network edge,false positives (FP) and false negatives (FN).Since biological networks are sparse, contain-ing many fewer edges than is theoretically pos-sible, it is important to characterize a predictednetwork with respect to each type of error. Sincenetwork prediction is similar to other types ofbinary prediction problems, it is advantageousto employ a generalized assessment of predic-tion accuracy.

As posed to the DREAM2 participants, net-work prediction can be thought of as a bi-nary classification task in which every poten-tial network edge is classified as either presentor absent. In some cases we could have hadthree-class predictions. For example, in a di-rected regulatory network an edge could be ei-ther (1) absent, (2) present and excitatory, or (3)present and inhibitory. The assessment of pre-diction of networks whose edges could belongto more than two classes, however, posed somenontrivial technical difficulties. Therefore wedecided to pose the network challenges as bi-nary problems. The simplification of problemsto binary classes came at the price of compli-cating the categories of submission. In the ex-ample above, rather than the category DirectedSigned, for which the predicted edges could be-long to more than two classes, we created twocategories, Directed Excitatory and DirectedInhibitory.

Stolovitzky et al.: Lessons from the DREAM2 Challenges 171

The advantage of using binary classes is thatbinary classification is a standard paradigmin the field of machine learning with well-established evaluation metrics. Because thereare two type of errors (also, two types of correctpredictions), it is common to simultaneously ex-plore complementary metrics over a range ofparameter values (e.g., varying the cutoff of thedecision boundary). One such pair of comple-mentary metrics is precision and recall. Pre-cision is a measure of fidelity, whereas recallis a measure of completeness. Another com-mon pair of metrics is true positive rate (TPR)and false positive rate (FPR), the axes of thereceiver operator characteristic (ROC) curve.These concepts are reviewed below.

To Threshold or Not to Threshold?

Any algorithm designed to decide the classthat each element in a set belongs to (say Posi-tive and Negative elements) will have to outputtwo sets: the set of elements predicted to bePositive, and the set of elements predicted tobe Negative. At some point the researcher isconfronted by the need to choose a cutoff thatdetermines the boundary between classes. Thiscutoff is typically controlled by one or more pa-rameters intrinsic to the algorithm. The choiceof this cutoff is usually arbitrary, such as, forexample, determining a P-value below which aprediction is deemed significant.

One of the problems in assessing methodsthat use a threshold is that the evaluation willsimultaneously assess the choice of the thresh-old along with the method used for the predic-tions. If a method is excellent but the choice ofthe cutoff is suboptimal, the evaluation mightgive a bad score to the method. If instead wefree the evaluation from a choice of the cutoff,the assessment will only consider the merits ofthe classification algorithm. The latter is thestrategy that we chose for the assessment ofDREAM2 predictions. Of course, availabilityof an objective approach to determine an op-timal metric is an important aspect of reverseengineering. For instance, an algorithm may be

far superior to others in the overall threshold-less metric. Yet, in the absence of a reasonableapproach to determine the threshold (whichmay be context-dependent), the algorithm maybehave poorly in practical conditions. This po-tential oversight will be considered in futurechallenge editions.

Format of Predictions

If the network predictions included thechoice of a threshold, the submissions wouldhave resulted in two sets, one class with pairs ofnodes deemed to have an edge, and the otherset with the pair of nodes predicted not to havean edge. Instead, DREAM2 participants wereasked to submit their predictions in the formof an ordered list of predicted network edges.The list was constructed in decreasing orderof reliability that an edge is in the network. Inthis way, the first list entry is the network edgepredicted with highest confidence, and the lastlist entry is the network edge predicted with thelowest confidence that the edge belongs in thenetwork. Let k be a pointer to the current listentry, k = 1, . . . , L, where L is the length ofthe list. If a particular k was chosen as a thresh-old to determine the boundary of positive andnegative sets, then the edges from 1 to k wouldbe the set of network edges predicted at cutoffk. Rather than considering a particular cutoffk, we considered all possible choices of k. Inother words, k is a parameter that incorporatesprogressively more edges in the predicted net-work as k is increased from 1 to L. Supposethat the challenge demanded the prediction ofT total edges (for example, for a network withn nodes, a challenge requesting the predictionof a directed network without self-interactionswould have T = n(n − 1) edges). If L = T ,then k truly ranges from a positive set with justone edge (k = 1), to a positive set including alledges (k = T ). However, the participant maysubmit a truncated list (L < T). In such case, theedges not included in the submitted predictionshould be “added” in random order at the endof the list to complete the submission. We say

172 Annals of the New York Academy of Sciences

“added” in quotations because the addition ofmissing edges takes place in an analytical way.In other words, the final assessment was madeon the average performance that would resultif the missing edges were added in all possibleorders (see Appendix).

Precision-Recall Curve

As explained above, for each value of the pa-rameter k indexing the position of an entry inthe submitted edge list, we have a concrete net-work prediction. On this prediction a measureof fidelity of the positive set can be defined asthe precision at depth k of the prediction as

prec(k ) = TP (k )TP (k ) + FP (k )

= TP (k )k

where TP (k) is the number of true positives(edges predicted to be in the gold standard net-work that actually are in it), and FP (k) is thenumber of false positives (edges predicted to bein the gold standard network that actually arenot in it), both at depth k in the list (note thatTP (k) + FP (k) = k). Another useful measure ofaccuracy of a prediction is the recall, a measureof completeness of a prediction, defined as

rec(k ) = TP (k )P

where P is the number of positives (i.e., thenumber of actual edges in the network), andtherefore is independent of k. The values of re-call and precision range from zero to one, withone being the theoretically optimal value forprecision, and min(k/P , 1) being the theoreti-cal optimum for recall at depth k. As k increasesfrom 1 and T , the precision-recall curve graph-ically explores changes in accuracy as we addmore edges to our putative positive set.

ROC Curve

The receiver operating characteristic (ROC)curve graphically explores the tradeoff betweenthe true positive rate TPR(k) and the false posi-

tive rate FPR(k), as k ranges from 1 and T . Thetrue positive rate at depth k is computed as

TPR(k ) = TP (k )P

and represents the fraction of positives that arecorrectly predicted at depth k in the submittedlist. Note that TPR(k) is equivalent to rec(k). Thefalse positive rate at depth k in the submittedlist, FPR(k), is defined as

FPR(k ) = FP (k )N

where N is the total number of node pairsthat are not connected by edges, and repre-sents the fraction of negatives that are incor-rectly predicted as positives at depth k. BothTPR(k) and FPR(k) range from zero to one. Ask increases from 1 to T , there is a tradeoff be-tween FPR(k) and TPR(k). The perfect true pos-itive rate will be achieved at k = T, by predict-ing the completely connected network (indeed,TP (k = T ) = P , and TPR(k = T ) = 1). How-ever, FPR(k = T ) would be the highest (equalto 1) due to the fact that at k = T all the edgesthat are actually negatives would be predictedpositive (FP (k = T ) = N ). Thus, the tradeoffbetween TPR(k) (which we desire to be high)and FPR(k) (which we desire to be low) can beinspected directly from the ROC curve.

Area Under the Curves

To compare the submitted predictionsagainst a corresponding gold-standard, we usea method based on the “Area Under theCurve,” a rather standard measure of classi-fication performance in the machine learningcommunity. The area under the ROC curve(AUROC) is a single number that summarizesthe tradeoff between TPR(k) and FPR(k) as k isvaried. To compute it, we simply plot the points(TPR(k), FPR(k)) in a TPR versus FPR plane andcompute the integral of the curve that resultsfrom a linear interpolation between the points.A perfect network prediction would yield anAUROC of one.

The area under the precision-recall curve(AUPR) is a single number that summarizes

Stolovitzky et al.: Lessons from the DREAM2 Challenges 173

FIGURE 7. Statistics of the AUPR and AUROC for the DREAM2 predictions. (A) Scatter plot of randomnetwork predictions for the five-gene-network challenge, Undirected Unsigned category. Two independentpredictions with the same AUPR can have vastly different AUROCs. (B) and (C) Empirical distribution (bluedots) and approximate fit (black solid line) of the AUPR (B) and the AUROC (C) for random predictions forthe InSilico1 network challenge, category Undirected Unsigned. The approximate fit is given by stretchedexponential functions with different parameters to the right and left of the mode of the distributions. (Color isshown in online version.)

the precision-recall tradeoff. To compute it, wesimply plot the points (prec(k), rec(k)) in a pre-

cision versus recall plane and compute the inte-gral of the curve that results from a nonlinearinterpolation between the points. The ratio-nal for the nonlinear form of the interpolation,and an algorithm to compute the AUPR usedin DREAM2, is presented in the Appendix.

A perfect network prediction would yield anAUPR of one.

The value of the area under the curves de-pends on the particular trajectory taken bythe points in the precision-recall curve andthe ROC curve. It could very well happenthat two predictions have the same AUPR andvastly different AUROC. This is clearly seen in

174 Annals of the New York Academy of Sciences

Figure 7A, where we show the results of creatinga large number of random network predictionsfor the five-gene-network challenge, in the cat-egory Undirected Unsigned. It is clear that fora given value of AUROC, there is a range ofpossible values of the AUPR. For example, theensemble of predictions with AUROC = 0.7,can have values of the AUPR between 0.55 and0.8. Therefore both AUPR and AUROC canbe used as measures of the performance of anetwork submission, as each gives a subtly dif-ferent summarization of the TP (k) and FP (k)measures over the whole range of k.

Scoring a Network Submission

To gain a sense of the statistical significanceof a particular area under the curve, we requirea null hypothesis from which a P-value couldbe computed. A reasonable null hypothesis isa random list of network edge predictions. Formany such random predictions, we computedthe AUROC and the AUPR. By sampling thespace of random predictions (see Fig. 7A), thejoint probability density function of obtainingsimultaneously a pair (AUROC, AUPR) could,in principle, be estimated. The P-value for analgorithm that simultaneously yields areas un-der the curve aROC and aPR could be com-puted as the probability of AUROC > aROC

and AUPR > aPR. (Other definitions of a P-value for a bivariate distribution are possible.)In practice, sampling this bivariate probabilitydensity with sufficient precision was unfeasiblebecause the space of high AUROC and highAUPR was extremely unlikely.

Instead we took the strategy of computingindependently the marginal P-values:

p ROC(aROC) = Prob(AUROC > aROC)

and

p PR(aPR) = Prob(AUPR > aPR)

and we defined the score of a submission withareas under the curve aROC and aPR as

score(aPR, aROC) =− log10[p PR(aPR)p ROC(aROC)].

With this definition, the higher the score, thebetter the merits of the submission.

We finish this section with a technical butimportant point. In many of the challengesthe submitted networks obtained values of theAUROC and AUPR that were too big to bereached by sampling the space of randompredictions. We addressed this problem withan empirical approach that is illustrated inFigures 7B and 7C. Shown in blue in Figure 7Bis the probability density function of the AUPRthat resulted from sampling 100,000 randompredictions for the InSilico1 network challenge,category Undirected Unsigned. The maximumvalue of AUPR sampled was 0.2, but the AUPRobtained by the best performer team was 0.6.Estimation of a P-value for this team entailsintegrating the distribution between 0.6 and 1,but we didn’t have any sampled value of theprobability distribution in this range. (It shouldnot come as a surprise that some teams per-formed much more accurately than randomchance.) Therefore we parameterized the nullprobability with a function that approximates itvery well, as shown in the black line that fits theblue points extremely well. The approximationwas given by stretched exponentials with differ-ent parameters to the right and left of the modeof the distribution, with functional form

pdf (x ) ={h max exp

[−b>(x − xmax)c >]

for x ≥ xmax

h max exp[−b<(xmax − x )c <

]for x < xmax

For Figure 7B, the parameters were given byhmax = 49.9655, xmax = 0.0785, c< = 2.8031,b< = 470,450, c> = 1.1565, and b> = 198.86.With this functional form, the resulted P-valuecould be computed to yield ∼10−41. A similarprocedure done for the distribution of AUROC(Fig. 7C) yielded a P-value of ∼10−41. In thisway, the score for this team was 61, an ex-tremely satisfactory performance.

We applied the procedure just described toevery category in all challenges. In this waywe could score the performance of all theDREAM2 submissions.

Stolovitzky et al.: Lessons from the DREAM2 Challenges 175

FIGURE 8. The statistics of submissions to the five DREAM2 challenges. The most popularof the challenges was the In Silico challenge, followed by the BCL6 target prediction challenge.(Color is shown in online version.)

Best Performers

Best performers were selected in each cate-gory of each challenge. They were identified onthe basis of the following criteria:

(i) AUPR P-value ≤ 0.05.(ii) AUROC P-value ≤ 0.05.(iii) Highest score of its category among the

teams for which conditions (i) and (ii) aresatisfied.

Considered as a sample of the communityperformance in reverse engineering problems,the compilation of DREAM2 submissions con-stitutes a true community-wide experiment thatcan be studied as data in its own right. In thefollowing sections we report on the assessmentof the performance of the participants in thedifferent DREAM2 challenges.

Results

Statistics of Submissions

On July 22, 2007 the challenges were postedon the DREAM website, and submissionsin response to the challenges were accepted

until October 23, 2007. During these threemonths, 109 teams, from 17 different countries(Argentina, Belgium, Canada, China, France,Germany, Hungary, India, Italy, Japan, Korea,Russia, Singapore, Sweden, Switzerland, UK,USA) downloaded the challenges. Of those,36 teams submitted for evaluation a total of110 predictions. The statistics of submissionscan be seen in Figure 8. The most popu-lar of the challenges was the In Silico chal-lenge, followed by the BCL6 target predictionchallenge. In the following subsections we willpresent analyses of performance for a subsetof all these submissions. The remaining resultscan be obtained from the DREAM website:http://wiki.c2b2.columbia.edu/dream/.

The BCL6 Transcriptional-TargetChallenge

Eleven teams submitted predictions for theBCL6 target discovery challenge. Predictionswere assessed as described in the section As-sessment of Predicted Networks and shown inTable 2. We can see that the best performingteam according to the criteria delineated in theprevious section was Team 2. Note, however,

176 Annals of the New York Academy of Sciences

TABLE 2. Performance of submissions to the BCL6 target identification challengea

Precision at nth correct predictionP-value P-value

Team 1st 2nd 5th 20th AUPR AUPR AUROC AUROC Score

Team 1 1 1 0.83 0.83 0.69 1.2 10−11 0.852 3.3 10−14 24.38Team 2 1 1 1 0.80 0.72 1.4 10−12 0.847 7.4 10−14 24.99Team 3 1 1 1 0.67 0.67 4.9 10−11 0.83 5.4 10−13 22.57Team 4 0.33 0.50 0.38 0.45 0.40 0.002 0.72 1.3 10−6 8.53Team 5 1 0.67 0.62 0.41 0.38 0.005 0.63 0.0022 4.97Team 6 1 1 0.45 0.30 0.33 0.057 0.54 0.21 1.93Team 7 0.50 0.67 0.45 0.27 0.31 0.11 0.52 0.30 1.49Team 8 0.09 0.13 0.21 0.26 0.25 0.73 0.50 0.51 0.43Team 9 1 0.40 0.55 0.24 0.29 0.23 0.46 0.81 0.73Team 10 0.11 0.07 0.10 0.21 0.20 0.998 0.39 0.99 0.005Team 11 0.04 0.06 0.09 0.16 0.18 1 0.28 1 0

aThe precision of the 1st, 2nd, 5th and 20th correct predictions for each team are listed in the 2nd to 5th columns.AUPR and AUROC indicate, respectively, the area under the Precision-Recall curve and the area under the ROCcurve. The last column lists the score, defined as minus the logarithm, to the base 10, of the product of the AUPRP-value (7th column) and the AUROC P-value (9th column).

that the AUROC of the second best scoringteam (Team 1) was larger than that of Team2, and that the difference in score betweenTeam 2 and Team 1 is rather small. In viewof these observations, the uncertainties of ourP-value calculations, and the non-uniquenessof the evaluation methodology, we decided thatthe performances of Team 1 and 2 were, byall intents and purposes, a tie. It is also inter-esting to observe that the jump of 14 units ofscore between the 3rd and 4th better scoringteams (Team 3 and Team 4) was much big-ger than the jump of 1.8 units between the2nd and 3rd better scoring teams (Team 1 andTeam 3).

To visually compare the performance of thesubmissions, we performed agglomerative un-supervised hierarchical clustering on the vec-tors prec(k) for k = 1, . . . , 200 (see Fig. 9), the pre-cision at depth k in the submitted list. (See thelegend to Fig. 9 for further explanation.) Theprecision of the three best performers nicelyclustered with the precision of the gold stan-dard (a perfect prediction should have a prec(k)identical to the gold standard), whereas the re-maining eight teams agglomerated in a sepa-rate cluster, with Teams 6 to 11 performing notsignificantly better than random.

The AUPR gives a sense of the typical pre-cision of a method if we don’t have a particu-lar threshold or cutoff for our predictions. Inreal applications, however, we do need to usesome criteria to separate the two classes underscrutiny. For this challenge, we need to choosea kc, and call all the predictions from 1 to kc

as targets, and call the predictions at depthk > kc as non-targets. Columns 2 through 5in Table 2 give us information on the precisionat a given depth in the list. The entries in thesecolumns indicate the precision at the n-th cor-rect prediction. As the precision is computedas prec(k) = n/k, then the depth k to achievethe n-th correct prediction is k = n/prec(k). Inorder to get the 20th BCL6 target (represent-ing a recall of 38%), we need to go to a depthof 24, 25, 30, 44, 49, 67, 74, 77, 83, 95, and125, for teams 1 through 11. The average preci-sion expected for a random prediction is 27%.Table 2 shows that at recall 38%, the precisionof Teams 1 through 3 is quite better than whatis expected by chance, while the precision ofTeams 6 through 11 is pretty much what is ex-pected by chance. Teams 4 and 5 are marginallybetter than chance.

The overall picture is that Teams 1, 2, and 3are clearly the best performers in this challenge.

Stolovitzky et al.: Lessons from the DREAM2 Challenges 177

FIGURE 9. Heat map and hierarchical cluster analysis of the precision for the gold stan-dard and the 11 submissions for the BCL6 target identification challenge. Precision is encodedin a color scale (shown in the palette at the upper right of the figure) going from blue (represent-ing the precision for a random submission), to red (representing a submission with a precisionof 1). The precision is parameterized by the prediction number (i.e., the depth on the predic-tion list) at which the cutoff is made. The gold standard has prediction 1 (saturated red) up tothe position 53 in the list, below which the decoys started to degrade the prediction, down tothe prediction number 200, at which the precision was 53/200 (saturated blue), the averageprecision expected at any cutoff in random predictions. (Color is shown in online version.)

(We included Team 3 even though it does notstrictly meet the requirement for best performergiven earlier. However, it is clear from thepresent analysis that Team 3 belongs in thegroup of best performers.)

The best performing teams were (team lead-ers are underlined):

• Team 1: W.H. Lee, V. Narang, H. Xu, F.Lin, K.C. Chin, and W.K. Sung, from theGenome Institute of Singapore

• Team 2: Vinsensius B. Vega, Xing YiWoo, Habib Hamidi, Hock Chuan Yeo,Zhen Xuan Yeo, Guillaume Bourque, andNeil D. Clarke, from the Genome Instituteof Singapore

• Team 3: Matti Nykter, Harri Lahdesmaki,Alistair Rust, Vesteinn Thorsson, andIlya Shmulevich, from The Institute forSystems Biology.

These teams discuss their best performingstrategies in their respective chapters later in

this volume.21–23 Even though the methodsused by these three teams were different inmany ways, it is interesting to highlight impor-tant similarities in their methods.

(1) Avoid giving too much weight to “known” bind-

ing sites. Team 1 chose not to use the“known” BCL6 binding site. Team 2 useda weighted consensus of attributes basedon gene expression experiments, sequencemotifs, and Gene Ontology. Team 3 useda weighted consensus of attributes basedon gene expression experiments, sequencemotifs, and Gene Ontology. Thereforepossible artifacts that the “known” BCL6motifs could have incorporated in theanalyses may have been compensated bythe other information included in theconsensus.

(2) Use gene expression that is only relevant to the

transcription factor under scrutiny. Of the 336gene expression experiments Teams 1, 2,and 3 only used the gene expression data

178 Annals of the New York Academy of Sciences

TABLE 3. Performance of submissions to the protein–protein interaction subnetwork challengea

Precision at nth correct predictionP-value P-value

Team 1st 2nd 5th 20th AUPR AUPR AUROC AUROC Score

Team 1 0.06 0.09 0.10 0.10 0.08 0.01 0.63 1.3 10−3 4.9Team 2 0.14 0.10 0.07 0.06 0.05 0.31 0.56 0.077 1.6Team 3 0.33 0.02 0.04 0.05 0.05 0.31 0.49 0.59 0.7Team 4 0.50 0.25 0.03 0.04 0.05 0.31 0.48 0.68 0.7Team 5 0.14 0.15 0.04 0.04 0.04 0.84 0.48 0.68 0.2

aThe columns are as in Table 2.

collected under conditions relevant to theBCL6 biology and neglected the rest ofthe data.

(3) Use previously known target genes as prior infor-

mation. All three teams collected a trainingdataset of previously known BCL6 targetsand used it to train a machine learningalgorithm, which was later used to classifythe test gene set.

We asked the remaining teams to voluntar-ily share with us what their methodology was,so that we could also extract lessons from thestrategies that performed sub-optimally. OnlyTeam 6 responded. In their approach they firstlooked for matches to a BCL6 weight matrixin the promoter regions of the 200 candidategenes. Those genes with a motif match abovea relatively lenient threshold were then rankedbased on how well their mRNA expression pro-file across all phenotypes correlated with theaverage expression profile of the genes. In theirmethodology, this team seems to have takenthe opposite approach to the items shared bythe best performing teams and enumeratedabove. Neither their AUPR nor their AUROCyielded values significantly different from ran-dom predictions.

The Protein-Protein SubnetworkChallenge

Five teams submitted predictions for theprotein–protein interaction subnetwork chal-lenge. Predictions were assessed as described inthe section Assessment of Predicted Networks

and are summarized in Table 3. From Table 3we can see that one team had significant P-values for both the AUPR and AUROC. (Forthe other four teams, neither the AUPR nor theAUROC yielded significant P-values.)

To visually compare the performance of thesubmissions, we performed agglomerative un-supervised hierarchical clustering on the vec-tors prec(k), k = 1, . . . , 200 (Fig. 10), the preci-sion at depth k in the submitted list. We onlyshow the precision up to depth k = 200, ratherthan the full length of the prediction list (1,070total predictions) to have better visualizationof the precision at the most trusted predic-tions. None of the predictions clustered with thegold standard; rather, all predictions groupedtogether. However, Team 1 more consistentlypredicted correct interactions down to deeperentries in the prediction list. Interestingly, thefirst correct prediction of Team 1 was madeonly at depth 16 (taking away the pairs notconsidered in the gold standard). For Teams 3and 4, the first correct prediction was the thirdand second, respectively. At their 50th predic-tion (k = 50), Team 1 had 12% of precision,whereas Teams 2 through 5 had precisions of6%, 2%, 4%, and 6%, respectively.

The best performing team was (the teamleader is underlined):

• Team 1: Hon Nian Chua, Willy Hugo,Guimei Liu, Xiaoli Li, Limsoon Wong,and See-Kiong Ng, from Singapore’s Insti-tute for Infocomm Research, and the Na-tional University of Singapore, Singapore.

Stolovitzky et al.: Lessons from the DREAM2 Challenges 179

FIGURE 10. Heat map and hierarchical cluster analysis of the precision for the goldstandard and the five submissions for the protein–protein interaction subnetwork challenge.The precision is shown down to the 200th position, out of the 1,070 possible entries in theprediction list. See Figure 9 legend for an explanation of the color scale. (Color is shown inonline version.)

This team discusses its best performingstrategy in the chapters dedicated to thebest performing teams, later in this vol-ume.24 The approach taken by the best per-forming team integrates seven protein–proteininteraction prediction techniques, includingdomain–domain interactions, interaction mo-tifs, paralogous interactions, Gene Ontology–based protein function similarity, and graphtheoretic techniques. Using currently knownphysical protein–protein interactions from theBioGRID database as a training set, Team1 trained each of their methods, and thenintegrated those methods into their predictionframework.

We asked the remaining teams to voluntar-ily share their methodologies with us, so thatwe could also extract lessons from the strategiesthat performed sub-optimally. Only Team 4 re-sponded. Team 4 described a protein–proteininteraction pair by a total of 28 features rep-resenting 17 distinct groups of biological datasources. These data sources could be dividedinto four categories: direct experimental datasets (two-hybrid screens and mass spectrome-

try), indirect high throughput data sets (geneexpression, protein-DNA binding, biologicalfunction, biological process, protein localiza-tion, protein class, essentiality, etc.); sequence-based data sources (domain information, genefusion, etc.), and features representing relation-ships between proteins on various graphs (e.g.,2-path on two-hybrid graph). They applied a“logistic regression” classifier to predict if aprotein pair interacts or not. Their trainingset for this classification was derived from theBioGRID yeast PPI dataset.

It is hard to ascertain why Team 1 did bet-ter than Team 4. It is possible that the use ofthe full complement of protein–protein inter-actions from BioGRID used by Team 4, ratherthan the pruned out version (leaving out thenonphysical interactions) used by Team 1, mayhave harmed Team 4’s performance.

The Five-Gene Network Challenges

A total of 11 teams submitted predictionsto five-gene network challenges, and eachof the 12 categories received at least one

180 Annals of the New York Academy of Sciences

TABLE 4. Summary of the participation to the different categories of the five-gene network challengea

Directed Undirected

Unsigned Excitatory Inhibitory Unsigned Excitatory Inhibitory

FiveGeneNet1 qPCR data (5) non-sgnif (6) Team 0 (6) non-sgnif (4) Team 1 (3) non-sgnif (3) non-sgnifFiveGeneNet2 Chip data (1) non-sgnif (1) non-sgnif (1) non-sgnif (2) non-sgnif (1) non-sgnif (1) non-sgnif

aThe FiveGeneNet1 (qPCR data) and FiveGeneNet2 (Chip data) correspond to two different subchallenges.Shaded cells are discussed further in the text. In each table entry, the number between parentheses indicates thenumber of participants in the corresponding category, and the text indicates either the best performing team, or thatno submission attained a significant score (non-sgnif).

TABLE 5. Performance of submissions to the FiveGeneNet1 subchallenge, Undirected Unsigned cate-gory, qPCR data modalitya

Precision at nthcorrect prediction

P-value P-valueTeam 1st 2nd 5th AUPR AUPR AUROC AUROC Score

Team 1 1 1 0.71 0.87 0.008 0.87 0.007 4.25Team 2 1 0.67 0.83 0.76 0.042 0.86 0.011 3.35Team 3 1 1 0.52 0.79 0.036 0.70 0.102 2.44Team 4 0.50 0.50 0.50 0.48 0.55 0.55 0.36 0.71

aThe columns are as in Table 2.

TABLE 6. Performance of submissions to the FiveGeneNet1 subchallenge, Directed Signed Excitatorycategory, qPCR data modalitya

Precision at nthcorrect prediction

P-value P-valueTeam 1st 2nd 5th AUPR AUPR AUROC AUROC Score

Team 0 1 1 0.20 0.72 0.0036 0.79 0.021 4.1Team 1 1 0.29 0.29 0.43 0.096 0.70 0.085 2.09Team 4 1 0.50 0.20 0.41 0.108 0.59 0.265 1.54Team 3 1 0.50 0.20 0.41 0.108 0.59 0.265 1.54Team 5 1 0.25 0.20 0.35 0.189 0.49 0.513 1.01Team 6 0.17 0.19 0.20 0.17 0.718 0.45 0.616 0.35

aThe columns are as in Table 2.

submission. The breakdown of participants perchallenge is summarized in Table 4. The bulkof the submissions were in the FiveGeneNet1(qPCR data) subchallenge. However, only twocategories—those shaded in Table 4—had sta-tistically significant scores. Interestingly, of theFiveGeneNet2 (chip data) subchallenge sub-missions, none of the participants made pre-dictions for edges involving genes in the fivegene network (recall that these 5 genes were

hidden among the 588 genes whose geneexpression data was provided): none of thesubmissions contained any of the networkgenes.

The details of the two categories that yieldedsignificant predictions in this challenge (shadedin Table 4) are listed in Tables 5 and 6. Pre-dictions were assessed as described in ear-lier sections. From Table 5 we can see thatin the Undirected Unsigned category of the

Stolovitzky et al.: Lessons from the DREAM2 Challenges 181

FIGURE 11. Heat map and hierarchical cluster analysis of the precision for the five-gene network chal-lenge. (A) The gold standard and the four submissions for the Undirected Unsigned category. (B) The goldstandard and the six submissions for the Directed Excitatory category. See Figure 9 legend for an explanationof the color scale. (Color is shown in online version.)

FiveGeneNet1 (qPCR data) subchallenge, twoof the four teams had statistically significantsubmissions. Even though Team 1 was clearlythe best performing team, Team 2 was notthat far behind. We can see some of theseteams’ performance details in Figure 11A,depicting a hierarchical clustering of preci-sion. While the second interaction predictedby Team 2 was not correct, the 3rd to 7thgene-to-gene connections were correct. Indeed,at depth k = 7, this team had a precisionof 86%. However, Team 1 was more con-sistent in its performance, which gave it theedge.

The Directed Excitatory category of theFiveGeneNet1 (qPCR data) subchallenge hada clear best performing team, as can be seen inthe heat map and dendrogram of Figure 11Bdepicting a hierarchical clustering of the preci-sion. Team 0 correctly predicted the first threeout of the 5 excitatory connections in this net-

work and clustered tightly with the gold stan-dard precision vector.

The best performing teams therefore were(team leaders are underlined):

• Team 0: Angela Baralla, WieslawsMentzen, and Alberto de la Fuente, fromthe CRS4 Bioinformatica, Universita degliStudi di Sassari, Italy

• Team 1: Daniel Marbach, ClaudioMattiussi, and Dario Floreano, fromthe Ecole Polytechnique Federale deLausanne (EPFL), Switzerland.

These teams discuss their best performingstrategies in their respective chapters later inthis volume.25,26 Even though the methods usedby these two teams were different in many ways,it is interesting to highlight the important simi-larities that these methods share.

The approach taken by Team 1 was a veryinnovative one. They proposed a variant of

182 Annals of the New York Academy of Sciences

a genetic algorithm that encoded the logicof gene regulatory networks, thus integratingprior knowledge into the reverse engineer-ing procedure itself. In this way, the searchwas biased toward biologically plausible solu-tions. They ran their Analog Genetic Encoding(AGE) algorithm 50 times and estimated thereliability of predicted edges by the frequencywith which a particular connection occurred inthe ensemble of predictions.

Team 0 modeled the kinetics of the systemwith a set of ordinary differential equationswhose parameters represented the networkconnections. They determined the parametersusing nonlinear optimization to fit the dynam-ics of their model to different combinationsof the data provided in the challenge. Inde-pendent fits to these different combinationsprovided many results in an ensemble ofpredictions that were averaged to yield thefinal parameters from which the connectionswere inferred.

We asked the remaining teams to voluntarilyshare with us what their methodology was,so that we could also extract lessons from thestrategies that performed sub-optimally. Threeteams responded. One of the responding teamsthat participated in the Directed Unsigned/FiveGeneNet1 (qPCR data) category (sum-mary tables not provided for this category),obtained AUPR and AUROC P-values of0.99 and 0.88. They suggest that the poorperformance of their method may have beendue to the fact that they did not focus onpairwise connections. The Bayesian networkwith continuous variables method they usedlearns the best set of parents for each variable.Thus, if a variable has more than one parent,it is difficult to assign a confidence to the edgebetween that variable and each parent, inde-pendently of the other parents. Team 4 usedRandom Forests composed of 2,000 decisiontrees, ranking how informative to a particulargene were the other genes. This was done byfitting a regression model for each gene interms of the values (past and present) of theother genes. Directions were resolved by using

information theory criteria. Team 5 used thetime course PCR data and applied a recursiveL1 optimization to obtain a sparse model.

Let us point out an important difference inthe comparison between the best performingteams and the other teams. The best perform-ers modeled the five-gene network system us-ing the knowledge that the system consisted ofa gene regulatory network. Even though thedetails are different both best performers usedkinetic models to represent the evolving systemand found the best models that fit the data.The poorly performing methods, instead, usedBayes network inference and regression modelsand attempted to use probability theory to cap-ture a causal kinetics that was probably bettermodeled by rate equations.

The in Silico Network Challenges

A total of 25 teams submitted predictions tothe in silico network challenges. Not all of the 18categories received submissions, however. Thebreakdown of participants per challenge is sum-marized in Table 7, where the number of teamsthat submitted in each category appears be-tween parentheses, and each table entry showsthe best performing team(s) if any.

The entries where there are two teams sep-arated by a slash sign are the results of an in-teresting turn of events. After the submissiondeadline, Team 2 participants realized that theyhad a formatting error in their submission of di-rected networks. In the submission format, thesource for each network edge had to be indi-cated in the first column and the destination inthe second column. Team 2 submissions hadthe order inverted, causing the directionality oftheir edges to be the reverse from their intendeddirectionality. That was not a problem for theundirected networks, where they obtained a re-markable score of 61.3 in the unsigned cate-gory. However, the mistake was devastating forthe directed network submission. Team 2 no-ticed this mistake after the results and best per-forming teams had been announced, and theyrequested that we evaluate the same directed

Stolovitzky et al.: Lessons from the DREAM2 Challenges 183

TABLE 7. Summary of the participation to the different categories of the In Silico network challengesa

Directed Undirected

Unsigned Excitatory Inhibitory Unsigned Excitatory Inhibitory

InSilico1 (7) Team 2s/ (8) Team 2s/ (8) Team 1/ (2) Team 2 No Subm No SubmTeam 1 Team 1 Team 2s

InSilico2 (6) Team 3 (6) non-sgnif (6) Team 1 (2) non-sgnif No Subm No SubmInSilico3 (1) Team 1 (2) Team 4 (1) non-sgnif No Subm No Subm No Subm

aThe rows indicate the InSilico1, 2, and 3 subchallenges. Shaded cells are discussed further in the text. In each tableentry, the number between parentheses indicates the number of participants in the corresponding category, and thetext indicates either the best performing team(s), that no submission attained a significant score (non-sgnif), or thatthere were no submissions (No Subm).

TABLE 8. Performance of submissions to the InSilico1 network challenge, Directed Unsigned categorya

Precision at nth correct predictionP-value P-value

Team 1st 2nd 5th 20th AUPR AUPR AUROC AUROC Score

Team 1 0.5 0.5 0.42 0.32 0.20 1.05 10−16 0.813 1.1 10−22 37.95Team 3 1 0.18 0.26 0.18 0.12 1.88 10−9 0.72 1.1 10−12 20.70Team 5 1 1 0.62 0.06 0.094 4.48 10−6 0.56 0.014 7.19Team 6 0.5 0.25 0.36 0.12 0.085 3.24 10−5 0.57 7.3 10−3 6.63Team 7 0.5 0.4 0.13 0.067 0.057 0.011 0.53 0.12 2.84Team 8 0.02 0.026 0.033 0.038 0.038 0.72 0.49 0.57 0.39Team 2 0.018 0.033 0.037 0.036 0.038 0.72 0.49 0.65 0.33Team 2s 1 1 1 1 0.58 1.58 10−54 0.82 3.06 10−24 77.3

aThe columns are as in Table 2.

submissions they had sent, with the first andsecond columns swapped. This constituted anew (unofficial) entry, that we denoted Team 2s(s for swapped). Team 2s performed better thanthe official best performer in two of the directedcategories for the InSilico1 challenge. However,Team 2s was not considered an official par-ticipant of DREAM2, and wherever Team 2sshares an entry with Team 1 in Table 7, Team1 was the official best performer.

The Undirected Signed categories didn’t re-ceive any submission, and the Undirected Un-signed received only four submissions in total,and only one of them was significant. It seemsthat, except for Team 2, teams with algorithmsto predict directed networks were not inter-ested in making predictions for the undirectedcategories.

The bulk of the submissions were in the di-rected categories for the InSilico1 and InSilico2

challenges. Even within the directed categories,participants seem to have been reluctant toparticipate in the more realistic and difficultInSilico3 challenge. However, there were sub-missions with significant scores in some of theInSilico3-directed categories.

The scores for all the teams in the DirectedUnsigned categories in the InSilico1 and 2

challenges (those shaded in Table 7), are listedin Tables 8 and 9. Team 2s had an overallscore of 77.3 in the InSilico1 Directed Unsignedcategory, and its first 33 predictions were actualedges in the InSilico1 network. Team 1 alsohad an excellent score of 37.95 for the samecategory, but its precision was not as impressiveas that of Team 2s. Figure 12A shows a heatmap of the precision as we go down in the list ofpredicted edges. Team 2s clearly clusters withthe gold standard, showing an incredible con-sistency in its predictions. Team 1 shows also

184 Annals of the New York Academy of Sciences

TABLE 9. Performance of submissions to the InSilico2 network challenge, Directed Unsigned categorya

Precision at nth correct predictionP-value P-value

Team 1st 2nd 5th 20th AUPR AUPR AUROC AUROC Score

Team 3 1 0.67 0.71 0.46 0.26 2.64 10−22 0.75 2.4 10−16 37.2Team 7 1 1 1 0.14 0.15 4.87 10−12 0.66 3.9 10−8 18.7Team 5 0.5 0.67 0.56 0.11 0.11 5.6 10−8 0.60 3.2 10−4 10.7Team 1 1 0.5 0.12 0.04 0.059 0.01 0.49 0.57 2.2Team 2 0.14 0.25 0.077 0.045 0.049 0.084 0.513 0.32 1.57Team 8 0.038 0.04 0.039 0.04 0.040 0.52 0.50 0.50 0.58

aThe columns are as in Table 2.

FIGURE 12. Heat map and hierarchical cluster analysis of the precision for the In Silico network challenge.(A) The gold standard and the eight submissions for the InSilico1 Directed Unsigned category. (B) The goldstandard and the six submissions for the InSilico2 Directed Unsigned category. The precision is shown downto the 200th position, out of the 2,450 possible entries in the prediction list. See Figure 9 legend for anexplanation of the color scale. (Color is shown in online version.)

Stolovitzky et al.: Lessons from the DREAM2 Challenges 185

a consistency in having its correct predictiontoward the top of the list. (This can be seen ina tendency for the Team 1 precision to have adarker to red coloration in the top predictions.)

In the InSilico2 challenge, it was Team 3 thattopped the scores. Interestingly, Teams 1 and2 did not score as well in this challenge asin the InSilico1 challenge. Recall that the dif-ference between these two challenges residesin the random topology of InSilico1 and thescale free topology of InSilico2. This is a cleardemonstration of the influence of topology inthe reconstruction of the networks, with somealgorithms better suited to certain global topo-logical structure.

Overall there were two teams, Team 1 andTeam 2 (and its swapped version 2s) that had anexcellent performance throughout all the cate-gories. Team 3 performed very well in the InSil-

ico2 Undirected Unsigned category, and Team4 had a modest but nonetheless best performingscore of 6.3 in the InSilico3, Directed Excitatorycategory.

The best performer teams were as follows(team leaders are underlined):

• Team 1: Alan Scheinine, WieslawaMentzen, G. Fotia, E. Pieroni, F. Maggio,G. Mancosu, and Alberto de la Fuente,CRS4 Bioinformatica, Italy

• Team 2 (and 2s): Mario Lauri, FrancescoIorio, and Diego di Bernardo, TelethonInstitute of Genetics and Medicine, Uni-versity of Naples “Federico II,” and Uni-versity of Salerno, Italy.

• Team 3: Mika Gustafsson, MichaelHornquist, Jesper Lundstrom, JohanBjorkegren, and Jesper Tegner, fromLinkoping University and KarolinskaUniversitetssjukhuset, Sweden

• Team 4: Tejaswi Gowda, Sarma Vrud-hul, and Seungchan Kim, from ArizonaState University, Tempe, Arizona, andTranslational Genomics Research Insti-tute, Phoenix, Arizona.

These teams discuss their best performingstrategies in their respective chapters later in

this volume.27–30 Team 1 used a linearized dy-namical system model under perturbations. Forthe InSilico1 and 3 challenges only the heterozy-gous knock-out data was considered. For theInSilico2 only the time-series data was used.For InSilico3, many edges were assumed to benonexistent based on biological reasoning. Inall cases a matrix had to be inverted to ob-tain the connectivity networks. Team 2 usedtheir NIR (Network Identification by Reverse-engineering) method, based also on a linearizeddynamical system model under perturbations,but with the constraint that each gene can becontrolled by, at most, ten other genes. LikeTeam 1, Team 2 only used the heterozygousknock-out data. Team 3 also described the un-derlying system as a set of differential equa-tions in which the dependence of a gene on an-other gene was mediated by a function chosenfrom a library of functional forms. The connec-tivity assignment, parameter estimation, andselection of nonlinear transfer functions wereperformed by least angle regression and cross-validation on all the available data. Team 4modeled the interaction between each geneand its controlling genes in terms of a so-called threshold function. This is a functionthat assumes the value 1 if the linear combi-nation of the levels of the controlling genes isabove a threshold and 0 otherwise. The geneabundance was discretized into four levels. Theweights of the linear combination were foundby optimization against the time course data,and the connectivity was computed by addingonly genes that reduce the errors in an iterativeprocedure.

We asked the remaining teams to voluntar-ily share their methodology with us, so thatwe could also extract lessons from the strate-gies that performed sub-optimally. Two teamsresponded. Team 7 used a method that as-sumed a nonlinear differential equation modelof mRNA transcription and used regularizedkernel regression to assess the degree to whicha set of potential regulators is able to explain theexpression of each gene. Note that Team 7 ob-tained quite a significant score for the InSilico2,

186 Annals of the New York Academy of Sciences

TABLE 10. Summary of the participation to the different categories of the Genome Scale networkchallengesa

Genome scale challenge Directed unsigned Directed excitatory Directed inhibitory

Regular submission (5) Team 1 (2) Team 2 – score: 9.6 (2) non-sgnifw/Dimitris effect N/A (3) Team 1 – score: 26.4 (3) Team 1 – score 26.5

aThe row labeled “w/Dimitris effect” explores the result of using the Directed Unsigned submission of Team 1 aseither Directed Excitatory, or Directed Inhibitory. The shaded cell is discussed further in the text. In each table entry,the number between parentheses indicates the number of participants in the corresponding category, and the textindicates either the best performing team, that no submission attained a significant score (non-sgnif), or the score ifthe best performing team is not discussed in Table 11.

Directed Unsigned category. Figure 12B alsoshows that the precision for the top 5 predictedinteractions was 100%. A second team sharedtheir method as well. It used a regression modelon the time course data and applied a recursiveL1 optimization to obtain a sparse model.

It seems that the Teams 1 and 2, which hadthe winning strategies, managed to capture thein silico dynamics with linear models and pertur-bations. The added sparsity condition of Team2’s NIR algorithm seems to have favored the re-construction of the random network topology,but not the scale free topology, presumably be-cause it might not have captured the dynamicsat the hub nodes.

The Genome-Scale Network Challenge

A total of six teams submitted predictionsto the Genome Scale network challenges. Thebreakdown of participants per subchallenge issummarized in Table 10, where the numberof teams that submitted in each category ap-pears between parentheses, and each table en-try shows the best performing team, if any. Eachof the three categories in Table 10 had at leasttwo submissions. The Directed Unsigned cate-gory (shaded in Table 10) was the most popu-lated one, with five submissions.

The row labeled Regular Submission was theone announced to the participants. After theresults of the predictions were posted, however,the best performer of the Directed Unsignedcategory, Team 1 (Prof. Dimitris Anastassiou),made a curious request. Given that he hadn’t

submitted predictions for the signed submis-sions, he requested whether we could evaluatehis unsigned submission as the predictions tobe used for directed excitatory and directed in-hibitory. After we did the evaluations, it turnedout that Team 1 was the best performing teamfor the signed categories as well! The fact thatan unsigned category submission performedvery well for a signed category submission wassufficiently perplexing, to give it a name. Wecalled it the “Dimitris effect,” for the origina-tor of the request. The row “w/Dimitris effect”indicates the corresponding score.

For the regular submission, the Directed Ex-citatory category best performer was Team 2with a score of 9.6 (see Table 10). Adding Team1’s submission with the Dimitris effect broughtthe best performer score up to 26.4. Likewise, itcan be seen in Table 10 that for the regular sub-mission, the Directed-Inhibitory category hadno significant prediction (out of the two submis-sions) with a score of 9.6. Again, scoring Team1’s submission with the Dimitris effect broughtthe best performer score up to 26.5.

Table 11 shows the details of the submissionsto the Directed Unsigned category. While theP-values for the AUROC are very small (andthe P-values for the AUPR couldn’t be extrapo-lated, and therefore are upper bound based onsampling), the precision starts to degrade veryfast after about the 300th prediction, even forTeam 1, as can be seen in Figure 13. Also inFigure 13, we can see that Team 3 and Team 4have a reasonable precision up to the first 150predictions, after which the precision degrades

Stolovitzky et al.: Lessons from the DREAM2 Challenges 187

TABLE 11. Performance of submissions to the Genome Scale network challenge, Directed Unsignedcategorya

Precision at nth correct predictionP-value P-value

Team 1st 2nd 5th 20th 100th 500th AUPR AUPR AUROC AUROC Score

Team 1 1 1 0.833 0.8 0.82 0.102 0.097 <10−5 0.61 3.0 10−36 40.5Team 3 1 0.4 0.556 0.769 0.654 0.045 0.063 <10−5 0.575 6.4 10−21 25.2Team 4 1 1 1 0.909 0.699 0.069 0.08 <10−5 0.571 8.8 10−20 24.1Team 5 1 1 1 0.952 0.351 0.045 0.057 <10−5 0.56 1.9 10−14 18.7Team 6 1 1 0.556 0.278 0.086 0.028 0.03 <10−5 0.53 9.6 10−6 10.0

aThe columns are as in Table 2.

FIGURE 13. Heat map and hierarchical cluster analysis of the precision for the goldstandard and the five submissions for the Directed Unsigned category of the Genome ScaleNetwork Challenge. The precision is shown down to the 3,102nd position (the number ofelements in the gold standard positive set) in the prediction list. See Figure 9 legend for anexplanation of the color scale. (Color is shown in online version.)

considerably. Indeed, no team seems to have asignificantly better than random precision afterthe 500th prediction. This indicates a need toprune out false positives. Figure 13 shows thatthe precision of Team 4 is better than that ofTeam 3, even though the latter scores betterthan the former in Table 11. This is due to thefact that the AUROC of Team 3 is marginallybetter that that of Team 4 and that we couldn’tproperly score the precision in these submis-sions beyond the upper bound of 10−5. It isvery likely that if our AUPR P-value had beenmore precise, Team 4 would have scored higher

than Team 3, as suggested by its clustering withTeam 1 in Figure 13.

The best performer team was (team leaderis underlined):

• Team 1: John Watkinson, Kuo-chingLiang, Xiadong Wang, Tian Zheng, andDimitris Anastassiou, all from ColumbiaUniversity.

This team describes its best performing strat-egy in a chapter later in this volume.31 Thestrategy of Team 1 was based on the contextlikelihood of relatedness (CLR) algorithm,20

188 Annals of the New York Academy of Sciences

enhancing it with additional complementaryinformation using the information theoreticmeasure of synergy (which makes use of threevariable associations) and assigning a score toeach ordered pair of genes measuring the de-gree of confidence that the first gene regulatesthe second.

We asked the remaining teams to volun-tarily share with us what their methodologywas, so that we can also extract lessons fromthe strategies that performed sub-optimally.Team 3, whose performance was the sec-ond best, responded. They used their LARS(a linear regression with a sparsity-promotingprior/penalty term) algorithm, and used a reg-ularization to minimize the number of genesregulated by each transcription factor.

A take home lesson is that the measure ofsynergy used by Team 1, which goes beyondthe pairwise mutual information to include thestate of a third gene that might enhance the in-teraction between the pair, seems to have givenTeam 1 an extra push to achieve the best per-formance in this challenge.

Conclusions

The lessons learned in DREAM2 can be di-vided into two classes: lessons for future devel-opment of DREAM challenges, and lessons onthe current state of affairs of reverse engineer-ing methods. We will address the former classfirst.

At the end of the DREAM2 conference, heldat the New York Academy of Sciences on De-cember 2 and 3, 2007, where the results fromthe challenges were discussed, there was a livelydiscussion where a number of different ideaswere suggested.

An important, if somehow philosophical,comment was whether we should predict a net-work model, which is a concept rather thanan observation, or a measurement related toan observation (e.g., gene expression or fluores-cence intensity). In the same vein as the equa-tions of mathematical-physics do not “hide”

within the physical phenomena they describe,the networks we proposed as gold standardmodels for inference in the DREAM2 chal-lenges do not truly “exist” in nature. Networksare a conceptual representation of cellular in-teractions. By giving an element taken fromthe realm of ideas an absolute reality, we areupholding the tenets of the Platonic idealis-tic philosophy. On the other side is the posi-tivistic philosophy, which postulates that scien-tific knowledge can only be based on actualmeasurements. In other words, the positivistwould posit that instead of predicting networks,DREAM challenges should be about predict-ing experimental results. This is an interestingdiscussion that makes the point that the originalemphasis on network inference may be exces-sive. Future DREAM challenges will thus de-emphasize network inference in favor of meth-ods that may use networks (or other models) tomake predictions about observations that maybe experimentally measured.

Another point of discussion was about theanonymity of the teams. There are pros andcons to disclosing the identity of all teams, butwe prefer to stick to the Hypocratic dictum: dono harm. It is possible that revealing the identityof poorly performing teams could hurt its mem-bers in different ways (e.g., funding, tenure, orprofessional credibility). Thus, some potentialparticipants may be deterred from participat-ing because of the possible repercussions. Whatwe need to know, in any case, is what methodthat team used, to be aware of its limitations.There are ways to disclose a methodology with-out disclosing its proponents, but the discussionis still open.

Some people felt that the design of our chal-lenges ignored years and years of molecular bi-ology. These critics propose to create challengeswhose outcomes are biological predictions thatextend and supersede what is already known.We believe that some challenges actually didthat. For example, the BCL6 challenge askedfor targets that are not in the literature andthe five-gene synthetic network was recently ac-cepted in the journal Cell. However the main

Stolovitzky et al.: Lessons from the DREAM2 Challenges 189

objective of the DREAM challenges is to assessthe ability of our methods to do what we claimthey do, for example, to find the hidden net-work, or to predict measurable results. Notwith-standing, it may be possible to create challengesthat may both address the DREAM aims andgenerate novel biological hypotheses, empha-sizing those predictions that have the greatestconsensus among participants. We are workingon it, but new ideas will be welcomed.

At DREAM2 some colleagues noted thatcalling ourselves “reverse engineers” impliesthat there is an “engineer” at work designing thebiological systems. In the preface to a previousvolume (“Reverse Engineering Biological Net-works, Annals of the NYAS, volume 1115”), weexplained why we chose those terms:

In the process of reverse engineering biological circuits, we

are trying to “read the mind” of the unintelligent designer

(evolution) that engineered the system. . . . We favor the terms

reverse engineering because the knowledge it generates can then

be used to forward engineer biological systems (i.e. to design

novel working biological systems from first principles), which

constitutes the basis for the budding field of synthetic biology.

We also confess that we like the “ring” ofreverse engineering.

There were comments to the effect that thenumber of categories in Challenge 3, 4, and 5should be reduced. As Alberto de la Fuente putit in the DREAM discussion forum:8 “Genenetworks are directed networks, so we shouldinfer directed networks; there is no need foran undirected category.” And then he added“there should be one overall signed categoryrather than decomposing into excitatory andinhibitory.” The reason we decomposed it in ex-citatory and inhibitory categories was to makethe evaluation over a binary prediction. But weagree that in this case less is more, and futurechallenges will have fewer categories.

Let us now focus on the lessons learned fromthe submissions to the challenges. One impor-tant observation can be made from the BCL6challenge submissions: of the 11 participatingteams, the prediction of 6 of them was no bet-ter than a coin toss. This conclusion is worri-some. It means that many researchers who are

experts in target identification may base theirconclusions on potentially wrong premises. Weare pleased that the team leader of Team 6 inTable 1 expressed appreciation for how muchthey had learned from having failed to pro-duce a significant prediction in Challenge 1.This is an important call for caution. On thebrighter side, we believe that the three best per-forming teams showed proof of principle thatit is possible to infer transcriptional targets ac-curately. It is very interesting to note that theaggregate predictions of these three teams per-formed much better than even the best of them.To aggregate the three team predictions, wedid the following. Each team gave each pre-sumed target a rank in the prediction list. Foreach gene, we summed the ranks of the threeteams, yielding a list of genes with new ranks.Reordering this list according to the rank sum,we obtained a new prediction. The results werefantastic. The best of the three AUPR was 0.72,with a P-value of 1.4 × 10−12; the new predic-tion yielded an AUPR of 0.85 and a P-value of3.85 x 10−17; the best of the three AUROC was0.85, with a P-value of 3.3 × 10−14; the newprediction yielded an AUROC of 0.93 and aP-value of 3.61 × 10−20; the best of the threeoverall scores was 24.99; the new predictionyielded a score of 35.86; finally, the team withthe largest run of correct predictions at thetop of the list had 10 correct predictions be-fore the first wrong prediction; the new predic-tion had 19 correct predictions before the firstwrong predictions. Clearly, this shows that the“intelligence” behind the three methods canbe easily aggregated to produce an integrativeapproach that is better than any method inisolation. This calls for a reverse-engineering-by-consensus approach that has not yet beenexplored by the community.

The upbeat message arising from tran-scriptional predictions, unfortunately, does nottranslate to the protein–protein interactionchallenge. Here, four out of five participat-ing teams had non-significant scores (i.e., nobetter than random). Even the best perform-ing team, while statistically significant, had a

190 Annals of the New York Academy of Sciences

disappointingly low score. This is probably lit-tle surprise to the field of protein–protein in-teraction inference, which has struggled witha high rate of both false positive and falsenegative errors arising from the technology in-volved. These may contaminate both the chal-lenge gold standards and the training sets usedby the participants; or else, the computationalmethods are not yet ready for prime-time in thisfield. Most likely, poor results arise from a com-bination of the two. In particular, we feel thatfalse negative errors from high-throughput ex-perimental techniques may still be too high toproduce accurate gold standard negative sets.Additionally, protein–protein interactions mayprovide fewer clues for computational methods.In any case, it seems that we still have much tolearn in the area of protein–protein interactionprediction.

There were many non-significant categoriesin the five-gene network challenge, and thescores of those who performed best were mod-est. It is surprising that a simple network of fivegenes could be so hard to reconstruct. However,the qPCR data available for the reconstructionwas scant and probably noisy. M. P. Cosma,D. diBernardo, and collaborators have recentlywritten an interesting paper15 on this syntheticbiology network, with more data of better qual-ity. It would be interesting for the teams thatparticipated in this challenge to revisit it withthe the new data. In the chip-data category, thechallenge was akin to finding a needle (the fivegene network) in a haystack (other 583 genes),and there was no inferred connection that con-tained any of the genes in question. However,there were too few teams in this challenge toextract any meaningful conclusion.

InSilico1 and InSilico2 network predictionchallenges had high participation, but not sothe more realistic InSilico3. It seems to be thecase that there are fewer methods that deal withthe more complicated and realistic case thatthe InSilico3 challenge presents. This is clearlya niche that needs to be filled. One clear lessonlearned from Challenge 4: one size doesn’t fitall. Topology is very important in the ability

of a method to reconstruct the network, as theteam that did so well in the Erdos-Renyi topol-ogy, performed very poorly in the scale-freetopology network. Challenge 4 Team 2 clearlyshone in the InSilico1 challenge, most likely be-cause the data fit the assumptions of the algo-rithm. This is a good lesson to learn. Howeverelegant an algorithm is, it needs to be designedusing all the biological knowledge of the systemavailable. Statistical methods of reconstructionfaired less well than mechanistic methods us-ing rate equations, probably because the latterwere a better match to the nature of the data.

The genome scale challenge can be seenunder two lenses. On the one hand, it is en-couraging that of the five teams participatingin the Undirected Unsigned category, all ob-tained significant scores. On the other hand,Figure 13 shows that even the best performeris only scratching the surface when it comes toprecision: precision downgraded close to ran-dom levels after the first 300 interaction pre-dictions a mere 10% of the 3,102 true posi-tive edges. Most likely, some gene interactionscan be dissected only under context-specificconditions. Thus, the imprecision cannot becompletely accounted for by the algorithms. Inother words, the gold standard was much richerthan the experimental sets provided for the in-ference. A curious phenomenon discovered wasthe “Dimitris effect.” The prediction of Team1, which was designed for unsigned predictionswithout any attempt to infer signs of the con-nection, faired better in the signed categoriesthan the methods created with the intention ofdiscovering the sign of the interaction. We donot fully understand why this was the case. Butit is very likely that the ability of Team 1 to pre-dict transcription factor-target interactions wasbetter than that of Team 2, thus compensatingfor the lack of consideration for the signs in themethod.

There is much to mine in the submitteddata that also requires a community effort.All the predictions from all the teams willbe posted in the DREAM website (http://wiki.c2b2.columbia.edu/dream/) and will be

Stolovitzky et al.: Lessons from the DREAM2 Challenges 191

publicly available. It is plausible that a meta-analysis of the whole prediction set beyondwhat we did in this chapter might provide uswith a few additional ideas to improve our algo-rithms. As much as the “reverse engineers” cre-ated their methods of network inference, thoseof us who are in the business of creating well-posed challenges have a long way to go, andmany more lessons to learn. We are clearly in atime when methods proliferate with some fair-ing well and some fairing rather poorly. So thequestion remains: When it comes to reverse en-gineering biological circuits, is the glass half fullor half empty?

Acknowledgments

Many people contributed to the DREAMproject and DREAM2, in particular. Thanksto J. Jeremy Rice, with whom GS first dreamtof DREAM on a train trip. Thanks also toFritz Roth, Joel Bader, Peicheng Du, GuillermoCecchi, and Ajay Royyuru for fruitful discus-sions, and Dan Gallahan for his encourage-ment. We are grateful to the labs of AndreaCalifano, Riccardo Dalla Favera, Marc Vidal,Maria Pia Cosma, Pedro Mendes, and TimGardener who gracefully provided the datafor the challenges. We are indebted to KaiWang, Adam Margolin, Haiyuan Yu, DiegodiBernardo, Irene Cantone, and Boris Hayetefor invaluable help in the curation for the chal-lenges. The DREAM project would not ex-ist without the teams, many of them anony-mous, participating in the challenges. They puttime and brain to solve the reverse engineer-ing problems. We hope the fruit of their ef-forts will benefit that whole reverse engineer-ing community. Financial support for DREAMfrom the IBM Computational Biology Cen-ter and the NIH Roadmap Initiative, throughthe Columbia University Center for Multi-scale Analysis Genomic and Cellular Networks(MAGNet) is acknowledged.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix

Precision Versus Recall Curve Beyondthe Maximum Recall Value of a Set of

Predictions

Suppose we have a sample space withT elements (in our case, possible edges be-tween nodes in a graph). These elements be-long to one of two classes: the class of pos-itive examples (edges of the graph), with P

elements, and the class of negative exam-ples (edges absent from the graph), with N

elements. Clearly T = P + N . The factthat we know the class to which each ex-ample belongs makes this sample space agold-standard.

A prediction is made in the form of an or-dered list of L samples taken from our samplespace. This list is ordered such that the exam-ples at the top of the list are those for which wehave the higher confidence to be in the positiveclass. We will assume that the whole list con-tains TP L true positive predictions and FP L

false positive (if the whole list with L entrieswere to be taken as a prediction of the Positiveset) predictions. Clearly L = TP L + FP L. Wenow add, in random order, the remaining T -L samples (on which no prediction was made)to the bottom of the original list with L exam-ples. We want to compute the precision andrecall corresponding to the prediction that thek (k > L) first samples in the resulting list arepositive.

We first estimate the number of TP and FP

resulting from adding k-L random examples.These examples were randomly sampled froma set of P– TP L positives and N – FP L neg-atives. Therefore the probability ρ of any ofthose k-L examples to be a positive exampleis ρ = P −TPL

T−L, and the probability of having

gained �TP true positives in the added samples(between samples L and k) follows a binomialdistribution with parameters k-L and probabil-ity ρ, whose average is 〈�TP 〉 = (k − L )ρ. Theprecision p(k) and recall r(k) at depth k in the listof predictions, therefore, are functions of the

192 Annals of the New York Academy of Sciences

random variable �TP :

p (k ) = TPL + �TP

k; r (k ) = TPL + �TP

P.

(A1)

It follows that the average precision and re-call at depth k are:

〈p 〉(k ) = TPL + (k − L )ρk

; (A2)

〈r 〉(k ) = TPL + (k − L )ρP

. (A3)

By allowing k to span the range between L

and T, the previous two equations give us theshape of the precision and recall curve, param-eterized by k.

We can also write down an explicit functionof precision versus recall curve. Solving for k inEquation (A3) and replacing in Equation (A2)we obtain

〈p 〉 = ρP〈r 〉

P (〈r 〉 − rL ) + L ρ, (A4)

where rL = TP L/P is the recall at depth L.Note that if L = 0 (no predictions submit-

ted), then rL = 0, and 〈p 〉 = PT

, that is, it’s aconstant. In other words the average precisionrecall curve for a random prediction list is aconstant and does not depend on the recall. Inthis case the integral of the area under the pre-cision versus recall curve (or AUPR for short,to be notated A) is A = P

T.

In the general case, Equation (A4) shows thatthe precision versus recall curve, after the addi-tion of T-L random predictions at the end of theoriginal L predictions, has a hyperbolic shapeand goes through the points (rL, pL) (wherepL = TP L/k is the precision at depth L) and(1,P/T ).

The slope of the curve for recall greater thanrL is

d 〈p 〉d 〈r 〉 =

(P

T− TPL

L

)

× ρPTL

(T − L ) [L ρ + P (〈r 〉 − rL )]2 ;

which clearly is negative (decreasing curve) ifthe precision at recall rL is larger than ran-

dom (P/T ), or positive (increasing curve) if itis smaller than random. The added area underthe curve, that is, the area beyond the precisionrecall curve above recall rL is

A =∫ 1

rL

〈p 〉dr = ρ(1 − rL )

+ ρ

(rL − L ρ

P

)ln

(L ρ + P (1 − rL )

L ρ

).

How to Integrate the PrecisionVersus Recall Curve to Find

the AUC

A practical issue to address is how to interpo-late between points in the precision versus recallcurve when we try to compute the AUPR. Oncewe have decided on the appropriate interpola-tion, it will be straightforward to proceed withthe integration of the curve.32 Each time we godown the list of predictions from item k to itemk + 1, two things can happen:

(a) We gain a FP : FPk+1 = FPk + 1.

In this case, the number of TP ’s didn’tchange (TP k+1 = TP k), and the PR curvemoved from point k at ( TPk

P, TPk

TPk +FPk) to

point k + 1 at ( TPk

P, TPk

TPk +F Pk +1 ). The areaunder the curve therefore didn’t change:Ak+1 = Ak.

(b) We gain a TP : TPk+1 = TPk+1.

In this case, the number of FP ’s didn’tchange (FP k+1 = FP k), and the PR curvemoved from point k at ( TPk

P, TPk

TPk +F Pk) to

point k + 1 at ( TPk +1P

, TPk +1TPk +FPk +1 ). Assume

that there is a continuous change in thenumber of TP ’s from TP k to TP k + 1.Call such continuous variable x. Betweenpoint k and point k + 1, we traverse thecurve ( TPk +x

P, TPk +x

TPk +FPk +x) for x between 0

and 1.

Calling r(x) = (TP + xP )/P (r as in recall),we see that the PR curve can be parameterizedby r as (r, rP

rP+FPk) for r between TP k/P and

Stolovitzky et al.: Lessons from the DREAM2 Challenges 193

FIGURE A1. Pseudo-code to determine the precision versus recall, as well as the ROC curves of a set ofcomplete predictions (L = T) or possible incomplete predictions (L < T). (Color is shown in online version.)

(TP k + 1)/P . Integrating this curve we obtainthat the area under the curve changed as:

Ak+1 = Ak +∫ (TPk +1)/P

TPk /P

rP

rP + FPk

dr

= Ak + 1P

(1 − FPk ln

(k + 1

k

)).

Note that if the interpolation between pointk and point k + 1 were to be done linearlyinstead of hyperbolically, the resulting areas Ak

would be

Ak+1 = Ak + 1P

(1 − FPk

1k (k + 1)

)(linear interpolation case).

It is then clear that a linear interpolationwhen we integrate would result in a more op-

timistic AUPR than the hyperbolic and moreaccurate functional form.

Determination of the PredictionAccuracy for DREAM2

Given the considerations of the previous sec-tions of this Appendix, the pseudo-code to de-termine the precision versus recall, as well asthe ROC curves of a set of (possibly) incom-plete predictions (i.e., L < T ) is as shown inFigure A1.

References

1. The DREAM project: Assessing the accuracy of re-verse engineering methods, NYAS e-briefing (http://www.nyas.org/ebriefreps/splash.asp?intEbriefID=534).

194 Annals of the New York Academy of Sciences

2. Rice, J.J. & G. Stolovitzky. 2004. Making the most ofit: Pathway reconstruction and integrative simulationusing the data at hand. Biosilico 2: 70–77.

3. Rice, J.J., A. Kershenbaum & G. Stolovitzky. 2005.Analyzing and reconstructing gene regulatory net-works. “Specialist review” to The Encyclopedia ofGenetics, Genomics, Proteomics and Bioinformat-ics. Jorde, Little, Dunn & Subramaniam, Eds. JohnWiley & Sons. Ltd. Chichester.

4. Uetz, P., L. Giot, G. Cagney, et al. 2000. Acomprehensive analysis of protein-protein interac-tions in Saccharomyces cerevisiae. Nature 403: 623–627.

5. Ito, T., T. Chiba, R. Ozawa, et al. 2001. A com-prehensive two-hybrid analysis to explore the yeastprotein interactome. Proc. Natl. Acad. Sci. USA. 98:4569–4574.

6. Gavin, A.C., P. Aloy, P. Grandi, et al. 2006. Proteomesurvey reveals modularity of the yeast cell machinery,Nature 440: 631–636.

7. Stolovitzky, G., D. Monroe & A. Califano. 2007. Di-alogue on reverse-engineering assessment and meth-ods: the DREAM of high-throughput pathway infer-ence. Ann. N. Y. Acad. Sci. 1115: 1–22.

8. DREAM Discussion Forum website, http://wiki.c2b2.columbia.edu/dream/discuss/.

9. Moult, J. 2006. Rigorous performance evaluation inprotein structure modelling and implications for com-putational biology. Philos. Trans. R. Soc. Lond. B Biol.

Sci. 361: 453–458.10. Jardin, F., P. Ruminy, C. Bastard & H. Tilly. 2007.

The bcl6 proto-oncogene: a leading role during ger-minal center development and lymphomagenesis.Pathol. Biol. (Paris). 55: 73–83.

11. Margolin, A.A., T. Palomero, P. Sumazin, et al. 2009.ChIP-on-chip significance analysis reveals large-scale binding and regulation by human transcrip-tion factor oncogenes. Proc. Natl. Acad. Sci. USA 106:244–249.

12. Basso, K., A. A. Margolin, G. Stolovitzky, et al.2005. Reverse engineering of regulatory net-works in human B cells. Nat. Genet. 37: 382–390.

13. Polo, J.M., P. Juszczynski, S. Monti, et al. 2007.Transcriptional signature with differential expressionof BCL6 target genes accurately identifies BCL6-dependent diffuse large B cell lymphomas. Proc. Natl.

Acad. Sci. USA 104: 3207–3212.14. Fields, S. & O. Song. 1989. A novel genetic system to

detect protein-protein interactions. Nature 340: 245–246.

15. Cantone, I., L. Marucci, F. Iorio, et al. A yeast syn-thetic network for in-vivo reverse-engineering andmodelling assessment in systems and synthetic biol-ogy. Cell. In press.

16. Mendes, P., W. Sha & K. Ye. 2003. Artificial genenetworks for objective comparison of analysis algo-rithms. Bioinformatics Suppl. 2: 122–129.

17. The In-Silico-Network Challenges. Description,(http://wiki.c2b2.columbia.edu/dream/challenges/dream2).

18. Faith, J.J., M.E. Driscoll, V.A. Fusaro, et al. 2008.Many Microbe Microarrays Database: uniformlynormalized affymetrix compendia with structuredexperimental metadata. Nucleic Acids Res. D866–D870.

19. Gama-Castro, S., V. Jimenez Jacinto, M. Peralta-Gil, et al. 2008. RegulonDB (version 6.0):gene regulation model of Escherichia coli K-12beyond transcription, active (experimental) an-notated promoters and Textpresso navigation.Nucleic Acids Res. 36(Database issue): D120–124.

20. Faith, J.J., B. Hayete, J.T. Thaden, et al. 2007. Large-scale mapping and validation of Escherichia coli tran-scriptional regulation from a compendium of expres-sion profiles. PLoS Biol. 5: e8.

21. Lee, W.H., V. Narang, H. Xu, et al. 2009.DREAM2 Challenge: Integrated multi-array super-vised learning algorithm for BCL6 transcriptionaltargets prediction. Ann. N. Y. Acad. Sci. 1465: doi:10.1111/j.1749-6632.2008.03755.x.

22. Nykter, M., H. Lahdesmaki, A. Rust, et al. 2009.A data integration framework for prediction oftranscription factor targets: a BCL6 case study.Ann. N. Y. Acad. Sci. 1465: doi: 10.1111/j.1749-6632.2008.03758.x.

23. Vega, V.B., X.Y. Woo, H. Hamidi, et al. 2009. Infer-ring direct regulatory targets of a transcription factorin the DREAM2 Challenge. Ann. N. Y. Acad. Sci. 1465:doi: 10.1111/j.1749-6632.2008.03759.x.

24. H.N. Chua, W. Hugo, G. Lui, et al. 2009. A prob-abilistic graph-theoretic approach to integrate mul-tiple predictions for the protein–protein subnetworkprediction challenge. Ann. N. Y. Acad. Sci. 1465: doi:10.1111/j.1749-6632.2008.03760.x.

25. Marbach, D., C. Mattiussi & D. Floreano. 2009. Re-playing the evolutionary tape: biomimetic reverse en-gineering of gene networks. Ann. N. Y. Acad. Sci. 1465:doi: 10.1111/j.1749-6632.2008.03944.x.

26. Baralla, A., W. Mentzen, A. de la Fuente. 2009. In-ferring gene networks: dream or nightmare? Part 1:Challenges 1 and 3. Ann. N. Y. Acad. Sci. 1465: doi:10.1111/j.1749-6632.2008.04099.x.

27. Lauria, M., F. Iorio, D. di Bernardo. 2009. NIRest:a tool for gene network and mode of action infer-ence. Ann. N. Y. Acad. Sci. 1465: doi: 10.1111/j.1749-6632.2008.03761.x.

28. Gustafsson, M., M. Hornquist, J. Lundstrom, et al.2009. Reverse engineering of gene networks with

Stolovitzky et al.: Lessons from the DREAM2 Challenges 195

LASSO and nonlinear basis functions. Ann. N. Y. Acad.

Sci. 1465: doi: 10.1111/j.1749-6632.2008.03764.x.29. Gowda, T., S. Vrudhula & S. Kim. 2009. Predic-

tion of pairwise gene interaction using thresholdlogic. Ann. N. Y. Acad. Sci. 1465: doi: 10.1111/j.1749-6632.2008.03763.x.

30. Scheinine, A., W. Mentzen, G. Fotia, et al. 2009. In-ferring gene networks: dream or nightmare? Part 2:Challenges 4 and 5. Ann. N. Y. Acad. Sci. 1465: doi:10.1111/j.1749-6632.2008.04100.x.

31. Watkinson, J., K.-C. Liang, X. Wang, et al. 2009.Inference of regulatory gene interactions from ex-pression data using three-way mutual information.Ann. N. Y. Acad. Sci. 1465: doi: 10.1111/j.1749-6632.2008.03757.x.

32. Davis, J. & M. Goadrich. 2006. The re-lationship between precision-recall and ROCcurves. Proceedings of the 23rd InternationalConference on Machine Learning. Pittsburgh,PA.