Computaonal methods for inferring gene regulatory networkskendzior/STAT877/SLIDES/roy1.pdf · Sushmita Roy [email protected] STAT/BMI 877: ... much-investigated random network

Computa(onalmethodsforinferringgeneregulatorynetworks

[email protected]

STAT/BMI877:Sta(s(calMethodsforMolecularBiologyh2ps://www.biostat.wisc.edu/~kendzior/STAT877/

Feb21st2017

SomeofthematerialcoveredinthislectureisadaptedfromBMI576

Whynetworks?

“AsystemisanenMtythatmaintainsitsfunc(onthroughtheinterac(onofitsparts”

– Kohl&Noble

Tounderstandcellsassystems:measure,model,predict,refine

UweSauer,Ma2hiasHeinemann,NicolaZamboni,Science2007

Planforthissegment

•  IntroducMontomolecularnetworksandexpression-basednetworkinference(Today,Feb21st)

•  Module-basednetworkinferencemethods(Thursday,Feb23rd)

•  IntegraMvenetworkinference(Thursday,Feb23rd)

Goalsfortoday

•  Background–  Introductorygraphtheory– Differenttypesofcellularnetworks

•  Expression-basedregulatorynetworkinference– Per-genevsPer-modulemethods

•  ProbabilisMcgraphicalmodelstorepresentregulatorynetworks– Bayesiannetworks(Sparsecandidates)– Dependencynetworks(GENIE3)

Anetwork

•  DescribesconnecMvitypa2ernsbetweenthepartsofasystem–  Vertex/Nodes:parts,components–  Edges/InteracMons/links:connecMons

•  Edgescanhavesigns,direcMons,and/orweight

•  Anetworkisrepresentedasagraph– Nodeandvertexareusedinterchangeably

–  Edge,link,andinteracMonareusedinterchangeably

Vertex/Node

Edge

A

B C

D

E

F

Definingagraph

•  Agraphisdefinedbytwosets:VandE•  V:setofverMces•  E:setofedges•  Dependinguponthegraph,wemayincludeaddiMonala2ributesabouttheverMcesandedges

Afewgraph-theore(cconcepts

•  Directedgraph•  Undirectedgraph•  Weightedgraph•  Degreeofanode•  Subnetworks/subgraphs

–  Path–  Cycle–  Cliques–  Connectedcomponent

Represen(ngagraph

•  Adjacencymatrix

•  Adjacencylist

Adjacencymatrix

C

AdaptedfromBarabasi&Oltvai,NatureReviewsGeneMcs2004

B

A

D

G

H

E

0 1 1 1 1 0 0 1

1 0 1 0 0 0 1 0

1 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

1 0 0 0 0 1 0 0

0 0 0 0 1 0 1 0

0 1 0 0 0 1 0 1

1 0 0 0 0 0 1 0

F

A

B

C

D

E

F

G

H

A B C D E F G HMatrix-basedrepresenta(on

V={A,B,C,D,E,F,G,H}E=setofvertexpairs(u,v)connectedwithanedge

Adjacencylist

C

B

A

D

G

H

E

F

Listbasedrepresenta(on;spaceefficientA B C D E H

B A C G

C A B

D A

E A F

F G E

G B F H

H A G

Directedgraphs

C

B

A

D

G

H

E

F

EdgeshavedirecMonalityonthem

0 0 0 1 1 0 0 0

1 0 0 0 0 0 1 0

1 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 1 0 1 0

0 0 0 0 0 0 0 1

1 0 0 0 0 0 0 0

A

B

C

D

E

F

G

H

A B C D E F G H

Adjacencymatrixisnolongersymmetric

Weightedgraphs

C

B

A

D

H

E

0 0 0 0.9 1.4 0

1.6 0 0 0 0 0

0.2 0.6 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 2

3.1 0 0 0 0 0

A

B

C

D

E

H

A B C D E H

Edgeshaveweightsonthem;wecanhavedirectedandundirectedweightedgraphs.Theexampleshownisthatofadirectedweightedgraph

0.61.6

0.2

3.1

1.40.9 2

Nodedegree

•  Undirectednetwork–  Degree,k:Numberofneighborsofanode

•  Directednetwork–  Indegree,kin:Numberofincomingedges–  Outdegree,kout:Numberofoutgoingedges

C

B

A

D

G

H

E

FIndegreeofBis1OutdegreeofAis2

Subgraphs

C

B

A

D

G

H

EPath:asetofconnectededgesfromonenodetoanotherPathlength:ThetotalnumberofedgesinapathShortestpath:ThepathbetweentwoverMceswiththeshortestpathlength

PathsfromBtoD

Asubgraphofagraphisdefinedbythesubsetofnodesandedges.Wewillusesubgraphandsubnetworkinterchangeably

F

C

B

A

D

G

H

ECycle:apathwhichstartsandendsatthesamenode

F

cycle

Subnetworks/Subgraphs

Clique:asetofnodesthathaveedgesbetweenallpairsofverMces

C

B

A

D

G

H

F

5nodeclique

Connectedcomponent:Asubgraphspanningavertexsubsetwhereeveryvertexcanbe“reached”fromanothervertex

C

B

A

D

G

H

E

F

Twoconnectedcomponents

Differenttypesofnetworks

•  Physicalnetworks–  Transcrip-onalregulatorynetworks:interacMonsbetween

regulatoryproteins(transcripMonfactors)andgenes–  Protein-protein:interacMonsamongproteins–  Signalingnetworks:protein-proteinandprotein-smallmolecule

interacMonstorelaysignalsfromoutsidethecelltothenucleus

•  FuncMonalnetworks–  Metabolic:reacMonsthroughwhichenzymesconvertsubstrates

toproducts–  Gene-c:interacMonsamonggeneswhichwhenperturbed

togetherproduceasignificantphenotypethanwhenindividuallyperturbed

Transcrip(onalregulatorynetworks!"#$%&'()*'$!+,-,.&!"##$%!!&'( )**+&,,---./01234536*789.512,':;"<#;#=,",'(

>8?3!(!1@!'#/01.)$23*4)5$2,($6,5$7+(1(+,2$0350,')'8

following the procedure detailed in Materials and Meth-ods. From the way they are built, the randomized net-works have the same number of TF and RG nodes, andeach node has the same number of links as in the corre-sponding original networks.

We further calculated the connectivity distributions forthe E. coli and S. cerevisiae, original and randomized, TFand RG projected networks. As seen in the plots of Figures4a and 4b, the connectivity distributions corresponding tothe E. coli TF-projected original and randomized networksare power-law distributions with slope about -1.5. Thecorresponding S. cerevisiae distributions show a slightnon-monotonic growing tendency.

In the plots of figures 4c and 4d, the connectivity distribu-tions for the E. coli and S. cerevisiae RG-projected networksare presented. Notice that the connectivity distributionsfor the S. cerevisiae RG-projected networks show anapproximately exponential decreasing behaviour, whilethe distributions corresponding to E. coli have variouslocal maxima and present a slow decreasing tendency.

Interestingly, the TF and RG projected networks of E. coliand S. cerevisiae have very different connectivity structures,

despite the strong similarities observed in the bipartite-network link distributions (see Figure 3). Furthermore,the connectivity distributions of the original and rand-omized RG-projected networks are very similar in boththe E. coli and S. cerevisie cases, while small deviationsfrom the behaviour of the randomized plots are observedin the TF projections. This indicates to our understandingthat the observed differences between the connectivitydistributions of the E. coli and S. cerevisiae projected net-works are mainly due to the very different number tran-scription factors and regulated genes in both organisms.

A network's clustering coefficient (C) is an estimation ofits nodes tendency to form tightly connected clusters (seeMaterials and Methods). We calculated the clusteringcoefficient of the E. coli and S. cerevisiae, original and ran-domized, TF- and RG-projected networks, and the resultsare shown in Table 1. Observe that the clustering coeffi-cient of the original and randomized RG projected net-works are quite similar for both E. coli and S. cerevisiae.Contrarily, the C values of the randomized TF projectionsare consistently larger than those of the original networkprojections.

Representation of the E. coli transcriptional regulatory networkFigure 1Representation of the E. coli transcriptional regulatory network. a) Representation of the transcription-factor gene regulatory network of E. coli. Green circles represent transcription factors, brown circles denote regulated genes, and those with both functions are coloured in red. Projections of the network onto b) transcription factor and onto c) regulated gene nodes are also shown.

R G

T F(b)

(c)

(a)

E. coli

RegulatorynetworkofE.coli.153TFs(green&lightred),1319targets

VargasandSanMllan,2008

A B

GeneC

TranscripMonfactors(TF)

C

A B

•  Directed,signed,weightedgraph•  Nodes:TFsandTargetgenes•  Edges:AregulatesB’sexpression

level

DNA

ReacMonsassociatedwithGalactosemetabolism

Metabolicnetworks

MetabolitesN

Ma b

c

O

Enzymes

d

O

M N

KEGG

•  Unweightedgraph•  Nodes:Metabolicenzyme•  Edges:EnzymesMandNsharea

compound

Protein-proteininterac(onnetworks

104 | FEBRUARY 2004 | VOLUME 5 www.nature.com/reviews/genetics

R E V I EW S

mathematical properties of random networks14. Theirmuch-investigated random network model assumes thata fixed number of nodes are connected randomly to eachother (BOX 2). The most remarkable property of the modelis its ‘democratic’or uniform character, characterizing thedegree, or connectivity (k ; BOX 1), of the individual nodes.Because, in the model, the links are placed randomlyamong the nodes, it is expected that some nodes collectonly a few links whereas others collect many more. In arandom network, the nodes degrees follow a Poissondistribution, which indicates that most nodes haveroughly the same number of links, approximately equalto the network’s average degree, <k> (where <> denotesthe average); nodes that have significantly more or lesslinks than <k> are absent or very rare (BOX 2).

Despite its elegance, a series of recent findings indi-cate that the random network model cannot explainthe topological properties of real networks. The deviations from the random model have several keysignatures, the most striking being the finding that, incontrast to the Poisson degree distribution, for manysocial and technological networks the number of nodeswith a given degree follows a power law. That is, theprobability that a chosen node has exactly k links follows P(k) ~ k –γ, where γ is the degree exponent, withits value for most networks being between 2 and 3 (REF. 15). Networks that are characterized by a power-lawdegree distribution are highly non-uniform, most ofthe nodes have only a few links. A few nodes with a verylarge number of links, which are often called hubs, holdthese nodes together. Networks with a power degreedistribution are called scale-free15, a name that is rootedin statistical physics literature. It indicates the absenceof a typical node in the network (one that could beused to characterize the rest of the nodes). This is instrong contrast to random networks, for which thedegree of all nodes is in the vicinity of the averagedegree, which could be considered typical. However,scale-free networks could easily be called scale-rich aswell, as their main feature is the coexistence of nodes ofwidely different degrees (scales), from nodes with oneor two links to major hubs.

Cellular networks are scale-free. An important develop-ment in our understanding of the cellular networkarchitecture was the finding that most networks withinthe cell approximate a scale-free topology. The first evi-dence came from the analysis of metabolism, in whichthe nodes are metabolites and the links representenzyme-catalysed biochemical reactions (FIG. 1).As manyof the reactions are irreversible, metabolic networks aredirected. So, for each metabolite an ‘in’ and an ‘out’degree (BOX 1) can be assigned that denotes the numberof reactions that produce or consume it, respectively.The analysis of the metabolic networks of 43 differentorganisms from all three domains of life (eukaryotes,bacteria, and archaea) indicates that the cellular metabo-lism has a scale-free topology, in which most metabolicsubstrates participate in only one or two reactions, but afew, such as pyruvate or coenzyme A, participate indozens and function as metabolic hubs16,17.

Depending on the nature of the interactions, net-works can be directed or undirected. In directednetworks, the interaction between any two nodes has awell-defined direction, which represents, for example,the direction of material flow from a substrate to aproduct in a metabolic reaction, or the direction ofinformation flow from a transcription factor to the genethat it regulates. In undirected networks, the links donot have an assigned direction. For example, in proteininteraction networks (FIG. 2) a link represents a mutualbinding relationship: if protein A binds to protein B,then protein B also binds to protein A.

Architectural features of cellular networksFrom random to scale-free networks. Probably the mostimportant discovery of network theory was the realiza-tion that despite the remarkable diversity of networksin nature, their architecture is governed by a few simpleprinciples that are common to most networks of majorscientific and technological interest9,10. For decadesgraph theory — the field of mathematics that dealswith the mathematical foundations of networks —modelled complex networks either as regular objects,such as a square or a diamond lattice, or as completelyrandom network13. This approach was rooted in theinfluential work of two mathematicians, Paul Erdös,and Alfréd Rényi, who in 1960 initiated the study of the

Figure 2 | Yeast protein interaction network. A map of protein–protein interactions18 inSaccharomyces cerevisiae, which is based on early yeast two-hybrid measurements23, illustratesthat a few highly connected nodes (which are also known as hubs) hold the network together.The largest cluster, which contains ~78% of all proteins, is shown. The colour of a node indicatesthe phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal,orange = slow growth, yellow = unknown). Reproduced with permission from REF. 18 ©Macmillan Magazines Ltd.

Barabasietal.2003

YeastproteininteracMonnetwork

X Y

XY

•  Un/weightedgraph•  Nodes:Proteins•  Edges:ProteinXphysically

interactswithproteinY

Goalsfortoday




Computa(onalinferenceoftranscrip(onalregulatorynetworks

•  WewillfocusontranscripMonalregulatorynetworks

•  ThesenetworkscontrolwhatgenesgetacMvatedwhen

•  PrecisegeneacMvaMonorinacMvaMoniscrucialformanybiologicalprocesses

•  MicroarraysandRNA-seqallowsustosystemaMcallymeasuregeneacMvitylevels

Whydoweneedtocomputa(onallyinfertranscrip(onalnetworks?

•  WhyinfertranscripMonalnetworks?–  Tounderstandhowgeneexpressioniscontrolled–  Tounderstandhowcellsprocessextra-cellularcues– ManydiseasesareassociatedwithchangesintranscripMonalnetworks

•  WhydosocomputaMonally?–  ExperimentaldetecMonofnetworksishard,expensive–  Afirststeptowardshavinganinsilicomodelofacell–  AmodelcanbeusedtomakepredicMonsthatcanbetestedandrefinethemodel

Methodstoinferregulatorynetworks

•  Experimental–  ChIP-chipandChIP-seq–  Factor/regulatorknockoutfollowedbygenome-wide

transcriptomemeasurements–  DNaseI/ATAC-seq+moMffinding

•  ComputaMonal–  Supervisednetworkinference–  Unsupervisednetworkinference

•  Expression-basednetworkinference

Typesofdataforreconstruc(ngtranscrip(onalnetworks

•  Node-specificdatasets–  Transcriptomes:Genome-widemRNAlevels–  CanpotenMallyrecovergenome-wide

regulatorynetworks

•  Edge-specificdatasets–  ChIP-chipandChIP-seq–  SequencespecificmoMfs–  Factorknockoutfollowedbywhole-

transcriptomeprofilingGene

moMfChIP

differences from the mean ranged from 30 (in vineyard strain I14)to nearly 600 (in clinical isolate YJM789), with a median of 88expression differences per strain. The number of expressiondifferences did not correlate strongly with the genetic distances ofthe strains (R2 = 0.16). However, this is not surprising since manyof the observed expression differences are likely linked in trans tothe same genetic loci [27,31,34,35,43]. Consistent with thisinterpretation, we found that the genes affected in each strainwere enriched for specific functional categories (Table S4),revealing that altered expression of pathways of genes was acommon occurrence in our study.We noticed that some functional categories were repeatedly

affected in different strains. To further explore this, we identifiedindividual genes whose expression differed from the mean in atleast 3 of the 17 non-laboratory strains. This group of 219 geneswas strongly enriched for genes involved in amino acid metabolism(p,10214), sulfur metabolism (p,10214), and transposition(p,10247), revealing that genes involved in these functions hada higher frequency of expression variation. Differential expression

of some of these categories was also observed for a different set ofvineyard strains [26,28], and the genetic basis for differentialexpression of amino acid biosynthetic genes in one vineyard strainhas recently been linked to a polymorphism in an amino acidsensory protein [35]. We also noted that the 1330 genes withstatistically variable expression in at least one non-laboratorystrain were enriched for genes that contained upstream TATAelements [46] (p = 10216) and genes with paralogs (p = 1026) butunder-enriched for essential genes [47] (p = 10225). The trendsand statistical significance were similar using 953 genes that variedsignificantly from YPS163. Thus, genes with specific functionaland regulatory features are more likely to vary in expression underthe conditions examined here, consistent with reports of otherrecent studies [30,43,48,49] (see Discussion).

Influence of Copy Number Variation on Gene ExpressionVariationExpression from transposable Ty elements was highly variable

across strains. However, Ty copy number is known to vary widely

Figure 3. Variation in gene expression in S. cerevisiae isolates. The diagrams show the average log2 expression differences measured in thedenoted strains. Each row represents a given gene and each column represents a different strain, color-coded as described in Figure 1. (A) Expressionpatterns of 2,680 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to S288c. (B) Expression patterns of 953genes that varied significantly in at least one strain compared to strain YPS163 (FDR= 0.01, unpaired t-test). For (A) and (B), a red color indicateshigher expression and a green color represents lower expression in the denoted strain compared to S288c, according to the key. (C) Expressionpatterns of 1,330 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to the mean expression of all 17 strains.Here, red and green correspond to higher and lower expression, respectively, compared to the mean expression of that gene in all strains. Geneswere organized independently in each plot by hierarchical clustering.doi:10.1371/journal.pgen.1000223.g003

Phenotypic Variation in Yeast

PLoS Genetics | www.plosgenetics.org 5 October 2008 | Volume 4 | Issue 10 | e1000223

Gene

expr

essio

n lev

els Samples

Asimpleexampleoftranscrip(onalregula(on

HSP12Sko1Hot1

AssumeHSP12’sexpressionisdependentuponHot1andSko1bindingtoHSP12’spromoter

HSP12ON

HSP12Sko1

HSP12OFF

Input:TranscripMonfactorlevel(trans)

Input:TranscripMonfactorbindingsites(cis) Output:mRNAlevels

Whatdowewantamodelforaregulatorynetworktocapture?

HSP12Sko1Hot1

HSP12’sexpressionisdependentuponHot1andSko1bindingtoHSP12’spromoter

HSP12ON

HSP12Sko1

HSP12OFF

X3=ψ(X1,X2)

Func(on

X1 X2

X3

BOOLEANLINEARDIFF.EQNSPROBABILISTIC…

.

Howtheydetermineexpressionlevels?

Sko1

Structure

HSP12

Hot1

Whoaretheregulators?

Hot1regulatesHSP12

HSP12isatargetofHot1

Mathema(calrepresenta(onsofregulatorynetworks

X1 X2

X3

fOutputexpressionoftargetgene

Modelsdifferinthefunc(onthatmapsregulatorinputlevelstotargetlevels

Inputexpression/acMvityofregulators

Rateequa(ons Probabilitydistribu(ons

BooleanNetworks DifferenMalequaMons ProbabilisMcgraphicalmodels

X1 X2

0 0

0 1

1 0

1 1

X3

0

1

1

1

Input Output dX3(t)dt

=

� g(X1(t), X2(t))�rX3(t)

P (X3|X1, X2) =N(X1a + X2b, �)

X1 X2

X3

Regulatorynetworkinferencefromexpression

!"#$%&%'(!"#$%&'!()*+$,-.-')*#'(&%/")$%&'*(/"0$,-.-'+*#',-,.%$/,!1+$%&'#())0$,-.-'+*#/0-,0$"2$%&'/()00$,-.-'

!"#$%&'()*$+)*,-'. !/#$0+1)'')2$+)*,-'.3

12#$34435'*10$67.8'9

12#$34435'*10$67.8'9

#+)$34435'+/!$67.8'9

!)2$34435'#+$67.8'9

!4#$567'-&''&8$7-9:)+26&

:;<=>3?;7.

@AB-4;<-.?'C.76C7=?'7D-4-AB4-''3.?;E;7?;6'?7A;.'-?69

F.7.5<;G-83?3

0+1)')+7)$9)*;-23

(4) Community network

H.?-,43?;7.

(5) Performance evaluation

%4=-$!"#$!%!&'.-?I74C

@AB-4;<-.?3>>58-?-4<;.-8;.?-436?;7.'$$

()#*+,-+$#.7?$='-8J74$-D3>=3?;7.

K3>;83?;7.

/" <-?L78'$3BB>;-8$E5?-3<'$ME>;.8-8N

O$) 7JJP?L-P'L->J$<-?L78'/< 9)*;-23$*)3*)2

QLHR<7?;J'-?69

=6(>')$"

=6(>')$"?$%;)$@ABC5<$+)*,-'.$6+1)')+7)$7;&DD)+()?differences from the mean ranged from 30 (in vineyard strain I14)to nearly 600 (in clinical isolate YJM789), with a median of 88expression differences per strain. The number of expressiondifferences did not correlate strongly with the genetic distances ofthe strains (R2 = 0.16). However, this is not surprising since manyof the observed expression differences are likely linked in trans tothe same genetic loci [27,31,34,35,43]. Consistent with thisinterpretation, we found that the genes affected in each strainwere enriched for specific functional categories (Table S4),revealing that altered expression of pathways of genes was acommon occurrence in our study.We noticed that some functional categories were repeatedly

affected in different strains. To further explore this, we identifiedindividual genes whose expression differed from the mean in atleast 3 of the 17 non-laboratory strains. This group of 219 geneswas strongly enriched for genes involved in amino acid metabolism(p,10214), sulfur metabolism (p,10214), and transposition(p,10247), revealing that genes involved in these functions hada higher frequency of expression variation. Differential expression

of some of these categories was also observed for a different set ofvineyard strains [26,28], and the genetic basis for differentialexpression of amino acid biosynthetic genes in one vineyard strainhas recently been linked to a polymorphism in an amino acidsensory protein [35]. We also noted that the 1330 genes withstatistically variable expression in at least one non-laboratorystrain were enriched for genes that contained upstream TATAelements [46] (p = 10216) and genes with paralogs (p = 1026) butunder-enriched for essential genes [47] (p = 10225). The trendsand statistical significance were similar using 953 genes that variedsignificantly from YPS163. Thus, genes with specific functionaland regulatory features are more likely to vary in expression underthe conditions examined here, consistent with reports of otherrecent studies [30,43,48,49] (see Discussion).

Influence of Copy Number Variation on Gene ExpressionVariationExpression from transposable Ty elements was highly variable

across strains. However, Ty copy number is known to vary widely

Figure 3. Variation in gene expression in S. cerevisiae isolates. The diagrams show the average log2 expression differences measured in thedenoted strains. Each row represents a given gene and each column represents a different strain, color-coded as described in Figure 1. (A) Expressionpatterns of 2,680 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to S288c. (B) Expression patterns of 953genes that varied significantly in at least one strain compared to strain YPS163 (FDR= 0.01, unpaired t-test). For (A) and (B), a red color indicateshigher expression and a green color represents lower expression in the denoted strain compared to S288c, according to the key. (C) Expressionpatterns of 1,330 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to the mean expression of all 17 strains.Here, red and green correspond to higher and lower expression, respectively, compared to the mean expression of that gene in all strains. Geneswere organized independently in each plot by hierarchical clustering.doi:10.1371/journal.pgen.1000223.g003

Phenotypic Variation in Yeast

PLoS Genetics | www.plosgenetics.org 5 October 2008 | Volume 4 | Issue 10 | e1000223

Expression-basednetworkinference

Gene

s

Experiments

X2

Structure

X3

X1

X3=f(X1,X2)

Func(on

X1 X2

X3Expressionlevelofgeneiinexperimentj

Expression-basedregulatorynetworkinference

•  Given– AsetofmeasuredmRNAlevelsacrossmulMplebiologicalsamples

•  Do–  Infertheregulatorsofgenes–  Inferhowregulatorsspecifytheexpressionofagene

•  AlgorithmsfornetworkreconstrucMonvarybasedontheirmeaningofinteracMon

Twoclassesofexpression-basedmethods

•  Per-gene/directmethods(Today)

•  Modulebasedmethods(Thursday)

X5X3

X1 X2

Module

X3

X1 X2

X5

X3 X4

Per-genemethods

X3

X1 X2

X5

X3 X4

•  Keyidea:findtheregulatorsthat“bestexplain”expressionlevelofagene

•  ProbabilisMcgraphicalmethods–  Bayesiannetwork

•  SparseCandidates–  Dependencynetworks

•  GENIE3,TIGRESS

•  InformaMontheoreMcmethods–  ContextLikelihoodofrelatedness–  ARACNE

Module-basedmethods

•  FindregulatorsforanenMremodule– Assumegenesinthesamemodulehavethesameregulators

•  ModuleNetworks(Segaletal.2005)•  StochasMcLeMoNe(Joshietal.2008)

Permodule

Y2Y1

X1 X2

Module

Anon-exhaus(velistofexpression-basednetworkinferencemethod

MethodName Per-module Per-gene Parameters Modeltype

Sparsecandidate ✓ ✓ Bayesiannetwork

CLR ✓ InformaMontheoreMc

ARACNE ✓ InformaMontheoreMc

TIGRESS ✓ Dependencynetwork

Inferelator ✓ ✓ Dependencynetwork

GENIE3 ✓ Dependencynetwork

ModuleNetworks ✓ ✓ Bayesiannetwork

LemonTree ✓ Dependencynetwork

WGCNA ✓ CorrelaMon

Goalsfortoday




Probabilis(cgraphicalmodels(PGMs)

•  Amarriagebetweenprobabilityandgraphtheory•  Nodesonthegraphrepresentrandomvariables•  GraphstructurespecifiesstaMsMcaldependencystructure

•  Graphparametersspecifythenatureofthedependency

•  PGMscanbedirectedorundirected•  ExamplesofPGMs:Bayesiannetworks,Dependencynetworks,Markovnetworks,Factorgraphs

Differenttypesofprobabilis(cgraphs

•  IneachgraphtypewecanassertdifferentcondiMonalindependencies

•  CorrelaMonnetworks•  GaussianGraphicalmodels•  Dependencynetworks•  Bayesiannetworks

Correla(onalnetworks

•  Anundirectedgraph•  EdgesrepresenthighcorrelaMon

–  Needtodeterminewhat“high”is

•  EdgeweightsdenotedifferentvaluesofcorrelaMon•  CannotdiscriminatebetweendirectandindirectcorrelaMons

Msb2

Sho1

Ste20

Randomvariablesrepresentgeneexpressionlevels

X1

X2

X3

X1 X2

X3

w23w13

AmeasureofstaMsMcalcorrelaMon(e.g.Pearson’scorrelaMon)

Anundirectedweightedgraph.

Popularexamplesofcorrela(onalnetworks

•  WeightedGeneCo-expressionNetworkAnalysis(WGCNA)-  ZhangandHorvath2005

•  RelevanceNetworks– Bu2e&Kohane,2000PacificsymposiumofbiocompuMng

Limita(onsofcorrela(onalnetworks

•  CorrelaMonalnetworkscannotdisMnguishbetweendirectandindirectdependencies•  ThismakesthemlessinterpretablethanotherPGMsBMC Bioinformatics 2007, 8(Suppl 6):S5 http://www.biomedcentral.com/1471-2105/8/S6/S5

Page 4 of 17(page number not for citation purposes)

expression profiles between genes does not have full rankand cannot be inverted [21]. The p Ŭ N-situation is truefor almost all genomic applications of graphical models.Therefore, one must either improve the estimators of par-tial correlations or resort to a simpler model. The basicidea in all of these approaches is that biological data are

high-dimensional but sparse in the sense that only a smallnumber of genes will regulate one specific gene of interest.

Several papers suggest ways to estimate GGMs in a p Ŭ N-situation. Kishino and Waddell [22] propose gene selec-tion by setting very low partial correlation coefficents tozero. As they state, the estimate still remains unstable. Inone study, Schäfer and Strimmer [21] use bootstrap resa-mpling together with the Moore-Penrose pseudoinverseand false discovery rate multiple testing, while in another[23], they discuss a linear shrinkage approach to regulari-zation. Li and Gui [24] prospose a threshold gradientdescent regularization procedure for estimating a sparseprecision matrix.

Heuristic regression-based estimationFull conditional independence models are closely relatedto a class of graphical models called dependency networks[25]. Dependency networks are built using sparse regres-sion models to regress each gene Xi onto the remaininggenes XV \ i. The genes, which predict the state of gene Xiwell, are connected to it by (directed) edges in the graph.In general, dependency networks may be inconsistent, i.e.the local regression models may not consistently specify ajoint distribution over all genes. Thus, the resulting modelis only an approximation of the true full conditionalmodel. Still, dependency networks are widely usedbecause of their flexibility and the computational advan-tage compared to structure learning in full conditionalindependence models. When learning dependency net-works, a variety of sparse classification/regression tech-niques may be used to estimate the local distributions,including linear models with an L1-penalty on modelparameters [20,26], classification trees [25,27], or sparseBayesian regression [19,28]. We will see later that these

Different mechanisms can explain coexpressionFigure 1Different mechanisms can explain coexpression. The left plot in the dashed box shows three coexpressed genes form-ing a clique in the coexpression graph. The other three plots show possible regulatory relationships that can explain coexpres-sion: The genes could be regulated in a cascade (left), or one regulates both others (middle), or there is a common "hidden" regulator (right), which is not part of the model.

Coexpression Regulatory network

X Z

Y

X Z

Y

X Z

Y

X Y Z

H

A small Gaussian graphical modelFigure 2A small Gaussian graphical model. Example of a full con-ditional model. Missing edges between nodes indicate inde-pendencies of two genes given all the other genes in the model. We can read from the graph that X ⊥ W | {Y, Z} and Y ⊥ W | {X, Z} and X ⊥ Z | {Y, W}.

W

Y

X Z

•  Foranyco-expressionnetwork,thereareseveralpossibleregulatorynetworksthatcanexplainthesecorrelaMons.

•  Whatwewouldlikeistobeabletodiscriminatebetweendirectandindirectdependencies

•  HereweneedtoreviewcondiMonalindependencies

Condi(onalindependenciesinPGMs

•  ThedifferentclassesofmodelswewillseearebasedonageneralnoMonofspecifyingstaMsMcalindependence

•  SupposewehavetwogenesXandY.WeaddanedgebetweenX andY ifXandYarenotindependentgivenathirdsetZ.

•  DependinguponZ wewillhaveafamilyofdifferentPGMs

Condi(onalindependenceandPGMs

•  CorrelaMonalnetworks–  Zistheemptyset

•  Markovnetworks–  XandYarenotindependentgivenallothervariables–  GaussianGraphicalmodelsareaspecialcase

•  Dependencynetworks–  ApproximateMarkovnetworks–  MaynotbeassociatedwithavalidjointdistribuMon

•  First-ordercondiMonalindependencemodels–  ExplainthecorrelaMonbetweentwovariablesbyathirdvariable

•  Bayesiannetworks–  Generalizefirst-ordercondiMonalindependencemodels

Bayesiannetworks(BN)

•  AspecialtypeofprobabilisMcgraphicalmodel•  Hastwoparts:

–  Agraphwhichisdirectedandacyclic–  AsetofcondiMonaldistribuMons

•  DirectedAcyclicGraph(DAG)–  ThenodesdenoterandomvariablesX1… XN–  Theedges

•  encodestaMsMcaldependenciesbetweentherandomvariables•  establishparentchildrelaMonships

•  EachnodeXihasacondi-onalprobabilitydistribu-on(CPD)represenMngP(Xi | Pa(Xi)); Pa:Parents

•  ProvidesatractablewaytorepresentlargejointdistribuMons

Keyques(onsinBayesiannetworks

•  WhatdotheCPDslooklike?•  WhatindependenceasserMonscanbemadeinBayesiannetworks?

•  Howcanwelearnthemfromexpressiondata?

Nota(on

•  Xi:ithrandomvariable•  Iftherearefewrandomvariables,wewilljustuseuppercasele2ers.E.g.A, B, C..

•  X={X1,.., Xp}:setofprandomvariables•  xi

k:AnassignmentofXi inthekth sample•  x-i

k:SetofassignmentstoallvariablesotherthanXi inthekth sample

•  D:AdatasetconsMtuMngthecollecMonofalljointassignments

AnexampleBayesiannetwork

Cloudy(C)

Rain(R)Sprinkler(S)

AdaptedfromKevinMurphy:IntrotoGraphicalmodelsandBayesnetworks:h2p://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

WetGrass(W)

P(C=F)P(C=T)

0.50.5

P(R=F)P(R=T)

0.80.2T 0.20.8

P(S=F)P(S=T)

0.50.5FT 0.90.1

P(W=F)P(W=T)10FF

TF 0.10.9

C

FTTT

0.10.90.010.99

C

F

SR

Expressiondatamatrix

pGe

nes

NExperiments/Timepointsetc

ObservaMons(expressionlevels)ofallvariablesinsamplek

ObservaMonsofvariableXi inallNexperiments

x-jk

Bayesiannetworkscompactlyrepresentjointdistribu(ons

P (X1, · · · , Xp) =pY

i=1

P (Xi|Pa(Xi))

ExampleBayesiannetworkof5variables

CHILD

PARENTS

X1 X2

X3

X5

X4

P (X) = P (X1)P (X2)P (X4)P (X3|X1, X2)P (X5|X3, X4)

P (X) = P (X1, X2, X3, X4, X5)NoindependenceasserMons

IndependenceasserMons

AssumeXiisbinary

Needs25measurements

Needs23measurements

CPDinBayesiannetworks

•  TheCPDP(Xi|Pa(Xi)) specifiesadistribuMonovervaluesofXiforeachcombinaMonofvaluesofPa(Xi)

•  CPD P(Xi|Pa(Xi)) canbeparameterizedindifferentways

•  Xiarediscreterandomvariables–  CondiMonalprobabilitytableortree

•  Xi areconMnuousrandomvariables–  CPDcanbecondiMonalGaussiansorregressiontrees

•  ConsiderfourbinaryvariablesX1, X2, X3, X4

Represen(ngCPDsastables

X1 X2 X3 t f t t t 0.9 0.1

t t f 0.9 0.1

t f t 0.9 0.1

t f f 0.9 0.1

f t t 0.8 0.2

f t f 0.5 0.5

f f t 0.5 0.5

f f f 0.5 0.5

P( X4 | X1, X2, X3 ) asatableX4

X1X2

X4

X3

Pa(X4): X1, X2, X3

Es(ma(ngCPDtablefromdata

•  AssumeweobservethefollowingassignmentsforX1,X2,X3, X4

T F T T

T T F T

T T F T

T F T T

T F T F

T F T F

F F T F

X1 X2 X3 X4

ForeachjointassignmenttoX1,X2,X3,esMmatetheprobabiliMesforeachvalueofX4

Forexample,considerX1=T, X2=F, X3=T

P(X4=T|X1=T, X2=F, X3=T)=2/4P(X4=F|X1=T, X2=F, X3=T)=2/4

N=7

Bayesiannetworkrepresenta(onofaregulatorynetwork

Bayesiannetwork

TARGET(CHILD)

REGULATORS(PARENTS)X1

X2

X3

X1 X2

X3P(X3|X1,X2)

Randomvariables

HSP12Sko1Hot1

Insidethecell

Hot1:

Sko1:

Hsp12:

P(X1) P(X2)

LearningproblemsinBayesiannetworks

•  Parameterlearningonknowngraphstructure– Givenasetofjointassignmentsoftherandomvariables,esMmatetheparametersofthemodel

•  Structurelearning– Givenasetofjointassignmentsoftherandomvariables,esMmatethestructureandparametersofthemodel

– Structurelearningsubsumesparameterlearning

Structurelearningusingscore-basedsearch

...

Score(B) DescribeshowwellBdescribesthedata

Score(B1) Score(B2) Score(B3) Score(Bm)

ExhausMvesearchisnotcomputaMonallytractable

ScoresforBayesiannetworks

•  Maximumlikelihood

•  Regularizedmaximumlikelihood

•  Bayesianscore

ScoreML = max

✓P (D|G, ✓)

ScoreBIC = max

✓P (D|G, ✓)� d

2

log(N)

d : no. of parameters

N : no. of samples

ScoreBayes = P (G|D) / P (D|G)P (G)

Decomposabilityofscores

•  ThescoreofaBayesiannetworkBdecomposesoverindividualvariables

•  EnablesefficientcomputaMonofthescorechangetolocalchanges

Score(B) = ⌃iS(Xi)

S(Xi) =NY

d=1

P (Xi = x

di |Pa(Xi) = padXi

)

padXiJointassignmenttoPa(Xi)inthedthsample

Graphstructurelearningfromexpressionisacomputa(onallydifficultproblem

•  Given2TFsand3nodeshowmanypossiblenetworkscantherebe?

NotexhausMvesetofpossiblenetworks

Therecanbeatotalof26possiblenetworks.

….

GreedyHillclimbingtosearchBayesiannetworkspace

•  Input:DataD,AniniMalBayesiannetwork,B0={G0, Θ0}

•  Output:Bbest

•  Loopforr=1, 2..unMlconvergence:–  {Br

1, .., Brm}=Neighbors(Br) bymakinglocalchangestoBr

–  Br+1:arg maxj(Score(Brj))

•  TerminaMon:–  Bbest=Br

LocalchangestoBi

A

B C

D

A

B C

D

addanedge

A

B C

D

deleteanedge

Currentnetwork

Checkforcycles

Br

Br1 Br

2

ChallengeswithapplyingBayesiannetworktogenome-scaledata

•  Numberofvariables,pisinthousands

•  Numberofsamples,N is inhundreds

Bayesiannetwork-basedmethodstohandlegenome-scalenetworks

•  Sparsecandidatealgorithm–  Friedman,Nachman,Pe’er.1999

–  Friedman,Linial,Nachman,Pe’er.2000.

•  Modulenetworks

– Segal,Pe’er,Regev,Koller,Friedman.2005

TheSparsecandidateStructurelearninginBayesiannetworks

•  AfastBayesiannetworklearningalgorithm•  Keyidea:IdenMfyk“promising”candidateparentsforeachXi–  k<<p,p:numberofrandomvariables–  Candidatesdefinea“skeletongraph”H

•  RestrictgraphstructuretoselectparentsfromH•  EarlychoicesinHmightexcludeothergoodparents

–  ResolveusinganiteraMvealgorithm

Sparsecandidatealgorithm

•  Input:–  AdatasetD–  AniniMalBayesnetB0–  Aparameterk:maxnumberofparentspervariable

•  Output:–  FinalBr

•  Loopforr=1,2..unMlconvergence–  Restrict

•  BasedonDandBr-1 selectcandidateparentsCirforXi

•  ThisdefinesaskeletondirectednetworkHr–  Maximize

•  FindnetworkBrthatmaximizesthescoreScore(Br) amongnetworkssaMsfying

•  TerminaMon:ReturnBr

Par(Xi) ✓ Cri

Selec(ngcandidateparentsintheRestrictStep

•  AgoodparentforXiisonewithstrongstaMsMcaldependencewithXi– MutualinformaMonprovidesagoodmeasureofstaMsMcaldependenceI(Xi; Xj)

– MutualinformaMonshouldbeusedonlyasafirstapproximaMon

•  CandidateparentsneedtobeiteraMvelyrefinedtoavoidmissingimportantdependences

•  AgoodparentforXihasthehighestscoreimprovementwhenaddedtoPa(Xi)

Sparsecandidatelearnsgoodnetworksfasterthanhill-climbing

-54

-53.5

-53

-52.5

-52

0 200 400 600 800 1000 1200 1400 1600 1800 2000Time

Greedy HCDisc 5

Disc 15Score 5

Score 15Shld 5

Shld 15-84

-83.5

-83

-82.5

-82

0 1000 2000 3000 4000 5000 6000 7000 8000 9000Time

Greedy HCDisc 5

Disc 15Score 5

Score 15Shld 5

Shld 15-500

-480

-460

-440

-420

-400

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Time

Disc 5Disc 10Score 5

Score 10

-54

-53.5

-53

-52.5

-52

0 5000 10000 15000 20000 25000 30000 35000 40000Collected Statistics

Greedy HCDisc 5

Disc 15Score 5

Score 15Shld 5

Shld 15-84

-83.5

-83

-82.5

-82

0 20000 40000 60000 80000 100000 120000 140000 160000Collected Statistics

Greedy HCDisc 5

Disc 15Score 5

Score 15Shld 5

Shld 15-500

-480

-460

-440

-420

-400

0 50000 100000150000200000250000300000350000400000450000Collected Statistics

Disc 5Disc 10Score 5

Score 10

Text 100 Text 200 Clc 800

Figure 4: Graphs showing the performance of the different algorithms on the text and biological domains. The graphson the top row show plots of score ( -axis) vs. running time ( -axis). The graphs on the bottom row show the same runmeasured in terms of score ( -axis) vs. number of statistics computed ( -axis). The reported methods vary in terms of thecandidate selection measure (Disc – discrepancy measure, Shld – shielding measure, Score – score based measure) andthe size of the candidate set (k = 10 or 15). The points on each curve for the sparse candidate algorithm are the end resultof an iteration.

marginalize statistics to get the statistics of subsets. Wereport the number of actual statistics that were computedfrom the data.Finally, in all of our experiments we used the BDe score

of [16] with a uniform prior with equivalent sample size often. This choice is a fairly unformed prior that does notcode initial bias toward the correct network. The strengthof the equivalent sample size was set prior to the experi-ments and was not tuned.In the first set of experiments we used a sample of 10000

instances from the “alarm” network [1]. This network hasbeen used for studies of structure learning in various pa-pers, and is treated as a common benchmark in the field.This network contains 37 variables, of which 13 have 2values, 22 have 3 values, and 2 have 4 values. We notethat although we do not consider this data set particularlymassive, it does allow us to estimate the behavior of oursearch procedure. In the future we plan to use syntheticdata from larger networks.The results for this small data set are reported in Table 1.

In this table we measure both the score of the networksfound and their error with respect to generating distribu-tions. The results on this toy domain show that our algo-rithm, in particular with the selection heuristic, findsnetworks with comparable score to the one found by greedyhill climbing. Although the timing results for this smallscale experiments are not too significant, we do see that thesparse candidate algorithm usually requires fewer statisticsrecords. Finally, we note that the first iteration of the al-

gorithm finds reasonably high scoring networks. Nonethe-less, subsequent iterations improve the score. Thus, the re-estimation of candidate sets based on our score does leadto important improvements.To test our learning algorithms on more challenging do-

mains we examined data from text. We used the dataset that contains messages from 20 newsgroups (approxi-mately 1000 from each) [18]. We represent each messageas a vector containing one attribute for the newsgroup andattributes for each word in the vocabulary. We constructeddata sets with different numbers of attributes by focusingon subsets of the vocabulary. We did this by removingcommon stop words, and then sorting words based on theirfrequency in the whole data set. The data sets included thegroup designator and the 99 (text 100 set) or 199 (text 200set) most common words. We trained on 10,000 messagesthat were randomly selected from the total data set.The results of these experiments are reported in figure 4.

As we can see, in the case of 100 attributes, by using theselection method with candidate sets of sizes 10 or

15, we can learn networks that are reasonably close to theone found by greedy hill-climbing in about half the runningtime and half the number of sufficient statistics. When wehave 200 attributes, the speedup is larger than 3. We ex-pect that as we consider data sets with larger number ofattributes, this speedup ratio will grow.To test that, we devised another synthetic dataset, which

originates in real biological data. We used gene expres-sion data from [23]. The data describes expression level

Dataset1 Dataset2

100variables 200variables

Greedyhillclimbingtakesmuchlongertoreachahighscoringbayesiannetwork

Score(highe

risb

e2er)

Somecommentsaboutchoosingcandidates

•  Howtoselectk inthesparsecandidatealgorithm?

•  ShouldkbethesameforallXi ?•  RegularizedregressionapproachescanbeusedtoesMmatethestructureofanundirectedgraph–  Schmidt,Niculescu-Mizil,Murphy2007–  EsMmateanundirecteddependencynetwork–  LearnaBayesiannetworkconstrainedonthedependencynetworkstructure

Assessingconfidenceinthelearnednetwork

•  Typicallythenumberoftrainingsamplesisnotsufficienttoreliablydeterminethe“right”network

•  OnecanhoweveresMmatetheconfidenceofspecificfeaturesofthenetwork– Graphfeaturesf(G)

•  Examplesoff(G)– Anedgebetweentworandomvariables– OrderrelaMons:IsX, Y’sancestor?

Howtoassessconfidenceingraphfeatures?

•  WhatwewantisP(f(G)|D),whichis

•  Butitisnotfeasibletocomputethissum

•  Insteadwewillusea“bootstrap”procedure

�Gf(G)P (G|D)

Bootstraptoassessgraphfeatureconfidence

•  For i=1 tom– ConstructdatasetDibysamplingwithreplacementN samplesfromdatasetD,whereNisthesizeoftheoriginal D

– LearnanetworkBi={Gi,Θi}•  Foreachfeatureofinterestf,calculateconfidence

Conf(f) =

1

m

mX

i=1

f(Gi)

Doesthebootstrapconfidencerepresentrealrela(onships?

•  ComparetheconfidencedistribuMontothatobtainedfromrandomizeddata

•  Shufflethecolumnsofeachrow(gene)separately•  Repeatthebootstrapprocedure

1,2g ng ,2

1,mg nmg ,

1,1g ng ,1

randomizeeachrowindependently

genes

ExperimentalcondiMons

Bootstrap-basedconfidencediffersbetweenrealandactualdata

612F

RIE

DM

AN

ET

AL

.

Markov OrderMultinomial

0

200

400

600

800

1000

1200

1400

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

50

100

150

200

250

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Feat

ures

with

Con

fide

nce

abov

e

Linear-Gaussian

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

500

1000

1500

2000

2500

3000

3500

4000

0

100

200

300

400

500

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RandomReal

Feat

ures

with

Con

fide

nce

abov

e

0

2000

4000

6000

8000

10000

12000

14000

0.2 0.4 0.6 0.8 1

Feat

ures

with

Con

fide

nce

abov

e

0

200

400

600

800

1000

0.2 0.4 0.6 0.8 1

RandomReal

FIG. 3. Plots of abundance of features with different con�dence levels for the cell cycle data set (solid line), and the randomized data set (dotted line).The x-axis denotes the con�dence threshold, and the y-axis denotes the number of features with con�dence equal or higher than the correspondingx-value. The graphs on the left column show Markov features, and the ones on the right column show Order features. The top row describes featuresfound using the multinomial model, and the bottom row describes features found by the linear-Gaussian model. Inset in each graph is a plot of the tailof the distribution.

f

f

Random

Real

Friedmanetal2000

Exampleofahighconfidencesub-networkthemselves transcriptionally regulated. Theseregulation mechanisms often involve feed-forward loops and feedback mechanisms(18, 31) that change the mRNA expression

level of regulators when their protein activ-ity changes. As a consequence, we candetect coordinated changes in the expres-sion levels of regulators and their targets.

This hypothesis is supported by an analysisof the discovered regulatory relations againsta database of protein-DNA and protein-protein interactions (26).

Fig. 3. Different regulatory network architectures. (A) An uncon-strained acyclic network where each gene can have a different regu-lator set. This is a fragment of a network learned in the experimentsof Pe’er et al. (24 ). (B) A summary of direct neighbor relations amongthe genes shown in (A) based on bootstrap estimates. Degrees ofconfidence are denoted by edge thickness. We automatically identifya subnetwork of genes, with high-confidence relations among them,that are involved in the yeast-mating pathways. The colors highlightgenes with known function in mating, including signal transduction(yellow), transcription factors (blue), and downstream effectors(green). (C) A fragment of a two-level network described by Pe’er etal. (25). The top level contains a small number of regulators; the

bottom level contains all other genes (targets). Each gene has differ-ent regulators from among the regulator genes. (D) Visualization ofsignificant Gene Ontology (42) annotations of the targets of differentregulators. Each significant annotation for the targets of a regulator(or pairs of regulators) is shown with the hypergeometric p-value. (E)A fragment of the module network described by Segal et al. (26). Eachmodule contains several genes that share the same set of regulators andshare the same conditional regulation program given these regulators. (F)Visualization of the expression levels of the 55 genes in Module 1 (b) andtheir regulators (a). Significant Gene Ontology annotations (c) and cis-regulatory motifs in promoter regions of genes in the module (d) are shown.[See figure 3 of (26); reproduced with permission]

M A T H E M A T I C S I N B I O L O G Y

6 FEBRUARY 2004 VOL 303 SCIENCE www.sciencemag.org804

SPECIALSECTION

OnelearnedBayesiannetwork BootstrappedconfidenceBayesiannetwork:highlightsasubnetworkassociatedwithyeastmaMngpathway.ColorsindicategeneswithknownfuncMons.

NirFriedman,Science2004

Goalsfortoday



•  ProbabilisMcgraphicalmodelstorepresentregulatorynetworks– SparseCandidatesBayesiannetworks– GENIE3:Regression-basedmethods

Limita(onswithBayesiannetworks

•  Cannotmodelcyclicdependencies•  InpracMcehavenotbeenshowntobebe2erthandependencynetworks– However,mostoftheevaluaMonhasbeendoneonstructurenotfuncMon

•  DirecMonalityisoyennotassociatedwithcausality– Toomanyhiddenvariablesinbiologicalsystems

Dependencynetwork

•  AtypeofprobabilisMcgraphicalmodel•  ApproximateMarkovnetworks

– Aremucheasiertolearnfromdata•  AsinBayesiannetworkshas

– Agraphstructure–  Parameterscapturingdependenciesbetweenavariableanditsparents

•  UnlikeBayesiannetwork–  Canhavecyclicdependencies–  CompuMngajointprobabilityisharder

•  Itisapproximatedwitha“pseudo”likelihood.DependencyNetworksforInference,CollaboraMveFilteringandDatavisualizaMonHeckerman,Chickering,Meek,Rounthwaite,Kadie2000

Originalmo(va(onofdependencynetworks

•  IntroducedbyHeckerman,Chickering,Meek,etal2000

•  OyenMmesBayesiannetworkscangetconfusing–  BayesiannetworkslearnedrepresentcorrelaMonorpredicMverelaMonships

–  ButthedirecMonalityoftheedgesaremistakenlyinterpretedascausalconnecMons

•  (Consistent)DependencynetworkswereintroducedtodisMnguishbetweenthesecases

DependencynetworkvsBayesiannetwork

Age Gender

Income

Age Gender

Income

(a) (b)

DependencyNetworksforInference,CollaboraMveFilteringandDatavisualizaMonHeckerman,Chickering,Meek,Rounthwaite,Kadie2000

Bayesiannetwork Dependencynetwork

OyenMmes,theBayesiannetworkontheleyisreadasif“Age”determines“Income”.However,allthismodeliscapturingisthat“Age”ispredicMveof“Income”.

Learningdependencynetworks:es(matethebestpredictorsofavariable

?? ?…

Xj

Regulators•  fjcanbeofdifferenttypes.

•  LearningrequiresesMmaMonofeachofthe fj funcMons

•  InallcaseslearningrequiresustominimizeanerrorofpredicMngXj fromitsneighborhood

fj

Differentrepresenta(onsoffj

•  IfXjisconMnuous–  fjcanbealinearfuncMon–  fj canbearegressiontree–  fjcanbeanensembleoftrees

•  E.g.randomforests

•  IfXjisdiscrete–  fjcanbeacondiMonalprobabilitytable–  fjcanbeacondiMonalprobabilitytree

Populardependencynetworksimplementa(ons

•  Learnedbysolvingasetoflinearregressionproblems•  TIGRESS(Hauryetal,2010)

•  Usesaconstrainttolearna“sparse”Markovblanket•  Uses“stabilityselecMon”toesMmateconfidenceofedges

•  Learnedbysolvingasetofnon-linearregressionproblems•  Non-linearitycapturedbyRegressionTree(Heckermanetal,

2000)•  GENIE3:Non-linearitycapturedbyRandomforest(Huynh-Thuet

al,2010)•  Inferelator(Bonneauetal,GenomeBiology2005)

•  CanhandleMmecourseandsingleMmepointdata•  Non-linearregressionisdoneusingalogisMctransform•  Handleslinearandnon-linearregression

GENIE3:GEneNetworkInferencewithEnsembleoftrees

•  Solvesasetofregressionproblems– Oneperrandomvariable

•  UsesanEnsembleofregressiontreestorepresentfj– Modelsnon-lineardependencies

•  Outputsadirected,cyclicgraphwithaconfidenceofeachedge

•  FocusongeneraMngarankingoveredgesratherthanagraphstructureandparameters

InferringRegulatoryNetworksfromExpressionDataUsingTree-BasedMethodsVanAnhHuynh-Thu,AlexandreIrrthum,LouisWehenkel,PierreGeurts,PlosOne2010

Aregressiontree

•  ArootedbinarytreeT•  Eachnodeinthetreeiseitheraninteriornodeoraleafnode

•  InteriornodesarelabeledwithabinarytestXi<u, u isarealnumberobservedinthedata

•  LeafnodesareassociatedwithunivariatedistribuMonsofthechild

Averysimpleregressiontreefortwovariables

X2

X3

e1 e2

µ1,�1

µ2,�2

µ3,�3

X2 > e1

X2 > e2

YESNO

NO YES

µ3,�3µ2,�2µ1,�1

X3

ThetreespecifiesX3 asafuncMonofX2 Interiornodes

Leafnodes

Learningaregressiontree•  AssumewearesearchingfortheneighborsofavariableX3 andit

alreadyhastwoparentsX1and X2

•  X4 willbeconsideredasanewneighborusing“split”operaMonsofexisMngleafnodes

X1 > e1

X2 > e2

YES

NO

NO YES

X4 > e3

NO YES

X1 > e1

X2 > e2

YES

NO

NO YES

N1 N2N3

Nl: Gaussianassociatedwithleafl

N2N3

N4 N5

AnEnsembleoftrees

•  Asingletreeisproneto“overfizng”•  Insteadoflearningasingletree,ensemblemodelsmakeuseofacollecMonoftrees

–  PredicMonis

ARandomforest:AnEnsembleofTrees

……tree1 treeT

splitnodesleafnodes

x-j x-j

TakenfromICCV09tutorialbyKim,Sho2onandStenger:h2p://www.iis.ee.ic.ac.uk/~tkkim/iccv09_tutorial

µ1

µT

bxj =1

T

TX

t=1

µt

GENIE3algorithmsketch

•  ForeachXj,generatelearningsamplesofinput/outputpairs–  LSj={(x-j

k,xjk), k=1..N}

– OneachLSj learnfj topredictthevalueofXj–  fj isanensembleofregressiontrees–  EsMmatewijforallgenesi ≠ j

•  wijquanMfiestheconfidenceoftheedgebetweenXiandXj•  AssociatedwiththedecreaseinvarianceofXjwhenXiisincludedinfj

•  Generateaglobalrankingofedgesbasedoneachwij

GENIE3algorithmsketch

we assume that we can write:

xjk~fj(x{jk )zek,Vk ð1Þ

where ek is a random noise with zero mean (conditionally to x{jk ).

We further make the assumption that the function fj only exploits

the expression in x{j of the genes that are direct regulators ofgene j, i.e. genes that are directly connected to gene j in thetargeted network. Recovering the regulatory links pointing togene j thus amounts at finding those genes whose expression ispredictive of the expression of the target gene. In machinelearning terminology, this can be considered as a feature selectionproblem (in regression) for which many solutions exist [28]. Weassume here the use of a feature ranking technique that, insteadof directly returning a feature subset, yields a ranking of thefeatures from the most relevant to the less relevant for predictingthe output.The proposed network inference procedure is illustrated in

Figure 1 and works as follows:

N For j=1 to p:

– Generate the learning sample of input-output pairs for genej:

LSj~f(x{jk ,xjk),k~1, . . . ,Ng:

– Use a feature selection technique on LSj to computeconfidence levels wi,j ,Vi=j, for all genes except gene j itself.

N Aggregate the p individual gene rankings to get a globalranking of all regulatory links.

Note that depending of the interpretation of the weights wi,j ,their aggregation to a get a global ranking of regulatory links is nottrivial. We will see in the context of tree-based methods that itrequires to normalize each expression vector appropriately.

Gene Ranking with Tree-based MethodsThe nature of the problem and the proposed solution put some

constraints on candidate feature selection techniques. The natureof the functions fj is unknown but they are expected to involve theexpression of several genes (combinatorial regulation) and to benon-linear. The number of input features in each of theseproblems is typically much greater than the number ofobservations. Computationally, since the identification of anetwork involving p genes requires to rerun the algorithm p times,it is also of interest for this algorithm to be fast and to require asfew manual tuning as possible. Tree-based ensemble methods aregood candidates for that purpose. These methods do not make anyassumption about the nature of the target function, can potentiallydeal with interacting features and non-linearity. They work well inthe presence of a large number of features, are fast to compute,scalable, and essentially parameter-free (see [29] for a review).We first briefly describe these methods and their built-in feature

ranking mechanism and then discuss their use in the context of thenetwork inference procedure described in the previous section.

Tree-based Ensemble MethodsEach subproblem, defined by a learning sample LSj , is a

supervised (non-parametric) regression problem. Using squareerror loss, each problem amounts at finding a function fj thatminimizes the following error:

XN

k~1

(xjk{fj(x{jk ))2: ð2Þ

Regression trees [30] solve this problem by developing treestructured models. The basic idea of this method is to recursivelysplit the learning sample with binary tests based each on one input

variable (selected in x{j ), trying to reduce as much as possible the

variance of the output variable (xj ) in the resulting subsets ofsamples. Candidate splits for numerical variables typicallycompare the input variable values with a threshold which isdetermined during the tree growing.

Figure 1. GENIE3 procedure. For each gene j~1, . . . ,p, a learning sample LSj is generated with expression levels of j as output values andexpression levels of all other genes as input values. A function fj is learned from LSj and a local ranking of all genes except j is computed. The p localrankings are then aggregated to get a global ranking of all regulatory links.doi:10.1371/journal.pone.0012776.g001

Inferring GRNs with Trees

PLoS ONE | www.plosone.org 3 September 2010 | Volume 5 | Issue 9 | e12776

FigurefromHuynh-Thuetal.

Predictorranking

Evalua(onofnetworkinferencemethods

•  Assumeweknowwhatthe“right”networkis•  OnecanusePrecision-Recallcurvestoevaluatethepredictednetwork

•  AreaunderthePRcurve(AUPR)curvequanMfiesperformance

Precision=#ofcorrectedges

#ofpredictededges

Recall=#ofcorrectedges

#oftrueedges

DREAM:Dialogueforreverseenginee(ngassessmentsandmethods

Communityefforttoassessregulatorynetworkinference

!"#$%&%'(!"#$%&'!()*+$,-.-')*#'(&%/")$%&'*(/"0$,-.-'+*#',-,.%$/,!1+$%&'#())0$,-.-'+*#/0-,0$"2$%&'/()00$,-.-'

!"#$%&'()*$+)*,-'. !/#$0+1)'')2$+)*,-'.3

12#$34435'*10$67.8'9

12#$34435'*10$67.8'9

#+)$34435'+/!$67.8'9

!)2$34435'#+$67.8'9

!4#$567'-&''&8$7-9:)+26&

:;<=>3?;7.

@AB-4;<-.?'C.76C7=?'7D-4-AB4-''3.?;E;7?;6'?7A;.'-?69

F.7.5<;G-83?3

0+1)')+7)$9)*;-23

(4) Community network

H.?-,43?;7.

(5) Performance evaluation

%4=-$!"#$!%!&'.-?I74C

@AB-4;<-.?3>>58-?-4<;.-8;.?-436?;7.'$$

()#*+,-+$#.7?$='-8J74$-D3>=3?;7.

K3>;83?;7.

/" <-?L78'$3BB>;-8$E5?-3<'$ME>;.8-8N

O$) 7JJP?L-P'L->J$<-?L78'/< 9)*;-23$*)3*)2

QLHR<7?;J'-?69

=6(>')$"

=6(>')$"?$%;)$@ABC5<$+)*,-'.$6+1)')+7)$7;&DD)+()?

DREAM5challenge

Previouschallenges:2006,2007,2008,2009,2010 Marbachetal.2012,NatureMethods

Wheredodifferentmethodsrank?

!

!

"#

!"

$

#

%

!

"#$%$&'(&'))*+,

"-$%$./$0$1+&&'234*+,"5$%$637')*3,

&'()

&'(!

'

*'(!

*'()*

+,-*.,/01

/.23*145

.406

07

#

%

!

+

$

8

! "

8

+

)

#88 )

%+

+ #

8+

"

"

+

#

"

&'() &'(! ' *'(! *'()#0-*.,/01/.23*145.40607

"8$%$+49'&

96:,6;;/40<=7=23*/0>4,527/40?4,,6327/40@2A6;/20*067B4,C;

/,:'&',1'$;'49+<)D7E6,<672

96:,6;;/40 <F ?4,,( @2A6;/20 D7E6, <672

?455=0/7A

920-45" # %8!+ $ ) " # 8!+

3 1

'8

"'

'

#'

!''#'

!'%'

"8

GHI9

*JKL

GHI9

*JKL

M14,6

=>$1+

2*

'8

"'"8

GHI9

*JKL

?>$1'

&'@*)

*3'/,$)*2*1+

A@'&322?455=0/7A

?"*&*,6:,6;(

?#*&*@2

A6;(

?+*&*<F*N*14

,,6327/40

?!*&*47E6,

+!8%

"

)

#+!

"

8

+!8"#+

#

#

!

"+%$)

#

+

"

8

#!8

$

"

%

G

?@G

?@

G

?@

G

?@

G

@

G

@

O041C4=7;

<67E4

-?3=;76,

P/,617/4023/7A

Q66->4,B2,-

344.

?2;12-6

Q20&/0

Q20&4=7

B&'<*14*+,$!*3)CD$'<('$&3,EF

'

F01,62;6-140>/-6016

R4*S/2;

P61,62;6-140>/-6016

T"8

U&"8 H;6*VQ

*C041C4=7;

" # + " # %8!+ " # %8!+ $ ) " # 8!+

G*(H&'$5

G*(H&'$5>$=@32H34*+,$+:$,'4I+&E$*,:'&',1'$;'49+<)

Marbachetal.,2012 Commun

ity

Rand

om

Theseapproachesweremostlyper-gene

Methodstendtoclustertogether

NATURE METHODS | VOL.9 NO.8 | AUGUST 2012 | 799

ANALYSIS

methods explicitly used this informa-tion. Consequently, these methods recov-ered target genes of deleted transcription factors more reliably than the inference methods that did not leverage this infor-mation (Fig. 2c). Explicit use of such knockouts also helped methods to draw the direction of edges between tran-scription factors more reliably. These observations suggest that measurements of transcription-factor knockouts can be informative for network reconstruc-tion. In particular, this is the case for the E. coli data set, which contained the larg-est number of such experiments (Online Methods). To further explore the informa-tion content of different experiments, we employed a machine learning framework22 to systematically analyze the information gain from microarrays grouped according to the type of experi-mental perturbation (knockouts, drug perturbations, environ-mental perturbations and time series; Supplementary Note 5). We found that experimental conditions independent of transcrip-tion factor knockout and overexpression also provide informa-tion, though at a reduced level.

Community networks outperform individual inference methodsNetwork inference methods have complementary advantages and limitations under different contexts, which suggests that combining the results of multiple inference methods could be a good strategy for improving predictions. We therefore integrated the predic-tions of all participating teams to construct community networks by rescoring interactions according to their average rank across all methods (Supplementary Note 6). The integrated community network ranks first for in silico, third for E. coli and sixth for S. cerevisiae out of the 35 applied inference methods, which shows that the community network is consistently as good or better than the top individual methods (Fig. 2a). Thus it has by far the best performance reflected in the overall score. We stress that, even though top-performing methods for a given network are com-petitive with the integrated community method, the performance of individual methods does not generalize across networks.

Given the biological variation among organisms and the experi-mental variation among gene-expression data sets, it is difficult to determine beforehand which methods will perform optimally for reconstructing an unknown regulatory network. In con-trast, the community approach performs robustly across diverse data sets.

We next analyzed how the number of integrated methods affects the performance of community predictions by examin-ing randomly sampled combinations of individual methods. On average, community methods perform better than indi-vidual inference methods even when integrating small sets of individual predictions: for example, just five teams (Fig. 3a). Performance increases further with the number of integrated methods. For instance, given 20 inference methods, their inte-gration ranks first or second in 98% of the cases (Fig. 3b). We also found that the performance of the community network can be improved by increasing the diversity of the underlying infer-ence methods. Consensus predictions from teams using similar methodologies were outperformed by consensus predictions from diverse methodologies (Fig. 3c).

A key feature in taking a community network approach is robust-ness to the inclusion of a limited subset (up to ~20%) of poorly per-forming inference methods (Fig. 3d). Poor predictors essentially contributed noise, but this did not affect the performance of the community approach as a whole. This finding is crucial because

b

c

C1.

Reg

ress

ion

C2.

Bay

esia

nC

3. M

I & c

orr.

C4.

Oth

er

3456

1

8

234

1

5

345123

2

2

4

13678

2

3

1

5

245

7

1

6

A

C

B

A

C

B

A

C

B

A

C

B

A

B

A

B

Kno

ckou

ts

Met

hod

Dire

ctio

nalit

y

Fee

d-fo

rwar

dlo

op

Cas

cade

Fan

-in

Fan

-out

Rank bias

0

Moreconfident

No bias

Lessconfident

≥15%

≤–15%

Use T

F knockouts

a

0

5

10

0

20

40

0

20

40

60

15

E. c

oli

AU

PR

(%

)

0

5

10

15

S. c

erev

isia

eA

UP

R (

%)

In s

ilico

AU

PR

(%

)O

vera

llsc

ore

Community

2 6543 71 8 1 2 543 1 2 3 1 2 6543 1 2 6543 7 8 1 2 54 CR3Regression MI Corr. Bayes Other Meta

12 6543 7 8 12 543 12 3 12 6543 12 6543 7 8 12 54 CR3Regression MI Corr. Bayes Other Meta

12 6543 7 8 12 543 12 3 12 6543 12 6543 7 8 12 54 CR3Regression MI Corr. Bayes Other Meta

12 6543 7 8 1 2 543 1 23 12 6543 12 6543 7 8 12 54 CR3Regression MI Corr. Bayes Other Meta

CommunityRandom

4

12

41

7

2

6

4

C1. Regression

C3. MI & corr.C2. Bayesian

–0.8

–0.4

0

0.4

0.8

Thi

rd p

rinci

pal c

ompo

nent

2

6

4

3

7

5

4 1

5

3

8

255 8

63

3 2

53

1

1

3

2

1

–0.8 –0.4 0 0.4 0.8Second principal component

C4. Other

RegressionMutual information (MI)

Inference methodsOtherMeta

Bayesian networksCorrelation (corr.)

Figure 2 | Evaluation of network inference methods. Inference methods are indexed according to Table 1. (a) The plots depict the performance for the individual networks (area under precision-recall curve, AUPR) and the overall score summarizing the performance across networks (Online Methods). R, random predictions; C, integrated community predictions. (b) Methods are grouped according to the similarity of their predictions via principal-component analysis. The second versus third principal components are shown; the first principal component accounts mainly for the overall performance (Supplementary Note 4). (c) The heat map depicts method-specific biases in predicting network motifs. Rows represent individual methods and columns represent different types of regulatory motifs. Red and blue show interactions that are easier and harder to detect, respectively.

Theseapproachesweremostlyper-geneMarbachetal.,2012

Takeawaypoints

•  NetworkinferencefromexpressionprovidesapromisingapproachtoidenMfycellularnetworks

•  GraphicalmodelsareonerepresentaMonofnetworksthathaveaprobabilisMcandgraphicalcomponent–  Networkinferencenaturallytranslatestolearningproblemsinthesemodels

•  Differenttypesofalgorithmstolearnthesenetworks–  Per-genemethods

•  SparseCandidate,GENIE3:learnregulatorsforindividualgenes–  Per-modulemethods

•  Modulenetworks:learnregulatorsforsetsofgenes/modules•  SuccessfulapplicaMonofBayesiannetworkstoexpressiondatarequires

addiMonalconsideraMons–  ReducepotenMalparents–  BootstrapbasedconfidenceesMmaMon

•  NetworkreconstrucMonfromexpressionaloneischallenging–  Newmethodsaimtocombineothertypesofmeasurements(nextlecture)

References•  Kim,HaroldD.,TalShay,ErinK.O'Shea,andAvivRegev."TranscripMonal

RegulatoryCircuits:PredicMngNumbersfromAlphabets."Science325(July2009):429-432.

•  DeSmet,RietandKathleenMarchal."AdvantagesandlimitaMonsofcurrentnetworkinferencemethods.."Naturereviews.Microbiology8(October2010):717-729.

•  Markowetz,FlorianandRainerSpang."Inferringcellularnetworks-areview."BMCbioinforma-cs8Suppl6(2007):S5+.

•  N.Friedman,M.Linial,I.Nachman,andD.Pe'er,"Usingbayesiannetworkstoanalyzeexpressiondata,"JournalofComputa-onalBiology,vol.7,no.3-4,pp.601-620,Aug.2000.[Online].Available:h2p://dx.doi.org/10.1089/106652700750050961

Documents

Computaonal methods for inferring gene regulatory networkskendzior/STAT877/SLIDES/roy1.pdf · Sushmita Roy [email protected] STAT/BMI 877: ... much-investigated random network