Upload
dophuc
View
215
Download
1
Embed Size (px)
Citation preview
Computa(onalmethodsforinferringgeneregulatorynetworks
STAT/BMI877:Sta(s(calMethodsforMolecularBiologyh2ps://www.biostat.wisc.edu/~kendzior/STAT877/
Feb21st2017
SomeofthematerialcoveredinthislectureisadaptedfromBMI576
Whynetworks?
“AsystemisanenMtythatmaintainsitsfunc(onthroughtheinterac(onofitsparts”
– Kohl&Noble
Tounderstandcellsassystems:measure,model,predict,refine
UweSauer,Ma2hiasHeinemann,NicolaZamboni,Science2007
Planforthissegment
• IntroducMontomolecularnetworksandexpression-basednetworkinference(Today,Feb21st)
• Module-basednetworkinferencemethods(Thursday,Feb23rd)
• IntegraMvenetworkinference(Thursday,Feb23rd)
Goalsfortoday
• Background– Introductorygraphtheory– Differenttypesofcellularnetworks
• Expression-basedregulatorynetworkinference– Per-genevsPer-modulemethods
• ProbabilisMcgraphicalmodelstorepresentregulatorynetworks– Bayesiannetworks(Sparsecandidates)– Dependencynetworks(GENIE3)
Anetwork
• DescribesconnecMvitypa2ernsbetweenthepartsofasystem– Vertex/Nodes:parts,components– Edges/InteracMons/links:connecMons
• Edgescanhavesigns,direcMons,and/orweight
• Anetworkisrepresentedasagraph– Nodeandvertexareusedinterchangeably
– Edge,link,andinteracMonareusedinterchangeably
Vertex/Node
Edge
A
B C
D
E
F
Definingagraph
• Agraphisdefinedbytwosets:VandE• V:setofverMces• E:setofedges• Dependinguponthegraph,wemayincludeaddiMonala2ributesabouttheverMcesandedges
Afewgraph-theore(cconcepts
• Directedgraph• Undirectedgraph• Weightedgraph• Degreeofanode• Subnetworks/subgraphs
– Path– Cycle– Cliques– Connectedcomponent
Represen(ngagraph
• Adjacencymatrix
• Adjacencylist
Adjacencymatrix
C
AdaptedfromBarabasi&Oltvai,NatureReviewsGeneMcs2004
B
A
D
G
H
E
0 1 1 1 1 0 0 1
1 0 1 0 0 0 1 0
1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0
0 0 0 0 1 0 1 0
0 1 0 0 0 1 0 1
1 0 0 0 0 0 1 0
F
A
B
C
D
E
F
G
H
A B C D E F G HMatrix-basedrepresenta(on
V={A,B,C,D,E,F,G,H}E=setofvertexpairs(u,v)connectedwithanedge
Adjacencylist
C
B
A
D
G
H
E
F
Listbasedrepresenta(on;spaceefficientA B C D E H
B A C G
C A B
D A
E A F
F G E
G B F H
H A G
Directedgraphs
C
B
A
D
G
H
E
F
EdgeshavedirecMonalityonthem
0 0 0 1 1 0 0 0
1 0 0 0 0 0 1 0
1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 1 0 1 0
0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0
A
B
C
D
E
F
G
H
A B C D E F G H
Adjacencymatrixisnolongersymmetric
Weightedgraphs
C
B
A
D
H
E
0 0 0 0.9 1.4 0
1.6 0 0 0 0 0
0.2 0.6 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 2
3.1 0 0 0 0 0
A
B
C
D
E
H
A B C D E H
Edgeshaveweightsonthem;wecanhavedirectedandundirectedweightedgraphs.Theexampleshownisthatofadirectedweightedgraph
0.61.6
0.2
3.1
1.40.9 2
Nodedegree
• Undirectednetwork– Degree,k:Numberofneighborsofanode
• Directednetwork– Indegree,kin:Numberofincomingedges– Outdegree,kout:Numberofoutgoingedges
C
B
A
D
G
H
E
FIndegreeofBis1OutdegreeofAis2
Subgraphs
C
B
A
D
G
H
EPath:asetofconnectededgesfromonenodetoanotherPathlength:ThetotalnumberofedgesinapathShortestpath:ThepathbetweentwoverMceswiththeshortestpathlength
PathsfromBtoD
Asubgraphofagraphisdefinedbythesubsetofnodesandedges.Wewillusesubgraphandsubnetworkinterchangeably
F
C
B
A
D
G
H
ECycle:apathwhichstartsandendsatthesamenode
F
cycle
Subnetworks/Subgraphs
Clique:asetofnodesthathaveedgesbetweenallpairsofverMces
C
B
A
D
G
H
F
5nodeclique
Connectedcomponent:Asubgraphspanningavertexsubsetwhereeveryvertexcanbe“reached”fromanothervertex
C
B
A
D
G
H
E
F
Twoconnectedcomponents
Differenttypesofnetworks
• Physicalnetworks– Transcrip-onalregulatorynetworks:interacMonsbetween
regulatoryproteins(transcripMonfactors)andgenes– Protein-protein:interacMonsamongproteins– Signalingnetworks:protein-proteinandprotein-smallmolecule
interacMonstorelaysignalsfromoutsidethecelltothenucleus
• FuncMonalnetworks– Metabolic:reacMonsthroughwhichenzymesconvertsubstrates
toproducts– Gene-c:interacMonsamonggeneswhichwhenperturbed
togetherproduceasignificantphenotypethanwhenindividuallyperturbed
Transcrip(onalregulatorynetworks!"#$%&'()*'$!+,-,.&!"##$%!!&'( )**+&,,---./01234536*789.512,':;"<#;#=,",'(
>8?3!(!1@!'#/01.)$23*4)5$2,($6,5$7+(1(+,2$0350,')'8
following the procedure detailed in Materials and Meth-ods. From the way they are built, the randomized net-works have the same number of TF and RG nodes, andeach node has the same number of links as in the corre-sponding original networks.
We further calculated the connectivity distributions forthe E. coli and S. cerevisiae, original and randomized, TFand RG projected networks. As seen in the plots of Figures4a and 4b, the connectivity distributions corresponding tothe E. coli TF-projected original and randomized networksare power-law distributions with slope about -1.5. Thecorresponding S. cerevisiae distributions show a slightnon-monotonic growing tendency.
In the plots of figures 4c and 4d, the connectivity distribu-tions for the E. coli and S. cerevisiae RG-projected networksare presented. Notice that the connectivity distributionsfor the S. cerevisiae RG-projected networks show anapproximately exponential decreasing behaviour, whilethe distributions corresponding to E. coli have variouslocal maxima and present a slow decreasing tendency.
Interestingly, the TF and RG projected networks of E. coliand S. cerevisiae have very different connectivity structures,
despite the strong similarities observed in the bipartite-network link distributions (see Figure 3). Furthermore,the connectivity distributions of the original and rand-omized RG-projected networks are very similar in boththe E. coli and S. cerevisie cases, while small deviationsfrom the behaviour of the randomized plots are observedin the TF projections. This indicates to our understandingthat the observed differences between the connectivitydistributions of the E. coli and S. cerevisiae projected net-works are mainly due to the very different number tran-scription factors and regulated genes in both organisms.
A network's clustering coefficient (C) is an estimation ofits nodes tendency to form tightly connected clusters (seeMaterials and Methods). We calculated the clusteringcoefficient of the E. coli and S. cerevisiae, original and ran-domized, TF- and RG-projected networks, and the resultsare shown in Table 1. Observe that the clustering coeffi-cient of the original and randomized RG projected net-works are quite similar for both E. coli and S. cerevisiae.Contrarily, the C values of the randomized TF projectionsare consistently larger than those of the original networkprojections.
Representation of the E. coli transcriptional regulatory networkFigure 1Representation of the E. coli transcriptional regulatory network. a) Representation of the transcription-factor gene regulatory network of E. coli. Green circles represent transcription factors, brown circles denote regulated genes, and those with both functions are coloured in red. Projections of the network onto b) transcription factor and onto c) regulated gene nodes are also shown.
R G
T F(b)
(c)
(a)
E. coli
RegulatorynetworkofE.coli.153TFs(green&lightred),1319targets
VargasandSanMllan,2008
A B
GeneC
TranscripMonfactors(TF)
C
A B
• Directed,signed,weightedgraph• Nodes:TFsandTargetgenes• Edges:AregulatesB’sexpression
level
DNA
ReacMonsassociatedwithGalactosemetabolism
Metabolicnetworks
MetabolitesN
Ma b
c
O
Enzymes
d
O
M N
KEGG
• Unweightedgraph• Nodes:Metabolicenzyme• Edges:EnzymesMandNsharea
compound
Protein-proteininterac(onnetworks
104 | FEBRUARY 2004 | VOLUME 5 www.nature.com/reviews/genetics
R E V I EW S
mathematical properties of random networks14. Theirmuch-investigated random network model assumes thata fixed number of nodes are connected randomly to eachother (BOX 2). The most remarkable property of the modelis its ‘democratic’or uniform character, characterizing thedegree, or connectivity (k ; BOX 1), of the individual nodes.Because, in the model, the links are placed randomlyamong the nodes, it is expected that some nodes collectonly a few links whereas others collect many more. In arandom network, the nodes degrees follow a Poissondistribution, which indicates that most nodes haveroughly the same number of links, approximately equalto the network’s average degree, <k> (where <> denotesthe average); nodes that have significantly more or lesslinks than <k> are absent or very rare (BOX 2).
Despite its elegance, a series of recent findings indi-cate that the random network model cannot explainthe topological properties of real networks. The deviations from the random model have several keysignatures, the most striking being the finding that, incontrast to the Poisson degree distribution, for manysocial and technological networks the number of nodeswith a given degree follows a power law. That is, theprobability that a chosen node has exactly k links follows P(k) ~ k –γ, where γ is the degree exponent, withits value for most networks being between 2 and 3 (REF. 15). Networks that are characterized by a power-lawdegree distribution are highly non-uniform, most ofthe nodes have only a few links. A few nodes with a verylarge number of links, which are often called hubs, holdthese nodes together. Networks with a power degreedistribution are called scale-free15, a name that is rootedin statistical physics literature. It indicates the absenceof a typical node in the network (one that could beused to characterize the rest of the nodes). This is instrong contrast to random networks, for which thedegree of all nodes is in the vicinity of the averagedegree, which could be considered typical. However,scale-free networks could easily be called scale-rich aswell, as their main feature is the coexistence of nodes ofwidely different degrees (scales), from nodes with oneor two links to major hubs.
Cellular networks are scale-free. An important develop-ment in our understanding of the cellular networkarchitecture was the finding that most networks withinthe cell approximate a scale-free topology. The first evi-dence came from the analysis of metabolism, in whichthe nodes are metabolites and the links representenzyme-catalysed biochemical reactions (FIG. 1).As manyof the reactions are irreversible, metabolic networks aredirected. So, for each metabolite an ‘in’ and an ‘out’degree (BOX 1) can be assigned that denotes the numberof reactions that produce or consume it, respectively.The analysis of the metabolic networks of 43 differentorganisms from all three domains of life (eukaryotes,bacteria, and archaea) indicates that the cellular metabo-lism has a scale-free topology, in which most metabolicsubstrates participate in only one or two reactions, but afew, such as pyruvate or coenzyme A, participate indozens and function as metabolic hubs16,17.
Depending on the nature of the interactions, net-works can be directed or undirected. In directednetworks, the interaction between any two nodes has awell-defined direction, which represents, for example,the direction of material flow from a substrate to aproduct in a metabolic reaction, or the direction ofinformation flow from a transcription factor to the genethat it regulates. In undirected networks, the links donot have an assigned direction. For example, in proteininteraction networks (FIG. 2) a link represents a mutualbinding relationship: if protein A binds to protein B,then protein B also binds to protein A.
Architectural features of cellular networksFrom random to scale-free networks. Probably the mostimportant discovery of network theory was the realiza-tion that despite the remarkable diversity of networksin nature, their architecture is governed by a few simpleprinciples that are common to most networks of majorscientific and technological interest9,10. For decadesgraph theory — the field of mathematics that dealswith the mathematical foundations of networks —modelled complex networks either as regular objects,such as a square or a diamond lattice, or as completelyrandom network13. This approach was rooted in theinfluential work of two mathematicians, Paul Erdös,and Alfréd Rényi, who in 1960 initiated the study of the
Figure 2 | Yeast protein interaction network. A map of protein–protein interactions18 inSaccharomyces cerevisiae, which is based on early yeast two-hybrid measurements23, illustratesthat a few highly connected nodes (which are also known as hubs) hold the network together.The largest cluster, which contains ~78% of all proteins, is shown. The colour of a node indicatesthe phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal,orange = slow growth, yellow = unknown). Reproduced with permission from REF. 18 ©Macmillan Magazines Ltd.
Barabasietal.2003
YeastproteininteracMonnetwork
X Y
XY
• Un/weightedgraph• Nodes:Proteins• Edges:ProteinXphysically
interactswithproteinY
Goalsfortoday
• Background– Introductorygraphtheory– Differenttypesofcellularnetworks
• Expression-basedregulatorynetworkinference– Per-genevsPer-modulemethods
• ProbabilisMcgraphicalmodelstorepresentregulatorynetworks– Bayesiannetworks(Sparsecandidates)– Dependencynetworks(GENIE3)
Computa(onalinferenceoftranscrip(onalregulatorynetworks
• WewillfocusontranscripMonalregulatorynetworks
• ThesenetworkscontrolwhatgenesgetacMvatedwhen
• PrecisegeneacMvaMonorinacMvaMoniscrucialformanybiologicalprocesses
• MicroarraysandRNA-seqallowsustosystemaMcallymeasuregeneacMvitylevels
Whydoweneedtocomputa(onallyinfertranscrip(onalnetworks?
• WhyinfertranscripMonalnetworks?– Tounderstandhowgeneexpressioniscontrolled– Tounderstandhowcellsprocessextra-cellularcues– ManydiseasesareassociatedwithchangesintranscripMonalnetworks
• WhydosocomputaMonally?– ExperimentaldetecMonofnetworksishard,expensive– Afirststeptowardshavinganinsilicomodelofacell– AmodelcanbeusedtomakepredicMonsthatcanbetestedandrefinethemodel
Methodstoinferregulatorynetworks
• Experimental– ChIP-chipandChIP-seq– Factor/regulatorknockoutfollowedbygenome-wide
transcriptomemeasurements– DNaseI/ATAC-seq+moMffinding
• ComputaMonal– Supervisednetworkinference– Unsupervisednetworkinference
• Expression-basednetworkinference
Typesofdataforreconstruc(ngtranscrip(onalnetworks
• Node-specificdatasets– Transcriptomes:Genome-widemRNAlevels– CanpotenMallyrecovergenome-wide
regulatorynetworks
• Edge-specificdatasets– ChIP-chipandChIP-seq– SequencespecificmoMfs– Factorknockoutfollowedbywhole-
transcriptomeprofilingGene
moMfChIP
differences from the mean ranged from 30 (in vineyard strain I14)to nearly 600 (in clinical isolate YJM789), with a median of 88expression differences per strain. The number of expressiondifferences did not correlate strongly with the genetic distances ofthe strains (R2 = 0.16). However, this is not surprising since manyof the observed expression differences are likely linked in trans tothe same genetic loci [27,31,34,35,43]. Consistent with thisinterpretation, we found that the genes affected in each strainwere enriched for specific functional categories (Table S4),revealing that altered expression of pathways of genes was acommon occurrence in our study.We noticed that some functional categories were repeatedly
affected in different strains. To further explore this, we identifiedindividual genes whose expression differed from the mean in atleast 3 of the 17 non-laboratory strains. This group of 219 geneswas strongly enriched for genes involved in amino acid metabolism(p,10214), sulfur metabolism (p,10214), and transposition(p,10247), revealing that genes involved in these functions hada higher frequency of expression variation. Differential expression
of some of these categories was also observed for a different set ofvineyard strains [26,28], and the genetic basis for differentialexpression of amino acid biosynthetic genes in one vineyard strainhas recently been linked to a polymorphism in an amino acidsensory protein [35]. We also noted that the 1330 genes withstatistically variable expression in at least one non-laboratorystrain were enriched for genes that contained upstream TATAelements [46] (p = 10216) and genes with paralogs (p = 1026) butunder-enriched for essential genes [47] (p = 10225). The trendsand statistical significance were similar using 953 genes that variedsignificantly from YPS163. Thus, genes with specific functionaland regulatory features are more likely to vary in expression underthe conditions examined here, consistent with reports of otherrecent studies [30,43,48,49] (see Discussion).
Influence of Copy Number Variation on Gene ExpressionVariationExpression from transposable Ty elements was highly variable
across strains. However, Ty copy number is known to vary widely
Figure 3. Variation in gene expression in S. cerevisiae isolates. The diagrams show the average log2 expression differences measured in thedenoted strains. Each row represents a given gene and each column represents a different strain, color-coded as described in Figure 1. (A) Expressionpatterns of 2,680 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to S288c. (B) Expression patterns of 953genes that varied significantly in at least one strain compared to strain YPS163 (FDR= 0.01, unpaired t-test). For (A) and (B), a red color indicateshigher expression and a green color represents lower expression in the denoted strain compared to S288c, according to the key. (C) Expressionpatterns of 1,330 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to the mean expression of all 17 strains.Here, red and green correspond to higher and lower expression, respectively, compared to the mean expression of that gene in all strains. Geneswere organized independently in each plot by hierarchical clustering.doi:10.1371/journal.pgen.1000223.g003
Phenotypic Variation in Yeast
PLoS Genetics | www.plosgenetics.org 5 October 2008 | Volume 4 | Issue 10 | e1000223
Gene
expr
essio
n lev
els Samples
Asimpleexampleoftranscrip(onalregula(on
HSP12Sko1Hot1
AssumeHSP12’sexpressionisdependentuponHot1andSko1bindingtoHSP12’spromoter
HSP12ON
HSP12Sko1
HSP12OFF
Input:TranscripMonfactorlevel(trans)
Input:TranscripMonfactorbindingsites(cis) Output:mRNAlevels
Whatdowewantamodelforaregulatorynetworktocapture?
HSP12Sko1Hot1
HSP12’sexpressionisdependentuponHot1andSko1bindingtoHSP12’spromoter
HSP12ON
HSP12Sko1
HSP12OFF
X3=ψ(X1,X2)
Func(on
X1 X2
X3
BOOLEANLINEARDIFF.EQNSPROBABILISTIC…
.
Howtheydetermineexpressionlevels?
Sko1
Structure
HSP12
Hot1
Whoaretheregulators?
Hot1regulatesHSP12
HSP12isatargetofHot1
Mathema(calrepresenta(onsofregulatorynetworks
X1 X2
X3
fOutputexpressionoftargetgene
Modelsdifferinthefunc(onthatmapsregulatorinputlevelstotargetlevels
Inputexpression/acMvityofregulators
Rateequa(ons Probabilitydistribu(ons
BooleanNetworks DifferenMalequaMons ProbabilisMcgraphicalmodels
X1 X2
0 0
0 1
1 0
1 1
X3
0
1
1
1
Input Output dX3(t)dt
=
� g(X1(t), X2(t))�rX3(t)
P (X3|X1, X2) =N(X1a + X2b, �)
X1 X2
X3
Regulatorynetworkinferencefromexpression
!"#$%&%'(!"#$%&'!()*+$,-.-')*#'(&%/")$%&'*(/"0$,-.-'+*#',-,.%$/,!1+$%&'#())0$,-.-'+*#/0-,0$"2$%&'/()00$,-.-'
!"#$%&'()*$+)*,-'. !/#$0+1)'')2$+)*,-'.3
12#$34435'*10$67.8'9
12#$34435'*10$67.8'9
#+)$34435'+/!$67.8'9
!)2$34435'#+$67.8'9
!4#$567'-&''&8$7-9:)+26&
:;<=>3?;7.
@AB-4;<-.?'C.76C7=?'7D-4-AB4-''3.?;E;7?;6'?7A;.'-?69
F.7.5<;G-83?3
0+1)')+7)$9)*;-23
(4) Community network
H.?-,43?;7.
(5) Performance evaluation
%4=-$!"#$!%!&'.-?I74C
@AB-4;<-.?3>>58-?-4<;.-8;.?-436?;7.'$$
()#*+,-+$#.7?$='-8J74$-D3>=3?;7.
K3>;83?;7.
/" <-?L78'$3BB>;-8$E5?-3<'$ME>;.8-8N
O$) 7JJP?L-P'L->J$<-?L78'/< 9)*;-23$*)3*)2
QLHR<7?;J'-?69
=6(>')$"
=6(>')$"?$%;)$@ABC5<$+)*,-'.$6+1)')+7)$7;&DD)+()?differences from the mean ranged from 30 (in vineyard strain I14)to nearly 600 (in clinical isolate YJM789), with a median of 88expression differences per strain. The number of expressiondifferences did not correlate strongly with the genetic distances ofthe strains (R2 = 0.16). However, this is not surprising since manyof the observed expression differences are likely linked in trans tothe same genetic loci [27,31,34,35,43]. Consistent with thisinterpretation, we found that the genes affected in each strainwere enriched for specific functional categories (Table S4),revealing that altered expression of pathways of genes was acommon occurrence in our study.We noticed that some functional categories were repeatedly
affected in different strains. To further explore this, we identifiedindividual genes whose expression differed from the mean in atleast 3 of the 17 non-laboratory strains. This group of 219 geneswas strongly enriched for genes involved in amino acid metabolism(p,10214), sulfur metabolism (p,10214), and transposition(p,10247), revealing that genes involved in these functions hada higher frequency of expression variation. Differential expression
of some of these categories was also observed for a different set ofvineyard strains [26,28], and the genetic basis for differentialexpression of amino acid biosynthetic genes in one vineyard strainhas recently been linked to a polymorphism in an amino acidsensory protein [35]. We also noted that the 1330 genes withstatistically variable expression in at least one non-laboratorystrain were enriched for genes that contained upstream TATAelements [46] (p = 10216) and genes with paralogs (p = 1026) butunder-enriched for essential genes [47] (p = 10225). The trendsand statistical significance were similar using 953 genes that variedsignificantly from YPS163. Thus, genes with specific functionaland regulatory features are more likely to vary in expression underthe conditions examined here, consistent with reports of otherrecent studies [30,43,48,49] (see Discussion).
Influence of Copy Number Variation on Gene ExpressionVariationExpression from transposable Ty elements was highly variable
across strains. However, Ty copy number is known to vary widely
Figure 3. Variation in gene expression in S. cerevisiae isolates. The diagrams show the average log2 expression differences measured in thedenoted strains. Each row represents a given gene and each column represents a different strain, color-coded as described in Figure 1. (A) Expressionpatterns of 2,680 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to S288c. (B) Expression patterns of 953genes that varied significantly in at least one strain compared to strain YPS163 (FDR= 0.01, unpaired t-test). For (A) and (B), a red color indicateshigher expression and a green color represents lower expression in the denoted strain compared to S288c, according to the key. (C) Expressionpatterns of 1,330 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain compared to the mean expression of all 17 strains.Here, red and green correspond to higher and lower expression, respectively, compared to the mean expression of that gene in all strains. Geneswere organized independently in each plot by hierarchical clustering.doi:10.1371/journal.pgen.1000223.g003
Phenotypic Variation in Yeast
PLoS Genetics | www.plosgenetics.org 5 October 2008 | Volume 4 | Issue 10 | e1000223
Expression-basednetworkinference
Gene
s
Experiments
X2
Structure
X3
X1
X3=f(X1,X2)
Func(on
X1 X2
X3Expressionlevelofgeneiinexperimentj
Expression-basedregulatorynetworkinference
• Given– AsetofmeasuredmRNAlevelsacrossmulMplebiologicalsamples
• Do– Infertheregulatorsofgenes– Inferhowregulatorsspecifytheexpressionofagene
• AlgorithmsfornetworkreconstrucMonvarybasedontheirmeaningofinteracMon
Twoclassesofexpression-basedmethods
• Per-gene/directmethods(Today)
• Modulebasedmethods(Thursday)
X5X3
X1 X2
Module
X3
X1 X2
X5
X3 X4
Per-genemethods
X3
X1 X2
X5
X3 X4
• Keyidea:findtheregulatorsthat“bestexplain”expressionlevelofagene
• ProbabilisMcgraphicalmethods– Bayesiannetwork
• SparseCandidates– Dependencynetworks
• GENIE3,TIGRESS
• InformaMontheoreMcmethods– ContextLikelihoodofrelatedness– ARACNE
Module-basedmethods
• FindregulatorsforanenMremodule– Assumegenesinthesamemodulehavethesameregulators
• ModuleNetworks(Segaletal.2005)• StochasMcLeMoNe(Joshietal.2008)
Permodule
Y2Y1
X1 X2
Module
Anon-exhaus(velistofexpression-basednetworkinferencemethod
MethodName Per-module Per-gene Parameters Modeltype
Sparsecandidate ✓ ✓ Bayesiannetwork
CLR ✓ InformaMontheoreMc
ARACNE ✓ InformaMontheoreMc
TIGRESS ✓ Dependencynetwork
Inferelator ✓ ✓ Dependencynetwork
GENIE3 ✓ Dependencynetwork
ModuleNetworks ✓ ✓ Bayesiannetwork
LemonTree ✓ Dependencynetwork
WGCNA ✓ CorrelaMon
Goalsfortoday
• Background– Introductorygraphtheory– Differenttypesofcellularnetworks
• Expression-basedregulatorynetworkinference– Per-genevsPer-modulemethods
• ProbabilisMcgraphicalmodelstorepresentregulatorynetworks– Bayesiannetworks(Sparsecandidates)– Dependencynetworks(GENIE3)
Probabilis(cgraphicalmodels(PGMs)
• Amarriagebetweenprobabilityandgraphtheory• Nodesonthegraphrepresentrandomvariables• GraphstructurespecifiesstaMsMcaldependencystructure
• Graphparametersspecifythenatureofthedependency
• PGMscanbedirectedorundirected• ExamplesofPGMs:Bayesiannetworks,Dependencynetworks,Markovnetworks,Factorgraphs
Differenttypesofprobabilis(cgraphs
• IneachgraphtypewecanassertdifferentcondiMonalindependencies
• CorrelaMonnetworks• GaussianGraphicalmodels• Dependencynetworks• Bayesiannetworks
Correla(onalnetworks
• Anundirectedgraph• EdgesrepresenthighcorrelaMon
– Needtodeterminewhat“high”is
• EdgeweightsdenotedifferentvaluesofcorrelaMon• CannotdiscriminatebetweendirectandindirectcorrelaMons
Msb2
Sho1
Ste20
Randomvariablesrepresentgeneexpressionlevels
X1
X2
X3
X1 X2
X3
w23w13
AmeasureofstaMsMcalcorrelaMon(e.g.Pearson’scorrelaMon)
Anundirectedweightedgraph.
Popularexamplesofcorrela(onalnetworks
• WeightedGeneCo-expressionNetworkAnalysis(WGCNA)- ZhangandHorvath2005
• RelevanceNetworks– Bu2e&Kohane,2000PacificsymposiumofbiocompuMng
Limita(onsofcorrela(onalnetworks
• CorrelaMonalnetworkscannotdisMnguishbetweendirectandindirectdependencies• ThismakesthemlessinterpretablethanotherPGMsBMC Bioinformatics 2007, 8(Suppl 6):S5 http://www.biomedcentral.com/1471-2105/8/S6/S5
Page 4 of 17(page number not for citation purposes)
expression profiles between genes does not have full rankand cannot be inverted [21]. The p Ŭ N-situation is truefor almost all genomic applications of graphical models.Therefore, one must either improve the estimators of par-tial correlations or resort to a simpler model. The basicidea in all of these approaches is that biological data are
high-dimensional but sparse in the sense that only a smallnumber of genes will regulate one specific gene of interest.
Several papers suggest ways to estimate GGMs in a p Ŭ N-situation. Kishino and Waddell [22] propose gene selec-tion by setting very low partial correlation coefficents tozero. As they state, the estimate still remains unstable. Inone study, Schäfer and Strimmer [21] use bootstrap resa-mpling together with the Moore-Penrose pseudoinverseand false discovery rate multiple testing, while in another[23], they discuss a linear shrinkage approach to regulari-zation. Li and Gui [24] prospose a threshold gradientdescent regularization procedure for estimating a sparseprecision matrix.
Heuristic regression-based estimationFull conditional independence models are closely relatedto a class of graphical models called dependency networks[25]. Dependency networks are built using sparse regres-sion models to regress each gene Xi onto the remaininggenes XV \ i. The genes, which predict the state of gene Xiwell, are connected to it by (directed) edges in the graph.In general, dependency networks may be inconsistent, i.e.the local regression models may not consistently specify ajoint distribution over all genes. Thus, the resulting modelis only an approximation of the true full conditionalmodel. Still, dependency networks are widely usedbecause of their flexibility and the computational advan-tage compared to structure learning in full conditionalindependence models. When learning dependency net-works, a variety of sparse classification/regression tech-niques may be used to estimate the local distributions,including linear models with an L1-penalty on modelparameters [20,26], classification trees [25,27], or sparseBayesian regression [19,28]. We will see later that these
Different mechanisms can explain coexpressionFigure 1Different mechanisms can explain coexpression. The left plot in the dashed box shows three coexpressed genes form-ing a clique in the coexpression graph. The other three plots show possible regulatory relationships that can explain coexpres-sion: The genes could be regulated in a cascade (left), or one regulates both others (middle), or there is a common "hidden" regulator (right), which is not part of the model.
Coexpression Regulatory network
X Z
Y
X Z
Y
X Z
Y
X Y Z
H
A small Gaussian graphical modelFigure 2A small Gaussian graphical model. Example of a full con-ditional model. Missing edges between nodes indicate inde-pendencies of two genes given all the other genes in the model. We can read from the graph that X ⊥ W | {Y, Z} and Y ⊥ W | {X, Z} and X ⊥ Z | {Y, W}.
W
Y
X Z
• Foranyco-expressionnetwork,thereareseveralpossibleregulatorynetworksthatcanexplainthesecorrelaMons.
• Whatwewouldlikeistobeabletodiscriminatebetweendirectandindirectdependencies
• HereweneedtoreviewcondiMonalindependencies
Condi(onalindependenciesinPGMs
• ThedifferentclassesofmodelswewillseearebasedonageneralnoMonofspecifyingstaMsMcalindependence
• SupposewehavetwogenesXandY.WeaddanedgebetweenX andY ifXandYarenotindependentgivenathirdsetZ.
• DependinguponZ wewillhaveafamilyofdifferentPGMs
Condi(onalindependenceandPGMs
• CorrelaMonalnetworks– Zistheemptyset
• Markovnetworks– XandYarenotindependentgivenallothervariables– GaussianGraphicalmodelsareaspecialcase
• Dependencynetworks– ApproximateMarkovnetworks– MaynotbeassociatedwithavalidjointdistribuMon
• First-ordercondiMonalindependencemodels– ExplainthecorrelaMonbetweentwovariablesbyathirdvariable
• Bayesiannetworks– Generalizefirst-ordercondiMonalindependencemodels
Bayesiannetworks(BN)
• AspecialtypeofprobabilisMcgraphicalmodel• Hastwoparts:
– Agraphwhichisdirectedandacyclic– AsetofcondiMonaldistribuMons
• DirectedAcyclicGraph(DAG)– ThenodesdenoterandomvariablesX1… XN– Theedges
• encodestaMsMcaldependenciesbetweentherandomvariables• establishparentchildrelaMonships
• EachnodeXihasacondi-onalprobabilitydistribu-on(CPD)represenMngP(Xi | Pa(Xi)); Pa:Parents
• ProvidesatractablewaytorepresentlargejointdistribuMons
Keyques(onsinBayesiannetworks
• WhatdotheCPDslooklike?• WhatindependenceasserMonscanbemadeinBayesiannetworks?
• Howcanwelearnthemfromexpressiondata?
Nota(on
• Xi:ithrandomvariable• Iftherearefewrandomvariables,wewilljustuseuppercasele2ers.E.g.A, B, C..
• X={X1,.., Xp}:setofprandomvariables• xi
k:AnassignmentofXi inthekth sample• x-i
k:SetofassignmentstoallvariablesotherthanXi inthekth sample
• D:AdatasetconsMtuMngthecollecMonofalljointassignments
AnexampleBayesiannetwork
Cloudy(C)
Rain(R)Sprinkler(S)
AdaptedfromKevinMurphy:IntrotoGraphicalmodelsandBayesnetworks:h2p://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
WetGrass(W)
P(C=F)P(C=T)
0.50.5
P(R=F)P(R=T)
0.80.2T 0.20.8
P(S=F)P(S=T)
0.50.5FT 0.90.1
P(W=F)P(W=T)10FF
TF 0.10.9
C
FTTT
0.10.90.010.99
C
F
SR
Expressiondatamatrix
pGe
nes
NExperiments/Timepointsetc
ObservaMons(expressionlevels)ofallvariablesinsamplek
ObservaMonsofvariableXi inallNexperiments
x-jk
Bayesiannetworkscompactlyrepresentjointdistribu(ons
P (X1, · · · , Xp) =pY
i=1
P (Xi|Pa(Xi))
ExampleBayesiannetworkof5variables
CHILD
PARENTS
X1 X2
X3
X5
X4
P (X) = P (X1)P (X2)P (X4)P (X3|X1, X2)P (X5|X3, X4)
P (X) = P (X1, X2, X3, X4, X5)NoindependenceasserMons
IndependenceasserMons
AssumeXiisbinary
Needs25measurements
Needs23measurements
CPDinBayesiannetworks
• TheCPDP(Xi|Pa(Xi)) specifiesadistribuMonovervaluesofXiforeachcombinaMonofvaluesofPa(Xi)
• CPD P(Xi|Pa(Xi)) canbeparameterizedindifferentways
• Xiarediscreterandomvariables– CondiMonalprobabilitytableortree
• Xi areconMnuousrandomvariables– CPDcanbecondiMonalGaussiansorregressiontrees
• ConsiderfourbinaryvariablesX1, X2, X3, X4
Represen(ngCPDsastables
X1 X2 X3 t f t t t 0.9 0.1
t t f 0.9 0.1
t f t 0.9 0.1
t f f 0.9 0.1
f t t 0.8 0.2
f t f 0.5 0.5
f f t 0.5 0.5
f f f 0.5 0.5
P( X4 | X1, X2, X3 ) asatableX4
X1X2
X4
X3
Pa(X4): X1, X2, X3
Es(ma(ngCPDtablefromdata
• AssumeweobservethefollowingassignmentsforX1,X2,X3, X4
T F T T
T T F T
T T F T
T F T T
T F T F
T F T F
F F T F
X1 X2 X3 X4
ForeachjointassignmenttoX1,X2,X3,esMmatetheprobabiliMesforeachvalueofX4
Forexample,considerX1=T, X2=F, X3=T
P(X4=T|X1=T, X2=F, X3=T)=2/4P(X4=F|X1=T, X2=F, X3=T)=2/4
N=7
Bayesiannetworkrepresenta(onofaregulatorynetwork
Bayesiannetwork
TARGET(CHILD)
REGULATORS(PARENTS)X1
X2
X3
X1 X2
X3P(X3|X1,X2)
Randomvariables
HSP12Sko1Hot1
Insidethecell
Hot1:
Sko1:
Hsp12:
P(X1) P(X2)
LearningproblemsinBayesiannetworks
• Parameterlearningonknowngraphstructure– Givenasetofjointassignmentsoftherandomvariables,esMmatetheparametersofthemodel
• Structurelearning– Givenasetofjointassignmentsoftherandomvariables,esMmatethestructureandparametersofthemodel
– Structurelearningsubsumesparameterlearning
Structurelearningusingscore-basedsearch
...
Score(B) DescribeshowwellBdescribesthedata
Score(B1) Score(B2) Score(B3) Score(Bm)
ExhausMvesearchisnotcomputaMonallytractable
ScoresforBayesiannetworks
• Maximumlikelihood
• Regularizedmaximumlikelihood
• Bayesianscore
ScoreML = max
✓P (D|G, ✓)
ScoreBIC = max
✓P (D|G, ✓)� d
2
log(N)
d : no. of parameters
N : no. of samples
ScoreBayes = P (G|D) / P (D|G)P (G)
Decomposabilityofscores
• ThescoreofaBayesiannetworkBdecomposesoverindividualvariables
• EnablesefficientcomputaMonofthescorechangetolocalchanges
Score(B) = ⌃iS(Xi)
S(Xi) =NY
d=1
P (Xi = x
di |Pa(Xi) = padXi
)
padXiJointassignmenttoPa(Xi)inthedthsample
Graphstructurelearningfromexpressionisacomputa(onallydifficultproblem
• Given2TFsand3nodeshowmanypossiblenetworkscantherebe?
NotexhausMvesetofpossiblenetworks
Therecanbeatotalof26possiblenetworks.
….
GreedyHillclimbingtosearchBayesiannetworkspace
• Input:DataD,AniniMalBayesiannetwork,B0={G0, Θ0}
• Output:Bbest
• Loopforr=1, 2..unMlconvergence:– {Br
1, .., Brm}=Neighbors(Br) bymakinglocalchangestoBr
– Br+1:arg maxj(Score(Brj))
• TerminaMon:– Bbest=Br
LocalchangestoBi
A
B C
D
A
B C
D
addanedge
A
B C
D
deleteanedge
Currentnetwork
Checkforcycles
Br
Br1 Br
2
ChallengeswithapplyingBayesiannetworktogenome-scaledata
• Numberofvariables,pisinthousands
• Numberofsamples,N is inhundreds
Bayesiannetwork-basedmethodstohandlegenome-scalenetworks
• Sparsecandidatealgorithm– Friedman,Nachman,Pe’er.1999
– Friedman,Linial,Nachman,Pe’er.2000.
• Modulenetworks
– Segal,Pe’er,Regev,Koller,Friedman.2005
TheSparsecandidateStructurelearninginBayesiannetworks
• AfastBayesiannetworklearningalgorithm• Keyidea:IdenMfyk“promising”candidateparentsforeachXi– k<<p,p:numberofrandomvariables– Candidatesdefinea“skeletongraph”H
• RestrictgraphstructuretoselectparentsfromH• EarlychoicesinHmightexcludeothergoodparents
– ResolveusinganiteraMvealgorithm
Sparsecandidatealgorithm
• Input:– AdatasetD– AniniMalBayesnetB0– Aparameterk:maxnumberofparentspervariable
• Output:– FinalBr
• Loopforr=1,2..unMlconvergence– Restrict
• BasedonDandBr-1 selectcandidateparentsCirforXi
• ThisdefinesaskeletondirectednetworkHr– Maximize
• FindnetworkBrthatmaximizesthescoreScore(Br) amongnetworkssaMsfying
• TerminaMon:ReturnBr
Par(Xi) ✓ Cri
Selec(ngcandidateparentsintheRestrictStep
• AgoodparentforXiisonewithstrongstaMsMcaldependencewithXi– MutualinformaMonprovidesagoodmeasureofstaMsMcaldependenceI(Xi; Xj)
– MutualinformaMonshouldbeusedonlyasafirstapproximaMon
• CandidateparentsneedtobeiteraMvelyrefinedtoavoidmissingimportantdependences
• AgoodparentforXihasthehighestscoreimprovementwhenaddedtoPa(Xi)
Sparsecandidatelearnsgoodnetworksfasterthanhill-climbing
-54
-53.5
-53
-52.5
-52
0 200 400 600 800 1000 1200 1400 1600 1800 2000Time
Greedy HCDisc 5
Disc 15Score 5
Score 15Shld 5
Shld 15-84
-83.5
-83
-82.5
-82
0 1000 2000 3000 4000 5000 6000 7000 8000 9000Time
Greedy HCDisc 5
Disc 15Score 5
Score 15Shld 5
Shld 15-500
-480
-460
-440
-420
-400
0 2000 4000 6000 8000 10000 12000 14000 16000 18000Time
Disc 5Disc 10Score 5
Score 10
-54
-53.5
-53
-52.5
-52
0 5000 10000 15000 20000 25000 30000 35000 40000Collected Statistics
Greedy HCDisc 5
Disc 15Score 5
Score 15Shld 5
Shld 15-84
-83.5
-83
-82.5
-82
0 20000 40000 60000 80000 100000 120000 140000 160000Collected Statistics
Greedy HCDisc 5
Disc 15Score 5
Score 15Shld 5
Shld 15-500
-480
-460
-440
-420
-400
0 50000 100000150000200000250000300000350000400000450000Collected Statistics
Disc 5Disc 10Score 5
Score 10
Text 100 Text 200 Clc 800
Figure 4: Graphs showing the performance of the different algorithms on the text and biological domains. The graphson the top row show plots of score ( -axis) vs. running time ( -axis). The graphs on the bottom row show the same runmeasured in terms of score ( -axis) vs. number of statistics computed ( -axis). The reported methods vary in terms of thecandidate selection measure (Disc – discrepancy measure, Shld – shielding measure, Score – score based measure) andthe size of the candidate set (k = 10 or 15). The points on each curve for the sparse candidate algorithm are the end resultof an iteration.
marginalize statistics to get the statistics of subsets. Wereport the number of actual statistics that were computedfrom the data.Finally, in all of our experiments we used the BDe score
of [16] with a uniform prior with equivalent sample size often. This choice is a fairly unformed prior that does notcode initial bias toward the correct network. The strengthof the equivalent sample size was set prior to the experi-ments and was not tuned.In the first set of experiments we used a sample of 10000
instances from the “alarm” network [1]. This network hasbeen used for studies of structure learning in various pa-pers, and is treated as a common benchmark in the field.This network contains 37 variables, of which 13 have 2values, 22 have 3 values, and 2 have 4 values. We notethat although we do not consider this data set particularlymassive, it does allow us to estimate the behavior of oursearch procedure. In the future we plan to use syntheticdata from larger networks.The results for this small data set are reported in Table 1.
In this table we measure both the score of the networksfound and their error with respect to generating distribu-tions. The results on this toy domain show that our algo-rithm, in particular with the selection heuristic, findsnetworks with comparable score to the one found by greedyhill climbing. Although the timing results for this smallscale experiments are not too significant, we do see that thesparse candidate algorithm usually requires fewer statisticsrecords. Finally, we note that the first iteration of the al-
gorithm finds reasonably high scoring networks. Nonethe-less, subsequent iterations improve the score. Thus, the re-estimation of candidate sets based on our score does leadto important improvements.To test our learning algorithms on more challenging do-
mains we examined data from text. We used the dataset that contains messages from 20 newsgroups (approxi-mately 1000 from each) [18]. We represent each messageas a vector containing one attribute for the newsgroup andattributes for each word in the vocabulary. We constructeddata sets with different numbers of attributes by focusingon subsets of the vocabulary. We did this by removingcommon stop words, and then sorting words based on theirfrequency in the whole data set. The data sets included thegroup designator and the 99 (text 100 set) or 199 (text 200set) most common words. We trained on 10,000 messagesthat were randomly selected from the total data set.The results of these experiments are reported in figure 4.
As we can see, in the case of 100 attributes, by using theselection method with candidate sets of sizes 10 or
15, we can learn networks that are reasonably close to theone found by greedy hill-climbing in about half the runningtime and half the number of sufficient statistics. When wehave 200 attributes, the speedup is larger than 3. We ex-pect that as we consider data sets with larger number ofattributes, this speedup ratio will grow.To test that, we devised another synthetic dataset, which
originates in real biological data. We used gene expres-sion data from [23]. The data describes expression level
Dataset1 Dataset2
100variables 200variables
Greedyhillclimbingtakesmuchlongertoreachahighscoringbayesiannetwork
Score(highe
risb
e2er)
Somecommentsaboutchoosingcandidates
• Howtoselectk inthesparsecandidatealgorithm?
• ShouldkbethesameforallXi ?• RegularizedregressionapproachescanbeusedtoesMmatethestructureofanundirectedgraph– Schmidt,Niculescu-Mizil,Murphy2007– EsMmateanundirecteddependencynetwork– LearnaBayesiannetworkconstrainedonthedependencynetworkstructure
Assessingconfidenceinthelearnednetwork
• Typicallythenumberoftrainingsamplesisnotsufficienttoreliablydeterminethe“right”network
• OnecanhoweveresMmatetheconfidenceofspecificfeaturesofthenetwork– Graphfeaturesf(G)
• Examplesoff(G)– Anedgebetweentworandomvariables– OrderrelaMons:IsX, Y’sancestor?
Howtoassessconfidenceingraphfeatures?
• WhatwewantisP(f(G)|D),whichis
• Butitisnotfeasibletocomputethissum
• Insteadwewillusea“bootstrap”procedure
�Gf(G)P (G|D)
Bootstraptoassessgraphfeatureconfidence
• For i=1 tom– ConstructdatasetDibysamplingwithreplacementN samplesfromdatasetD,whereNisthesizeoftheoriginal D
– LearnanetworkBi={Gi,Θi}• Foreachfeatureofinterestf,calculateconfidence
Conf(f) =
1
m
mX
i=1
f(Gi)
Doesthebootstrapconfidencerepresentrealrela(onships?
• ComparetheconfidencedistribuMontothatobtainedfromrandomizeddata
• Shufflethecolumnsofeachrow(gene)separately• Repeatthebootstrapprocedure
1,2g ng ,2
1,mg nmg ,
1,1g ng ,1
randomizeeachrowindependently
genes
ExperimentalcondiMons
Bootstrap-basedconfidencediffersbetweenrealandactualdata
612F
RIE
DM
AN
ET
AL
.
Markov OrderMultinomial
0
200
400
600
800
1000
1200
1400
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
50
100
150
200
250
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Feat
ures
with
Con
fide
nce
abov
e
Linear-Gaussian
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
500
1000
1500
2000
2500
3000
3500
4000
0
100
200
300
400
500
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
RandomReal
Feat
ures
with
Con
fide
nce
abov
e
0
2000
4000
6000
8000
10000
12000
14000
0.2 0.4 0.6 0.8 1
Feat
ures
with
Con
fide
nce
abov
e
0
200
400
600
800
1000
0.2 0.4 0.6 0.8 1
RandomReal
FIG. 3. Plots of abundance of features with different con�dence levels for the cell cycle data set (solid line), and the randomized data set (dotted line).The x-axis denotes the con�dence threshold, and the y-axis denotes the number of features with con�dence equal or higher than the correspondingx-value. The graphs on the left column show Markov features, and the ones on the right column show Order features. The top row describes featuresfound using the multinomial model, and the bottom row describes features found by the linear-Gaussian model. Inset in each graph is a plot of the tailof the distribution.
f
f
Random
Real
Friedmanetal2000
Exampleofahighconfidencesub-networkthemselves transcriptionally regulated. Theseregulation mechanisms often involve feed-forward loops and feedback mechanisms(18, 31) that change the mRNA expression
level of regulators when their protein activ-ity changes. As a consequence, we candetect coordinated changes in the expres-sion levels of regulators and their targets.
This hypothesis is supported by an analysisof the discovered regulatory relations againsta database of protein-DNA and protein-protein interactions (26).
Fig. 3. Different regulatory network architectures. (A) An uncon-strained acyclic network where each gene can have a different regu-lator set. This is a fragment of a network learned in the experimentsof Pe’er et al. (24 ). (B) A summary of direct neighbor relations amongthe genes shown in (A) based on bootstrap estimates. Degrees ofconfidence are denoted by edge thickness. We automatically identifya subnetwork of genes, with high-confidence relations among them,that are involved in the yeast-mating pathways. The colors highlightgenes with known function in mating, including signal transduction(yellow), transcription factors (blue), and downstream effectors(green). (C) A fragment of a two-level network described by Pe’er etal. (25). The top level contains a small number of regulators; the
bottom level contains all other genes (targets). Each gene has differ-ent regulators from among the regulator genes. (D) Visualization ofsignificant Gene Ontology (42) annotations of the targets of differentregulators. Each significant annotation for the targets of a regulator(or pairs of regulators) is shown with the hypergeometric p-value. (E)A fragment of the module network described by Segal et al. (26). Eachmodule contains several genes that share the same set of regulators andshare the same conditional regulation program given these regulators. (F)Visualization of the expression levels of the 55 genes in Module 1 (b) andtheir regulators (a). Significant Gene Ontology annotations (c) and cis-regulatory motifs in promoter regions of genes in the module (d) are shown.[See figure 3 of (26); reproduced with permission]
M A T H E M A T I C S I N B I O L O G Y
6 FEBRUARY 2004 VOL 303 SCIENCE www.sciencemag.org804
SPECIALSECTION
OnelearnedBayesiannetwork BootstrappedconfidenceBayesiannetwork:highlightsasubnetworkassociatedwithyeastmaMngpathway.ColorsindicategeneswithknownfuncMons.
NirFriedman,Science2004
Goalsfortoday
• Background– Introductorygraphtheory– Differenttypesofcellularnetworks
• Expression-basedregulatorynetworkinference– Per-genevsPer-modulemethods
• ProbabilisMcgraphicalmodelstorepresentregulatorynetworks– SparseCandidatesBayesiannetworks– GENIE3:Regression-basedmethods
Limita(onswithBayesiannetworks
• Cannotmodelcyclicdependencies• InpracMcehavenotbeenshowntobebe2erthandependencynetworks– However,mostoftheevaluaMonhasbeendoneonstructurenotfuncMon
• DirecMonalityisoyennotassociatedwithcausality– Toomanyhiddenvariablesinbiologicalsystems
Dependencynetwork
• AtypeofprobabilisMcgraphicalmodel• ApproximateMarkovnetworks
– Aremucheasiertolearnfromdata• AsinBayesiannetworkshas
– Agraphstructure– Parameterscapturingdependenciesbetweenavariableanditsparents
• UnlikeBayesiannetwork– Canhavecyclicdependencies– CompuMngajointprobabilityisharder
• Itisapproximatedwitha“pseudo”likelihood.DependencyNetworksforInference,CollaboraMveFilteringandDatavisualizaMonHeckerman,Chickering,Meek,Rounthwaite,Kadie2000
Originalmo(va(onofdependencynetworks
• IntroducedbyHeckerman,Chickering,Meek,etal2000
• OyenMmesBayesiannetworkscangetconfusing– BayesiannetworkslearnedrepresentcorrelaMonorpredicMverelaMonships
– ButthedirecMonalityoftheedgesaremistakenlyinterpretedascausalconnecMons
• (Consistent)DependencynetworkswereintroducedtodisMnguishbetweenthesecases
DependencynetworkvsBayesiannetwork
Age Gender
Income
Age Gender
Income
(a) (b)
DependencyNetworksforInference,CollaboraMveFilteringandDatavisualizaMonHeckerman,Chickering,Meek,Rounthwaite,Kadie2000
Bayesiannetwork Dependencynetwork
OyenMmes,theBayesiannetworkontheleyisreadasif“Age”determines“Income”.However,allthismodeliscapturingisthat“Age”ispredicMveof“Income”.
Learningdependencynetworks:es(matethebestpredictorsofavariable
?? ?…
Xj
Regulators• fjcanbeofdifferenttypes.
• LearningrequiresesMmaMonofeachofthe fj funcMons
• InallcaseslearningrequiresustominimizeanerrorofpredicMngXj fromitsneighborhood
fj
Differentrepresenta(onsoffj
• IfXjisconMnuous– fjcanbealinearfuncMon– fj canbearegressiontree– fjcanbeanensembleoftrees
• E.g.randomforests
• IfXjisdiscrete– fjcanbeacondiMonalprobabilitytable– fjcanbeacondiMonalprobabilitytree
Populardependencynetworksimplementa(ons
• Learnedbysolvingasetoflinearregressionproblems• TIGRESS(Hauryetal,2010)
• Usesaconstrainttolearna“sparse”Markovblanket• Uses“stabilityselecMon”toesMmateconfidenceofedges
• Learnedbysolvingasetofnon-linearregressionproblems• Non-linearitycapturedbyRegressionTree(Heckermanetal,
2000)• GENIE3:Non-linearitycapturedbyRandomforest(Huynh-Thuet
al,2010)• Inferelator(Bonneauetal,GenomeBiology2005)
• CanhandleMmecourseandsingleMmepointdata• Non-linearregressionisdoneusingalogisMctransform• Handleslinearandnon-linearregression
GENIE3:GEneNetworkInferencewithEnsembleoftrees
• Solvesasetofregressionproblems– Oneperrandomvariable
• UsesanEnsembleofregressiontreestorepresentfj– Modelsnon-lineardependencies
• Outputsadirected,cyclicgraphwithaconfidenceofeachedge
• FocusongeneraMngarankingoveredgesratherthanagraphstructureandparameters
InferringRegulatoryNetworksfromExpressionDataUsingTree-BasedMethodsVanAnhHuynh-Thu,AlexandreIrrthum,LouisWehenkel,PierreGeurts,PlosOne2010
Aregressiontree
• ArootedbinarytreeT• Eachnodeinthetreeiseitheraninteriornodeoraleafnode
• InteriornodesarelabeledwithabinarytestXi<u, u isarealnumberobservedinthedata
• LeafnodesareassociatedwithunivariatedistribuMonsofthechild
Averysimpleregressiontreefortwovariables
X2
X3
e1 e2
µ1,�1
µ2,�2
µ3,�3
X2 > e1
X2 > e2
YESNO
NO YES
µ3,�3µ2,�2µ1,�1
X3
ThetreespecifiesX3 asafuncMonofX2 Interiornodes
Leafnodes
Learningaregressiontree• AssumewearesearchingfortheneighborsofavariableX3 andit
alreadyhastwoparentsX1and X2
• X4 willbeconsideredasanewneighborusing“split”operaMonsofexisMngleafnodes
X1 > e1
X2 > e2
YES
NO
NO YES
X4 > e3
NO YES
X1 > e1
X2 > e2
YES
NO
NO YES
N1 N2N3
Nl: Gaussianassociatedwithleafl
N2N3
N4 N5
AnEnsembleoftrees
• Asingletreeisproneto“overfizng”• Insteadoflearningasingletree,ensemblemodelsmakeuseofacollecMonoftrees
– PredicMonis
ARandomforest:AnEnsembleofTrees
……tree1 treeT
splitnodesleafnodes
x-j x-j
TakenfromICCV09tutorialbyKim,Sho2onandStenger:h2p://www.iis.ee.ic.ac.uk/~tkkim/iccv09_tutorial
µ1
µT
bxj =1
T
TX
t=1
µt
GENIE3algorithmsketch
• ForeachXj,generatelearningsamplesofinput/outputpairs– LSj={(x-j
k,xjk), k=1..N}
– OneachLSj learnfj topredictthevalueofXj– fj isanensembleofregressiontrees– EsMmatewijforallgenesi ≠ j
• wijquanMfiestheconfidenceoftheedgebetweenXiandXj• AssociatedwiththedecreaseinvarianceofXjwhenXiisincludedinfj
• Generateaglobalrankingofedgesbasedoneachwij
GENIE3algorithmsketch
we assume that we can write:
xjk~fj(x{jk )zek,Vk ð1Þ
where ek is a random noise with zero mean (conditionally to x{jk ).
We further make the assumption that the function fj only exploits
the expression in x{j of the genes that are direct regulators ofgene j, i.e. genes that are directly connected to gene j in thetargeted network. Recovering the regulatory links pointing togene j thus amounts at finding those genes whose expression ispredictive of the expression of the target gene. In machinelearning terminology, this can be considered as a feature selectionproblem (in regression) for which many solutions exist [28]. Weassume here the use of a feature ranking technique that, insteadof directly returning a feature subset, yields a ranking of thefeatures from the most relevant to the less relevant for predictingthe output.The proposed network inference procedure is illustrated in
Figure 1 and works as follows:
N For j=1 to p:
– Generate the learning sample of input-output pairs for genej:
LSj~f(x{jk ,xjk),k~1, . . . ,Ng:
– Use a feature selection technique on LSj to computeconfidence levels wi,j ,Vi=j, for all genes except gene j itself.
N Aggregate the p individual gene rankings to get a globalranking of all regulatory links.
Note that depending of the interpretation of the weights wi,j ,their aggregation to a get a global ranking of regulatory links is nottrivial. We will see in the context of tree-based methods that itrequires to normalize each expression vector appropriately.
Gene Ranking with Tree-based MethodsThe nature of the problem and the proposed solution put some
constraints on candidate feature selection techniques. The natureof the functions fj is unknown but they are expected to involve theexpression of several genes (combinatorial regulation) and to benon-linear. The number of input features in each of theseproblems is typically much greater than the number ofobservations. Computationally, since the identification of anetwork involving p genes requires to rerun the algorithm p times,it is also of interest for this algorithm to be fast and to require asfew manual tuning as possible. Tree-based ensemble methods aregood candidates for that purpose. These methods do not make anyassumption about the nature of the target function, can potentiallydeal with interacting features and non-linearity. They work well inthe presence of a large number of features, are fast to compute,scalable, and essentially parameter-free (see [29] for a review).We first briefly describe these methods and their built-in feature
ranking mechanism and then discuss their use in the context of thenetwork inference procedure described in the previous section.
Tree-based Ensemble MethodsEach subproblem, defined by a learning sample LSj , is a
supervised (non-parametric) regression problem. Using squareerror loss, each problem amounts at finding a function fj thatminimizes the following error:
XN
k~1
(xjk{fj(x{jk ))2: ð2Þ
Regression trees [30] solve this problem by developing treestructured models. The basic idea of this method is to recursivelysplit the learning sample with binary tests based each on one input
variable (selected in x{j ), trying to reduce as much as possible the
variance of the output variable (xj ) in the resulting subsets ofsamples. Candidate splits for numerical variables typicallycompare the input variable values with a threshold which isdetermined during the tree growing.
Figure 1. GENIE3 procedure. For each gene j~1, . . . ,p, a learning sample LSj is generated with expression levels of j as output values andexpression levels of all other genes as input values. A function fj is learned from LSj and a local ranking of all genes except j is computed. The p localrankings are then aggregated to get a global ranking of all regulatory links.doi:10.1371/journal.pone.0012776.g001
Inferring GRNs with Trees
PLoS ONE | www.plosone.org 3 September 2010 | Volume 5 | Issue 9 | e12776
FigurefromHuynh-Thuetal.
Predictorranking
Evalua(onofnetworkinferencemethods
• Assumeweknowwhatthe“right”networkis• OnecanusePrecision-Recallcurvestoevaluatethepredictednetwork
• AreaunderthePRcurve(AUPR)curvequanMfiesperformance
Precision=#ofcorrectedges
#ofpredictededges
Recall=#ofcorrectedges
#oftrueedges
DREAM:Dialogueforreverseenginee(ngassessmentsandmethods
Communityefforttoassessregulatorynetworkinference
!"#$%&%'(!"#$%&'!()*+$,-.-')*#'(&%/")$%&'*(/"0$,-.-'+*#',-,.%$/,!1+$%&'#())0$,-.-'+*#/0-,0$"2$%&'/()00$,-.-'
!"#$%&'()*$+)*,-'. !/#$0+1)'')2$+)*,-'.3
12#$34435'*10$67.8'9
12#$34435'*10$67.8'9
#+)$34435'+/!$67.8'9
!)2$34435'#+$67.8'9
!4#$567'-&''&8$7-9:)+26&
:;<=>3?;7.
@AB-4;<-.?'C.76C7=?'7D-4-AB4-''3.?;E;7?;6'?7A;.'-?69
F.7.5<;G-83?3
0+1)')+7)$9)*;-23
(4) Community network
H.?-,43?;7.
(5) Performance evaluation
%4=-$!"#$!%!&'.-?I74C
@AB-4;<-.?3>>58-?-4<;.-8;.?-436?;7.'$$
()#*+,-+$#.7?$='-8J74$-D3>=3?;7.
K3>;83?;7.
/" <-?L78'$3BB>;-8$E5?-3<'$ME>;.8-8N
O$) 7JJP?L-P'L->J$<-?L78'/< 9)*;-23$*)3*)2
QLHR<7?;J'-?69
=6(>')$"
=6(>')$"?$%;)$@ABC5<$+)*,-'.$6+1)')+7)$7;&DD)+()?
DREAM5challenge
Previouschallenges:2006,2007,2008,2009,2010 Marbachetal.2012,NatureMethods
Wheredodifferentmethodsrank?
!
!
"#
!"
$
#
%
!
"#$%$&'(&'))*+,
"-$%$./$0$1+&&'234*+,"5$%$637')*3,
&'()
&'(!
'
*'(!
*'()*
+,-*.,/01
/.23*145
.406
07
#
%
!
+
$
8
! "
8
+
)
#88 )
%+
+ #
8+
"
"
+
#
"
&'() &'(! ' *'(! *'()#0-*.,/01/.23*145.40607
"8$%$+49'&
96:,6;;/40<=7=23*/0>4,527/40?4,,6327/40@2A6;/20*067B4,C;
/,:'&',1'$;'49+<)D7E6,<672
96:,6;;/40 <F ?4,,( @2A6;/20 D7E6, <672
?455=0/7A
920-45" # %8!+ $ ) " # 8!+
3 1
'8
"'
'
#'
!''#'
!'%'
"8
GHI9
*JKL
GHI9
*JKL
M14,6
=>$1+
2*
'8
"'"8
GHI9
*JKL
?>$1'
&'@*)
*3'/,$)*2*1+
A@'&322?455=0/7A
?"*&*,6:,6;(
?#*&*@2
A6;(
?+*&*<F*N*14
,,6327/40
?!*&*47E6,
+!8%
"
)
#+!
"
8
+!8"#+
#
#
!
"+%$)
#
+
"
8
#!8
$
"
%
G
?@G
?@
G
?@
G
?@
G
@
G
@
O041C4=7;
<67E4
-?3=;76,
P/,617/4023/7A
Q66->4,B2,-
344.
?2;12-6
Q20&/0
Q20&4=7
B&'<*14*+,$!*3)CD$'<('$&3,EF
'
F01,62;6-140>/-6016
R4*S/2;
P61,62;6-140>/-6016
T"8
U&"8 H;6*VQ
*C041C4=7;
" # + " # %8!+ " # %8!+ $ ) " # 8!+
G*(H&'$5
G*(H&'$5>$=@32H34*+,$+:$,'4I+&E$*,:'&',1'$;'49+<)
Marbachetal.,2012 Commun
ity
Rand
om
Theseapproachesweremostlyper-gene
Methodstendtoclustertogether
NATURE METHODS | VOL.9 NO.8 | AUGUST 2012 | 799
ANALYSIS
methods explicitly used this informa-tion. Consequently, these methods recov-ered target genes of deleted transcription factors more reliably than the inference methods that did not leverage this infor-mation (Fig. 2c). Explicit use of such knockouts also helped methods to draw the direction of edges between tran-scription factors more reliably. These observations suggest that measurements of transcription-factor knockouts can be informative for network reconstruc-tion. In particular, this is the case for the E. coli data set, which contained the larg-est number of such experiments (Online Methods). To further explore the informa-tion content of different experiments, we employed a machine learning framework22 to systematically analyze the information gain from microarrays grouped according to the type of experi-mental perturbation (knockouts, drug perturbations, environ-mental perturbations and time series; Supplementary Note 5). We found that experimental conditions independent of transcrip-tion factor knockout and overexpression also provide informa-tion, though at a reduced level.
Community networks outperform individual inference methodsNetwork inference methods have complementary advantages and limitations under different contexts, which suggests that combining the results of multiple inference methods could be a good strategy for improving predictions. We therefore integrated the predic-tions of all participating teams to construct community networks by rescoring interactions according to their average rank across all methods (Supplementary Note 6). The integrated community network ranks first for in silico, third for E. coli and sixth for S. cerevisiae out of the 35 applied inference methods, which shows that the community network is consistently as good or better than the top individual methods (Fig. 2a). Thus it has by far the best performance reflected in the overall score. We stress that, even though top-performing methods for a given network are com-petitive with the integrated community method, the performance of individual methods does not generalize across networks.
Given the biological variation among organisms and the experi-mental variation among gene-expression data sets, it is difficult to determine beforehand which methods will perform optimally for reconstructing an unknown regulatory network. In con-trast, the community approach performs robustly across diverse data sets.
We next analyzed how the number of integrated methods affects the performance of community predictions by examin-ing randomly sampled combinations of individual methods. On average, community methods perform better than indi-vidual inference methods even when integrating small sets of individual predictions: for example, just five teams (Fig. 3a). Performance increases further with the number of integrated methods. For instance, given 20 inference methods, their inte-gration ranks first or second in 98% of the cases (Fig. 3b). We also found that the performance of the community network can be improved by increasing the diversity of the underlying infer-ence methods. Consensus predictions from teams using similar methodologies were outperformed by consensus predictions from diverse methodologies (Fig. 3c).
A key feature in taking a community network approach is robust-ness to the inclusion of a limited subset (up to ~20%) of poorly per-forming inference methods (Fig. 3d). Poor predictors essentially contributed noise, but this did not affect the performance of the community approach as a whole. This finding is crucial because
b
c
C1.
Reg
ress
ion
C2.
Bay
esia
nC
3. M
I & c
orr.
C4.
Oth
er
3456
1
8
234
1
5
345123
2
2
4
13678
2
3
1
5
245
7
1
6
A
C
B
A
C
B
A
C
B
A
C
B
A
B
A
B
Kno
ckou
ts
Met
hod
Dire
ctio
nalit
y
Fee
d-fo
rwar
dlo
op
Cas
cade
Fan
-in
Fan
-out
Rank bias
0
Moreconfident
No bias
Lessconfident
≥15%
≤–15%
Use T
F knockouts
a
0
5
10
0
20
40
0
20
40
60
15
E. c
oli
AU
PR
(%
)
0
5
10
15
S. c
erev
isia
eA
UP
R (
%)
In s
ilico
AU
PR
(%
)O
vera
llsc
ore
Community
2 6543 71 8 1 2 543 1 2 3 1 2 6543 1 2 6543 7 8 1 2 54 CR3Regression MI Corr. Bayes Other Meta
12 6543 7 8 12 543 12 3 12 6543 12 6543 7 8 12 54 CR3Regression MI Corr. Bayes Other Meta
12 6543 7 8 12 543 12 3 12 6543 12 6543 7 8 12 54 CR3Regression MI Corr. Bayes Other Meta
12 6543 7 8 1 2 543 1 23 12 6543 12 6543 7 8 12 54 CR3Regression MI Corr. Bayes Other Meta
CommunityRandom
4
12
41
7
2
6
4
C1. Regression
C3. MI & corr.C2. Bayesian
–0.8
–0.4
0
0.4
0.8
Thi
rd p
rinci
pal c
ompo
nent
2
6
4
3
7
5
4 1
5
3
8
255 8
63
3 2
53
1
1
3
2
1
–0.8 –0.4 0 0.4 0.8Second principal component
C4. Other
RegressionMutual information (MI)
Inference methodsOtherMeta
Bayesian networksCorrelation (corr.)
Figure 2 | Evaluation of network inference methods. Inference methods are indexed according to Table 1. (a) The plots depict the performance for the individual networks (area under precision-recall curve, AUPR) and the overall score summarizing the performance across networks (Online Methods). R, random predictions; C, integrated community predictions. (b) Methods are grouped according to the similarity of their predictions via principal-component analysis. The second versus third principal components are shown; the first principal component accounts mainly for the overall performance (Supplementary Note 4). (c) The heat map depicts method-specific biases in predicting network motifs. Rows represent individual methods and columns represent different types of regulatory motifs. Red and blue show interactions that are easier and harder to detect, respectively.
Theseapproachesweremostlyper-geneMarbachetal.,2012
Takeawaypoints
• NetworkinferencefromexpressionprovidesapromisingapproachtoidenMfycellularnetworks
• GraphicalmodelsareonerepresentaMonofnetworksthathaveaprobabilisMcandgraphicalcomponent– Networkinferencenaturallytranslatestolearningproblemsinthesemodels
• Differenttypesofalgorithmstolearnthesenetworks– Per-genemethods
• SparseCandidate,GENIE3:learnregulatorsforindividualgenes– Per-modulemethods
• Modulenetworks:learnregulatorsforsetsofgenes/modules• SuccessfulapplicaMonofBayesiannetworkstoexpressiondatarequires
addiMonalconsideraMons– ReducepotenMalparents– BootstrapbasedconfidenceesMmaMon
• NetworkreconstrucMonfromexpressionaloneischallenging– Newmethodsaimtocombineothertypesofmeasurements(nextlecture)
References• Kim,HaroldD.,TalShay,ErinK.O'Shea,andAvivRegev."TranscripMonal
RegulatoryCircuits:PredicMngNumbersfromAlphabets."Science325(July2009):429-432.
• DeSmet,RietandKathleenMarchal."AdvantagesandlimitaMonsofcurrentnetworkinferencemethods.."Naturereviews.Microbiology8(October2010):717-729.
• Markowetz,FlorianandRainerSpang."Inferringcellularnetworks-areview."BMCbioinforma-cs8Suppl6(2007):S5+.
• N.Friedman,M.Linial,I.Nachman,andD.Pe'er,"Usingbayesiannetworkstoanalyzeexpressiondata,"JournalofComputa-onalBiology,vol.7,no.3-4,pp.601-620,Aug.2000.[Online].Available:h2p://dx.doi.org/10.1089/106652700750050961