Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
582746 Modelling and Analysis in Bioinformatics
Lecture 1: Global Network Models
26.10.2015
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
Course topics
I Computational models for biological networks (Leena Salmela)
I Gene regulation (Antti Honkela)
I Probabilistic analysis of sequence level problems(Veli Makinen)
Practical arrangements
I Mondays: Lectures to introduce the topics
I Thursday mornings: Study group to deepen the knowledge onthe subject
I Thursday afternoons: Exercise sessions
I 23.11.-27.11. Visiting lecturers (no exercise session)
How to pass the course?
I Attending study groups on Thursday mornings is mandatory
I Attending visiting lectures on Monday 23.11. and Thursday26.11. is mandatory
I Submit the exercises and get at least 6 points for each threeexercise sets (network models, gene regulation, probabilisticanalysis of sequence-level problems)
I If you miss a study group or visiting lecture, contact thelecturers for an alternative assignment
Grading
I Grading is based on submitted exercises
I 60 points will be available
I 30 points =⇒ Passed, 50 points =⇒ 5
I No exam
I Not possible to pass with a separate exam
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
Protein-protein interaction network
I Vertices are proteins
I The proteins are connected if they interact with each other.
Metabolic network
I Vertices are metabolites,i.e. chemical compounds
I Edges describe how thecell can transform ametabolite into another
Gene regulatory networkI Vertices are genesI Genes are linked if one regulates the other
Source: Shen-Orr SS, Milo R, Mangan S, Alon U. 2002
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
(Shortest path) distance
I Distance dij is the length of theshortest path between vertices ni
and nj , i.e. the minimal number ofedges one needs to traverse to getfrom ni to nj
I The shortest path may not beunique, but the length of theshortest path is unique
I In directed network, we may havedij 6= dji
I If there is no path between ni andnj , we have dij =∞
Ignoring weights, above wehave d12 = 2,d13 = 1,d14 = 2,. . .
Diameter and average path length
I The diameter dm = max(dij ) isthe maximal distance between anytwo nodes (= the longest shortestpath)
I Average or characteristic pathlengthd = 〈dij〉 = 1
N2V
∑NVi=1
∑NVj=1 dij Ignoring weights, above
dm = 3, d ≈ 1.57
Efficiency
I Efficiency, or average inverse pathlength: deff = 〈1/dij〉
I Useful when average path lengthis infinite (disconnected network)
I Fully connected network hasefficiency deff = 1, graph with noedges has deff = 0 Ignoring weights, above
deff ≈ 0.73
Weighted graphs
I If the edges in the graph haveassociated weights wij , it isnatural to define distances basedon the weights:
I dij as the sum of weights in theminimum weight path between ni
and nj
I Maximum and average pathlength as well as efficiencynaturally generalize by changingthe distance measure to theweighted version
With weights, d12 = 40 (redpath), d ≈ 24.2,dm = 50 = d25, deff = 0.06
Finding shortest paths
Finding shortest paths in graphs is part of classical algorithmtheory, two efficient algorithms
I Dijkstra’s algorithm: given a vertex find shortest paths to allother vertices, basic implementation runs in O(N2
V ) time, canbe implemented faster for sparse graphs
I Floyd-Warshall algorithm: find shortest paths for all pairs ofvertices in the graph in O(N3
V ) time; outputs a distance
matrix (dij )NVi ,j=1 in same time.
I Both work with weighted formulations
Shortest path distances in empirical networks
Path length analysis of many networksthat occur in nature reveals thesmall-world property
I Metabolite graphs: average pathlength d ≈ 3 (NV ≈ 103 − 104)
I WWW: links chains between twoweb documents d ≈ 16(NV > 109)
I Erdos number: shortest co-authorchain to Paul Erdos, d = 4.65(NV ≈ 4× 105)
I ...
Paul Erdos (1913-1996, a Hungarian
mathematician, published over 1400 scientific
papers over his lifetime with over 500
different co-authors
Node degree
I Degree ki of vertex ni is thenumber of edges adjacent to avertex
I In a network without self-loopsand without multiple edgesbetween any pair of edges: degree= number of neighbours
I In directed networks: in-degree isthe number of incoming edges andout-degree is the number outgoingedges
Degree distribution
I Given a fixed set of vertices, p(k) denotes the probability thata randomly chosen vertex has degree k .
I (Empirical) degree distribution is the list of probabilities (orrelative frequencies) p(k), k = 0 . . .NV .
I Analysis of the degree distribution is an important means tocharacterize networks
Degree distribution
Fitting the empirical degreedistribution to a theoretical distributiongiven by a mathematical law is animportant tool for network analysis
I Regular lattice: p(k) ≈ 1, where kis a constant
I Scale free network: p(k) ∝ k−γ
I Random network:p(k) ∝
(NV−1k
)pk (1− p)NV−1−k
Degree distribution
The degree distributions of of scale-free network and randomnetwork look markedly different
I Scale free network: p(k) ∝ k−γ (power law, heavy tail)
I Random network: p(k) ∝(NV−1
k
)pk (1− p)NV−1−k (binomial,
light tail)
Fitting degree distributions
I Typically the fitting of the empirical distribution is based onthe histogram of observations for p(k)
I This is prone to errors in the region of high degree nodes dueto low number of observations
I Binning can help: divide the range of k into intervals and putall observations in the interval into a common bin
I Cumulative degree distribution pc (k) =∑∞
l=k p(l), thelikelyhood that a given node has degree at least k , is morereliable and does not require binning
Degree correlations and assortative mixing
Degree correlation is a statistic that reveals additional informationof the connection patterns of the nodes
I Assortative networks: high correlation between the degrees ofadjacent nodes; highly connected nodes mostly connect toother highly connected nodes
I Disassortative networks: highly connected nodes mostlyconnect to low degree nodes
I Assortativity index −1 ≤ r ≤ 1: Pearson correlationcoefficient of degrees of adjacent nodes, r > 0 assortative,r < 0 disassortative
Examples
Social networks are typically assortative, technological andbiological networks tend to be disassortative
(M. Newman. Phys. Rev. E 67, 026126 (2003) )
Clustering coefficient
I Clustering coefficient measures theprobability that two vertices witha common neighbor are connected
I Let Ei denote the number of edgesbetween the neighbors of vi , andEmax = ki (ki − 1)/2 thetheoretical maximum. Clusteringcoefficient for vertex ni is now
Ci =Ei
Emax=
2Ei
ki (ki − 1)
I Clustering coefficient for the wholegraph is obtained by averagingover the vertices
Clustering coefficient in natural networks
I Natural networks often haverelatively high clusteringcoefficient indicating localclustering within the network
I Negative correlation between thedegree and the clusteringcoefficient has also been observed;
I Low degree nodes lie in localclusters, while the neighbors ofhigh degree nodes are less oftenconnected
I Indicates modular networkstructure
Example: PPIs in Mouse andHuman:
(http://bccs.bristol.ac.uk/toProgramme
/ project/2008/Angela Onslow S08/)
Matching index
To be functionally related, twovertices do not need to beconnected, examples:
I Two transcription factorproteins regulating thesame gene
I Two metabolite moleculestaking part in similarreactions
Zamora-Lopez. Frontiers in Neuroinformatics 4, 2010
Matching index
I Matching index measuresthe amount of neighborsthe two nodes share:
MIij =Sharedij
ki + kj − Sharedij
I Similarity in terms ofperceiving theneighborhood similarly
Zamora-Lopez. Frontiers in Neuroinformatics 4, 2010
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
Models of complex networks
I Theoretical models of networks areneeded as a basis for comparisonto determine the significance ofglobal properties or non-trivialsubstructures of natural networks.
I We will look at three specificmodels
I Erdos-Renyi ModelI Watts-Strogatz ModelI Barabasi-Albert model
Erdos-Renyi Model
I ER network consists of NV vertices
I Edge is drawn between a pair ofnodes randomly with probability p
I Degree distribution of the ERmodel is binomial:p(k) ∝
(NV−1k
)pk (1− p)NV−1−k
I Degree distribution can beapproximated by Poissondistribution for large graphs
Erdos-Renyi Model
Significant body of theoretical research exists for the ER model,e.g.
I For NV p < 1 the network almost surely has no largeconnected components
I For NV p ≈ 1 the network will almost surely have one largeconnected component
I For NV p > logNV the network will almost surely beconnected
I ER network has the small-world property when p > 1/NV
with average path length scaling as l ∼ logNV
I No local clustering, expected clustering coefficientC = p = 〈k〉 /NV for all nodes
Watts-Strogatz model
1. Arrange vertices in a ring structure
2. Connect each vertex to K closestneighbours
3. With probability prew , rewire eacheach edge by detaching from oneend and attaching to a randomlychosen vertex.
After steps (1-2) there is localclustering, step (3) lowers average pathlength by creating shortcuts
http://en.wikipedia.org/wiki/File:Watts strogatz.svg
Watts-Strogatz model
I Even for low rewiring probability(prew << 1) the average pathlength goes down rapidly
I Small average path length andlocal clustering is retained forintermediate prew
I When prew → 1, we get ER model,i.e. local clustering is destroyed
I Degree distribution is similar toER graph: homogeneuos andpeaked around k = K
http://en.wikipedia.org/wiki/File:Watts strogatz.svg
Barabasi-Albert model
I Start with an initial smallconnected network of N0 vertices
I Iteratively add new vertices andconnect the new vertex to m ≤ N0
vertices
I Draw the nodes that will beconnected the the new vertex withprobability proportional to theirdegree (preferential attachment):
ρ(ni ) = ki/∑
j
kj (BA graph from http://melihsozdinler.blogspot.com/)
Barabasi-Albert model
I Unlike ER or WS model,Barabasi-Albert model explain theinhomogeneuos degree distributionobserved in natural graphs
I With enough iterations, the degreedistribution of the BA model isscale-free, with p(k) ∼ k−3
I Average path length in BAnetworks has been found to besmaller than in ER and WS models
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
Robustness and Attack tolerance
I Robustness against pertubations(mutations, environment changes)is a preferable property forbiological networks
I Networks analysis is interested inpreservation of network topologyunder perturbations (usually:removals of vertices or edges)
Robustness and Attack tolerance
I Both ER networks and scale-freenetworks (such as BA model) arerobust towards random deletionsof nodes and connections
I A random mutation is likely tohit a low degree node in BAmodel
I Scale-free networks are not robusttowards intentioanl attacks
I Removal a set of highlyconnected nodes may collapsethe global structure
I ”Robust, yet fragile”
Modularity and hierarchical organization
I Many natural networks areobserved to possesmodular structure withdensely connectedfunctional clusters ofnodes that are sparselyconnected to other nodes.
I Also, hierarchicalorganization of networkstructure can be observed
I The random networkmodels discussed above,do not directly explainthese phenomena
(Zhao et al. BMC Bioinformatics 2006, 7:386)
Modularity and hierarchical organization
I Barabasi and Albert model hasbeen later extended to thatdirection
I Based on replicating basicmodules and wiring them to thecentral module of rest of thenetwork
I Recursive application leads tohierarchical organization
I Deterministic rather thanrandom procedure
Outline
Course introduction
Examples of biological networks
Global properties of networksDistance measuresDegree measuresLocal clusters
Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model
More Network Properties
Statistical Testing of Network Properties
Statistical testing of network properties
I How to determine if an observed property of the network issignificant or if it occured just by chance?
I Set up a null hypothesis
I Test if the observed property is consistent with the nullhypothesis
Statistical testing of network properties: Example
I Suppose that we have observed aclustering coefficient C for a givennetwork. Is the network highly clustered?
I Null hypothesis: The clustering coefficientis consistent with a network of the samesize and degree distribution.
I Create an ensemble of random networkswith same size and degree distribution andcompute the clustering coefficient of eachnetwork
I Reject the null hypothesis if theprobability of a network with clusteringcoefficient of at least C is low enough
I If the null hypothesis can be rejected, wecan conclude that the network is highlyclustered as compared to the null model.
What next?
I Thursday 10-12: Study group on analytical properties of ERnetworks
I Thursday 12-14: Exercise session
Moodle Enrolment
I https:
//moodle.helsinki.fi/course/view.php?id=18471
I Enrolment key: BIOMODELS