47
582746 Modelling and Analysis in Bioinformatics Lecture 1: Global Network Models 26.10.2015

582746 Modelling and Analysis in Bioinformatics

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 582746 Modelling and Analysis in Bioinformatics

582746 Modelling and Analysis in Bioinformatics

Lecture 1: Global Network Models

26.10.2015

Page 2: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 3: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 4: 582746 Modelling and Analysis in Bioinformatics

Course topics

I Computational models for biological networks (Leena Salmela)

I Gene regulation (Antti Honkela)

I Probabilistic analysis of sequence level problems(Veli Makinen)

Page 5: 582746 Modelling and Analysis in Bioinformatics

Practical arrangements

I Mondays: Lectures to introduce the topics

I Thursday mornings: Study group to deepen the knowledge onthe subject

I Thursday afternoons: Exercise sessions

I 23.11.-27.11. Visiting lecturers (no exercise session)

Page 6: 582746 Modelling and Analysis in Bioinformatics

How to pass the course?

I Attending study groups on Thursday mornings is mandatory

I Attending visiting lectures on Monday 23.11. and Thursday26.11. is mandatory

I Submit the exercises and get at least 6 points for each threeexercise sets (network models, gene regulation, probabilisticanalysis of sequence-level problems)

I If you miss a study group or visiting lecture, contact thelecturers for an alternative assignment

Page 7: 582746 Modelling and Analysis in Bioinformatics

Grading

I Grading is based on submitted exercises

I 60 points will be available

I 30 points =⇒ Passed, 50 points =⇒ 5

I No exam

I Not possible to pass with a separate exam

Page 8: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 9: 582746 Modelling and Analysis in Bioinformatics

Protein-protein interaction network

I Vertices are proteins

I The proteins are connected if they interact with each other.

Page 10: 582746 Modelling and Analysis in Bioinformatics

Metabolic network

I Vertices are metabolites,i.e. chemical compounds

I Edges describe how thecell can transform ametabolite into another

Page 11: 582746 Modelling and Analysis in Bioinformatics

Gene regulatory networkI Vertices are genesI Genes are linked if one regulates the other

Source: Shen-Orr SS, Milo R, Mangan S, Alon U. 2002

Page 12: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 13: 582746 Modelling and Analysis in Bioinformatics

(Shortest path) distance

I Distance dij is the length of theshortest path between vertices ni

and nj , i.e. the minimal number ofedges one needs to traverse to getfrom ni to nj

I The shortest path may not beunique, but the length of theshortest path is unique

I In directed network, we may havedij 6= dji

I If there is no path between ni andnj , we have dij =∞

Ignoring weights, above wehave d12 = 2,d13 = 1,d14 = 2,. . .

Page 14: 582746 Modelling and Analysis in Bioinformatics

Diameter and average path length

I The diameter dm = max(dij ) isthe maximal distance between anytwo nodes (= the longest shortestpath)

I Average or characteristic pathlengthd = 〈dij〉 = 1

N2V

∑NVi=1

∑NVj=1 dij Ignoring weights, above

dm = 3, d ≈ 1.57

Page 15: 582746 Modelling and Analysis in Bioinformatics

Efficiency

I Efficiency, or average inverse pathlength: deff = 〈1/dij〉

I Useful when average path lengthis infinite (disconnected network)

I Fully connected network hasefficiency deff = 1, graph with noedges has deff = 0 Ignoring weights, above

deff ≈ 0.73

Page 16: 582746 Modelling and Analysis in Bioinformatics

Weighted graphs

I If the edges in the graph haveassociated weights wij , it isnatural to define distances basedon the weights:

I dij as the sum of weights in theminimum weight path between ni

and nj

I Maximum and average pathlength as well as efficiencynaturally generalize by changingthe distance measure to theweighted version

With weights, d12 = 40 (redpath), d ≈ 24.2,dm = 50 = d25, deff = 0.06

Page 17: 582746 Modelling and Analysis in Bioinformatics

Finding shortest paths

Finding shortest paths in graphs is part of classical algorithmtheory, two efficient algorithms

I Dijkstra’s algorithm: given a vertex find shortest paths to allother vertices, basic implementation runs in O(N2

V ) time, canbe implemented faster for sparse graphs

I Floyd-Warshall algorithm: find shortest paths for all pairs ofvertices in the graph in O(N3

V ) time; outputs a distance

matrix (dij )NVi ,j=1 in same time.

I Both work with weighted formulations

Page 18: 582746 Modelling and Analysis in Bioinformatics

Shortest path distances in empirical networks

Path length analysis of many networksthat occur in nature reveals thesmall-world property

I Metabolite graphs: average pathlength d ≈ 3 (NV ≈ 103 − 104)

I WWW: links chains between twoweb documents d ≈ 16(NV > 109)

I Erdos number: shortest co-authorchain to Paul Erdos, d = 4.65(NV ≈ 4× 105)

I ...

Paul Erdos (1913-1996, a Hungarian

mathematician, published over 1400 scientific

papers over his lifetime with over 500

different co-authors

Page 19: 582746 Modelling and Analysis in Bioinformatics

Node degree

I Degree ki of vertex ni is thenumber of edges adjacent to avertex

I In a network without self-loopsand without multiple edgesbetween any pair of edges: degree= number of neighbours

I In directed networks: in-degree isthe number of incoming edges andout-degree is the number outgoingedges

Page 20: 582746 Modelling and Analysis in Bioinformatics

Degree distribution

I Given a fixed set of vertices, p(k) denotes the probability thata randomly chosen vertex has degree k .

I (Empirical) degree distribution is the list of probabilities (orrelative frequencies) p(k), k = 0 . . .NV .

I Analysis of the degree distribution is an important means tocharacterize networks

Page 21: 582746 Modelling and Analysis in Bioinformatics

Degree distribution

Fitting the empirical degreedistribution to a theoretical distributiongiven by a mathematical law is animportant tool for network analysis

I Regular lattice: p(k) ≈ 1, where kis a constant

I Scale free network: p(k) ∝ k−γ

I Random network:p(k) ∝

(NV−1k

)pk (1− p)NV−1−k

Page 22: 582746 Modelling and Analysis in Bioinformatics

Degree distribution

The degree distributions of of scale-free network and randomnetwork look markedly different

I Scale free network: p(k) ∝ k−γ (power law, heavy tail)

I Random network: p(k) ∝(NV−1

k

)pk (1− p)NV−1−k (binomial,

light tail)

Page 23: 582746 Modelling and Analysis in Bioinformatics

Fitting degree distributions

I Typically the fitting of the empirical distribution is based onthe histogram of observations for p(k)

I This is prone to errors in the region of high degree nodes dueto low number of observations

I Binning can help: divide the range of k into intervals and putall observations in the interval into a common bin

I Cumulative degree distribution pc (k) =∑∞

l=k p(l), thelikelyhood that a given node has degree at least k , is morereliable and does not require binning

Page 24: 582746 Modelling and Analysis in Bioinformatics

Degree correlations and assortative mixing

Degree correlation is a statistic that reveals additional informationof the connection patterns of the nodes

I Assortative networks: high correlation between the degrees ofadjacent nodes; highly connected nodes mostly connect toother highly connected nodes

I Disassortative networks: highly connected nodes mostlyconnect to low degree nodes

I Assortativity index −1 ≤ r ≤ 1: Pearson correlationcoefficient of degrees of adjacent nodes, r > 0 assortative,r < 0 disassortative

Page 25: 582746 Modelling and Analysis in Bioinformatics

Examples

Social networks are typically assortative, technological andbiological networks tend to be disassortative

(M. Newman. Phys. Rev. E 67, 026126 (2003) )

Page 26: 582746 Modelling and Analysis in Bioinformatics

Clustering coefficient

I Clustering coefficient measures theprobability that two vertices witha common neighbor are connected

I Let Ei denote the number of edgesbetween the neighbors of vi , andEmax = ki (ki − 1)/2 thetheoretical maximum. Clusteringcoefficient for vertex ni is now

Ci =Ei

Emax=

2Ei

ki (ki − 1)

I Clustering coefficient for the wholegraph is obtained by averagingover the vertices

Page 27: 582746 Modelling and Analysis in Bioinformatics

Clustering coefficient in natural networks

I Natural networks often haverelatively high clusteringcoefficient indicating localclustering within the network

I Negative correlation between thedegree and the clusteringcoefficient has also been observed;

I Low degree nodes lie in localclusters, while the neighbors ofhigh degree nodes are less oftenconnected

I Indicates modular networkstructure

Example: PPIs in Mouse andHuman:

(http://bccs.bristol.ac.uk/toProgramme

/ project/2008/Angela Onslow S08/)

Page 28: 582746 Modelling and Analysis in Bioinformatics

Matching index

To be functionally related, twovertices do not need to beconnected, examples:

I Two transcription factorproteins regulating thesame gene

I Two metabolite moleculestaking part in similarreactions

Zamora-Lopez. Frontiers in Neuroinformatics 4, 2010

Page 29: 582746 Modelling and Analysis in Bioinformatics

Matching index

I Matching index measuresthe amount of neighborsthe two nodes share:

MIij =Sharedij

ki + kj − Sharedij

I Similarity in terms ofperceiving theneighborhood similarly

Zamora-Lopez. Frontiers in Neuroinformatics 4, 2010

Page 30: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 31: 582746 Modelling and Analysis in Bioinformatics

Models of complex networks

I Theoretical models of networks areneeded as a basis for comparisonto determine the significance ofglobal properties or non-trivialsubstructures of natural networks.

I We will look at three specificmodels

I Erdos-Renyi ModelI Watts-Strogatz ModelI Barabasi-Albert model

Page 32: 582746 Modelling and Analysis in Bioinformatics

Erdos-Renyi Model

I ER network consists of NV vertices

I Edge is drawn between a pair ofnodes randomly with probability p

I Degree distribution of the ERmodel is binomial:p(k) ∝

(NV−1k

)pk (1− p)NV−1−k

I Degree distribution can beapproximated by Poissondistribution for large graphs

Page 33: 582746 Modelling and Analysis in Bioinformatics

Erdos-Renyi Model

Significant body of theoretical research exists for the ER model,e.g.

I For NV p < 1 the network almost surely has no largeconnected components

I For NV p ≈ 1 the network will almost surely have one largeconnected component

I For NV p > logNV the network will almost surely beconnected

I ER network has the small-world property when p > 1/NV

with average path length scaling as l ∼ logNV

I No local clustering, expected clustering coefficientC = p = 〈k〉 /NV for all nodes

Page 34: 582746 Modelling and Analysis in Bioinformatics

Watts-Strogatz model

1. Arrange vertices in a ring structure

2. Connect each vertex to K closestneighbours

3. With probability prew , rewire eacheach edge by detaching from oneend and attaching to a randomlychosen vertex.

After steps (1-2) there is localclustering, step (3) lowers average pathlength by creating shortcuts

http://en.wikipedia.org/wiki/File:Watts strogatz.svg

Page 35: 582746 Modelling and Analysis in Bioinformatics

Watts-Strogatz model

I Even for low rewiring probability(prew << 1) the average pathlength goes down rapidly

I Small average path length andlocal clustering is retained forintermediate prew

I When prew → 1, we get ER model,i.e. local clustering is destroyed

I Degree distribution is similar toER graph: homogeneuos andpeaked around k = K

http://en.wikipedia.org/wiki/File:Watts strogatz.svg

Page 36: 582746 Modelling and Analysis in Bioinformatics

Barabasi-Albert model

I Start with an initial smallconnected network of N0 vertices

I Iteratively add new vertices andconnect the new vertex to m ≤ N0

vertices

I Draw the nodes that will beconnected the the new vertex withprobability proportional to theirdegree (preferential attachment):

ρ(ni ) = ki/∑

j

kj (BA graph from http://melihsozdinler.blogspot.com/)

Page 37: 582746 Modelling and Analysis in Bioinformatics

Barabasi-Albert model

I Unlike ER or WS model,Barabasi-Albert model explain theinhomogeneuos degree distributionobserved in natural graphs

I With enough iterations, the degreedistribution of the BA model isscale-free, with p(k) ∼ k−3

I Average path length in BAnetworks has been found to besmaller than in ER and WS models

Page 38: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 39: 582746 Modelling and Analysis in Bioinformatics

Robustness and Attack tolerance

I Robustness against pertubations(mutations, environment changes)is a preferable property forbiological networks

I Networks analysis is interested inpreservation of network topologyunder perturbations (usually:removals of vertices or edges)

Page 40: 582746 Modelling and Analysis in Bioinformatics

Robustness and Attack tolerance

I Both ER networks and scale-freenetworks (such as BA model) arerobust towards random deletionsof nodes and connections

I A random mutation is likely tohit a low degree node in BAmodel

I Scale-free networks are not robusttowards intentioanl attacks

I Removal a set of highlyconnected nodes may collapsethe global structure

I ”Robust, yet fragile”

Page 41: 582746 Modelling and Analysis in Bioinformatics

Modularity and hierarchical organization

I Many natural networks areobserved to possesmodular structure withdensely connectedfunctional clusters ofnodes that are sparselyconnected to other nodes.

I Also, hierarchicalorganization of networkstructure can be observed

I The random networkmodels discussed above,do not directly explainthese phenomena

(Zhao et al. BMC Bioinformatics 2006, 7:386)

Page 42: 582746 Modelling and Analysis in Bioinformatics

Modularity and hierarchical organization

I Barabasi and Albert model hasbeen later extended to thatdirection

I Based on replicating basicmodules and wiring them to thecentral module of rest of thenetwork

I Recursive application leads tohierarchical organization

I Deterministic rather thanrandom procedure

Page 43: 582746 Modelling and Analysis in Bioinformatics

Outline

Course introduction

Examples of biological networks

Global properties of networksDistance measuresDegree measuresLocal clusters

Network ModelsErdos-Renyi ModelWatts-Strogatz ModelBarabasi-Albert Model

More Network Properties

Statistical Testing of Network Properties

Page 44: 582746 Modelling and Analysis in Bioinformatics

Statistical testing of network properties

I How to determine if an observed property of the network issignificant or if it occured just by chance?

I Set up a null hypothesis

I Test if the observed property is consistent with the nullhypothesis

Page 45: 582746 Modelling and Analysis in Bioinformatics

Statistical testing of network properties: Example

I Suppose that we have observed aclustering coefficient C for a givennetwork. Is the network highly clustered?

I Null hypothesis: The clustering coefficientis consistent with a network of the samesize and degree distribution.

I Create an ensemble of random networkswith same size and degree distribution andcompute the clustering coefficient of eachnetwork

I Reject the null hypothesis if theprobability of a network with clusteringcoefficient of at least C is low enough

I If the null hypothesis can be rejected, wecan conclude that the network is highlyclustered as compared to the null model.

Page 46: 582746 Modelling and Analysis in Bioinformatics

What next?

I Thursday 10-12: Study group on analytical properties of ERnetworks

I Thursday 12-14: Exercise session

Page 47: 582746 Modelling and Analysis in Bioinformatics

Moodle Enrolment

I https:

//moodle.helsinki.fi/course/view.php?id=18471

I Enrolment key: BIOMODELS