67
Biological Networks – Graph Theory and Matrix Theory Ka-Lok Ng Department of Bioinformatics Asia University

Biological Networks – Graph Theory and Matrix Theory

  • Upload
    sine

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Biological Networks – Graph Theory and Matrix Theory. Ka-Lok Ng Department of Bioinformatics Asia University. Content. Topological Statistics of the protein interaction networks How to characterize a network ? - PowerPoint PPT Presentation

Citation preview

Page 1: Biological Networks –  Graph Theory and Matrix Theory

Biological Networks – Graph Theory and Matrix Theory

Ka-Lok NgDepartment of Bioinformatics

Asia University

Page 2: Biological Networks –  Graph Theory and Matrix Theory

2

Content

Topological Statistics of the protein interaction networks

How to characterize a network ?– Graph theory, topological parameters (node degrees, average path length, clustering coefficient, and node degree correlation.)– Random graph, Scale-free network, Hierarchical network

– Evolution of Biological Networks

Page 3: Biological Networks –  Graph Theory and Matrix Theory

3

Biological Networks - metabolic networks

Metabolism is the most basic network of biochemical reactions, which generate energy for driving various cell processes, and degrade and synthesize many different bio-molecules.

Page 4: Biological Networks –  Graph Theory and Matrix Theory

4

Biological Networks - Protein-protein interaction network (PIN)

Proteins perform distinct and well-defined functions, but little is known about how

interactions among them are structured at the cellular level. Protein-protein interaction account for binding interactions and formation of protein complex. - Experiment – Yeast two-hybrid method, or co-immunoprecipitation

www.utoronto.ca/boonelab/proteomics.htm

Limitation: No subcellular location, and temporalinformation.

Cliques – protein complexes ?

Page 5: Biological Networks –  Graph Theory and Matrix Theory

5

Biological Networks - PIN

Yeast Protein-protein interaction network - protein-protein interactions are not random - highly connected proteins are unlikely to interact with each other.

Not a random network - Data from the high- throughput two-hybrid experiment (T. Ito, et al. PNAS (2001) )- The full set containing 4549 interactions among 3278 yeast proteins 87% nodes in the largest component- kmax ~ 285 !- Figure shows nuclear proteins only

Page 6: Biological Networks –  Graph Theory and Matrix Theory

6

Biological Networks – Gene regulation networks

Example of a genetic regulatory network of two genes (a and b), each coding for a regulatory protein (A and B).

In a gene regulatory network, the protein encoded by a gene can regulatethe expression of other genes, for instance, by activating or inhibiting DNA transcription. These genes in turn produce new regulatory proteins that control other genes.

Page 7: Biological Networks –  Graph Theory and Matrix Theory

7

Biological Networks – Gene regulation networks

Transcription regulatory network in Yeast- From the YPD database:

1276 regulations among 682 proteins

by 125 transcription factors (~10 regulated genes per TF)

- Part of a bigger genetic regulatory network of 1772 regulations among 908 proteins

Transcription regulatory network in H. sapiensData courtesy of Ariadne Genomics

obtained from the literature search:

1449 regulations among 689 proteins

Transcription regulatory network in E. coliData (courtesy of Uri Alon)

was curated from the Regulon

database:

606 interactions between

424 operons (by 116 TFs)

Page 8: Biological Networks –  Graph Theory and Matrix Theory

8

Biological Networks – Signal transduction networks

Nuclear transcription factor NF-kB-control of apoptosis (cell suicide), -development of B and T cells,-anti-viral and bacterial responses Oxidant-induced activation of NF-kB signal

transduction

Hormones (first message)

Receptor

cAMP, Ca++ (second message)

phosphorylation

Page 9: Biological Networks –  Graph Theory and Matrix Theory

9

Biological Networks – Cell cycle regulation networks

Page 10: Biological Networks –  Graph Theory and Matrix Theory

10

Biological Networks

Biological networks are not randomly connected

Underlying architecture clustering

How to characterize ?An universal features across different species ?

Page 11: Biological Networks –  Graph Theory and Matrix Theory

11

Graph theory- The Bridge Obsession Problem

Bridges of Königsberg

Find a tour crossing every bridge just once Leonhard Euler (Switzerland), 1735. It turns out it is impossible.

L. Euler1707-1783

Page 12: Biological Networks –  Graph Theory and Matrix Theory

12

Eulerian Cycle Problem

Find a cycle that visits every edge exactly once

Linear time

More complicated Königsberg

Page 13: Biological Networks –  Graph Theory and Matrix Theory

13

Hamiltonian Cycle Problem

Find a cycle that visits every vertex exactly once

Around the 20 famous cities in the world

NP – complete Game invented by Sir William Hamilton in 1857

Sir W. Hamilton(English mathematician)

1805 - 1865

Page 14: Biological Networks –  Graph Theory and Matrix Theory

14

Mapping Problems to Graphs

Arthur Cayley (English mathematician) studied chemical structures of hydrocarbons in the mid-1800s

He used trees (acyclic connected graphs) to enumerate structural isomers Arthur Cayley

Page 15: Biological Networks –  Graph Theory and Matrix Theory

15

Real networks Many networks show scale-free behavior World-Wide Web Internet Ecology network (food web) Science collaboration network Movie actor collaboration network

Cellular network Network in linguistic Power and neural network Sexual contacts within a population (important

for disease prevention!) etc.

Power law behavior

Page 16: Biological Networks –  Graph Theory and Matrix Theory

16

Graph Theory

A

B

Binary relation

Pathway Cluster Hierarchical Tree

Network representation. A network (graph) consists of a set of elements (vertices) and a set of binary relations (edges).

Page 17: Biological Networks –  Graph Theory and Matrix Theory

17

Graph Theory – Basic conceptsGraphsG=(N,E)N={n1 n2,... nN}E={e1 e2,... eM}ek={ni nj}Nodes: proteinsEdges: protein interactions

Mutligraphek={ni nj}+ duplicate edgesi.e. em={ni nj}Nodes: proteinsEdges: interactions of different sort: binding and similarity

HypergraphsHyperedge: ex={ni, nj, nk ...}Nodes: proteinsEdges: protein complexes

Directed hypergraphHyperedge: ex={ni, nj .. | nk nl ...}Nodes: substancesEdges: chemical reactions A + B C +DeX={A, B .. | C, D ...}

Directed graphek={ni nj}Nodes: genes and their productsEdges from A to B: gene regulation gene A regulates expression of gene B

Different systems Different graphs

NnEnd ||2)(

Page 18: Biological Networks –  Graph Theory and Matrix Theory

18

Graph Theory – Basic conceptsNode degree

Components

Complete graph (Clique)

Shortest path length

Clustering coefficient Ci

if A-B, B-C, then it is highly probable that A-C

)1(

2

ii

ii kk

EC

1.0)15(5

12

AC

Two ways to compute Ci

-Ei actual connections out of Ck2 possible c

onnections-number of triangles that included i/ki(ki-1)

Average clustering coefficient

N

iiC

NC

1

1

Page 19: Biological Networks –  Graph Theory and Matrix Theory

19

Graph Theory – Vertex adjacency matrix

01

01

1101

10

A

1

2 3

4- ∞ means not directly connected

- node i connectivity, ki = countj(mij = 1)

ki

1

3

1

1

Undirected graph

Bipartite graph

0

0TB

BA

symmetric

Page 20: Biological Networks –  Graph Theory and Matrix Theory

20

Graph Theory – Edge adjacency matrix

c

0111

1010

1101

1010

)(GE

a b c d

a

b

c

d

symmetric

1

2 3

4

ab

d

G

The edge adjacency matrix (E) of a graph G is identical to vertex adjacency matrix (A) of the line graph of G, L(G). That is the edge in G are replaced by vertices in L(G). Two vertices in L(G) are connected whenever the corresponding edges in G are adjacent. a b

cd

A(L(G)) = E(G)L(G)

The labeling of the same graph G are related by a similarity transformation, P -1A(G1)P=A(G2).

Page 21: Biological Networks –  Graph Theory and Matrix Theory

21

Graph Theory – average network distance

Interaction path length or average network distance, d

- the average of the distances between all pairs of nodes - frequency of the shortest interaction path length, f(j) - determined by using the Floyd’s algorithm The average network diameter d is given by

where j is the shortest path length between two nodes.

Network diameter (global) Average network distance (local)

j

j

jf

jjf

d)(

)(

Page 22: Biological Networks –  Graph Theory and Matrix Theory

22

Graph Theory – the shortest path

The shortest path- Floyd algorithm, an O(N3) algorithm.

For iteration n,- given three nodes i, j and k, it is short

er to reach j from i by passing through k

Mnij=min{Mn-1

ij, Mn-1ik+Mn-1

kj}

- search for all possible paths,

e.g. 1-2, 1-2-3, 1-2-4, 2-3, 2-4

1

2 3

4

i

k

j

Page 23: Biological Networks –  Graph Theory and Matrix Theory

23

Graph Theory – number of the shortest path in a graph

A nonvanishing element of A(G), Aij = 1, represents a walk of a length between the vertices i and j. Therefore, in general

0

1ijA

if there is a walk of length one between vertices i and j

otherwise

There are walks of various lengths which can be found in a given graph. Thus

0

1kjik AA

if there is a walk of length two between vertices i and j passing through the vertex kotherwise

Therefore, the expression represents the total number of walks of the length 2 in G between the vertices i and j.

For a walk of a length L, we

0

1... zjrsir AAA

if there is a walk of length L between vertices i and j passing through the vertices r, s, …..zotherwise

kj

N

kikij AAA

1

2 )(

Page 24: Biological Networks –  Graph Theory and Matrix Theory

24

Graph Theory – Trace of a matrix

Trace of the NxN matrix A

In the case of the adjacency matrix for graph without loops, Tr A = 0

The trace of powers of A is a graph invariant

N

iiiAATr

1

)(

31

33

11

22

6)()(

2)()()(

CAATr

MiDAATr

ii

N

i

N

iii

N

i

where M is the number of edges, C3 is the number of three-membered cycles.

In case of graph with n loops

N

iihATr

1

)(

Page 25: Biological Networks –  Graph Theory and Matrix Theory

25

Random Graph Theory = Graph Theory +Probability

Page 26: Biological Networks –  Graph Theory and Matrix Theory

26

Random Graph Theory = Graph Theory +Probability

Page 27: Biological Networks –  Graph Theory and Matrix Theory

27

Random Graph Theory= Graph Theory + Probability

Random graph (Erdos and Renyi, 1960)

N nodes labeled and connected by n edges CN

2 = N(N-1)/2 possible edges

possible graphs with N nodes and n edges

2

N

nC

n Number of possible graphs, C6n

1 6

2 15

3 20

4 15

5 6

6 1

N = 4 C6n

n 3 3 4 4 5 6

N = 4

kNkNki ppCkkP 11 )1()(

Page 28: Biological Networks –  Graph Theory and Matrix Theory

28

Random Graph Theory – Random network, Scale free network

Connectivity distribution P(k) In a random network, the links are randomly connected and most of the nodes have degrees close to <k>=2E/N. The degree distribution P(k) vs. k is a Poission distribution, i.e. P(k) ~ <k>ke-<k>/k! In many real life networks, the degree distribution has no well-defined peak but has a power-law distribution, P(k) ~ k-,where is a constant. Such networks are known as scale-free network.

Random network Log[P(k)] vs Log[k] plot has a peak homogenous nodes d ~ log NScale-free network Log[P(k)] vs Log[k] plot is a line with negative slope inhomogenous nodes d ~ log(log N)

Albert R. and Barabasi A.L.(2002) Rev. Mod. Phys. 74, 47

Random network Scale-free network

http://physicsweb.org/box/world/

Page 29: Biological Networks –  Graph Theory and Matrix Theory

29

Example – metabolic pathways

WIT database (43 organisms), node = substrated, edge = reaction scale-free network P(k)<k, with in = 2.2, out = 2.2 similar scaling behavior of connectivity distribution Fig. 2d, connectivity distribution averaged over 43 organisms Suggested that metabolic networks belong to the class of scale- free networks

It is interesting to notice that most of the real networks have 1 < < 3.

http://ergo.integratedgenomics.com/IGwit/

Page 30: Biological Networks –  Graph Theory and Matrix Theory

30

Random Graph, Scale-free network, Hierarchical network

Clusteringcoefficient

Node degreedistribution

scaling Cave(k) ~ k

for Deterministichierarchical networkmodel

Hierarchical network - coexistence of (1) modularity, (2) local clustering, and (3) scale-free behavior

Page 31: Biological Networks –  Graph Theory and Matrix Theory

31

Graph Theory – Network motifsCompared the abundance of small loops in E. coli transcription regulatory network to its randomized counterpart

- Treat the transcription network as directed graph node = operon (a group of contiguous genes) edge = from an operon that encode an TF to an operon regulated by that TF- Frequency of occurrences three types of motifs (feed-forward loops, single input module, and dense overlapping regulons) are much higher than the random network version

-There are 13 types of 3-node connected, directed subgraphs-Feed-Forward Loops (FFL) were significantly over-represented (40 in real vs 7+/- 5 in random)

Reference : S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, Nature Genetics, 31(1):64-68 (2002)

Page 32: Biological Networks –  Graph Theory and Matrix Theory

32

Graph Theory – Network motifs

Feed-forward loop- A TF X regulate a second TF Y, and both jointly regulated one or more operons Z1,….. Zn.

Single input module (SIM)- A single TF X regulates a set of operons Z1,….. Zn. X is usually autoregulate

Dense overlapping regulons (DOR)- A set of operons Z1,….. Zm are each regulated by a combination of a set of TFs, X1,….. Xn.

Page 33: Biological Networks –  Graph Theory and Matrix Theory

33

Graph Theory – Node degree correlation

- random graph models node degrees are uncorrelated- count the frequency P(K1,K2) that two proteins with connectivity K1 and K2 connected to each other by a link- compared it to the same quantity PR(K1,K2) measured in a randomized version of the same network. - The average node connectivity for a fixed K1 is given by,.

where <> denotes the multiple sampling average, and the summation sums for all K2 with a fixed K1.

- In the randomized version, the node degrees of each protein are kept the same as in the original network, whereas their linking partner is totally random.

K1 = 2

K2 = 5

),(

),(

21

2122 KKP

KKPKK

R

01

01

1101

10

M

1

2 3

4

Page 34: Biological Networks –  Graph Theory and Matrix Theory

34

Input - Database of Interacting Proteins (DIP)

DIP http://dip.doe-mbi.ucla.edu DIP is a database that documents experimentally determined protein-protein interactions. We analyze the protein-protein interaction for seven different species, C.elegan, D. melanogaster, E. coli, H. pylori, H. sapiens, M. musculus and S. cerevisiae.

- Look for general and different features of PIN for different species

Page 35: Biological Networks –  Graph Theory and Matrix Theory

35

Input - Database of Interacting Proteins (DIP)

Page 36: Biological Networks –  Graph Theory and Matrix Theory

36

Results – scale-free network study

• Large standard deviation of k• Coefficient of determination, r2 = SSR/SST >0.90• To account for the flat plateau and long tail behaviors, assume a short-length scale correction k0 and an exponential cut-off tail at kc

FLYC. elegans

E . coli H. pyloriYEAST

H. sapiens M. musculus

CkkekkkP /0 )()( yeast ~ 2.1

fly ~ 1.9

Page 37: Biological Networks –  Graph Theory and Matrix Theory

37

Results

Highly connected proteins (k 25)≧– yeast (39 sequences) and fly (317 sequences) Most of the sequences do not have high sequence similarity (E-value ≦ 0.01) different functions

Page 38: Biological Networks –  Graph Theory and Matrix Theory

38

Results

These highly connected proteins are pair-wise compared in an all-against-all manner using gapped BLAST (16), and none of the sequences shown significant sequences similarity (E-value < 0.001) except the tryptophan protein and SEC27 protein, nuclear pore protein, 26S proteasome regulatory particle chain and DNA-

directed RNA polymerase.

Page 39: Biological Networks –  Graph Theory and Matrix Theory

39

Results

L

0 2 4 6 8 10 12 14 16

Lo

g f(

L)

-7

-6

-5

-4

-3

-2

-1

0

E.coliH.pyloriS.cerevisiaeH.sapiensM.musculusD.melanogasterC.elegans

Fig. 4. The logarithm of the normalized frequency distribution of connected paths vs the logarithm of their length for

S. cerevisiae(CORE), H. pylori, E. coil, H. sapiens, M. musculus and D. melanogaster.

Page 40: Biological Networks –  Graph Theory and Matrix Theory

40

Results – node degrees correlation

2.0

2.0

2.0

2.02.0

2.0

1.01.6

2.0

1.0

2.0 2.0

2.0

1.0

Highly connected proteins are unlikely to interact.

Page 41: Biological Networks –  Graph Theory and Matrix Theory

41

Results – Hierarchical structures

The plots of Log Cave(k) vs Log k for the seven species.

All the species exhibit a rather flat plateau for small values of k, and they fall rapidly for large k.

Cave(k) ~ k-

Yeast

E. coli

Page 42: Biological Networks –  Graph Theory and Matrix Theory

42

Results – identification of cliques

identify protein complexes compute the clustering coefficients, find the cliques or pseudo-cliques

Page 43: Biological Networks –  Graph Theory and Matrix Theory

43

Identification of cliques

TheoremLet A3

ij be the (i,j)-th element of A3. Then a vertex Pi belongs to some clique if and only if A3

ii ≠0.

Example

00001

00011

00000

01001

11010

A

0

1ijA

if there is a walk of length one between vertices i and j

otherwise

01013

12034

00000

13024

34042

3Aand

The non-zero diagonal entries of A3 are a311, a3

22 and a344. Consequently, node 1, 2 and 4

belong to cliques. Since a clique must contain at least three vertices, the graph has only one clique.

Page 44: Biological Networks –  Graph Theory and Matrix Theory

44

Results - protein complexes

Identification of the highest clique degree with protein complexes

We had identified all possible cliques within the seven PINs. To identify the relation between cliques and protein complexes, we only considered cliques with the largest number of connected proteins in our preliminary study, and had succeeded in predicting some of the cliques did correspond to protein complexes (comparing data from the BIND database).

Page 45: Biological Networks –  Graph Theory and Matrix Theory

45

Evolution of Biological Networks

Databases – DIP and MIPS

Motif identification- detecting all n-node subgraphs, i.e. all 2-, 3-, 4- and some 5- node (a set of 28 five-node motifs) motifs in yeast PIN- the network consists of 3183 yeast proteins encodes 1000 to 1,000,000 copies of the specific motif types

Page 46: Biological Networks –  Graph Theory and Matrix Theory

46

Evolution of Biological Networks

-studied the conservation of 678 (47% of 1443) yeast proteins with an ortholog in each of five higher eukaryotes (A. thaliana, C.elegans, D. melanlgaster, M. musculus and H.sapiens) deposited in the InParanoid database

- 47% of the 1443 fully connected pentagons (#11), in yeast have each of their five proteins components conserved in each of the five higher eukaryotes

- this results blocks of cohesive motifs tend to be evolutionary conserved

Page 47: Biological Networks –  Graph Theory and Matrix Theory

47

Evolution of Biological Networks

Redundant links are lost(in an asymmetric fashion)

Growth Model of a scale-free network PIN- New proteins nodes are added (genes duplication)- Preferential attachment

Page 48: Biological Networks –  Graph Theory and Matrix Theory

48

Evolution of Biological NetworksGrowth1. start with m0 nodes2. add a node with m edges3. connect these edges to existing nodesat time step t : t+m0 nodes, tm edges

Preferential attachmentProbability q of connection to node i depends on the degree ki of this node.

j j

ii k

kkq )(

m0=3, m=2

This model leads to the power law distribution

P(k) = 2m2k-3 ~ k-3

Page 49: Biological Networks –  Graph Theory and Matrix Theory

49

SummaryProtein-protein interaction Network• PINs are not random networks, they have

rather heterogeneous structures highly connected protein blastp shows that they do not share sequence similarity

• The plots of Log[Pcum(k)] vs Log[k] study indicates that PINs are well approximate by scale-free networks

• ~ 2 A general biological evolution mechanism across species growth + preferential attachment model

• The plots of Log[Pcum(k)] vs Log[k] for fly and yeast seems to have deviation at the small k and large k value modification of the growth + preferential attachment model

• Highly connected proteins are unlikely to interact

• Hierarchical network model is a better description for certain species’ PINs

Log k

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Lo

g C

ave(

k)

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

S.cerevisiae (CORE)S. cerevisiae (regression)

Page 50: Biological Networks –  Graph Theory and Matrix Theory

50

MatrixPermutations

A one-to-one mapping of the set {1,2,3…,n} onto itself is called a permutation. We denote the permutation by = j1j2…jn.

The number of possible permutation is n!, and that the set of them is denoted by Sn. For example, S2 = {12, 21}, S3 = {123, 132, 213, 231, 312, 321}.

Consider an arbitrary permutation in Sn: = j1j2…jn. We say is even or odd according whether there is an even or odd number of pairs (i,k) for which

i > k but i precedes k in We then define the sign of by, written sgn , by

1

1)sgn(

if is even

if is odd

Page 51: Biological Networks –  Graph Theory and Matrix Theory

51

Matrix

Example

Consider the permutation =35142 in S5. 3 and 5 precede and are greater than 1; hence

(3,1), and (5,1) 3,5 and 4 precede and are greater than 2; henc

e (3,2), (5,2), (4,2) 5 precedes and is greater than 4; hence (5,4) Since exactly 6 pairs satisfy the sgn(), is e

ven and sgn() = 1.

Page 52: Biological Networks –  Graph Theory and Matrix Theory

52

MatrixDeterminant

The determinant of matrix A is defined by

nnjjj aaaA ...))(sgn()det(

21 21

where the factors come from successive rows and so the first subscripts arein the natural order 1,2,…n. The sequence of second subscripts form a permutation in Sn. Also the sum is summed over all permutations in Sn.ExampleIn S2, the permutation 12 is even and the permutation 21 is odd

det(A) = a11a21 – a12a21

In S3, the permutation are 123, 231 and 321 are even, and the permutation 132, 213 and 312, are odd.

det(A) = a11a22a33 + a12a23a31 + a13a21a32 – a13a22a31 – a12a21a33 – a11a23a32

nnjjj aaaAper ...)(21 21Permanent of A where is always equal to 1

Page 53: Biological Networks –  Graph Theory and Matrix Theory

53

MatrixThe incidence matrixThe incidence matrix T(G) of a graph G with N vertices and M edges is

the NxM matrix; the rows and columns of the matrix corresponding to vertices and edges, respectively, of G. It is defined as,

Example

0

1ijT

if the j-th edge is incident with the i-th vertex

otherwise

1100

0110

1011

0001

)(GT

a b c d1

2

3

4

1

2 3

4

ab

d

G

c

428

2

,

,

jiij

ii

jiij

T

DMT

M is the number of edges

Page 54: Biological Networks –  Graph Theory and Matrix Theory

54

Matrix

The circuit matrixThe circuit matrix C(G) of a graph G, whose cycles (circuits) c and edges e a

re labeled, is a cxe matrix; the rows and columns of the matrix corresponding to circuits and edges, respectively, of G. It is defined as,

0

1ijC

if the i-th cycle contains the j-th

otherwise

1 2

34

Ga

b

c

d e

CyclesC1 = {a,d,e}C2 = {b,c,e}C3 = {a,b,c,d}

01111

10110

11001

)(GC

a b c d e

C1

C2

C3

Example

Page 55: Biological Networks –  Graph Theory and Matrix Theory

55

Principle Components Analysis (PCA) General Outline Suppose you have a microarray dataset composed of 1000 genes, each of whic

h have an expression value over 10 experiments. The dimensionality of that dataset is therefore 10 or 1000.

The data, though clumped around several central points in that hyperspace, will generally tend towards one direction. If one were to draw a solid line that best describes that direction, then that line is the first principle component (PC). The result was a space in which the axes were the eigenvectors of the covariance matrix of the experiments (in this space each point is a gene) or the genes (in this space each point is an experiment).

Any variation that is not captured by that first PC is captured by subsequent orthogonal PCs (captures the maximum amount of variation left in the data).

The first 3 PCs could themselves act as Cartesian axes. The data they capture can therefore be plotted in terms of these axes. Hence there is a reduction of dimensionality.

When the data is plotted in this manner they are said to be plotted in PC-space.References1. http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm2. Draghici S., 2003. Data Analysis Tools for DNA Microarray. Chapman & Hall/CR

C

Page 56: Biological Networks –  Graph Theory and Matrix Theory

56

Principle Components Analysis (PCA)- PCA is commonly used in microarray research as a cluster analysis tool. - to capture the variance in a dataset in terms of principle components. - trying to reduce the dimensionality of the data to summarize the most important (i.e.

defining) parts whilst simultaneously filtering out noise. - Normalization, however, can sometimes remove this noise and make the data less

variant, which could affect the ability of PCA to capture data structure. - PCA can be imposed on datasets to capture the cluster structure (just using the first f

ew PC's) prior to cluster analysis (e.g. before performing k-Means clustering to determine a good value for K).

Coordinate transformation(translation + rotation)Most of the variancealong the first eigenvector.

Variance along the second eigenvector is probably due tonoise.

Page 57: Biological Networks –  Graph Theory and Matrix Theory

57

Principle Components Analysis (PCA)- PCA pay attention to those dimensions that account for a large variance in the data and t

o ignore the dimensions in which the data are not vary very much.- The direction of PC is determined by calculating the eigenvectors of the covariance matri

x of the pattern.- Eigenvector of a matrix A is defined as a vector z such as:

Az = z where is the eignevalue- For example,

20

11A

has eignevalues 1 = -1 and 2 = -2, and eigenvectors z1=(1, 0) and z2 = (1, -1)

111 0

1)1(

0

1

20

11zAz

Similarly expression for 2 = -2- In other words, covariance matrix captures the shape of the set of data- Eigenvalue with largest absolute value implies that the data have the largest variance along its eignevector

Page 58: Biological Networks –  Graph Theory and Matrix Theory

58

Principle Components Analysis (PCA)

How to calculate the eigenvalues of a matrix ? Solve

0)2)(1(

020

11

How to find eigenvectors ?

0,1,

,0

2

)1(20

11

11

yxchoose

anythingxy

yy

xyx

y

x

y

x

1,1,

22

2

)2(20

11

22

yxchoose

yx

yy

xyx

y

x

y

x

Normalization of eignevectors | | = 1 122 yxx

Up to a multiple constant

Page 59: Biological Networks –  Graph Theory and Matrix Theory

59

Eigenvalues and eigenvectors

If A is a square matrix, then is an eigenvalue of A if, for some nonzero vectorAv = v

where v is an eigenvector of A belonging to .

Linear dependenceThe vectors v1, …, vm are said to be linearly dependent if there exist scalars a1, …, am not all of them 0, such that

a1v1 + …. amvm = 0Otherwise, the vectors are said to be linearly independent.

ExampleThe vectors u =(1, -1, 0), v=(1,3,-1) and w=(5,3,-2) are dependent since 3u+2v-w=0.

Theorem Nonzero eignevectors belonging to distinct eignevalues are linearly independent.

Page 60: Biological Networks –  Graph Theory and Matrix Theory

60

Eigenvalues and eigenvectorsTheoremAn n-square matrix A is similar to a diagonal matrix B if and only if A ha

s n linearly independent eigenvectors. In this case the diagonal elements of B are the corresponding eigenvalues.

In the above theorem, if we let P be the matrix whose columns are the n independent eigenvectors of A, then B = P-1AP.

Example: Consider the matrix .

A has two independent eigenvectors and .

Set and so .

Then A is similar to the diagonal matrix

23

21A

3

2

1

1

13

12P

5/25/3

5/15/11P

10

04

13

12

23

21

5/25/3

5/15/11APPB

As expected, the diagonal elements 4 and -1 of the diagonal matrix B are eigenvalues corresponding to the given eigenvectors.

Page 61: Biological Networks –  Graph Theory and Matrix Theory

61

Characteristic Polynomial – Cayley-Hamilton Theorem

The matrix tI-A, where I is the n-square matrix and t is a constant, is called the characteristic matrix of A. Its determinant DA(t) = det(tI – A) which is a polynomial in t is called the characteristic polynomial of A.

Cayley-Hamilton Theorem

Every matrix is a zero of its characteristic polynomial.

Example:

The characteristic polynomial of the matrix is

D(t) = (t-1)(t-2) – (-2)(-3) = t2 – 3t – 4. As expected, A is a zero of D(A):

23

21A

00

00

10

014

23

213

23

21)(

2

AD

Page 62: Biological Networks –  Graph Theory and Matrix Theory

62

Singular Value Decomposition Performing PCA is the equivalent of performing Singular Value Decompo

sition (SVD) on the covariance matrix of the data. Applications: Image processing and compression, Information Retrieval, Im

munology, Molecular dynamics, least square error What is SVD? (See MIT paper at the bottom of this page for more depth) For the sake of example, let A be a nxp matrix that represents gene expressi

on experiments such that the rows are genes and columns are experiments (i.e.arrays). The SVD of A is said to be the factorization:

A = USVT (Eq. 1)(Note that VT means 'the Transpose of matrix V'). How to determine U, V and S ? In equation 1, matrices U and V are such that they are orthogonal (their col

umns are orthonormal, UTU = I, VTV = I). The columns of U are called left singular values (gene coefficients) and the rows of VT are called right singular values (expression level vectors).

To calculate the matrices U and V, one must calculate the eigenvectors and eigenvalues of AAT and ATA. These multiplications of A by its transpose results in a square matrix (the number of columns is equal to the number of rows).

Page 63: Biological Networks –  Graph Theory and Matrix Theory

63

Singular Value Decomposition The eigenvectors of AAT (a nxn matrix) columns of U (a nxn matri

x)The eigenvectors of ATA (a pxp matrix) columns of V (a pxp matrix)The eigenvalues of AAT or ATA, when square-rooted (s1≧ s2 ≧ …) the columns of S (nxp matrix)The diagonal of S is said to be the singular values (min(n,p)) of the original matrix, A.

Each eigenvector described above represents a principle component. PC1 (Principle Component 1), which is defined as the eigenvector with the highest corresponding eigenvalue. The individual eigenvalues are numerically related to the variance they capture via PC's - the higher the value, the more variance they have captured.

Please also see Yeung & Ruzzo (2001), where it is shown that lower PC's can capture data structure.

http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

http://www.netlib.org/lapack/lug/node53.html or http://kwon3d.com/theory/jkinem/svd.html

Page 64: Biological Networks –  Graph Theory and Matrix Theory

64

Singular Value Decomposition

eigenvalues of AAT or ATA

eigenvectors of AAT

eigenvectors of ATA

Example

0.34

Page 65: Biological Networks –  Graph Theory and Matrix Theory

65

Yeast Two-hybrid System

pubs.acs.org/cen/coverstory/ 7831/7831scit1.html

http://cmbi.bjmu.edu.cn/

Have two plasmids:    - fusion of target to activation domain (AD).    - fusion of bait to DNA binding domain (BD).If the target protein bind to the bait protein it brings AD close to BD.

BD bind to the GAL4 promoter and brings AD close so it can activate the DNA.LacZ is expressed from a GAL4 promoter when proteins are interacting. LacZ will make the yeast blue.

Page 66: Biological Networks –  Graph Theory and Matrix Theory

66

Immunoprecipitation

(1) Antibody added to a mixture radiolabel(*) or unlabeled proteins binds specifically to its antigen (A) (left tube).(2) Antibody-antigen complex is absorbed from solution through the addition of an immobilized antibody binding protein such as Protein A-Sepharose beads (middle panel). (3) Upon centrifugation, the antibody-antigen complex is brought down in the pellet (right panel). Subsequent liberation of the antigen can be achieved by boiling the sample in the presence of SDS.

Page 67: Biological Networks –  Graph Theory and Matrix Theory

67

Co-Immunoprecipitation

Co-IP vs. IPCo-immunoprecipitation (Co-IP) is a popular technique for protein interaction discovery. Co-IP is conducted in essentially the same manner as an IP. However, in a co-IP the target antigen precipitated by the antibody “co-precipitates” a binding partner/protein complex from a lysate ( 細胞溶解液 ), i.e., the interacting protein is bound to the target antigen, which becomes bound by the antibody that becomes captured on the Protein A or G gel support.