Mutual Information Mathematical Biology Seminar 23.5.2005

Mutual Information

Mathematical Biology Seminar 23.5.2005

1 .Information Theory

and

,

are terms which describe any process that selects one or more objects from a set of objects.

Mathematical Biology Seminar

Information Theory


Uncertainty = 3 SymbolABC

12 Uncertainty = 2 Symbol

A1A2B1B2C1C2 Uncertainty = 6 Symbol

Uncertainty = Log (M) M = The Number of Symbols

Information Theory

Very SurprisedNot Surprised


PU iiSurprisal log

2

01

0

UPUP

ii

ii

)log(1

log)log()log( 1 PM

MM

Entropy (self information)

– a discrete random variable

- probability distribution

measure of the uncertainty information of a discrete random variable.

How certain we are of the outcome.


)(xp

Xx

xpxpXH )(log)()(

X

Entropy – properties:

maximum entropy – a uniform distribution

0)( XH


p(x)

1log E

p(x)

1p(x)log p(x)p(x)logH(X)

2

Xx2

Xx2


Joint Entropy

measure of the uncertainty between X and Y.


Xx y

Y)p(X,y)logp(x,Y)H(X,Y

)()(),( YHXHYXH

Conditional Entropy

measure the remaining uncertainty when X is known.


X)|p(YlogE x)|p(yy)logp(x,

x)|p(yx)log|p(yp(x)

x)X|p(x)H(YX)|H(Y

Xx Yy

Xx Yy

Xx

Mutual Information

It is the reduction of uncertainty of one variable due to knowing about the other, or the amount of information one variable contains about the other.

H(Y)}max{H(X),

Y)MI(X, MI : Normalize

X)|H(Y -H(Y) Y)|H(X-H(X) Y)MI(X,

___


Y)H(X,-H(Y)H(X) X)|H(Y -H(Y) Y)|H(X-H(X) Y)MI(X,

Y)|H(XH(Y) X)|H(YH(X) Y)H(X,

MI(X,Y) 0

MI(X,Y) = 0 only when X,Y are independent: H(X|Y) = H(X).

MI(X,X) = H(X)-H(X|X) = H(X) Entropy is the self-information.

Mutual Information – properties:

2 .Applications:

• Clustering algorithms

• Clustering quality


Clustering algorithms

Motivation: MI’s capability to measure a general dependence among random variables. Use MI as a similarity measure.

Minimize the statistical correlation among

clusters in contrast to distance-based algorithms which minimize the total variance within different clusters.


Clustering algorithms


Two methods:

1. Mutual-information – MI, PMI2. Combined mutual-information and

distance-based – MIK, MIF

MI – mutual information minimization

Grouping property:

1. Compute a proximity matrix based on pairwise mutual informations; assign n clusters such that each cluster contains exactly one

object; 2. find the two closest clusters i and j;3. create a new cluster (ij) by combining i and j;4. delete the lines/columns with indices i and j from the proximity matrix, and add one line/column containing the proximities between cluster (ij) and all other clusters;5. if the number of clusters is still > 2, goto (2); else join the two clusters and stop.


)),,((),(),,( ZYXMIYXMIZYXMI

PMI – threshold based on pairwise mutual information

1. Start with the first gene and grouping genes that has smallest mutual-information-based distance with it.

Repeat, until no gene can be added without surpassing the threshold.

Then start with the second gene and repeat the same procedure (all genes are available).

Repeat for all genes.

2. The largest candidate cluster is selected.

3. Repeat 1 and 2 until the K clusters.


),(1),(___

YXMIYXd

PMI

Threshold – 1. Mean of the distances of all gene pairs2 .Choose empirically

Optimal solution – simulated annealing algorithm (optimization).

cost function : )()( , jiji

XXMIsf


)(min* sfs Ss

Combined methods

Euclidean distance – positive correlation.

Mutual information – nonlinear correlation.

A small data sample size

combined algorithms


MIF - combined metric of MI and fuzzy membership distance

The objective function:

- a weight factor - , normalization constants

)(21

)1()(2

2

1 1, sf

KKcyu

Msh ki

N

i

K

kki


10

M

1

KK 2

2


Performance on simulated data

8 clustering algorithms.

measure of performance: percentage of points placed into correct clusters .

1. 4 variables:

The sample size (M) is changed .

),(),(),,,(

~,,,

43214321

4321

xxpxxpxxxxp

pBerxxxx


5.0



Result (1):1. MI method outperforms the Fuzzy, K-

means, linkage, biclustering, PMI.2 .MIF – best clustering accuracy.

3. MIK has similar performance as the MI.4 .MI based clustering methods – more

accurate as the sample size increases.



2. different number of genes (N) M=30

The data are generated according to:

Results (2):

In addition to the previous results…1. Performances degrade as the number of gene

increase.2. Degree of degradation depends on the

distributions governing the data.


)()....()(),.....,,( 2121 kk XpXpXpXXXp

Experimental Analysis

Clustering genes based on similarity of their expression patterns in a limited set of experiments. Gene with similar expression patterns are more likely to have similar biological function (it is not provide the best possible grouping).

Higher entropy for a gene means that its expression data are more randomly distributed.

Higher MI between genes, it is more likely that they have a biological relationship.




579 genes from 26 human glioma surgical tissue samples.

526 genes after filtering out genes with insufficient variability.

Glioma

Gliomas are tumors that can be found in various parts of the brain. They arise from the support cells of the brain, the glial cells.



Fuzzy K-means MIFbinary profiles


Results (Fuzzy vs. MIF):

Two small clusters were broken out from the Fuzzy clusters.

While the number of genes changed is small, the error decrease is significant (2.013 decrease to 1.084).



Results:

The results are the same for MIK and Fuzzy.

Compared with MIF and MIK, MI and PMI gives different results.


Applications:

• Clustering algorithms

• Clustering quality


Clustering quality

What choice of number of clusters generally yields the most information about gene function (where function is known)?

9 different algorithms, 2 databases, 4 data sets.

a table of 6300 genes * 2000 attributes. a cogency table for each cluster-attribute

pairs.


Clustering quality

Calculate entropies:

and the total MI between the cluster result C and all the attributes as:

),()()(

),(),.....,,,( 21

CAHAHCHN

ACMIAAACMI

i ii iA

i iN A


),(),(),( ii ACHAHCH

1 .How does MI change?

given: 3000 genes30 clusters

Perform random swaps – the cluster sizes were held but the degree of correlation within the clusters, slowly destroy.


Results:

1. MI decreases

2. MI converges to a non-zero value


2 .Score the partition

1 .Compute MI for the clustered data. –

2 .Compute MI for clustering obtained by random swaps , Repeating until a distribution of values is obtained.

3 .Compute z-score:

random

randomreal

s

MIMIz


realMI

randonMI


large z-score greater distance

clustering results more significantly related to gene function.

Results: 1. low cluster numbers2 .clustering algorithm which produce

nonuniform cluster size distribution, perform better.


Conclusion – Advantages(1):

Very simple and natural hierarchical clustering algorithm (As MI estimates are becoming better, also the results should improve).

Optimal results when the sample size is large.

MI is a proximity measure, which also recognizes negatively and nonlinearly

correlated data set. So it is more general to use it modeling relationship between genes.

MI is not biased by outliers.Euclidian distance is more easily distorted when variables are not uniformly distributed.


Conclusion – Advantages(2):

Expression levels can be modeled to include measurement noise.


Conclusion - Disadvantages:

In general, It is not easy to estimate MI (as an example, continuous random variables).

The performances degrade substantially as the number of genes increases.


Conclusion

It is not so accurate to look at each condition as a independent observation. Each point is significant.

There are analyses on datasets which do not miss any non-linear correlations .

Its more accurate as a validation method.





Documents

Mutual Information Mathematical Biology Seminar 23.5.2005