Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classiﬁcation

Bioinformatics - Lecture 07

BioinformaticsClusters and networks

Martin Saturka

http://www.bioplexity.org/lectures/

EBI version 0.4

Creative Commons Attribution-Share Alike 2.5 License

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Learning on profiles

Supervised and unsupervised methods on expression data.approximation, clustering, classification, inference.crisp and fuzzy relation models. graphical models.

Main topicsvector approaches- SVM, ANN, kernels- classificationdata organization- clustering technics- map generationregulatory systems- Bayesian networks- algebraic description


Inference types

reasoninglogical

standard logical rules based reasoningstatistical

frequent co-occurence based reasoning

deduction (logic, recursion)A, A → B ` B

induction (frequentist statistics)many A, B ` A ∼ Bfew A, ¬B ` A → B

abduction (Bayesian statistics)A1 → B, ..., An → B, B ` Ai


Machine learning methods

supervised learning methodswith known correct outputs on training data

approximationmeasured data to (continuous) output functiongrowth rate → nutrition supply regulation

classificationmeasured data to discrete output functionexpression profiles → illness diagnosis

regressioncontinuous measured data (cor)relationsa gene expression magnitude → growth rate

unsupervised learning methodswithout known desired outputs on used data

data granulationinternal data organization and distribution

data visualizationoverall outer view onto the internal data


MLE

maximal likelihood estimationL(y |X ) = Pr(X |Y = y)conditional probability as a function of the unkown conditionwith known outcome, a reverse view on probability

Bernoulli trials example

L(θ |H = 11, T = 10) ≡ Pr(H = 11, T = 10 | p = θ) =`21

11

´θ11(1− θ)10

0 = ∂/∂θ L(θ |H, T ) =`21

11

´θ10(1− θ)9(11− 21θ) → θ = 11/21

used when without a better modelmaximizations inside dynamic programming technics

√

variance estimation leads to biased sample variance ×


Regression

linear regressionleast squares

for homoskedastic distributionssample mean the best estimation

least absolute deviationsrobust versionsample median a safe estimation

minx̄∈R(∑

i |xi − x̄ |2)

minx̄∈R (∑

i |xi − x̄ |)⇓ ⇓(∑

i |xi − x̄ |2)′

= 0 (∑

i |xi − x̄ |)′ = 0

→ arithmetic mean → medianx̄ =

∑i xi/n #xi : xi < x̄ = #xi : xi > x̄


Parametrization

parametrized curve crossingassumption: equal amounts of above / below pointsthe same probabilities to cross / not to cross the curvecross count distribution approaches normal distributionover/under-fitting if not in (N − 1)/2±

√(N − 1)/2


Distinctions

empirical risk minimization for discrimination

metamethodologyboosting

increasing weights of wrong result training casesprobably approximately correct learning

to achieve high probabilities to make convenient predictions

particular methodssupport vector machinesartificial neural networkscase-based reasoningnearest neighbour algorithm(naive) bayes classifierdecision treesrandom forests


SVM

Support vector machines

linear classifiermaximal margin linear separationminimal distances for misclassified


Kernel methods

non-linear into linear separations in higher dimensional space

F(.)

linear discriminant given by dot product < F (xi), F (xj) >

back into low-dimensional space by a kernel K (xi , xj)


VC dimension

Vapnik-Chervonenkis dimension

classification error estimationmore power a method has more prone it is to overfittingmisclassifications for a binary f (α) classificatoriid samples drawn from an unknown distribution

R(α) probability of misclassification - in real usageRemp(α) fraction of misclasified cases of a training set

then with probability 1− η, training set of size N

R(α) < Remp(α) +√

h(1+ln(2N/h))−ln(η/4)N

h the VC dimensionsize of maximal sets that f (α) can shatter3 for a line classifier in a 2D space


ANN

artificial neural networks

wA1

Xi inputs

i

iA

iB

C

w

w

w

w

w

A2

B2

B3

B4

C4

2 6

7

8

1

3

4

5

w48

w8o

w

w

w

w

w

w15

16

25

26

27

38

w

w

w

5o

6o

7o

output

n hidden neurons

w weights

neuron activation functionf2(e) = f2(iAwA2 + iBwB2 − c2)fi is non-linear, usually sigmoid, with ci given constants


ANN learning

error backpropagation

iterative weight adjustingcompute errors for each training caseδ = desired − computedpropagate the δ backward: δ5 = δ · w5oδ1 = δ5 · w15 + δ6 · w16, ...

adjust weights to new valueswnew

A1 = wA1 + η · δ1 · df1(e)/de · iAwnew

15 = w15 + η · δ5 · df5(e)/de · f1(e)...

kind of gradient descent methodother (sigmoid function) parameters can be adjusted as wellconverges to a local minimum of errors


SOM

self-organizing maps

kind of an unsupervised version of ANNsthe map is commonly a 2D arrayarray nodes exhibit a simple propertyeach input connected to each outputused to visualize multidimensional datasimilar parts should behave similarly

competitive learning of the networknodes compete to represent particular data objectseach node of the array has its vector of weightsinitially either random or two principal componentsiterative node weights / property adjusting

take a random data objectfind its best matching node according to nodes’ weightsadjust node weights / property to be more similar to the dataadjust somewhat other neighboring nodes too


GTM

generative topographic map

GTM characteristicsnon-linear latent variable modelprobabilistic counterpart to the SOM modela generative model

actual data are being modeled as created by mappingsfrom a low-dimensional space into the actualhigh-dimensional spacedata visualization is gained according to Bayes’ theoremthe latent-to-data space mappings are Gaussian distributionscreated densities are iteratively fitted to approximate realdata distributionknown Gaussian mixtures and radial basis functionsalgorithms


Nearest neighbor

case-based reasoning classificationdiagnosis set to the most similar determined casehow to measure distances between particular cases?

k -NNtake the k most similar cases, each of them has a votesimple but frequently works for (binary) classification

common problemswhich properties are significant, which are just noisesuitable sizes of similar cases, how to avoid outliers


Relations

the right descriptive features - the right similar cases

search for important gene expressions and patient cases

unsupervised methodsdata clustering

for the similar data casesdata mining

for the important features

supervised methodsBayesian network inference

informatics and statisticsminimum message length

informatics and algebrainductive logic programming

informatics and logic


Cliques

graph approach to clusteringtransform given data table to a graph - vertices for genesedges for gene pairs with similarity above a thresholdto find the least graph alteration to result in a clique graph

CAST algorithmiterative heuristic clique generationa clique construction from available vertices

initiate with a vertex of maximal degreewhile a distant vertex taken or close vertex freeadd the closest vertex into the cliqueremove the farthest vertex from the clique


Clustering

standard clustering technics

to make separated homogenous groups

center based methodsk-means as the standardc-means, qt-clustering

hierarchy methodsagglomerative bottom-updivisive top-down

combinationstwo-steps approach


Cluster structures

single

complete

linkage

hierarchical clustering neighbor joining

k-means clustering qt-clustering


Distances

how to measure object-object (dis)similarity

Euclidean distance [∑

i(xi − y1)2]1/2

Manhattan distance∑

i |xi − y1|Power distance [

∑i(xi − y1)

p]1/p

maximum distance maxi{|xi − yi |}Pearson’s correlation dot product for normalized datapercentage disagreement fraction of xi 6= yi

metric significancedifferent powers usually do not significantly alter resultsmore different distance measuring should change clustercompositions


Hierarchical clustering

neighbor joining - common agglomerative method

hierarchical tree creation

joining the most similar clusterssingle linkage - nearest neighbor

distances according to the most similar cluster objectscomplete linkage - furthest neighbor

distances according to the most distant cluster objectsaverage linkage

cluster distances as mean distances of respective elementsWards method - information loss minimization

takes minimal variance increase for possible cluster pairs


The k-means

the most frequently used kind of clustering

k-mean clustering algorithmstart: choose initial k centersiterate for objects (e.g. genes) being clustered:compute new distances to centers, choose the nearest onefor each cluster compute new centerend when no cluster changesto put less weight on similar microarrays

prosusually fast, does not compute all object-object distances

consamount and initial positions of centers highly affect results


Center count

do not add more clusters when it does not increase gainedinformation sufficiently

center selectionrandom

make k-means clustering several timesPCA

principal components lie in data cloudsdata objects

choose distant objects, with weightstwo steps

take larger amount of clusters, then do hierarchicalclustering on the result centers


Cluster fit

how convenient are the gained clusters

k-mean clusteringintra-cluster vs. out-of-cluster distances for objectsratios of inter-cluster to intra-cluster distances

the Dunn’s indexinter / intra variances

hierarchical clusteringvariances of each clustersimilarity for cluster meansbootstrapping for objects with suitable inner structures


Alternative clustering

qt (quality threshold) clusteringchoose maximal cluster diameter instead of center counttry to make maximal cluster around each data objectstake the one with the greatest amount of objects insidecall it recursively on the rest of the datamore computation intensive - more plausible than k-meanscan be done alike for maximal cluster sizes

soft c-meanseach object (gene) is in more clusters (gene families)object belonging degrees, sums equal to onesimilar to k-means, stop when small cluster changessuitable for lower amounts of clusters

spectral clusteringobject segmentation according to similarity Laplacian matrixeigenvector of the second smallest eigenvalue


Dependency description

used for characterization, classification and compression

BNBayesian networkswhat depends on what, which variables are independentthen fast, suitable inference computing

MMLminimal mesage lengthshortest form of an object / feature descriptionreal amount of information an object contains

ILPinductive logic programmingto prove most of the positive / least of the negative casesaccurate given objects vs. background characterization


Graphical models

joint distributionsPr(x1, x2, x3) = Pr(x1 | x2, x3) · Pr(x2 | x3) · Pr(x3)intractable for slightly larger amounts of variablesused to compute important probabilities themselvesused for conditional probabilities→ for Bayesian inference

conditional independenceA and B independent under C: A ⊥⊥ B |CA ⊥⊥ B |C: Pr(A, B |C) = Pr(A |C) · Pr(B |C)after a few rearrengements: Pr(A |B, C) = Pr(A |C)

Markov processes are just one example of conditionalindependence

graphs and relationssimplifying the structure with stating just some validdependencies, no edges → (conditional) independenceedges stating the only dependencies


Bayesian systems

1 2 3

4

56

7 8 9

DAG of dependencies

x depends on x x5 1 2

Pr(x5 | x1, x2, x3, x4) = Pr(x5 | x1, x2)

any node is - given all its parents - (conditionally)independent with all the nodes which are not itsdescendants (e.g. x3, x5 independent at all)


Bayesian network

inferenceinference along the DAG is computed acoording to thedecomposition of conditional probabilitiesinference along the opposite directions is computedacoording to the Bayesian approach

A C

B

Pr(A, B, C) = Pr(A|B, C) · Pr(B|C) · Pr(C)how to compute Pr(A = True|C = True)

Pr(A = True|C = True) = Pr(C = True, A = True)/ Pr(C = True)=

PB Pr(C = True, B, A = True)/

PA,B Pr(C = True, B, A)


Naive Bayes

...

S

w w2

wn1

S ... mail/spam state

w individual wordsi

assumption of independent outcomesused e.g. for spam classificationS = argmaxs Pr(S = s)

∏j Pr(Oj = wj |S = s)

possible to compute fast online with 104 items


Items to remember

Nota bene:

classification methods

learning technicsvectors, kernels, SVM, ANNconditional independence

clustering methodshierarchical clusteringk-means, qt-clustering


Documents

Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classiﬁcation