Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Bioinformatics - Lecture 07
BioinformaticsClusters and networks
Martin Saturka
http://www.bioplexity.org/lectures/
EBI version 0.4
Creative Commons Attribution-Share Alike 2.5 License
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Learning on profiles
Supervised and unsupervised methods on expression data.approximation, clustering, classification, inference.crisp and fuzzy relation models. graphical models.
Main topicsvector approaches- SVM, ANN, kernels- classificationdata organization- clustering technics- map generationregulatory systems- Bayesian networks- algebraic description
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Inference types
reasoninglogical
standard logical rules based reasoningstatistical
frequent co-occurence based reasoning
deduction (logic, recursion)A, A → B ` B
induction (frequentist statistics)many A, B ` A ∼ Bfew A, ¬B ` A → B
abduction (Bayesian statistics)A1 → B, ..., An → B, B ` Ai
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Machine learning methods
supervised learning methodswith known correct outputs on training data
approximationmeasured data to (continuous) output functiongrowth rate → nutrition supply regulation
classificationmeasured data to discrete output functionexpression profiles → illness diagnosis
regressioncontinuous measured data (cor)relationsa gene expression magnitude → growth rate
unsupervised learning methodswithout known desired outputs on used data
data granulationinternal data organization and distribution
data visualizationoverall outer view onto the internal data
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
MLE
maximal likelihood estimationL(y |X ) = Pr(X |Y = y)conditional probability as a function of the unkown conditionwith known outcome, a reverse view on probability
Bernoulli trials example
L(θ |H = 11, T = 10) ≡ Pr(H = 11, T = 10 | p = θ) =`21
11
´θ11(1− θ)10
0 = ∂/∂θ L(θ |H, T ) =`21
11
´θ10(1− θ)9(11− 21θ) → θ = 11/21
used when without a better modelmaximizations inside dynamic programming technics
√
variance estimation leads to biased sample variance ×
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Regression
linear regressionleast squares
for homoskedastic distributionssample mean the best estimation
least absolute deviationsrobust versionsample median a safe estimation
minx̄∈R(∑
i |xi − x̄ |2)
minx̄∈R (∑
i |xi − x̄ |)⇓ ⇓(∑
i |xi − x̄ |2)′
= 0 (∑
i |xi − x̄ |)′ = 0
→ arithmetic mean → medianx̄ =
∑i xi/n #xi : xi < x̄ = #xi : xi > x̄
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Parametrization
parametrized curve crossingassumption: equal amounts of above / below pointsthe same probabilities to cross / not to cross the curvecross count distribution approaches normal distributionover/under-fitting if not in (N − 1)/2±
√(N − 1)/2
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Distinctions
empirical risk minimization for discrimination
metamethodologyboosting
increasing weights of wrong result training casesprobably approximately correct learning
to achieve high probabilities to make convenient predictions
particular methodssupport vector machinesartificial neural networkscase-based reasoningnearest neighbour algorithm(naive) bayes classifierdecision treesrandom forests
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
SVM
Support vector machines
linear classifiermaximal margin linear separationminimal distances for misclassified
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Kernel methods
non-linear into linear separations in higher dimensional space
F(.)
linear discriminant given by dot product < F (xi), F (xj) >
back into low-dimensional space by a kernel K (xi , xj)
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
VC dimension
Vapnik-Chervonenkis dimension
classification error estimationmore power a method has more prone it is to overfittingmisclassifications for a binary f (α) classificatoriid samples drawn from an unknown distribution
R(α) probability of misclassification - in real usageRemp(α) fraction of misclasified cases of a training set
then with probability 1− η, training set of size N
R(α) < Remp(α) +√
h(1+ln(2N/h))−ln(η/4)N
h the VC dimensionsize of maximal sets that f (α) can shatter3 for a line classifier in a 2D space
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
ANN
artificial neural networks
wA1
Xi inputs
i
iA
iB
C
w
w
w
w
w
A2
B2
B3
B4
C4
2 6
7
8
1
3
4
5
w48
w8o
w
w
w
w
w
w15
16
25
26
27
38
w
w
w
5o
6o
7o
output
n hidden neurons
w weights
neuron activation functionf2(e) = f2(iAwA2 + iBwB2 − c2)fi is non-linear, usually sigmoid, with ci given constants
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
ANN learning
error backpropagation
iterative weight adjustingcompute errors for each training caseδ = desired − computedpropagate the δ backward: δ5 = δ · w5oδ1 = δ5 · w15 + δ6 · w16, ...
adjust weights to new valueswnew
A1 = wA1 + η · δ1 · df1(e)/de · iAwnew
15 = w15 + η · δ5 · df5(e)/de · f1(e)...
kind of gradient descent methodother (sigmoid function) parameters can be adjusted as wellconverges to a local minimum of errors
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
SOM
self-organizing maps
kind of an unsupervised version of ANNsthe map is commonly a 2D arrayarray nodes exhibit a simple propertyeach input connected to each outputused to visualize multidimensional datasimilar parts should behave similarly
competitive learning of the networknodes compete to represent particular data objectseach node of the array has its vector of weightsinitially either random or two principal componentsiterative node weights / property adjusting
take a random data objectfind its best matching node according to nodes’ weightsadjust node weights / property to be more similar to the dataadjust somewhat other neighboring nodes too
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
GTM
generative topographic map
GTM characteristicsnon-linear latent variable modelprobabilistic counterpart to the SOM modela generative model
actual data are being modeled as created by mappingsfrom a low-dimensional space into the actualhigh-dimensional spacedata visualization is gained according to Bayes’ theoremthe latent-to-data space mappings are Gaussian distributionscreated densities are iteratively fitted to approximate realdata distributionknown Gaussian mixtures and radial basis functionsalgorithms
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Nearest neighbor
case-based reasoning classificationdiagnosis set to the most similar determined casehow to measure distances between particular cases?
k -NNtake the k most similar cases, each of them has a votesimple but frequently works for (binary) classification
common problemswhich properties are significant, which are just noisesuitable sizes of similar cases, how to avoid outliers
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Relations
the right descriptive features - the right similar cases
search for important gene expressions and patient cases
unsupervised methodsdata clustering
for the similar data casesdata mining
for the important features
supervised methodsBayesian network inference
informatics and statisticsminimum message length
informatics and algebrainductive logic programming
informatics and logic
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Cliques
graph approach to clusteringtransform given data table to a graph - vertices for genesedges for gene pairs with similarity above a thresholdto find the least graph alteration to result in a clique graph
CAST algorithmiterative heuristic clique generationa clique construction from available vertices
initiate with a vertex of maximal degreewhile a distant vertex taken or close vertex freeadd the closest vertex into the cliqueremove the farthest vertex from the clique
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Clustering
standard clustering technics
to make separated homogenous groups
center based methodsk-means as the standardc-means, qt-clustering
hierarchy methodsagglomerative bottom-updivisive top-down
combinationstwo-steps approach
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Cluster structures
single
complete
linkage
hierarchical clustering neighbor joining
k-means clustering qt-clustering
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Distances
how to measure object-object (dis)similarity
Euclidean distance [∑
i(xi − y1)2]1/2
Manhattan distance∑
i |xi − y1|Power distance [
∑i(xi − y1)
p]1/p
maximum distance maxi{|xi − yi |}Pearson’s correlation dot product for normalized datapercentage disagreement fraction of xi 6= yi
metric significancedifferent powers usually do not significantly alter resultsmore different distance measuring should change clustercompositions
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Hierarchical clustering
neighbor joining - common agglomerative method
hierarchical tree creation
joining the most similar clusterssingle linkage - nearest neighbor
distances according to the most similar cluster objectscomplete linkage - furthest neighbor
distances according to the most distant cluster objectsaverage linkage
cluster distances as mean distances of respective elementsWards method - information loss minimization
takes minimal variance increase for possible cluster pairs
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
The k-means
the most frequently used kind of clustering
k-mean clustering algorithmstart: choose initial k centersiterate for objects (e.g. genes) being clustered:compute new distances to centers, choose the nearest onefor each cluster compute new centerend when no cluster changesto put less weight on similar microarrays
prosusually fast, does not compute all object-object distances
consamount and initial positions of centers highly affect results
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Center count
do not add more clusters when it does not increase gainedinformation sufficiently
center selectionrandom
make k-means clustering several timesPCA
principal components lie in data cloudsdata objects
choose distant objects, with weightstwo steps
take larger amount of clusters, then do hierarchicalclustering on the result centers
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Cluster fit
how convenient are the gained clusters
k-mean clusteringintra-cluster vs. out-of-cluster distances for objectsratios of inter-cluster to intra-cluster distances
the Dunn’s indexinter / intra variances
hierarchical clusteringvariances of each clustersimilarity for cluster meansbootstrapping for objects with suitable inner structures
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Alternative clustering
qt (quality threshold) clusteringchoose maximal cluster diameter instead of center counttry to make maximal cluster around each data objectstake the one with the greatest amount of objects insidecall it recursively on the rest of the datamore computation intensive - more plausible than k-meanscan be done alike for maximal cluster sizes
soft c-meanseach object (gene) is in more clusters (gene families)object belonging degrees, sums equal to onesimilar to k-means, stop when small cluster changessuitable for lower amounts of clusters
spectral clusteringobject segmentation according to similarity Laplacian matrixeigenvector of the second smallest eigenvalue
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Dependency description
used for characterization, classification and compression
BNBayesian networkswhat depends on what, which variables are independentthen fast, suitable inference computing
MMLminimal mesage lengthshortest form of an object / feature descriptionreal amount of information an object contains
ILPinductive logic programmingto prove most of the positive / least of the negative casesaccurate given objects vs. background characterization
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Graphical models
joint distributionsPr(x1, x2, x3) = Pr(x1 | x2, x3) · Pr(x2 | x3) · Pr(x3)intractable for slightly larger amounts of variablesused to compute important probabilities themselvesused for conditional probabilities→ for Bayesian inference
conditional independenceA and B independent under C: A ⊥⊥ B |CA ⊥⊥ B |C: Pr(A, B |C) = Pr(A |C) · Pr(B |C)after a few rearrengements: Pr(A |B, C) = Pr(A |C)
Markov processes are just one example of conditionalindependence
graphs and relationssimplifying the structure with stating just some validdependencies, no edges → (conditional) independenceedges stating the only dependencies
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Bayesian systems
1 2 3
4
56
7 8 9
DAG of dependencies
x depends on x x5 1 2
Pr(x5 | x1, x2, x3, x4) = Pr(x5 | x1, x2)
any node is - given all its parents - (conditionally)independent with all the nodes which are not itsdescendants (e.g. x3, x5 independent at all)
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Bayesian network
inferenceinference along the DAG is computed acoording to thedecomposition of conditional probabilitiesinference along the opposite directions is computedacoording to the Bayesian approach
A C
B
Pr(A, B, C) = Pr(A|B, C) · Pr(B|C) · Pr(C)how to compute Pr(A = True|C = True)
Pr(A = True|C = True) = Pr(C = True, A = True)/ Pr(C = True)=
PB Pr(C = True, B, A = True)/
PA,B Pr(C = True, B, A)
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Naive Bayes
...
S
w w2
wn1
S ... mail/spam state
w individual wordsi
assumption of independent outcomesused e.g. for spam classificationS = argmaxs Pr(S = s)
∏j Pr(Oj = wj |S = s)
possible to compute fast online with 104 items
Martin Saturka www.Bioplexity.org Bioinformatics - Learning
Items to remember
Nota bene:
classification methods
learning technicsvectors, kernels, SVM, ANNconditional independence
clustering methodshierarchical clusteringk-means, qt-clustering
Martin Saturka www.Bioplexity.org Bioinformatics - Learning