97

Click here to load reader

Machine Learning and its Applications in Bioinformatics

Embed Size (px)

DESCRIPTION

Machine Learning and its Applications in Bioinformatics. Yen-Jen Oyang Dept. of Computer Science and Information Engineering. Observations and Challenges in the Information Age. A huge volume of information has been and is being digitized and stored in the computer. - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Learning and its Applications in Bioinformatics

Machine Learning and its Applications in Bioinformatics

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

Page 2: Machine Learning and its Applications in Bioinformatics

Observations and Challenges in the Information Age

• A huge volume of information has been and is being digitized and stored in the computer.

• Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.

Page 3: Machine Learning and its Applications in Bioinformatics

An Example ofSupervised Machine Learning

(Data Classification)• Given the data set shown on next slide, can

we figure out a set of rules that predict the classes of objects?

Page 4: Machine Learning and its Applications in Bioinformatics

Data Set

Data Class Data Class Data Class

( 15,33)

O ( 18,28)

× ( 16,31)

O

( 9 ,23)

× ( 15,35)

O ( 9 ,32)

×

( 8 ,15)

× ( 17,34)

O ( 11,38)

×

( 11,31)

O ( 18,39)

× ( 13,34)

O

( 13,37)

× ( 14,32)

O ( 19,36)

×

( 18,32)

O ( 25,18)

× ( 10,34)

×

( 16,38)

× ( 23,33)

× ( 15,30)

O

( 12,33)

O ( 21,28)

× ( 13,22)

×

Page 5: Machine Learning and its Applications in Bioinformatics

Distribution of the Data Set

。。

10 15 20

30

。。。 。。

。 。。

××

××

×

×

×

×

×

×

××

×

×

Page 6: Machine Learning and its Applications in Bioinformatics

Rule Based on Observation

.x

o

30

253015 22

class

else

class

, thenand y

yxIf

Page 7: Machine Learning and its Applications in Bioinformatics

Rule Generated by a Kernel Density Estimation Based

Algorithm

Let and

If then prediction=“O”.

Otherwise prediction=“X”.

2o

2o

210

12o

o 2

1)( i

icv

i i

evf

.

2

1)(

2

214

12x

x

2x

x

j

jcv

j j

evf

),()( xo vfvf

Page 8: Machine Learning and its Applications in Bioinformatics

(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027

ico

io

(9,23) (8,15)(13,37)

(16,38)

(18,28)

(18,39)

(25,18)

(23,33)

(21,28)

(9,32)(11,38)

(19,36)

(10,34)

(13,22)

6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260

jcx

jx

Page 9: Machine Learning and its Applications in Bioinformatics

Identifying Boundary of Different Classes of Objects

Page 10: Machine Learning and its Applications in Bioinformatics

Boundary Identified

Page 11: Machine Learning and its Applications in Bioinformatics

Problem Definition ofData Classification

• In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes.

• The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.

Page 12: Machine Learning and its Applications in Bioinformatics

The Vector Space Model

• In the vector space model, each object is described by a number of numerical attributes/features.

• For example, the outlook of a man is described by his height, weight, and age.

• It is typical that the objects are described by a large number of attributes/features.

Page 13: Machine Learning and its Applications in Bioinformatics

Transformation of Categorical Attributes into Numerical

Attributes• Represent the attribute values of the object

in a binary table form as exemplified in the following:

10003

00112

01001School Graduate

Education

College

Education

School High

Education

Female

Male/Objects

Page 14: Machine Learning and its Applications in Bioinformatics

• Assign appropriate weight to each column.

• Treat the weighted vector of each row as the feature vector of the corresponding object.

4

21

3

0003

002

0001School Graduate

Education

College

Education

School High

Education

Female

Male/Objects

Page 15: Machine Learning and its Applications in Bioinformatics

Transformation of the Similarity/Dissimilarity Matrix

Model• In this model, a matrix records the

similarity/dissimilarity scores between every pair of objects.

P1 P2 P3 P4 P5 P6

P1 - 53 137 862 35 180

P2 - 46 72 816 606

P3 - 447 751 201

P4 - 291 156

P5 - 494

P6 -

Page 16: Machine Learning and its Applications in Bioinformatics

• We may select P2, P5, P6 as representatives and use reciprocals of the similarity scores to these representatives to describe an object.

• For example, the feature vectors of P1 and P2 are <1/53, 1/35, 1/180> and

<0, 1/816, 1/606>, respectively.

Page 17: Machine Learning and its Applications in Bioinformatics

Applications ofData Classificationin Bioinformatics

• In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.

• Data classification has also been widely employed in prediction of protein family, protein fold, and protein secondary structure.

Page 18: Machine Learning and its Applications in Bioinformatics

• For example, in the Leukemia data set, there are 72 samples and 7129 genes.• 25 Acute Myeloid Leukemia(AML) samples.

• 38 B-cell Acute Lymphoblastic Leukemia samples.

• 9 T-cell Acute Lymphoblastic Leukemia samples.

Page 19: Machine Learning and its Applications in Bioinformatics

Model of Microarray Data SetsGene1 Gene2 ‧‧‧‧‧‧ Genen

Sample1

Sample2

Samplem

.),( RjiM

Page 20: Machine Learning and its Applications in Bioinformatics

Alternative Data Classification Algorithms

• Decision tree (Q4.5 and Q5.0);

• Instance-based learning(KNN);

• Naïve Bayesian classifier;

• RBF network;

• Support vector machine(SVM);

• Kernel Density Estimation (KDE) based classifier.

Page 21: Machine Learning and its Applications in Bioinformatics

Instance-Based Learning

• In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples.

• Classifiers that adopt instance-based learning are commonly called the KNN classifiers.

Page 22: Machine Learning and its Applications in Bioinformatics

Example of the KNN Classifiers

• If an 1NN classifier is employed, then the prediction of “” = “X”.

• If an 3NN classifier is employed, then prediction of “” = “O”.

Page 23: Machine Learning and its Applications in Bioinformatics

Decision Function of the KNN Classifier

• Assume that there are two classes of samples, positive and negative.

• The decision function of a KNN classifier is:

.or query vect

of samplesnearest theare ,..., , where

,)sgn()sgn(

21

1

v

sss

sv

kk

k

i

i

Page 24: Machine Learning and its Applications in Bioinformatics

Extension of the KNN Classifier

• We may extend the KNN classifier by weighting the contribution of each neighbor with a term related to its distance to the query vector:

.or query vect

of samplesnearest theare ,..., , where

,)sgn()()sgn(

21

1

v

sss

ssvv

k

w

k

k

i

iiii

Page 25: Machine Learning and its Applications in Bioinformatics

A RBF Network Based Classifier with

Gaussian Kernels• It is typical that all are radial basis

functions of the same form.

• With the popular Gaussian function, the decision function is of the following form:

. )sgn(2

2

2

2

)(2)(2

j

j

i

i

ewewj

j

i

i

svsv

v

k ,...,, 21

Page 26: Machine Learning and its Applications in Bioinformatics

The Common Structure of the RBF Network Based Data Classifier

v

)(1 v

)(vk

)(2 v

)(vf

iw

)(1 v

)(vk

)(2 v )(vf jw

Page 27: Machine Learning and its Applications in Bioinformatics

2

2

2

2

)(2exp

)(2exp

Classifier DataBased

Network RBF

j

j

j

j

i

i

i

i

w

w

. i.e.,global aemploy and

sampleeach at function kernel a place

kernel RBF with the

ji

SVM

.

orset lly heuristica are and

ji

ji

NetworkstionRegulariza

mh

hw

adaptiveandVariable

KDE

2

1

and sampleeach at

function kernel a place

. and

2

1

h

m

h

hw

Fixed

KDE

Page 28: Machine Learning and its Applications in Bioinformatics

Regularization of a RBF Network Based Classifier

• The conventional approaches proceed with either employing a constant σ for all kernel functions or employing a heuristic mechanism to set σi individually, e.g. a multiple of the average distance among samples, and attempt to minimize

where is a learning sample.

,

)(

)(

)(

)(

)(

)(

1 1

2

2

2

1

2

1

21

22221

11211

i

m

j

k

h

jh

ik

i

i

ik

i

i

mkmm

k

k

w

f

f

f

www

www

www

Es

s

s

s

s

s

s

is

Page 29: Machine Learning and its Applications in Bioinformatics

• The term

is included to avoid overfitting and γ is to be set through cross validation.

m

j

k

h

jhwE1 1

2

Page 30: Machine Learning and its Applications in Bioinformatics

ii

ii

ii

i

i

ii

k

h

hihi

s

ii

k

h

hihi

ii

k

h

hihi

i

k

h

iihh

ii

m

j

k

h

jh

k

h

iihh

fwww

E

fwww

E

fww

wfw

VV

w

wfw

w

E

ss

s

ss

s

s

ssss

ssss

ssss

sss

ss

)()()()(0

)()()()(0

Similarly,

)()()()(

02)()()(2

.tor column vec ofelement first theis )1( where

,0

)()(

0

1121

1

2121

2212

1

1212

1111

1

11

111

1

11

11

1 1

2

2

1

11

11

Page 31: Machine Learning and its Applications in Bioinformatics

. and have weThen,

)()()(

)()()(

)()()(

Let W

1

21

22212

12111

21

22212

12111

21

22212

12111

T

ZWZW

fff

fff

fff

Z

www

www

www

TT

ikninin

ikii

ikii

kkkk

k

k

mkkk

m

m

sss

sss

sss

Page 32: Machine Learning and its Applications in Bioinformatics

Decision Function of a SVM

• A prediction of the class of a new sample located at v in the vector space is based on the following rule:

.,0

, )(ˆ 2

2

2

2

22

Cww

bewewvf

ji

vn

j

j

vn

i

i

ji

ss

ctors.support ve thecalled are

weghtszero-non with and instances Those

"." class predicted Otherwise,

"." class predicted then ,0)(ˆ if

-ji

vf

ss

Page 33: Machine Learning and its Applications in Bioinformatics

The Kernel Density Estimation (KDE) Based

Classifier• The KDE based learning algorithm constructs

one approximate probability density function for one class of objects.

• Classification of a new object is conducted based on the likelihood function:

objects. -class of

functiondensity y probabilit eapproximat theis )(ˆ and

ly,respective classes, all of samples trainingofnumber total theand

class of samples trainingofnumber theare and where

),(ˆ)(

m

f

mSS

vfS

SvL

m

m

mm

m

Page 34: Machine Learning and its Applications in Bioinformatics

Identifying Boundary of Different Classes of Objects

Page 35: Machine Learning and its Applications in Bioinformatics

Boundary Identified

Page 36: Machine Learning and its Applications in Bioinformatics

Problem Definition of Kernel Density Estimation

• Given a set of samples

randomly taken from a probability distribution. We want to find a set of symmetric kernel functions and the corresponding weights such that

nsssS ,...,, 21

),;( iiK

iw

).(),;()(ˆ fKwf ii

i

i

Page 37: Machine Learning and its Applications in Bioinformatics

The Proposed KDE Based Classifier

• We determined to employ the Gaussian function and set the width of each Gaussian function to a multiple of the average distance among neighboring samples:

. of samples gneighborin

theamong distance average theis and

where,2

exp2

11)(ˆ

12

2

i

i

ii

n

i i

id

i

s

nf

Page 38: Machine Learning and its Applications in Bioinformatics

• can be estimated as follow:

sample.nearest

its to sample from distance theis )( and

space vector ldimensiona- ain )( radiuswith

sphere a of volume theis )1

2(

)(

where,1

)12

(

)()1(

2

1

2

k-thssR

dsR

dsR

dsR

k

iik

ik

dd

ik

d

i

i

dd

ik

i

Page 39: Machine Learning and its Applications in Bioinformatics

Accuracy of Different Classification Algorithms

Data setclassification algorithms

KDE SVM 1NN 3NN

Satimage

(4335,2000)92.30 91.30 89.35 90.6

Letter

(15000,5000)97.12 97.98 95.26 95.46

Shuttle

(43500,14500)99.94 99.92 99.91 99.92

Average 96.45 96.40 94.84 95.33

Page 40: Machine Learning and its Applications in Bioinformatics

Comparison of Execution Time(in seconds)

KDE without data reduction

KDE with data reduction SVM

Cross validation

Satimage 670 265 64622

Letter 2825 1724 386814

Shuttle 96795 59.9 467825

Make classifier

Satimage 5.91 0.85 21.66

Letter 17.05 6.48 282.05

Shuttle 1745 0.69 129.84

Test

Satimage 21.3 7.4 11.53

Letter 128.6 51.74 94.91

Shuttle 996.1 5.85 2.13

Page 41: Machine Learning and its Applications in Bioinformatics

Parameter Setting through Cross Validation

• When carrying out data classification, we normally need to set one or more parameters associated with the data classification algorithm.

• For example, we need to set the value of k with the KNN classifier.

• The typical approach is to conduct cross validation to find out the optimal value.

Page 42: Machine Learning and its Applications in Bioinformatics

• In the cross validation process, we set the parameters of the classifier to a particular combination of values that we are interested in and then evaluate how good the combination is based on alternative schemes.

• With the leave-one-out cross validation scheme, we attempt to predict the class of each sample using the remaining samples as the training data set.

Page 43: Machine Learning and its Applications in Bioinformatics

• With 10-fold cross validation, we evenly divide the training data set into 10 subsets. Each time, we test the prediction accuracy of one of the 10 subsets using the other 9 subsets as the training set.

Page 44: Machine Learning and its Applications in Bioinformatics

Overfitting

• Overfitting occurs when we construct a classifier based on insufficient quantity of samples.

• As a result, the classifier may works well for the training dataset but fail to deliver an acceptable accuracy in the real world.

Page 45: Machine Learning and its Applications in Bioinformatics

• For example, if we toss a fair coin two times, there is a 50% chance that we will observe either side up in both tosses.

• Therefore, if we draw our conclusion on how fair the coin is with just two tosses, we may end up with overfitting the dataset.

• Overfitting is a serious problem in analyzing high-dimensional datasets, e.g. the microarray datasets.

Page 46: Machine Learning and its Applications in Bioinformatics

Alternative Similarity Functions

• Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n > be the gene expression vectors, i.e. the feature vectors, of samples Sr and St, respectively. Then, the following alternative similarity functions can be employed:• Euclidean distance—

n

hhthr vvitydissimilar

1

2,,

Page 47: Machine Learning and its Applications in Bioinformatics

• Cosine—

• Correlation coefficient--

n

hht

n

hhr

n

hhthr

vv

vvSimilarity

1

2,

1

2,

1,,

n

hthtt

n

hrhrr

n

hhtt

n

hhrr

tr

n

hthtrhr

vn

vn

vn

vn

vvn

Similarity

1

2,

1

2,

1,

1,

1,,

ˆ1

1ˆ ˆ

1

where,ˆˆ

ˆˆ1

Page 48: Machine Learning and its Applications in Bioinformatics

Importance of Feature Selection

• Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.

• For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.

Page 49: Machine Learning and its Applications in Bioinformatics

• It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly.

x=10 x

y

Page 50: Machine Learning and its Applications in Bioinformatics

Linearly Separable and Non-Linearly Separable

• Some datasets are linearly separable.

• However, there are more datasets that are non-linearly separable.

Page 51: Machine Learning and its Applications in Bioinformatics

An Example of Linearly Separable

x

y

Page 52: Machine Learning and its Applications in Bioinformatics

An Example of Non-Linearly Separable

Page 53: Machine Learning and its Applications in Bioinformatics

x=10 x

y

A Simplest Case ofLinearly Separable

Page 54: Machine Learning and its Applications in Bioinformatics

Feature Selection Based on the Univariate Analysis

Ai

S11 v11

S12 v12

S13 v13

: :

S21 v21

S22 v22

S23 v23

: :

S31 v31

S32 v32

S33 v33

Class 2

Class 1

Class 3

Page 55: Machine Learning and its Applications in Bioinformatics

. 3) and 123nfor 3.07 e.g.(

1

ifonly and if

included, is attribute test,-F on the based Then,

. where

;1

;1

;1

Let

2

2...

1

1

1 1

2.

2

1

.

1 1

..

ks

kvvn

A

nn

vvkn

s

vn

vvn

v

i

k

i

i

i

k

i

i

k

i

n

j

iij

n

j

iji

i

k

i

n

j

ij

i

ii

Page 56: Machine Learning and its Applications in Bioinformatics

An Example of Univariate Analysis

Sample X Y Class Sample X Y Class

1 7.1 9.1 1 11 10.9 8.8 2

2 6.7 10.2 1 12 10.8 10.3 2

3 7.5 10.6 1 13 11.1 11 2

4 7.6 8.8 1 14 12.3 9.1 2

5 8.1 10.3 1 15 12.1 9.7 2

6 8.0 11.0 1 16 12 10.9 2

7 8.6 8.9 1 17 13.1 8.9 2

8 8.7 9.8 1 18 12.8 10.1 2

9 9.2 11.2 1 19 13.2 11.3 2

10 6.5 10.1 1 20 13.7 9.9 2

Average 7.8 10.0 - Average 12.2 10.0 -

Page 57: Machine Learning and its Applications in Bioinformatics

Joint p.m.f. of X, Y, and C

1

11

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

8

9

10

11

12

6 8 10 12 14

Page 58: Machine Learning and its Applications in Bioinformatics

0

10

0.05). with statisticfor (threshold 4.35

24.16

10

2

2...

2

1

2

2...

2

1

Y

i

i

X

i

i

s

yy

F

s

xx

Page 59: Machine Learning and its Applications in Bioinformatics

Blind Spot of the Univariate Analysis

• The univariate analysis is not able to identify crucial features in the following example:

x

y

Page 60: Machine Learning and its Applications in Bioinformatics

The Demonstrative Data Set

Page 61: Machine Learning and its Applications in Bioinformatics

Joint p.m.f. of X, Y, and C

0

2

4

6

8

10

12

0 2 4 6

1

2

Page 62: Machine Learning and its Applications in Bioinformatics

• For Gene X,

• For Gene Y,

)threshold(35.40785.184.1

9845.1

9845.1155.384.210155.347.310

155.320

84.21047.310

22

35.4054.133.5

618.5

618.504.657.61004.651.510

04.620

57.61051.510

22

Page 63: Machine Learning and its Applications in Bioinformatics

• However, if we apply the following linear transformation, then we will be able to identify the significance of these two genes:

2(GeneX)(GeneY)

Page 64: Machine Learning and its Applications in Bioinformatics

• For “2 Gene X – Gene Y”,

35.4853.4460.0

912.26

912.2627.089.01027.043.110

27.020

)89.0(1043.110

22

Page 65: Machine Learning and its Applications in Bioinformatics

• On the other hand, if we employ linear operator (x+2y), then we obtain

.3267.0

10102

2...2

2...1

ws

wwww

Page 66: Machine Learning and its Applications in Bioinformatics

• Accordingly, the issue now is that how we can figure out the optimal linear operator of form αx+βy for the projection.

• In the 2-D case, given a set of samples

{(x1,y1), (x2,y2),…, (xn,yn)},

then vi = cosθxi+sinθyi

is the value obtained by projecting (xi,yi) onsinθx-cosθy=0or on the component along vector(cosθ, sinθ) as shown on the following page.

Page 67: Machine Learning and its Applications in Bioinformatics
Page 68: Machine Learning and its Applications in Bioinformatics

Feature Selection with Independent Components

Analysis (ICA)• In recent years, ICA has emerged as a

promising approach for carrying out multivariate analysis.

Page 69: Machine Learning and its Applications in Bioinformatics

Basic Idea

• The ICA algorithm attempts to identify a plane so that when we project the data set on the plane the distribution is most non-gaussian.

Page 70: Machine Learning and its Applications in Bioinformatics

A Measure of Non-Gaussianity

• The kurtosis is commonly employed to measure the non-Gaussianity of a data set.

• The kurtosis of a dataset {v1, v2 ,… , vn} is

2

1

2

1

41

4

)(1

1 and

1 where

,3)1(

)(

vvn

svn

v

sn

vv

n

ii

n

ii

n

ii

Page 71: Machine Learning and its Applications in Bioinformatics

• The expected value of the kurtosis of a set of random samples taken from a standard normal distribution is 0.

• If the kurtosis of a set of random sample is larger than 0, then the p.d.f. of the distribution is sharper than the standard normal distribution.

• If the kurtosis of a set of random sample is smaller than 0, then the p.d.f. of the distribution is flatter than the standard normal distribution.

Page 72: Machine Learning and its Applications in Bioinformatics

• Let kurt(θ) denote the kurtosis of {v1, v2 ,… , vn}

2

1

2

1

41

4

)sin(cos1

1 and

)sin(cos1

where

,3)1(

)sin(cos)(

vyxn

s

yxn

v

sn

vyxkurt

n

iii

n

iii

n

iii

Page 73: Machine Learning and its Applications in Bioinformatics

• The issue now is to find the value of θ that minimizes kurt(θ).

• This is an optimization problem.

Page 74: Machine Learning and its Applications in Bioinformatics

The Optimization Problem

• The optimization problem is to find the global maximum/minimum of a function.

• There are several heuristic algorithms designed for solving the optimization problem, e.g.• gradient descend;

• genetic algorithms;

• simulated annealing.

Page 75: Machine Learning and its Applications in Bioinformatics

The Gradient Descend Algorithm

• In the gradient descend algorithm, a number of random samples are taken as the starting points.

• Then, we compute the gradient at each point and make a move in the direction of which the slope is maximum.

• This process is repeated a number of times until the convergent criterion is met.

Page 76: Machine Learning and its Applications in Bioinformatics

An 1-D Example

d

dkurt )( to

right the tomove:

11

1

d

dkurt )( to

left the tomove :

22

2

d

dkurt )( to

right the tomove :

33

3

δ : is a parameter that controls the stepsize

Page 77: Machine Learning and its Applications in Bioinformatics

• The gradient descend algorithm can be applied to multidimensional functions. In such cases, partial differentiation is involved.

• If the gradient descend algorithm is to be employed, then we must be able to compute the gradient of the function at any point in the vector space.

Page 78: Machine Learning and its Applications in Bioinformatics

Blind Spot of ICA

• However, ICA may fail in the following non-linearly separable dataset

Page 79: Machine Learning and its Applications in Bioinformatics

Data Clustering

• Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space.

Page 80: Machine Learning and its Applications in Bioinformatics

Model of Microarray Data Sets

Gene 1 Gene 2 Gene n

S11

S12

S13

:

S21 vi,j

S22

S23

:

S31

S32

S33

Class 2

Class 1

Class 3

Page 81: Machine Learning and its Applications in Bioinformatics

Applications of Data Clustering in Microarray Data Analysis

• Data clustering has been employed in microarray data analysis for

• identifying the genes with similar expressions;

• identifying the subtypes of samples.

Page 82: Machine Learning and its Applications in Bioinformatics

• For cluster analysis of samples, we can employ the feature selection mechanism developed for classification of samples.

• For cluster analysis of genes, each column of gene expression data is regarded as the feature vector of one gene.

Page 83: Machine Learning and its Applications in Bioinformatics

The Agglomerative Hierarchical Clustering

Algorithms• The agglomerative hierarchical clustering

algorithms operate by maintaining a sorted list of inter-cluster distances/similarity.

• Initially, each data instance forms a cluster.• The clustering algorithm repetitively

merges the two clusters with the minimum inter-cluster distance or the maximum inter-cluster similarity.

Page 84: Machine Learning and its Applications in Bioinformatics

• Upon merging two clusters, the clustering algorithm computes the distances between the newly-formed cluster and the remaining clusters and maintains the sorted list of inter-cluster distances accordingly.

• There are a number of ways to define the inter-cluster distance:• minimum distance/maximum similarity (single-link);

• maximum distance/minimum similarity (complete-link);

• average distance/average similarity;

• mean distance (applicable only with the vector space model).

Page 85: Machine Learning and its Applications in Bioinformatics

• Given the following similarity matrix, we can apply the complete-link algorithm to obtain the dendrogram shown on the next slide.

g1 g2 g3 g4 g5 g6

g1 - 0.053 0.137 0.862 0.035 0.018

g2 - 0.046 0.072 0.816 0.606

g3 - 0.447 0.751 0.201

g4 - 0.291 0.156

g5 - 0.494

g6 -

Page 86: Machine Learning and its Applications in Bioinformatics

• Assume that the complete-link algorithm is employed.

• If those similarity scores that are less than 0.3 are excluded, then we obtain 3 clusters {P1, P4}, {P2, P5, P6}, {P3}.

P1 P4 P2 P5P6P3

0.862 0.816

0.137

0.494

0.018

Page 87: Machine Learning and its Applications in Bioinformatics

g1 g4 g2 g5 g3 g6

0.862 0.816

0.751

0.606

0.447

• If the single-link algorithm is employed, then we obtain the following result.

Page 88: Machine Learning and its Applications in Bioinformatics

Example of the Chaining Effect

Single-link (10 clusters)

Complete-link (2 clusters)

Page 89: Machine Learning and its Applications in Bioinformatics

Effect of Bias towards Spherical Clusters

Single-link (2 clusters) Complete-link (2 clusters)

Page 90: Machine Learning and its Applications in Bioinformatics

K-Means: A Partitional Data Clustering Algorithm

• The k-means algorithm is probably the most commonly used partitional clustering algorithm.

• The k-means algorithm begins with selecting k data instances as the means or centers of k clusters.

Page 91: Machine Learning and its Applications in Bioinformatics

• The k-means algorithm then executes the following loop iteratively until the convergence criterion is met.• repeat {

• assign every data instance to the closest cluster based on the distance between the data instance and the center of the cluster;

• compute the new centers of the k clusters;

• } until(the convergence criterion is met);

Page 92: Machine Learning and its Applications in Bioinformatics

• A commonly-used convergence criterion is

If the difference between the values of two consecutive iterations is smaller than a threshold, then the algorithm terminates.

.cluster ofcenter theis where

,2

ii

C Cpi

Cm

mpEi i

Page 93: Machine Learning and its Applications in Bioinformatics

Illustration of the K-Means Algorithm---(I)

initial center

initial center initial center

Page 94: Machine Learning and its Applications in Bioinformatics

Illustration of the K-Means Algorithm---(II)

x

x

x

new center after 1st iteration

new center after 1st iteration

new center after 1st iteration

Page 95: Machine Learning and its Applications in Bioinformatics

Illustration of the K-Means Algorithm---(III)

new center after 2nd iteration

new center after 2nd iteration

new center after 2nd iteration

Page 96: Machine Learning and its Applications in Bioinformatics

A Case in which the K-Means Algorithm Fails

• The K-means algorithm may converge to a local optimal state as the following example demonstrates:

InitialSelection

Page 97: Machine Learning and its Applications in Bioinformatics

Conclusions

• Machine learning algorithms have been widely exploited to tackle many important bioinformatics problems.