Download ppt - Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning

JM - http://folding.chmcc.org 1

Introduction to Bioinformatics: Lecture VIIIClassification and Supervised Learning

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC


Outline of the lecture

Motivating story: correlating inputs and outputs Learning with a teacher Regression and classification problems Model selection, feature selection and generalization k-nearest neighbors and some other classification

algorithms Phenotype fingerprints and their applications in

medicine


Web watch: an on-line biology textbook by JW Kimball

Dr. J. W. Kimball's Biology Pages

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/

Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene

Story #2: ApoB, cholesterol uptake, LDL and its endocytosis

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl

Complex patterns of mutations in genes related to cholesterol transport anduptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood.


Correlations and fingerprints

Instead of often difficult to decipher underlying molecular model, one maysimply try to find correlations between inputs and outputs. If measurementson certain attributes correlate with molecular processes, underlying genomicstructures, phenotypes, disease states etc., one can use such attributes asindicators of such “hidden” states and to make predictions for new cases.

Consider for example the elevated levels of the low density lipoprotein (LDL)particles in the blood, as an indicator (fingerprint) of the atherosclerosis.


Correlations and fingerprints: LDL example

050

100150

200250

300

20

40

60

80

1000

10

20

30

40

50

60

70

80

Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data); x – LDL; y - HDL; z – age (see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549


LDL example: 2D projection

0 50 100 150 200 250 30020

30

40

50

60

70

80

90

100

LDL

HD

L


LDL example: regression with binary output and 1D projection for classification

0 50 100 150 200 250 300-1

-0.5

0

0.5

1

1.5

2

LDL

clas

s


Unsupervised vs. supervised learning

In case of unsupervised learning the goal is to “discover” the structure in thedata and group (cluster) similar objects, given a similarity measure. In caseof supervised learning (or learning with a teacher) a set of examples withclass assignments (e.g. healthy vs. diseased) is given and the goal is tofind a representation of the problem in some feature (attribute) space thatprovides a proper separation of the imposed classes. Such representationsWith the resulting decision boundaries may be subsequently used to makeprediction for new cases.

Class 1

Class 2

Class 3


Choice of the model, problem representation and feature selection: another simple example

heights

est

rog

en

F

M

adults children

weight

testosterone

10

Gene expression example again: JRA clinical classes

Controls

Poly-Articular JRA Course

...

...

1111

3.3,

P

auci

, 1_

Pau

ci

, MT

X 0

, X

R_u

nkno

wn

878,

P

auci

, 2_P

oly

, MT

X 1

, X

R_e

rosi

ons

845,

S

pond

, JA

S ,

MT

X 0

, X

R_s

pace

nar

row

ing

1805

7,

Spo

nd, J

AS

, M

TX

0 ,

XR

_scl

eros

is

1803

6,

Pau

ci, 1

_P

auci

, M

TX

0 ,

XR

_nor

mal

7029

, P

auci

, 1_

Pau

ci ,

MT

X 0

, X

R_n

orm

al

894,

P

auci

, 2_P

oly

, MT

X 1

, X

R_n

orm

al

831,

Pol

y, 2

_Pol

y , M

TX

1 ,

XR

_ero

sion

s

850,

Pol

y, 2

_Pol

y , M

TX

1 ,

XR

_ero

sion

s

872,

Pol

y, 2

_Pol

y , M

TX

1 ,

XR

_spa

ce n

arro

win

g

1073

, S

yst,

2_P

oly

, MT

X 0

, X

R_s

pace

nar

row

ing

1087

PB

, Pol

y, 2

_Pol

y , M

TX

1 ,

XR

_spa

ce n

arro

win

g

9272

, Pol

y, 2

_Pol

y , M

TX

1 ,

XR

_unk

now

n

1081

, Pol

y, 2

_Pol

y , M

TX

0 ,

XR

_spa

ce n

arro

win

g

912,

S

yst,

2_P

oly

, MT

X 1

, X

R_s

pace

nar

row

ing

19,

Pau

ci, 2

_Pol

y , M

TX

1 ,

XR

_spa

ce n

arro

win

g

9137

, Pol

y, 2

_Pol

y , M

TX

0 ,

XR

_spa

ce n

arro

win

g

993,

S

yst,

2_P

oly

, MT

X 1

, X

R_s

pace

nar

row

ing

8003

, P

auci

, 1_

Pau

ci

, MT

X 0

, X

R_u

nkno

wn

7177

, S

pond

, JS

PA

, M

TX

1 ,

XR

_spa

ce n

arro

win

g

9161

, P

auci

, 1_

Pau

ci

, MT

X 1

, X

R_n

orm

al

976,

P

auci

, 1_

Pau

ci

, MT

X 1

, X

R_n

orm

al

1083

, Con

trol

, na

, M

TX

0 ,

XR

_na

7145

, P

auci

, 1_

Pau

ci

, MT

X 0

, X

R_n

orm

al

7206

, P

auci

, 1_

Pau

ci

, MT

X 0

, X

R_u

nkno

wn

817,

P

auci

, 1_

Pau

ci

, MT

X 0

, X

R_s

pace

nar

row

ing

1082

, Con

trol

, na

, M

TX

0 ,

XR

_na

1085

, Con

trol

, na

, M

TX

0 ,

XR

_na

1089

, Con

trol

, na

, M

TX

0 ,

XR

_na

1095

, Con

trol

, na

, M

TX

0 ,

XR

_na

7149

.3, C

ontr

ol,

na ,

MT

X 0

, X

R_

na

801,

P

auci

, 2_P

oly

, MT

X 1

, X

R_e

rosi

ons

1087

ctrl

, Con

trol

, na

, M

TX

0 ,

XR

_na

9245

, S

pond

, JS

PA

, M

TX

0 ,

XR

_spa

ce n

arro

win

g

824,

Pol

y, 2

_Pol

y , M

TX

0 ,

XR

_nor

mal

9264

, P

auci

, 1_

Pau

ci

, MT

X 0

, X

R_e

rosi

ons

7118

.3, C

ontr

ol,

na ,

MT

X 0

, X

R_

na

1804

2,

Spo

nd, J

AS

, M

TX

0 ,

XR

_scl

eros

is

7021

.31,

Con

trol

, na

, M

TX

0 ,

XR

_na

1084

, Con

trol

, na

, M

TX

0 ,

XR

_na

813.

3, C

ontr

ol,

na ,

MT

X 0

, X

R_

na

7108

, S

yst,

3_S

yste

mic

, M

TX

1 ,

XR

_nor

mal

9150

, S

pond

, JA

S ,

MT

X 0

, X

R_s

cler

osis

7113

.3, C

ontr

ol,

na ,

MT

X 0

, X

R_

na

242genes

105 Genes withSignificantly Lower

Expression InPolyArticular

JRA

137 Genes withSignificantly

HigherExpression InPolyArticular

JRA

Individual:Individuals (33 patients + 12 controls)

Picture: courtesy of B. Aronow


Advantages of prior knowledge, problems with class assignment (e.g. in clinical practice) on the other hand

FixL

PYP

GLOBINS

??

No sequence similarity

Prior knowledge – the same class despite low sequence similarity; suggestion that distance based on sequence similarity is not sufficient – adding structure derived features might help (“good model” question again).


Three phases in supervised learning protocols

Training data: examples with class assignment are given Learning:

i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)

Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)


Training set: LDL example again

A set of objects (here patients) xi , i=1, …, N is given. For each patient a set of features (attributes and the corresponding measurements on these attributes) are given too. Finally, for each patient we are given the class Ck , k=1, …, K, he/she belongs to.

Age LDL HDL Sex Class

41 230 60 F healthy (0)32 120 50 M stroke within 5 years (1)45 90 70 M heart attack within 5 years (1)

{ xi , Ck } i=1, …, N


Optimizing adaptable parameters in the model

Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w.

Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years)

y(x;w)


Examples of machine learning algorithms for classification and regression problems

Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis)

(simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide

margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models,

adaptivity, “artificial brain”)


Training accuracy vs. generalization


Model complexity, training set size and generalization

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8data 1 linear cubic 7th degree


Similarity measures


k-nearest neighbors as a simple algorithm for classification

Given a training set of N objects with known class assignment and k<N find an assignment of new objects (not included in the training) to one of the classes based on the assignment of its k neighbors

A simple, non-parametric method that works surprisingly well, especially in case of low dimensional problems

Note however that the choice of the distance measure may again have a profound effect on the results

The optimal k is found by trial and error


k-nearest neighbor algorithm

Step 1: Compute pairwise distances and take k closest neighborsStep2: Assign class based on a simple majority voting, the new point belongs to the class with most neighbors in this class