JM - http://folding.chmcc.org 1
Introduction to Bioinformatics: Lecture VIIIClassification and Supervised Learning
Jarek MellerJarek Meller
Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org 2
Outline of the lecture
Motivating story: correlating inputs and outputs Learning with a teacher Regression and classification problems Model selection, feature selection and generalization k-nearest neighbors and some other classification
algorithms Phenotype fingerprints and their applications in
medicine
JM - http://folding.chmcc.org 3
Web watch: an on-line biology textbook by JW Kimball
Dr. J. W. Kimball's Biology Pages
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/
Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene
Story #2: ApoB, cholesterol uptake, LDL and its endocytosis
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl
Complex patterns of mutations in genes related to cholesterol transport anduptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood.
JM - http://folding.chmcc.org 4
Correlations and fingerprints
Instead of often difficult to decipher underlying molecular model, one maysimply try to find correlations between inputs and outputs. If measurementson certain attributes correlate with molecular processes, underlying genomicstructures, phenotypes, disease states etc., one can use such attributes asindicators of such “hidden” states and to make predictions for new cases.
Consider for example the elevated levels of the low density lipoprotein (LDL)particles in the blood, as an indicator (fingerprint) of the atherosclerosis.
JM - http://folding.chmcc.org 5
Correlations and fingerprints: LDL example
050
100150
200250
300
20
40
60
80
1000
10
20
30
40
50
60
70
80
Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data); x – LDL; y - HDL; z – age (see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549
JM - http://folding.chmcc.org 6
LDL example: 2D projection
0 50 100 150 200 250 30020
30
40
50
60
70
80
90
100
LDL
HD
L
JM - http://folding.chmcc.org 7
LDL example: regression with binary output and 1D projection for classification
0 50 100 150 200 250 300-1
-0.5
0
0.5
1
1.5
2
LDL
clas
s
JM - http://folding.chmcc.org 8
Unsupervised vs. supervised learning
In case of unsupervised learning the goal is to “discover” the structure in thedata and group (cluster) similar objects, given a similarity measure. In caseof supervised learning (or learning with a teacher) a set of examples withclass assignments (e.g. healthy vs. diseased) is given and the goal is tofind a representation of the problem in some feature (attribute) space thatprovides a proper separation of the imposed classes. Such representationsWith the resulting decision boundaries may be subsequently used to makeprediction for new cases.
Class 1
Class 2
Class 3
JM - http://folding.chmcc.org 9
Choice of the model, problem representation and feature selection: another simple example
heights
est
rog
en
F
M
adults children
weight
testosterone
10
Gene expression example again: JRA clinical classes
Controls
Poly-Articular JRA Course
...
...
1111
3.3,
P
auci
, 1_
Pau
ci
, MT
X 0
, X
R_u
nkno
wn
878,
P
auci
, 2_P
oly
, MT
X 1
, X
R_e
rosi
ons
845,
S
pond
, JA
S ,
MT
X 0
, X
R_s
pace
nar
row
ing
1805
7,
Spo
nd, J
AS
, M
TX
0 ,
XR
_scl
eros
is
1803
6,
Pau
ci, 1
_P
auci
, M
TX
0 ,
XR
_nor
mal
7029
, P
auci
, 1_
Pau
ci ,
MT
X 0
, X
R_n
orm
al
894,
P
auci
, 2_P
oly
, MT
X 1
, X
R_n
orm
al
831,
Pol
y, 2
_Pol
y , M
TX
1 ,
XR
_ero
sion
s
850,
Pol
y, 2
_Pol
y , M
TX
1 ,
XR
_ero
sion
s
872,
Pol
y, 2
_Pol
y , M
TX
1 ,
XR
_spa
ce n
arro
win
g
1073
, S
yst,
2_P
oly
, MT
X 0
, X
R_s
pace
nar
row
ing
1087
PB
, Pol
y, 2
_Pol
y , M
TX
1 ,
XR
_spa
ce n
arro
win
g
9272
, Pol
y, 2
_Pol
y , M
TX
1 ,
XR
_unk
now
n
1081
, Pol
y, 2
_Pol
y , M
TX
0 ,
XR
_spa
ce n
arro
win
g
912,
S
yst,
2_P
oly
, MT
X 1
, X
R_s
pace
nar
row
ing
19,
Pau
ci, 2
_Pol
y , M
TX
1 ,
XR
_spa
ce n
arro
win
g
9137
, Pol
y, 2
_Pol
y , M
TX
0 ,
XR
_spa
ce n
arro
win
g
993,
S
yst,
2_P
oly
, MT
X 1
, X
R_s
pace
nar
row
ing
8003
, P
auci
, 1_
Pau
ci
, MT
X 0
, X
R_u
nkno
wn
7177
, S
pond
, JS
PA
, M
TX
1 ,
XR
_spa
ce n
arro
win
g
9161
, P
auci
, 1_
Pau
ci
, MT
X 1
, X
R_n
orm
al
976,
P
auci
, 1_
Pau
ci
, MT
X 1
, X
R_n
orm
al
1083
, Con
trol
, na
, M
TX
0 ,
XR
_na
7145
, P
auci
, 1_
Pau
ci
, MT
X 0
, X
R_n
orm
al
7206
, P
auci
, 1_
Pau
ci
, MT
X 0
, X
R_u
nkno
wn
817,
P
auci
, 1_
Pau
ci
, MT
X 0
, X
R_s
pace
nar
row
ing
1082
, Con
trol
, na
, M
TX
0 ,
XR
_na
1085
, Con
trol
, na
, M
TX
0 ,
XR
_na
1089
, Con
trol
, na
, M
TX
0 ,
XR
_na
1095
, Con
trol
, na
, M
TX
0 ,
XR
_na
7149
.3, C
ontr
ol,
na ,
MT
X 0
, X
R_
na
801,
P
auci
, 2_P
oly
, MT
X 1
, X
R_e
rosi
ons
1087
ctrl
, Con
trol
, na
, M
TX
0 ,
XR
_na
9245
, S
pond
, JS
PA
, M
TX
0 ,
XR
_spa
ce n
arro
win
g
824,
Pol
y, 2
_Pol
y , M
TX
0 ,
XR
_nor
mal
9264
, P
auci
, 1_
Pau
ci
, MT
X 0
, X
R_e
rosi
ons
7118
.3, C
ontr
ol,
na ,
MT
X 0
, X
R_
na
1804
2,
Spo
nd, J
AS
, M
TX
0 ,
XR
_scl
eros
is
7021
.31,
Con
trol
, na
, M
TX
0 ,
XR
_na
1084
, Con
trol
, na
, M
TX
0 ,
XR
_na
813.
3, C
ontr
ol,
na ,
MT
X 0
, X
R_
na
7108
, S
yst,
3_S
yste
mic
, M
TX
1 ,
XR
_nor
mal
9150
, S
pond
, JA
S ,
MT
X 0
, X
R_s
cler
osis
7113
.3, C
ontr
ol,
na ,
MT
X 0
, X
R_
na
242genes
105 Genes withSignificantly Lower
Expression InPolyArticular
JRA
137 Genes withSignificantly
HigherExpression InPolyArticular
JRA
Individual:Individuals (33 patients + 12 controls)
Picture: courtesy of B. Aronow
JM - http://folding.chmcc.org 11
Advantages of prior knowledge, problems with class assignment (e.g. in clinical practice) on the other hand
FixL
PYP
GLOBINS
??
No sequence similarity
Prior knowledge – the same class despite low sequence similarity; suggestion that distance based on sequence similarity is not sufficient – adding structure derived features might help (“good model” question again).
JM - http://folding.chmcc.org 12
Three phases in supervised learning protocols
Training data: examples with class assignment are given Learning:
i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)
Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)
JM - http://folding.chmcc.org 13
Training set: LDL example again
A set of objects (here patients) xi , i=1, …, N is given. For each patient a set of features (attributes and the corresponding measurements on these attributes) are given too. Finally, for each patient we are given the class Ck , k=1, …, K, he/she belongs to.
Age LDL HDL Sex Class
41 230 60 F healthy (0)32 120 50 M stroke within 5 years (1)45 90 70 M heart attack within 5 years (1)
{ xi , Ck } i=1, …, N
JM - http://folding.chmcc.org 14
Optimizing adaptable parameters in the model
Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w.
Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years)
y(x;w)
JM - http://folding.chmcc.org 15
Examples of machine learning algorithms for classification and regression problems
Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models,
adaptivity, “artificial brain”)
JM - http://folding.chmcc.org 16
Training accuracy vs. generalization
JM - http://folding.chmcc.org 17
Model complexity, training set size and generalization
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8data 1 linear cubic 7th degree
JM - http://folding.chmcc.org 18
Similarity measures
JM - http://folding.chmcc.org 19
k-nearest neighbors as a simple algorithm for classification
Given a training set of N objects with known class assignment and k<N find an assignment of new objects (not included in the training) to one of the classes based on the assignment of its k neighbors
A simple, non-parametric method that works surprisingly well, especially in case of low dimensional problems
Note however that the choice of the distance measure may again have a profound effect on the results
The optimal k is found by trial and error
JM - http://folding.chmcc.org 20
k-nearest neighbor algorithm
Step 1: Compute pairwise distances and take k closest neighborsStep2: Assign class based on a simple majority voting, the new point belongs to the class with most neighbors in this class