Upload
brice-powers
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
2
Outline
• Basic concept of SVM• SVC formulations• Kernel function• Model selection (tuning SVM hyperparameters)• SVM application: breast cancer diagnosis• Prediction of protein secondary structure• SVM application in protein fold assignment
3
• Data classification
– training– testing
• Learning
– supervised learning (classification)– unsupervised learning (clustering)
Introduction
4
• Consider linear separable case• Training data two classes
0 bxT w
hyperplane Separating
Basic Concept of SVM
5
constant:b
tcoefficien:w
0623
6322
3
20
03
2
0
:
0:2
0 w
: hyperplane a ofEquation
2211
yx
xyx
y
example
bxwxwD
bxT
6
• f(x) > 0 class 1• f(x) < 0 class 2• How to find good w and b?• There are many possible (w,b)
bxwxf T )(
Decision Function
7
• a promising technique for data classification • statistic learning theorem: maximize the distance between t
wo classes• linear separating hyperplane
Support Vector Machines
8
• Maximal margin = distance between
1 bxwT
11
11..
max
,,
iT
iT
yifbxw
yifbxwts
solvebwfind
w
2
w
2distance
9
• 1. How to solve w,b?• 2. Linear nonseparable case• 3. Is this (w,b) good?• 4. Multiple-class case
Questions?
10
nonlinear case
•mapping the input data into a higher dimensional feature space )),(),(()( 21 xxx
Method to Handle Non-separable Case
11
blenow separa
),,(
),,(
),,(
RR
)x,,(xx)x
ension: higher mapping to
)rable in R (nonsepa,x,xx
R,x,xx
12242
1211
1000
12(
dim
210
31
2
1321
1321
Example:
12
• Find a linear separating hyperplane
ml
nl
Rxxx
Rxxx
)(),(),(
,,,
21
21
11)(
11)(..
max
iiT
iiT
yifbxw
yifbxwts
w
2
13
libxwytosubject
ww
iT
i
T
bw
,11))((..
2min
2minmax ,
w
w
2
• Questions:• 1. How to choose ?• 2. Is it really better? Yes.• Some times even in high dimension spaces. Data
may still not separable.
Allow training error
14
)2,2,2,,,,2,2,2,1()( 32312123
22
21321 xxxxxxxxxxxxx
103 )(, RxRx example:
• non-linear curves: linear hyperplane in high dimension space (feature space)
15
Expect: if separable,
constant:
:
0
1
aC
termpenaltyCl
ii
i
li
bxwy
Cww
i
iiT
i
l
ii
T
bw
,,1,0
,1)))(((
)(2
1min
1,,
SVC formulations
(the soft margin hyperplane)
16
0)(
0)(,0
0)(
)()()()()(
cond. regularity satisfied x and opt.an is x if
conditionnecessary find
0)(,,0)(
0)(,,0)(
)(min
1111
1
1
xh
xg
xg
xhxhxgxgxf
xhxh
xgxg
xf
j
ii
ii
nnmm
m
m
If f is convex, x is opt. …(KKT condition)
17
Given an optimisation problem
)()()(
)()()(),,(
,,1,0)(
,1,0)(
)(min
11
whwgwf
whwgwfwL
miwh
liwg
wf
TT
l
iii
l
iii
i
i
as function Lagrangian dgeneralise the define we
to subject
How to solve an opt. problem with constraints?
Using Lagrangian multipliers
18
• Consider the following primal problem:
• (P) # variables: w dimension of (x) ( very big number) , b1, l
• (D) # variables: l• Derive its dual.
li
libxwytosubject
Cww
i
iiT
i
l
ii
Tbw
,,1,0
,,1,1))((
1,,
minimise
What is good in Dual than Primal?
19
Derive the DualThe primal Lagrangian for the problem is :
ii
l
ii
l
ii
Tii
l
ii
T bxwyCwwbwL
111
]1))(([)(2
1),,,,(
l
iii
iii
l
iiii
l
iiii
yb
bwL
CbwL
xywxyww
bwL
1
11
0),,,,(
0),,,,(
0),,,,(
The corresponding dual is found by differentiating with respect to w, , and b.
20
Resubstituting the relations obtained into the primal to obtain the following adaptation of the dual objective function:
Let then
Hence, maximizing the above objective over is equivalent to maximizing
l
ii
l
jij
Tijiji xxyybwL
11,
)(2
1),,,,(
jTijiij xxyyQ
TT eQbwL2
1),,,,(
21
)00(
0
0
2
1)(max
CandC
C
ytosubject
eQ
iiii
T
TT
C
C ii
i
i
enforces
, 0 with together , 0 constraint the
22
• Primal and dual problem have the same KKT conditions• Primal: # variables very large (shortcoming)
• Dual: # of variable = l
• High dim. Inner product• Reduce its computational time• For special question can be efficiently calculated.
)()( jT
ijiij xxyyQ
23
Kernel function
l
i
*i
*ii
T
b)K(x,xαysign f(x)
(y)(x)K(x,y)
1
:form thehas SVM a ofsolution The
dy)x(K(x,y)
)x-y(-K(x,y)
Linear
1
:ddegree of Polynomial
exp
:RBFGaussian
: kernels usedCommonly
2
y)(xy)K(x, :
24
Kernel function
dy)x(K(x,y)
)x-y(-K(x,y)
Linear
1
:ddegree of Polynomial
exp
:RBFGaussian
: kernels usedCommonly
2
y)(xy)K(x, :
25
RRxx
ey
ey
ey
e
ex
ex
ex
eyx
yyyxxx
xyxyxye
eeeeyx
eyxRyxif
eyxRyxEx
RBFAbout
T
T
yyyy
xxxxT
xy
yxyxyxyxT
yxTn
yxT
1
3322
3322
33223322
322
2)2(
)(1
),(
]!3
2
!2
2
!1
2[
]!3
2
!2
2
!1
2[)()(
]!3
2
!2
2
!1
21[]
!3
2
!2
2
!1
21[
!3
)2(
!2
)2(
!1
21
)()(
))()(,(
)()(,,:
:
2222
2222
2222
2
2
26
Model selection
(Tuning SVM hyperparameters)
•Cross validation: can avoid overfitting
Ex: 10 fold cross-validation, l data separated to 10 groups. Each time 9 groups as training data, 1group as test data.•LOO (leave-one-out):
cross validation with l groups, each time (l-1) data for training, 1 for testing.
28
Model Selection of SVMs Using GA Approach
• Peng-Wei Chen, Jung-Ying Wang and Hahn-Ming Lee; 2004 IJCNN International Joint Conference on Neural Networks, 26 - 29 July 2004.
• Abstract— A new automatic search methodology for model selection of support vector machines, based on the GA-based tuning algorithm, is proposed to search for
the adequate hyperameters of SVMs.
29
Model Selection of SVMs Using GA Approach
Procedure: GA-based Model Selection AlgorithmBegin Read in dataset; Initialize hyperparameters; While (not termination condition) do
Train SVMs; Estimate general error; Create hyperparameters by tuning algorithm; End Output the best hyperparameters; End
30
Experiment Setup • The initial population is selected at random and the
chromosome consists of one string of bits with fixed length 20.
• Each bit can have the value 0 or 1.• The first 10 bits encode the integer value of C, and
the rest 10 bits encode the decimal value of σ.• Suggestion of population size N = 20 is used • The crossover rate 0.8 and mutation rate = 1/20 =
0.05 is chosen
32
Coding for Weka
1. @relation breast_training2. @attribute a1 real3. @attribute a2 real4. @attribute a3 real5. @attribute a4 real6. @attribute a5 real7. @attribute a6 real8. @attribute a7 real9. @attribute a8 real10.@attribute a9 real11.@attribute class {2,4}
33
Coding for Weka@data
5 ,1 ,1 ,1 ,2 ,1 ,3 ,1 ,1 ,2
5 ,4 ,4 ,5 ,7 ,10,3 ,2 ,1 ,2
3 ,1 ,1 ,1 ,2 ,2 ,3 ,1 ,1 ,2
6 ,8 ,8 ,1 ,3 ,4 ,3 ,7 ,1 ,2
8 ,10,10,7 ,10,10,7 ,3 ,8 ,4
8 ,10,5 ,3 ,8 ,4 ,4 ,10,3 ,4
10,3 ,5 ,4 ,3 ,7 ,3 ,5 ,3 ,4
6 ,10,10,10,10,10,8 ,10,10,4
1 ,1 ,1 ,1 ,2 ,10,3 ,1 ,1 ,2
2 ,1 ,2 ,1 ,2 ,1 ,3 ,1 ,1 ,2
2 ,1 ,1 ,1 ,2 ,1 ,1 ,1 ,5 ,2
34
Running Results: using Weka 3.3.6 predictor: Support Vector Machines (in Weka called: Sequential Minimal Optimization algorithm
Weka SMO result for 400 training data:
36
Software and Model Selection
• software: LIBSVM
• mapping function: use Radial Basis Function
• find the best parameter C and kernel parameter
• use cross validation to do the model selection
37
LIBSVM Model Selection using Grid Method
-c 1000 -g 10 3-fold accuracy= 69.8389-c 1000 -g 1000 3-fold accuracy= 69.8389-c 1 -g 0.002 3-fold accuracy= 97.0717 winner-c 1 -g 0.004 3-fold accuracy= 96.9253
38
Coding for LIBSVM
2 1: 2 2: 3 3: 1 4: 1 5: 5 6: 1 7: 1 8: 1 9: 1 2 1: 3 2: 2 3: 2 4: 3 5: 2 6: 3 7: 3 8: 1 9: 1 4 1:10 2:10 3:10 4: 7 5:10 6:10 7: 8 8: 2 9: 1 2 1: 4 2: 3 3: 3 4: 1 5: 2 6: 1 7: 3 8: 3 9: 1 2 1: 5 2: 1 3: 3 4: 1 5: 2 6: 1 7: 2 8: 1 9: 1 2 1: 3 2: 1 3: 1 4: 1 5: 2 6: 1 7: 1 8: 1 9: 1 4 1: 9 2:10 3:10 4:10 5:10 6:10 7:10 8:10 9: 1 2 1: 5 2: 3 3: 6 4: 1 5: 2 6: 1 7: 1 8: 1 9: 1 4 1: 8 2: 7 3: 8 4: 2 5: 4 6: 2 7: 5 8:10 9: 1
41
Multi-class SVM
• one-against-all method
– k SVM models (k: the number of classes)
– ith SVM trained with all examples in the ith class as positive, and others as negative
• one-against-one method
– k(k-1)/2 classifiers where each one trains data from two classes
42
SVM Application in Bioinformatics
• Prediction of protein secondary structure
• SVM application in protein fold assignment
43
Introduction to Secondary Structure
• The prediction of protein secondary structure is an important step to determine structural properties of proteins.
• The secondary structure consists of local folding regularities maintained by hydrogen bonds and is traditionally subdivided into three classes: alpha-helices, beta-sheets, and coil.
46
Coding Example:Protein Secondary Structure Prediction
• given an amino-acid sequence • predict a secondary-structure state
(,, coil) for each residue in the sequence• coding: considering a moving window on n (typically
13-21) neighboring residues
FGWYALVLAMFFYOYQEKSVMKKGD
47
Methods
1. statistical information ( Figureau et al., 2003; Yan et al., 2004);
2. neural networks (Qian and Sejnowski, 1988; Rost and Sander, 1993;; Pollastri et al., 2002; Cai et al., 2003; Kaur and Raghava, 2004; Wood and Hirst, 2004; Lin et al., 2005);
3. nearest-neighbor algorithms
4. hidden Markov modes
5. support vector machines (Hua and Sun, 2001; Hyunsoo and Haesun, 2003; Ward et al., 2003; Guo et al., 2004).
48
Milestone
• In 1988, using Neural Networks first achieved about 62% accuracy (Qian and Sejnowski, 1988; Holley and Karplus, 1989).
• In 1993, using evolutionary information, Neural Network system had improved the prediction accuracy to over 70% (Rost and Sander, 1993).
• Recently there have been approaches (e.g. Baldi et al., 1999; Petersen et al., 2000; Pollastr and McLysaght, 2005) using neural networks which achieve even higher accuracy (> 78%).
49
Benchmark (Data Set Used in Protein Secondary Structure)
• Rost and Sander data set (Rost and Sander, 1993) (referred as RS126)
• Note that the RS126 data set consists of 25,184 data points in three classes where 47% are coil, 32% are helix, and 21% are strand.
• Cuff and Barton data set (Cuff and Barton, 1999) (referred as CB513)
• The performance accuracy is verified by a 7-fold cross validation.
50
Secondary Structure Assignment
• According to the DSSP (Dictionary of Secondary Structures of Proteins) algorithm (Kabsch and Sander, 1983), which distinguishes eight secondary structure classes
• We converted the eight types into three classes in the following way: H (α-helix), I (π-helix), and G (310-helix) as helix (α), E (extended strand) as β-strand (β), and all others as coil (c).
• Different conversion methods influence the prediction accuracy to some extent, as discussed by Cutt and Barton (Cutt and Barton, 1999).
51
Assessment of Prediction Accuracy
• Overall three-state accuracy Q3. (Qian and Sejnowski, 1988; Rost and Sander, 1993). Q3 is calculated by:
• N is the total number of residues in the test data sets, and qs is the number of residues of secondary structure type s that are predicted correctly.
3 100,coilq q qQ
N
100,ii
i
cQ
n
52
Assessment of Prediction Accuracy
• A more complicated measure of accuracy is Matthews’ correlation coefficient (MCC) introduced in (Matthews, 1975):
• Where TPi, TNi, FPi and FNi are numbers of true positives, true negatives, false positives, and false negatives for class i, respectively. It can be clearly seen that a higher MCC is better.
,( )( )( )( )
i i i ii
i i i i i i i i
TP TN FP FNC
TP FN TP FP TN FP TN FN
53
Support Vector Machines Predictor
• Using the software LIBSVM (Chang and Lin, 2005) as our SVM predictor
• Using the RBF kernel for all experiments
• Choosing optimal parameter for support vector machines. Find the pair of C = 10 and γ = 0.01 that achieves the best prediction rate
54
Coding Scheme SH3 N S T N K D W W K sequence to process
index a1 N K S N P D W W E a2 E E H . G E W W K multiple a3 R S T . G D W W L alignment a4 F S . . . . F F G
1 V 0 0 0 0 0 0 0 0 0 2 L 0 0 0 0 0 0 0 0 20 numeric 3 I 0 0 0 0 0 0 0 0 0 profile 4 M 0 0 0 0 0 0 0 0 0 5 F 20 0 0 0 0 0 20 20 0 6 W 0 0 0 0 0 0 80 80 0 7 Y 0 0 0 0 0 0 0 0 0 8 G 0 0 0 0 50 0 0 0 20 9 A 0 0 0 0 0 0 0 0 0 10 P 0 0 0 0 25 0 0 0 0 11 S 0 60 25 0 0 0 0 0 0 12 T 0 0 50 0 0 0 0 0 0 13 C 0 0 0 0 0 0 0 0 0 14 H 0 0 25 0 0 0 0 0 0 15 R 20 0 0 0 0 0 0 0 0 16 K 0 20 0 0 25 0 0 0 40 17 Q 0 0 0 0 0 0 0 0 0 18 E 20 20 0 0 0 25 0 0 20 19 N 40 0 0 100 0 0 0 0 0 20 D 0 0 0 0 0 75 0 0 0
Coding output: G=0.50, P=0.25, K=0.25
55
Coding Scheme for Support Vector Machines
• We use the BLASTP to generate the alignments (profile) of proteins in our database
• The expectation value (E-value) of 10.0 and choosing the non-redundant protein sequence database (NCBI nr database) to search.
• The profile data obtained from BLASTPGP are normalized to 0~1, and then used as inputs to our SVM predictor.
56
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 1 R -1 5 -1 -1 -4 3 0 -2 -1 -3 -3 3 -2 -3 -2 -1 -1 -3 -2 -3 0 50 0 0 0 17 0 0 0 0 0 33 0 0 0 0 0 0 0 0 0.63 0.09 2 T 0 -2 -1 -2 5 -1 -2 -2 -2 -2 -2 -1 -2 -3 -2 3 4 -3 -2 -1 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 30 48 0 0 0 0.54 0.15 3 D -1 -3 4 1 -4 -2 -2 6 -2 -5 -5 -2 -4 -4 -3 -1 -2 -4 -4 -4 0 0 24 8 0 0 0 68 0 0 0 0 0 0 0 0 0 0 0 0 1.04 0.34 4 C -2 0 1 -3 9 -3 -3 -3 -3 -2 -2 -3 -2 0 -3 -1 2 -3 -3 -2 0 6 10 0 61 0 0 0 0 0 0 0 0 6 0 0 17 0 0 0 1.14 0.38 5 Y -3 -3 -4 -5 -4 -3 -4 -5 0 -3 -2 -3 -2 2 -4 -3 -3 1 9 -3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 97 0 1.84 0.53 6 G 1 -3 -2 -2 -4 -2 0 6 -3 -4 -5 -3 -4 -4 -3 -1 1 -4 -4 -4 11 0 0 0 0 0 9 72 0 0 0 0 0 0 0 0 9 0 0 0 1.10 0.56 7 N -3 -2 4 4 4 0 1 -3 2 -3 -3 -1 2 -4 -3 -1 -2 -5 -3 -2 0 0 30 26 11 3 9 0 5 0 0 3 9 0 0 2 0 0 0 2 0.64 0.56 8 V -2 -4 -4 -4 -3 -4 -4 -5 -5 6 0 -4 0 -2 -4 -3 1 -4 -3 3 0 0 0 0 0 0 0 0 0 68 0 0 0 0 0 0 9 0 0 23 0.95 0.56 9 N 1 1 3 -2 -3 1 1 -3 -2 -2 -2 -1 5 -3 -3 2 2 -4 -3 -2 11 8 14 0 0 6 8 0 0 0 0 0 23 0 0 17 12 0 0 0 0.40 0.57 10 R 0 2 1 -1 2 1 0 -3 -2 -3 -2 1 -3 -4 -3 2 3 -4 -3 -3 5 11 5 3 5 7 5 0 0 0 3 11 0 0 0 20 24 0 0 0 0.38 0.57 11 I 0 -4 -4 -5 -3 -3 -4 -4 -4 4 3 -4 3 -2 -4 -3 -2 -4 -3 3 9 0 0 0 0 0 0 0 0 29 33 0 10 0 0 0 0 0 0 20 0.65 0.57 12 D -2 -2 -1 4 -4 3 3 -3 -2 -4 -4 1 -3 -5 -3 2 1 -5 -4 -4 0 0 0 26 0 16 26 0 0 0 0 7 0 0 0 17 7 0 0 0 0.67 0.57
59
SVM Application in Protein Fold Assignment
"Fine-grained Protein Fold Assignment by Support Vector Machines using generalized n-peptide Coding Schemes and jury voting from multiple parameter sets",
Chin-Sheng Yu, Jung-Ying Wang, Jin-Moon Young, P.-C. Lyu, Chih-Jen Lin, Jenn-Kang Hwang,
Proteins: Structure, Function, Genetics, 50, 531-536 (2003).
60
Data Sets
• Ding and Dubchak which consists of 386 proteins of the most populated 27 SCOP folds in which the protein pairs have sequence identity below 35% for the aligned subsequences longer than 80 residues.
• These 27 proteins folds cover most major structural classes and have at least 7 or more proteins in their classes
61
Coding Scheme
• We denote the coding schemes by X if all 20 amino acids are used
• X’ when the amino acids are classified as four groups— charged, polar, aromatic, and nonpolar
• X’’, if predicted secondary structures are used
• We assign the symbol X the values of D, T, Q, and P, denoting the distributions of dipeptides, 3-peptides, and 4-peptides, respectively.
67
Structure Example: Jury SVM Predictor
C
S
H
V
P
Z
CS
HZ
SV
CSH
VPZ
HVP
CSHV
CSHVPZ
1-VS-1
Jury (using vote)Prediction result
68
SVM
SVM
SVM
SVM
SVMcombiner
finalprediction
coding scheme 1
coding scheme 2
coding scheme 3
coding scheme 4
Structure Example:SVM Combiner