Upload
meena
View
45
Download
0
Embed Size (px)
DESCRIPTION
Introduction Support Vector Regression QSAR Problems and Data SVMs for QSAR Linear Program Feature Selection Model Selection and Bagging Computational Results Discussion. Support Vector Regression. e -insensitive loss function. Quadratic SVMs with L 2 -norm. - PowerPoint PPT Presentation
Citation preview
Introduction Support Vector Regression
QSAR Problems and Data
SVMs for QSAR
Linear Program Feature Selection
Model Selection and Bagging
Computational Results
Discussion
Support Vector Regression
-insensitive loss function
)( bxwy
)( bxwy
Quadratic SVMs with L2-norm
0)(
0 s.t.
)()(
)()(21
min
1
*
1
*
1
*
1 1
**
l
iii
*ii
l
iii
l
iiii
l
i
l
jjjijii
C,αα
y
K
Linear SVMs with L1-norm (-SVR)
*
1
*
1
*
**
1
*
1
*
)(
)(
0,,,, s.t.
)()(1
min
i
l
jijjji
ii
l
jijjj
iijj
l
iii
l
jjj
bKy
ybK
ClC
l
QSAR Problems and Data
SVMs for QSARSVMs for QSAR SVMs for QSARSVMs for QSAR
Statistical Analysis Statistical Analysis QSAR Model BuildingQSAR Model Building
Statistical Analysis Statistical Analysis QSAR Model BuildingQSAR Model Building
Calculation of DescriptorsCalculation of DescriptorsCalculation of DescriptorsCalculation of Descriptors
3D Geometry Optimization3D Geometry Optimization 3D Geometry Optimization3D Geometry Optimization Preparation of Input DATA Preparation of Input DATA
(Bioactivity value, Structures)(Bioactivity value, Structures)
Preparation of Input DATA Preparation of Input DATA (Bioactivity value, Structures)(Bioactivity value, Structures)
Data Sets
HIV dataset five classes of Anti-HIV molecules, 64 molecules, 620 descriptors
Lombardo benchmark dataset Brain-blood barrier partitioning dataset, 62 molecules, 649 descriptors
Data Matrix descriptor1 descriptor2 - - - descriptor m Activity Data Matrix descriptor1 descriptor2 - - - descriptor m Activity
Molecule 1 x11 x12Molecule 1 x11 x12 x1m x1m ln BB ln BB Molecule 2 x21 x22 Molecule 2 x21 x22 x2m x2m ln BB ln BB - - - - - - - - - - - - Molecule n x n1 x n2 Molecule n x n1 x n2 x nm x nm ln BB ln BB
Data Matrix descriptor1 descriptor2 descriptor3 - - - descriptor m Activity Data Matrix descriptor1 descriptor2 descriptor3 - - - descriptor m Activity
Molecule 1 x11 x12 x13 x1m ln BB Molecule 1 x11 x12 x13 x1m ln BB Molecule 2 x21 x22 x23 x2m ln BB Molecule 2 x21 x22 x23 x2m ln BB - - - - - - - - - - - - Molecule n x n1 x n2 x n3 x nm ln BBMolecule n x n1 x n2 x n3 x nm ln BB
SVMs for QSAR
Construct Datasets
Final Model
Optimize Model
Model SelectionC, , ,
Bagging Models
Feature Selection
Linear Program Feature Selection
*
1
*
1
*
**
1
*
1
*
)(
)(
0,,,, s.t.
)()( min
i
n
jijjji
ii
n
jijjj
iijj
l
iii
n
jjj
bxy
ybx
ClC
Bagging
• Different validation sets give different models
• Many local minima in SVM parameter search
• Average models
Model Selection
• Choose SVM model parameters, C, or ,
• Select evaluation function Q2
• Evaluate on testing data
• Adjust using cross validation
Computational Results
Methods (10-fold
CV)
Full Data (649)
LP FS (21)
NN SA(9)
Q2 q2 Q2 q2 Q2 q2
L1-SVM .384 .382 .157 .153 .219 .217
L2-SVM .310 .292 .171 .160 .247 .245
NN .320 .301 .222 .193 .247 .238
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
2
3 4
5
6
7
8
9
10
11
12
13
14 151617
1819
20
212223
2425
26
272829
30
3132
33
34
35
36
37
3839
4041
4243
4445
46
47
48 49 50
5152
53
54
55
56
57
5859 60
61
62
SCATTERPLOT DATA ( SVM1LOMBFULL )
Observed Response
Pre
dict
ed R
espo
nse
Q2 = 0.384
q2 = 0.382 RMSE = 0.500
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
2
3 4
5
6
7
8
9
10
1112
13
14 15
16
17
1819
20 21
2223
2425
26
2728
29
30
3132
33
34
35
36
37
3839
4041
4243
4445
46
4748 49 50
5152
53
54
55
56
57
5859
60
61
62
SCATTERPLOT DATA ( SVM1LOMBLPFS )
Observed Response
Pre
dict
ed R
espo
nse
Q2 = 0.157
q2 = 0.153 RMSE = 0.316
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
2
3
45
6
7 8
910
1112
1314
15
16
17
1819
2021
2223
2425
26
27282930
31
3233
34
35
36
373839
40
41
42
43
44
4546
47
4849
50
5152
53
54 55
56
57
58
59
60
61
62
SCATTERPLOT DATA ( SVM1LOMBNNSA )
Observed Response
Pre
dict
ed R
espo
nse
Q2 = 0.219
q2 = 0.217 RMSE = 0.117
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
2
3 4
5
6
7
8
9
10
1112
13
1415
16
17
1819
20 21
2223
24 25 26
272829
30
3132
33
34
35
36
37
3839
4041
4243
4445
464748 49 50
5152
5354
55
56
57
5859
60
61
62
SCATTERPLOT DATA ( SVM2LOMBLPFS )
Observed Response
Pre
dict
ed R
espo
nse
Q2 = 0.171
q2 = 0.160 RMSE = 0.104
This work is supported by NSF (IIS-9979860 and 970923)
Discussion
Robust optimization methods
LPFS outperforms NNSA
L1-SVM can run faster than L2-SVM
? May improve LPFS method
? May improve performance of L1-SVM