Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
SVM and a Novel POOL Method Coupled with THEMATICS for
Protein Active Site Prediction
A DISSERTATION
SUBMITTED TO THE COLLEGE OF COMPUTER AND INFORMATION SCIENCE
OF NORTHEASTERN UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
By
Wenxu Tong
April 2008
©Wenxu Tong, 2008
ALL RIGHTS RESERVED
iii
Acknowledgements
I have a lot of people to thank. I am mostly indebted to my advisor Dr. Ron Williams. He generously
allowed me to work on a problem with the idea he developed decades ago and guided me through the
research to turn it from just a mere idea into a useful system to solve important problems. Without his
kindness, wisdom and persistence, there would be no this dissertation.
Another person I am so fortunate to meet and work with is Dr. Mary Jo Ondrechen, my co-advisor. It was
her who developed the THEMATICS method, which this dissertation works on. Her guidance and help
during my research is critical for me.
I am grateful to all the committee members for reading and commenting my dissertation. Especially, I am
thankful to Dr. Jay Aslam, who provided a lot of advice for my research, and Dr. Bob Futrelle, who
brought me into the field and provided much help in my writing of this dissertation. I also thank Dr. Budil
for the time he spent serving as my committee member, especially during all the difficulties and
inconvenience he happened to experience unfortunately during the time.
I am so fortunate to work in the THEMATICS group with Dr. Leo Murga, Dr. Ying Wei and my fellow
graduate student soon to be Dr. Heather Brodkin. I also thank Dr. Jun Gong, Dr. Emine Yilmaz and soon
to be Dr. Virgiliu Pavlu, without their generous help, my journey through the tunnel towards my degree
would be much darker and harder.
I would not have been what I am without the love and support of my parents, Yunkun Tong and Xiazhu
Wang. Thanks to raising us and giving us the best education possible during all the hardship they had
endured, both of my sister, Wenyi Tong and I have received Ph.D. degrees, the highest degree one can
expect. I am so grateful and proud of them.
iv
Last but definitely not the least I would thank Ying Yang, my beloved wife. Without her patience and
confidence in me, I could not imagine that I can do what I have done and I will humbly dedicate this
dissertation to her.
v
Table of Contents
Abstract ...............................................................................................................12
1 Introduction .....................................................................................................13
1.1 THEMATICS and protein active site prediction.................................................................... 14
1.2 Machine Learning ................................................................................................................... 15
2. Background and Related Work .....................................................................18
2.1 Protein Active Site Prediction ................................................................................................. 18
2.2 Machine Learning ................................................................................................................... 20
2.2.1 Commonly used supervised learning methods ........................................................... 20
2.2.2 Probability based approach ........................................................................................ 22
2.2.3 Performance measure for classification problems ..................................................... 23
2.3 THEMATICS .......................................................................................................................... 27
2.3.1 The THEMATICS method and its features ............................................................... 27
2.3.2 Statistical analysis with THEMATICS....................................................................... 32
2.3.3 Challenges of the site prediction problem using THEMATICS data ........................ 34
3.Applying SVM to THEMATICS.....................................................................37
3.1 Introduction............................................................................................................................. 38
3.2 THEMATICS curve features used in the SVM...................................................................... 38
3.3 Training ................................................................................................................................... 40
3.4 Results...................................................................................................................................... 41
3.4.1 Success in site prediction............................................................................................. 42
3.4.2 Success in catalytic residue prediction........................................................................ 42
3.4.3 Incorporation of non-ionizable residues ..................................................................... 43
vi
3.4.4 Comparison with other methods................................................................................. 49
3.5 Discussion ................................................................................................................................ 52
3.5.1 Cluster number and size ............................................................................................. 52
3.5.2 Failure analysis............................................................................................................ 53
3.5.3 Analysis of high filtration ratio cases.......................................................................... 53
3.5.4 Some specific examples ............................................................................................... 56
3.6 Conclusions.............................................................................................................................. 58
3.7 Next step .................................................................................................................................. 58
4. New Method: Partial Order Optimal Likelihood (POOL) ...........................61
4.1 Ways to estimate class probabilities........................................................................................ 61
4.1.1 Simple joint probability table look-up........................................................................ 61
4.1.2 Naïve Bayes method .................................................................................................... 62
4.1.3 The K-nearest-neighbor method................................................................................. 63
4.1.4 POOL method ............................................................................................................. 63
4.1.5 Combining CPE's........................................................................................................ 64
4.2 POOL method in detail ........................................................................................................... 65
4.2.1 Maximum likelihood problem with monotonicity assumption .................................. 65
4.2.2 Convex optimization and K.K.T. conditions .............................................................. 67
4.2.3 Finding Minimum Sum of Squared Error (SSE) ....................................................... 69
4.2.4 POOL algorithm ......................................................................................................... 71
4.2.5 Proof that the POOL algorithm finds the minimum SSE. ......................................... 73
4.2.6 Maximum likelihood vs. minimum SSE. .................................................................... 80
4.3 Additional computational steps............................................................................................... 86
4.3.1 Preprocessing .............................................................................................................. 86
4.3.2 Interpolation................................................................................................................ 87
vii
5. Applying the POOL Method with THEMATICS in Protein Active Site
Prediction .......................................................................................................88
5.1 Introduction............................................................................................................................. 89
5.2 THEMATICS curves and other features used in the POOL method .................................... 90
5.3 Performance measurement ..................................................................................................... 94
5.6 Computational procedure ....................................................................................................... 96
5.5 Results...................................................................................................................................... 99
5.5.1 Ionizable residues using only THEMATICS features ................................................ 99
5.5.2 Ionizable residues using THEMATICS plus cleft information................................ 102
5.5.3 All residues using THEMATICS plus cleft information .......................................... 107
5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if
applicable.................................................................................................................. 110
5.5.5 Recall-filtration ratio curves..................................................................................... 115
5.5.6 Comparison with other methods............................................................................... 117
5.5.7 Rank of the first positive. .......................................................................................... 124
5.6 Discussion .............................................................................................................................. 126
6. Summary and Conclusions ...........................................................................130
6.1 Contributions......................................................................................................................... 132
6.2 Future research ..................................................................................................................... 134
Appendices ........................................................................................................136
Appendix A. The training set used in THEMATICS-SVM ....................................................... 136
Appendix B. The test set used in THEMATICS-SVM............................................................... 138
Appendix C. The 64 protein testing set used in THEMATICS-POOL...................................... 147
Appendix D. The 160 protein testing set used in THEMATICS-POOL.................................... 151
viii
Bibliography......................................................................................................161
ix
List of Figures
Figure 2.1 Titration curves.................................................................................................................. 30
Figure 3.1 The success rate for site prediction on a per-protein basis ............................................... 46
Figure 3.2 Distribution of the 64 proteins across different values for the filtration ratio. ................ 48
Figure 3.3 Recall-false positive rate plot (ROC curves) of SVM versus other methods. ................... 51
Figure 3.4 SVM prediction for protein1QFE. .................................................................................... 56
Figure 3.5 The SVM prediction for 2PLC. ......................................................................................... 57
Figure 4.1 Three cases of Gr
in relation to the convex cone of constraints......................................... 76
Figure 5.1 Averaged ROC curve comparing POOL(T4), Wei’s statistical analysis and Tong’s SVM
using THEMATICS features. ................................................................................................... 101
Figure 5.2 Averaged ROC curves comparing different methods of predicting ionizable active site
residues using a combination of THEMATICS and geometric features of ionizable residues
only. ........................................................................................................................................... 103
Figure 5.3 Averaged ROC curve comparing POOL methods applied to ionizable residues only
CHAIN(TION, G) and to all residues CHAIN(TALL, G). ...................................................... 109
Figure 5.4 Averaged ROC curves comparing different methods of combining THEMATICS,
geometric and sequence conservation features of all residues. ................................................ 113
Figure 5.5 Averaged RFR curve of for CHAIN(T, G, C) on the 160 protein test set....................... 116
Figure 5.6 ROC curves comparing CHAIN(T, G), CHAIN(T, G, C) and Petrova’s method.......... 122
Figure 5.7 Histogram of the first annotated active site residue........................................................ 125
x
List of Tables
Table 2.1 Confusion matrix of classification labeling. ...........................................................25
Table 3.1 Performance of the SVM predictions alone versus the SVM regional predictions
that include all residues within a 6Å sphere of each SVM-predicted residue. ..............44
Table 3.2 Comparison of THEMATICS-SVM and other methods.......................................50
Table 5.1 Wilcoxon signed-rank tests between methods shown in figure 5.2 .....................105
Table 5.2 Wilcoxon signed-rank tests between methods shown in figure 5.4 .....................114
Table 5.3 Comparison of sensitivity, precision, and AUC of CHAIN(T, G, C) with Youn’s
reported results for proteins in the same family, super family, and fold. ...................120
Table 5.4 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Petrova’s method. ....121
Table 5.5 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Xie’s method.............123
11
List of Abbreviation
ANN Artificial Neural Network
ASA Area of Solvent Accessibility
AUC Area Under the Curve
AveS Averaged Specificity
CPE Class Probability Estimator
CSA Catalytic Site Atlas
E.C. number Enzyme Commission number
H-H equation Henderson-Hasselbalch equation
K.K.T. conditions Karush-Kuhn-Tucker conditions
k-NN k-Nearest Neighbor method
MAP Maximum a posteriori
MCC Matthews Correlation Coefficient
MAS Mean Average Specificity
ML Maximum Likelihood
PDB Protein Data Bank
POOL Partial Order Optimal Likelihood
RFR curve Recall-Filtration Ratio curve
ROC curve Receiver Operating Characteristic curve
SSE Sum of Squared Errors
SVM Support Vector Machine
THEMATICS Theoretical Microscopic Titration Curves
VC dimension Vapnik-Chervonenkis dimension
12
Abstract
Protein active site prediction is a very important problem in bioinformatics. THEMATICS is a simple and
effective method based on the special electrostatic properties of ionizable residues to predict such sites
from protein three-dimensional structure alone. The process involves distinguishing computed titration
curves with perturbed shape from normal ones; the differences are subtle in many cases. In this
dissertation, I develop and apply special machine learning techniques to automate the process and achieve
higher sensitivity than results from other methods while maintaining high specificity. I first present
application of support vector machines (SVM) to automate the active site prediction using THEMATICS;
at the time this work was developed, it achieved better performance than any other 3D structure based
methods. I then present the more recently developed Partial Order Optimal Likelihood (POOL) method,
which estimates the probabilities of residues being active under certain natural monotonicity assumptions.
The dissertation shows that applying the POOL method just on THEMATICS features outperforms the
SVM results. Furthermore, since the overall approach is based on estimating certain probabilities from
labeled training data, it provides a principled way to combine the use of THEMATICS features with other
non-electrostatic features proposed by others. In particular, I consider the use of geometric features as
well, and the resulting classifiers are the best structure-only predictors yet found. Finally, I show that
adding in sequence-based conservation scores where applicable yields a method that outperforms all
existing method while using only whatever combination of structure-based or sequence-based features is
available.
13
Chapter 1
Introduction
14
This dissertation employs both standard and novel machine learning techniques to automate one aspect of
the problem of protein function prediction from the three-dimensional structure. In addition to applying
established techniques, particularly the support vector machine (SVM), I introduce a novel method, called
partial order optimal likelihood (POOL), to perform the task of selection of functionally important sites
in protein structures. In my approach to the protein function prediction problem, I start with just the 3D
structure of proteins and use THEMATICS, one of the most effective methods which focuses on
electrostatic features of residues. Later, I also add some geometric features, and then the conservation of
residues among homologous sequences, when available, into our system to achieve better results.
1.1 THEMATICS and protein active site prediction.
Function prediction (predicting protein function from protein structure) is an important and challenging
task in genomics and proteomics 1. More and more protein structures have been deposited to the PDB
(Protein Data Bank) database, many with unknown functions. As of this writing, there are over 3600
protein structures in the PDB of unknown or uncertain function. The recent development of generating
structures from proteins expressed from gene sequences using high throughput methods 2-6 only makes
effective and efficient function prediction even more important, as most of these Structural Genomics
proteins are of unknown function.
Determination of active sites, including enzyme catalytic sites, ligand binding sites, recognition epitopes,
and other functionally important sites is one of the keys to protein function prediction.
In addition, the importance of site prediction goes beyond predicting active sites for proteins with
unknown function. Even for a protein with known function, it is not necessarily true that the active site of
that protein is fully or partially characterized. Correctly finding the active site of a protein is always a
prerequisite to understanding the protein’s catalytic mechanism. It also opens the door to the design of
ligands to inhibit, activate, or otherwise modify the protein’s function. Protein engineering applications to
design a protein of particular functions 7-9 also require knowledge of the proper features needed to create a
functioning active site .
15
Because of its importance in genomics and proteomics, many different methods have been developed to
predict the active site of a protein 10-21. We will survey some of these methods in a later section. But
among them, there is one particular method, namely THEMATICS (Theoretical Microscopic Titration
Curves), which is powerful, accurate and precise 16, 22-24. Based on protein 3D structure alone, it can
predict correct active sites that are highly localized in small regions of the proteins’ structures.
The details of the THEMATICS method will be given later, but the key point of this method is that it
takes advantage of the special chemical and electrostatic properties of active site residues, since active site
residues tend to have anomalous titration behavior; THEMATICS generates the titration curves of
ionizable residues of a protein. In its original formulation the presence of two or more residues with
perturbed titration curves in physical proximity is considered a reliable predictor of the active sites for
proteins.
For the THEMATICS method to work well, one needs a criterion to distinguish the perturbed titration
curves from normal, unperturbed ones, which is not a trivial task. My work uses machine-learning
technology to automatically this process. In SVM, I tried to solve this problem in the form of
classification by predicting each residue as either an active site residue, or not. Later, I developed the
POOL method to solve this problem by rank-ordering the residues in a protein according to their
probability of being in an active site, based on how perturbed their titration curves are in addition to some
other 3D-structure-based information. Later still, sequence conservation information, if available, was
added as well.
1.2 Machine Learning.
Machine learning is a well-developed field in computer science. There are many types of tasks, ranging
from reinforcement learning 25 to more passive forms of learning, like supervised learning and
unsupervised learning 26. A typical supervised learning task is to create a function or classification from a
set of training data. If the output of the function is a label from some finite set of classes, it is called a
classification problem. If the output is a continuous value, it is called a regression problem. A training set
16
consists of a set of training examples, i.e. pairs of input-output vectors. And the machine-learning
problem is typically an optimization problem to generate a function that will give an output from a valid
input that generalizes from the seen training data in a “reasonable” way. The part that gets optimized is
typically generalization error, which is the error a certain trained machine will make on unseen data with
the same distribution as that of the population. For an unsupervised learning task, there is no labeled
training data. Usually, the task in unsupervised learning is to cluster the observed data with some criteria,
or fit a model to represent the observed data. I will focus on supervised learning; here my learning task is
essentially a classification task. As will be described below, in one part of this work the goal will be to
estimate actual class probabilities, which in some respects is like a regression problem.
17
Chapter 2
Background and Related Work
18
2.1 Protein Active Site Prediction.
Since the main focus of this dissertation is in Computer Science, I will just briefly survey some of the
methods used for this application, to serve as a background for the method comparison later in my
dissertation.
There are two major classes of methodology used to predict protein active sites. Almost all current
methods in active site prediction use one of them or a mix of both approaches.
The first methodology is based on sequence comparison, or evolutionary information derived from
sequence alignments. The rationale is that active sites of a protein are important regions in the proteins,
and that the amino acids, termed residues, in active sites therefore should be more conserved throughout
evolution than some other regions of the protein. If we can find highly conserved regions among
sequences in similar proteins from different sources (species/tissues), or even in different proteins but
with similar functions, most likely active sites should consist of subsets within these regions. This is a
valid assumption and indeed, many methods have been developed based on this approach, such as
ConSurf 27, Rate4Site 28, and others 29-32.
However there are two drawbacks to this approach.
First, in order to use this method, there have to be at least 10, and preferably 50, different protein
sequences with certain degrees of similarity in order to get reliable results. The method does not work
well if the similarities between sequences are either too high or too low. There are studies showing that
sequence-based methods can transfer reliably the extracted functional information only when applied to
sequences with as high as 40% sequence identity 33, 34. This drawback makes the method unsuitable for
many proteins, particular Structural Genomics proteins, since they often do not have enough similar
sequences with suitable range of similarities.
19
Second, although most active site residues tend to be conserved through evolution, it is certainly not true
that all conserved regions of a protein are active sites. Residues in protein sequences can be conserved
for a variety of reasons, not just because of involvement in active sites. One well-known counterexample
is the set of residues that stabilize the structure of the protein; they are so important to the protein that
once mutated, the protein will not have the proper structure to perform its function. These residues will
be conserved among different protein homologues, even if they are not active sites. Therefore typically
sites predicted from sequence based methods are non-local, spanning a much larger area than the true
active site. Another difficulty arises for cases where an active site region in a protein is less conserved
than other regions of the protein, especially when the function and/or substrate of the proteins in the class
are somewhat versatile.
The second methodology is structure-based active site prediction. There are different properties that have
been studied and used in different methods, such as electrostatics properties as in THEMATICS 16,
residue interaction as in the graph theoretic method SARIG 35, van der Waals binding energy of a probe
molecule as in Q-site Finder 20, geometric cleft location as in surfNet 36 and castP 37, and a geometric
shape descriptor termed geometric potential 38.
There are also studies that combine results from different methods, employing either statistical or
machine-learning techniques. Among all such studies, I list a few examples that either use similar residue
properties or similar machine learning methods as I used in my earlier and current study. P-cats 21, uses a
k-nearest neighbor method to smooth the joint probability lookup table; a study by Gutteridge uses a
neural network and spatial clustering to predict the location of active sites 18; Petrova’s work uses a
support vector machine (SVM) to predict catalytic residues 39; and Youn’s work uses a support vector
machine to predict catalytic residues in proteins 40. All these methods use both sequence conservation and
3D structural information.
Depending on the properties that these methods are based on, the computational cost and accuracy varies.
20
Among all these methods, THEMATICS is the most accurate to date. The computational cost is
acceptable. To analyze a typical protein using THEMATICS takes less than an hour on a desktop PC,
although actual CPU times depend on protein size. The details of this method will be explained in a later
section.
Although THEMATICS is the most effective and accurate method among these when used on its own, it
is natural to consider whether the predictions can be improved by using additional information. I examine
this using both geometric and conservation information, and find that this is indeed the case.
2.2 Machine Learning.
Machine learning is a very broad subfield of artificial intelligence. It is almost impossible to survey the
whole area in this dissertation. Here, I focus on just supervised learning, mostly classification. Even this
area is still too broad, and I will briefly introduce the framework and some of the most commonly used
methods with their basic principles.
2.2.1 Commonly used supervised learning methods.
The first method is called artificial neural network (ANN) or just neural network (NN) 26, 41. It is based on
a computational model of a group of interconnected nodes (neurons) akin to the nervous system in
humans. Each neuron has a certain number of inputs and typically one output. The input to a neuron can
be either the features of input data, or the outputs of other neurons. The output of one neuron can serve as
input to multiple neurons. Typically, there is a weight associated with each input of each neuron. At
processing time (classifying a query instance), each neuron in the network takes the input and computes
the weighted sum of the total input with their associated weights, and generates its output by some
nonlinear function f. There are different flavors of the structures of ANN, such as feedforward versus
recurrent network. During training time, a cost function is defined to estimate the accuracy of the ANN
with respect to the data, or essentially a measurement of how much error the current ANN makes on the
training data. The learning process is to find the optimal setting of the structure and/or weights of the
21
ANN to minimize the cost function on the training data. ANN in general is a very powerful method, and
it has been used in numerous applications including in protein active site predictions 18. One drawback is
that ANN is a somewhat black-box method, meaning although one may find a very good classifier, the
structure of the network and the weights associated with each input may not reveal too much useful
information on why it works.
Another popular and intuitively appealing method is the nearest neighbor method, or its more general
form, the k-nearest-neighbor method (k-NN) 42. The principal idea is to classify the query instance based
on its k nearest neighbors among the training set, which represent the most “similar” training instances.
The success of this method relies on several choices; k and the distance function that defines what
“similar” means are among the most important ones. The training or learning process of this method is
somewhat different from most other machine learning techniques. In most cases, instead of solving an
optimization problem, it uses cross validation directly to select the best k and best distance function.
However, this method and the naïve Bayes method, which will be discussed later, are both susceptible to
the presence of correlated features. The method is also susceptible to the presence of noisy and irrelevant
features.
The support vector machine (SVM) 43, 44 is a relatively newly developed method. A one-sentence
description of this method is that it uses the “kernel trick” to find a best linear separator (hyperplane) in
kernel space to separate instances which are not linearly separable in feature space. There are two major
advantages of SVM. First, unlike ANN or some other classification techniques, the linear separator SVM
finds is not only the one that successfully separate the two classes (in hard margin case) or that make the
“fewest” errors (in soft margin case), but is the one that is the best among all of those separators. The
reason that it finds the “best” among all possible good separators is it maximizes the margin, which
measures how “far” this separator can move without making more mistakes on the training data, and there
is a rigorous proof that a classifier giving the maximum margin tends to make the fewest errors on the
testing data. Another advantage is that the “kernel trick” maps instances in the original feature space to
22
instances in the kernel space and in certain cases the linearly-inseparable instances in feature space
become linearly-separable in kernel space and the kernel transform is easy to compute and the explicit
form of mapping function between instances is not required to be known. To successfully use this method,
one needs to select the right kernel. There are commonly used kernels, but to take full advantage of using
and developing kernels is not a trivial task. I have applied this method to protein active site prediction
with success 45.
Last, I will mention boosting, another method developed quite recently which has achieved a lot of
success 46, 47. Boosting is a meta-learning algorithm. Boosting occurs in stages, by incrementally adding
to the current learned function. At every stage, a weak learner (i.e., one that has accuracy greater than
chance) is trained with the data. The output of the weak learner is then added to the learned function, with
some strength (proportional to how accurate the weak learner is). Then, the data is re-weighted: examples
that the current learned function gets wrong are "boosted" in importance, so that future weak learners will
attempt to fix the errors. If every weak learner is guaranteed to perform better than random guessing, the
boosting method can find the learned function that makes fewer errors on training data than any pre-set
threshold very fast. It is a very powerful method to combine different learners into one “super system”.
There is also a well-developed mathematical theory showing that boosting can lower generalization error.
2.2.2 Probability based approach.
A lot of machine learning work overlaps with statistics, where probability is used to classify instances
directly. The first idea introduced here is Bayesian inference, which is based on Bayes’ theorem 26. Bayes’
theorem may be written as:
)()()|()|(
BPAPABPBAP ∗=
The probability of an event A conditional on another event B is generally different from the probability of
B conditional on A. However, there is a definite relationship between the two, and Bayes' theorem is the
23
statement of that relationship. P(A) and P(B) are called prior probability, and P(A|B) is called conditional
probability of A given B.
Bayes’ theorem is important to a number of applications, including several different places in the present
problem. One way to formulate our problem is to generate the hypothesis H that gives the probability that
a residue with certain features is in the active site, based on the observed data (D), or training examples,
i.e., P(H|D). Take H here as a look-up table for probability of positive with different feature values x. To
use such H at query time, just go to the entry that has the same reading as x and read the corresponding
probability. Usually the number of ways to fill out the look-up table H is infinite, so how should one
choose? One simple answer is to choose the H that gives the largest P(D|H); this is called the maximum
likelihood (ML) hypothesis. Taking P(H), the prior probabilities of H into consideration, one could pick
the H that gives the largest P(D|H)P(H); this gives the maximum a posteriori (MAP) hypothesis. Notice
that ML is equivalent to MAP with “flat prior”, when the prior probability of all hypotheses under
consideration are same. The POOL method I introduce below for active site prediction is MAP with all
hypotheses satisfying the monotonicity constraints having a flat prior, while all other hypotheses have a
prior probability at 0. Both ML and MAP method pick out the “most favorable” hypothesis H* out of all
possible ones based on the data, and use that to predict the “most probable class” of new query data,
which is, given query data q, the probability that q is in class c is determined by P(C=c| q, H*), and the
objective is to find the particular c that gives the largest P(C=c| q, H*). There is another way to do
prediction at query time, namely consider the prediction from all possible hypotheses and sum the result
with the probability of each hypothesis as weights: P(C=c|q, D) = ∑P(C=c|q, Hi, D)*P(Hi|D). This is
called the Bayes classifier, which gives an optimal result, but is often difficult to compute in practice.
2.2.3 Performance measure for classification problems.
In the context of binary classification tasks, the terms true positives, true negatives, false positives and
false negatives are used to describe the given classification of an item (the class label assigned to the item
24
by a classifier) with the desired correct classification (the class the item actually belongs to). This is
illustrated by the confusion matrix below:
25
Predicted classification
Positive Negative
Positive True Positive (TP) False Negative (FN) Actual
classification Negative False Positive (FP) True Negative (TN)
Table 2.1. Confusion matrix of classification labeling.
26
In the confusion matrix above, TP, FP, FN and TN are the number of true positive, false positive, false
negative and true negative instances respectively. Although all the information about the classifier’s
performance is included in the confusion matrix, people tend to use some other measurement derived
from the listed information to compare performance of classifiers. We list some commonly used ones:
FNTP
TPysensitivitrecall+
==
TNFP
TNyspecificit+
=
FPTP
TPvaluepredictivepositiveprecision+
==
FNTN
TNvaluepredictivenegative+
=
FPTN
FPyspecificitratepositivefalse+
=−= 1
FNFPTNTPTNTPaccuracy
++++=
FNFPTNTPFNFPaccuracyerror
++++=−= 1
FNFPTNTPFPTPratiofiltration
++++=
))()()(( FNTNFPTNFNTPFPTPFNFPTNTPMCC
++++∗−∗=
Most of the measurements above are very straightforward and can be treated as mere definitions with no
need for further explanation, with the exception of the filtration ratio and the Matthews correlation
27
coefficient (MCC). I invented the term filtration ratio, to be used in place of precision and false positive
rate in the present problem. One of the advantages of the filtration ratio is that it is the only measurement
listed above that can be determined with information from predicted classification, without the
requirement of knowing the actual classification, which is always unknown in practice. Another reason I
invent and use filtration ratio is that in the present problem, information available in the literature about
actual positives is incomplete, even in our training and testing dataset. Thus a certain fraction of the
nominal false positives probably are not false. In situations like ours where one expects true positives to
represent a fairly small proportion of all the instances, and the measured false positive rate is suspect,
using filtration ratio is more appropriate than some other measurements, such as precision. Both of these
issues will be discussed in more detail later in the dissertation. On the other hand, MCC is widely used in
machine learning as a measure of the quality of binary classifications. It takes into account both
sensitivity and selectivity and is generally regarded as a balanced measure which can be used even if the
classes are of very different sizes. It returns a value between -1 and +1. A coefficient of +1 represents a
perfect prediction, 0 an average random prediction and -1 the worst possible prediction. While there is no
perfect way of describing the confusion matrix of true and false positives and negatives by a single
number, the MCC is generally regarded as being one of the best such measures. In the denominator of
MCC, if any of the four sums is zero, the denominator can be arbitrarily set to one; this results in a MCC
of zero, which can be shown to be the correct limiting value. 48
2.3 THEMATICS
I am going to discuss the THEMATICS method in more detail because this is the basis for most of the
input data in the work of this dissertation.
2.3.1 The THEMATICS method and its features.
In the application of THEMATICS, one begins with the 3D structure of the query protein, solves the
Poisson-Boltzmann (P-B) equations using well-established methods 49-52, and then performs a Monte
28
Carlo procedure to compute the proton occupations of each ionizable amino acid as a function of the pH.
Each such function is called a titration curve, as shown in figure 2.1 (a).
From the theoretical titration curves computed from the 3D structure of a query protein, THEMATICS
identifies residues (amino acids) that exhibit significant deviation from Henderson-Hasselbalch (H-H)
behavior, which I now describe.
A typical ionizable residue in a protein obeys the H-H equation, which may be expressed as a proton
occupation O as a function of pH as:
1)110()( −− += apKpHpHO (1)
For the residues that form a cation upon protonation (Arg, His, Lys, and the N-terminus), the mean net
charge C on particular residue is equal to O, whereas for the residues that form an anion upon
deprotonation (Asp, Cys, Glu, Tyr, and the C-terminus), the mean net charge is given by )1( −O as:
)()( pHOpHC = cationic (2)
1)()( −= pHOpHC anionic (3)
Note that C represents the average net charge on a particular residue for a large ensemble of protein
molecules. Equations (1) - (3) have the sigmoid shape that is typical of a weak acid or base that obeys the
H-H equation. Thus, as pH increases, the predicted average charge falls sharply in a pH range close to the
pKa, which is defined as the pH at which that residue is protonated in exactly half of the protein
molecules in the ensemble.
Underlying the THEMATICS approach is the observation that the computed titration curves tend to
deviate more from this H-H shape for ionizable residues belonging to active sites than for ionizable
residues not belonging to such sites. The key step in the application of the THEMATICS approach is thus
recognizing significant deviation from H-H behavior in the shape of these predicted titration curves.
29
When THEMATICS was first developed, visual inspection of the computed curves was used to identify
THEMATICS positive residues. Although simple, it is inefficient, vulnerable to bias, and in some cases
ineffective, since some deviations of the curves are subtle and not easily recognized visually, as indicated
by figure 2.1(b).
30
(a)
(b) Figure 2.1. Titration curves. (a) A standard HH curve (black), a typical perturbed curve (red), and a typical unperturbed curve from residues not in active site (blue). (b) Titration curves from active site residues (red) versus non-active-site residues (green) from a set of 20 proteins (Appendix A); only glutamate residues are shown.
31
Before introducing the methods for automation of the classification of residues using THEMATICS, I
will first present the feature extraction process.
In order to perform any machine-learning or statistical analysis on titration curves, one needs to find
features that are easy to compute and are effective to distinguish positive and negative instances.
I have defined features that may be used to measure the deviation of a particular titration curve from H-H
behavior. In particular, four features extracted from the titration curves are most useful in separating
THEMATICS positives residues from the others.
These four features are based on the first four moments of the derivatives of the titration curves, as I now
describe briefly. A more detailed description can be found in Ko’s study 24.
Define the variable x to be the offset of the pH from the pKa, as:
apKpHx −= , (4)
Then equation (1) for Henderson-Hasselbalch titration curves becomes:
1)110()( −+= xxO . (5)
The key observation on which the moment analysis is based is that, for any titration curve O(x), whether
of Henderson-Hasselbalch form or not, the corresponding derivative
)(//)( pHddOdxdOpHf −=−= (6)
is effectively a probability density function (ignoring those rare cases when the titration curve fails to be a
non-decreasing function of x, in which case this derivative function takes on negative values) 53.
The nth moment of f is defined as
∫= )()()( pHdpHfpHM nn (7)
and the corresponding nth central moment µn is
32
∫ −= )()()( 1 pHdpHfMpH nnµ (8)
where these integrals are over all space (−∞ to +∞).
The features I use are based on the first moment M1 and the second, third, and fourth central moments µ2,
µ3, and µ4, respectively, of the derivatives f. For a pure H-H titration curve these moments are
M1 = pKa , µ2 = 0.620, µ3 = 0, and µ4 = 1.62. (9)
It is interesting to note that, for an arbitrary probability density function, M1 is its mean and µ2 is its
variance, while µ3 and µ4 are related to the skewness and kurtosis, respectively, standard quantities used in
statistics to measure departure from normality.
When applied to a general titration curve of ionizable residues in a protein, the pKa shift is closely related
to how much M1 differs from the free-solution pKa. Those residues that interact strongly with other
ionizing residues in such a way that the predicted titration functions O(pH) are elongated will have
broader first derivative functions f and thus have generally higher values for µ2 and especially µ4. The
moment µ3 measures the asymmetry of the function f and has a nonzero value for any residue that
interacts with other ionizing groups in such a way that the strength of this interaction in the range pH <
pKa is different from that in the range pH > pKa.
Thus it is clear that the first moment and the second, third, and fourth central moments are useful
measures for determining deviation from H-H behavior.
The methods introduced below all use some of the four features described above, with some additional
features in some specific methods.
2.3.2 Statistical analysis with THEMATICS.
One automated analysis was proposed and studied by Ko 24. Ko introduced simple statistical metrics to
automatically evaluate the degree of perturbation of a titration curve from H-H behavior. The method is
simple, just looking at two of the above features, namely µ3 and µ4. A statistical Z-score was computed on
33
these features; i.e. for every curve analyzed, the deviation of the µ3 and µ4 values from the mean,
expressed in units of the standard deviation, of all curves from the same protein were computed. Any
residue with a titration curve with either the absolute value of Z-score of µ3 or Z-score of µ4 greater than
1.0 was classified as a THEMATICS positive residue and any such residues with at least one other
THEMATICS positive residue located within 9Å were reported as active site candidates.
Good results were obtained for the identification of active sites in a set of 44 proteins with Ko’s method.
Although this method has excellent recall of catalytic sites, identifying the correct catalytic site for 90%
of enzymes, the recall rate is lower (about 50%) for the identification of catalytic residues. It is desirable
to improve the catalytic residue recall rate and also to expand the method to include predictions of non-
ionizable residues.
Wei studied Ko’s analysis and modified it 54, introducing a new fractional parameter, α, which typically
runs between 0.95 and 1.0. In this method, the mean and the standard deviation of µ3 and µ4 are
calculated from the titration curves from a portion of the residues in a protein, in contrast to the whole
population in Ko’s method. The portion of the residues excluded from the sample are the residues with
titration curves with µ3 and µ4 values in the highest (1-α) fraction. This modification did yield better
recall of annotated catalytic residues than Ko’s analysis, but the optimal α is different for different
proteins and was finally fixed at 0.99, the value that yields the best overall performance when averaged
over the set of annotated proteins. The purpose of this α is to exclude residues with titration curves with
the most extreme µ3 and µ4 values from influencing the mean and standard deviation of the population too
much and thus yielding better statistics and slightly more reliable predictions.
Meanwhile, Yang developed another rule-based statistical analysis identifying THEMATICS positive
residues 55. In addition to the four features described earlier, her method uses an additional feature, a
value called the buffer range R, which measures the width of the pH range over which the residue is
34
partially ionized. Also, outliers are selected within each amino acid type, when possible, instead of the
entire set of ionizable residues. The performance of this method is a little better than Ko’s method.
The three statistics-based analyses listed above all employ handcrafted cutoff values to differentiate the
positive from the negative instances. The study described in this dissertation begins with the hypothesis
that a machine learning method can utilize similar sets of features, define a threshold in a systematic way,
and achieve better performance in practice.
2.3.3 Challenges of the site prediction problem using THEMATICS data.
One of the challenges of the task to classify THEMATICS results arises from special characteristics of the
training data set.
First, the vast majority of the residues in the training data are negative examples. Literature-confirmed
active site residues typically consist of less than 3% of total residues. At the same time, the negative
examples, which comprise most of the data in the training set, share some “common” property, while the
relatively few positive examples have “abnormal” behaviors in a varied way. This is one of the key
reasons that a simple outlier detection process like Ko’s analysis is quite successful in solving this
problem. But it is not clear how this method can incorporate additional non-THEMATICS features to
possibly improve the active-site prediction.
Secondly, the nature of this problem limits the quality of the training data. The ultimate goal of the
project is to predict the active sites of proteins using THEMATICS data, however, the absolute criterion
to label a residue in a protein as active is that someone has done the experiment in the lab and published
the result supporting the claim. There are databases collecting such annotations and by no means are
these annotations complete. There is another subtlety in that although THEMATICS positive residues
have been shown to be a very reliable indicator of active sites, THEMATICS sometime predicts
additional nearby residues that are not annotated as active, including second shell residues. There is some
evidence to support the hypothesis that these second shell residues may be important. Alternatively, they
35
may be affected by the special electrostatic field created by the nearby active sites residues. The
THEMATICS positive residues in the second case may not be shown experimentally to be active site
residues. Because residue activity is often measured in a kinetics experiment and a number of factors can
sometimes cause large errors in these experiments, the training set inevitably contains some positive
instances that are misclassified in the first place, or some instances that cannot be correctly distinguished
by the model. In particular, there are most probably instances of true positives that are improperly
annotated as negatives, simply because no experiments have been tried on the vast majority of residues.
In order to overcome these two obstacles, in my earlier work of neural network machine learning and
SVM method, I “cleaned” the training set. Instead of using just literature confirmed positive instances, I
also labeled “apparent” THEMATICS positives, near a known active site although not experimentally
identified as active site residues. I also removed some of the isolated THEMATICS positive instances
from the training set. Although this data cleaning did improve the results, it is ad hoc and lacks a
systematic justification.
For any machine learning problem, if there is some prior belief, or bias, which turns out to be true,
applying it should always help the performance.
After studying THEMATICS and its application in protein active site prediction, it would be fair to
conclude the following THEMATICS principles as prior belief:
THEMATICS Principle 1: The more perturbed the titration curve is (relative to other titration curves in
the same protein), the greater the probability that residue is in the active site.
THEMATICS Principle 2: The more perturbed the neighboring titration curves are (relative to other
titration curves in the same protein), the greater the probability that residue is in the active site.
The ad hoc method used before implicitly cleaned the data based on the THEMATICS principles.
In addition to THEMATICS features, to which we can apply THEMATICS principles, there are some
non-THEMATICS features having either positive or negative correlation to the probability that a residue
36
is located in the active site. Those features may not be a reliable indicator by themselves, but combined
with THEMATICS methods, they may improve the overall prediction accuracy.
While there may be ways to enforce inductive bias in classifiers like neural networks and SVMs, I believe
the most straightforward approach is instead to try to estimate P(class | attributes) nonparametrically,
while enforcing these principles as constraints, as explained in Chapters 4 and 5.
37
Chapter 3
Applying SVM to THEMATICS
38
3.1 Introduction.
As discussed in earlier chapters, THEMATICS is a technique for the prediction of local interaction sites
in a protein from its three-dimensional structure alone. Various approaches have been taken to automate
and standardize the process with various sensitivity and specificity. Here, I will present my first work on
this project, using a support vector machine, with four extracted features from THEMATICS alone to
predict the active sites of enzymes.
In this chapter it is shown how support vector machines (SVM’s) may be combined with THEMATICS to
achieve a substantially higher recall rate for catalytic residues with only a small sacrifice in specificity
when compared to Ko’s statistical analysis of THEMATICS 24. It is argued that clusters predicted by
THEMATICS-SVM are small, local networks of ionizable residues with strong coupling between their
protonation events; these characteristics appear to be very common, perhaps nearly universal, in enzyme
active sites. Performance of THEMATICS-SVM in active site prediction is compared with other 3D-
structure-based methods, including THEMATICS combined with previous analyses and shown to return
equal or better recall with generally higher specificity and lower filtration ratio. The high specificity and
low filtration ratio translate to better quality, more localized, predictions.
This work builds on the prior work of Ko using variants of some of the same features that were found to
be successful there, plus some additional features. Results of our method are presented for 60 different
proteins. In this chapter, I also present a way to extend the method’s capabilities to the prediction of non-
ionizable residues.
3.2 THEMATICS curve features used in the SVM
To use an SVM to classify residues as either likely or not likely to be in the active site, I represent the
computed titration curves as points in a four-dimensional space. These four features are based on the first
four moments of these curves, as described in section 2.3.1.
39
The four features, namely M1, µ2, µ3 and µ4 are conceptually similar to those described in Ko’s analysis 24,
except I slightly modified the normalization process to prevent both the sample mean and sample standard
deviation from being too strongly influenced by extreme values. A more robust estimator is used to
distinguish “typical” from “atypical” titration curves within a single protein than the standard Z-score. In
my normalization, each of the four moments was normalized to its corresponding robust Z-score Z′,
which is defined as its deviation from the median, divided by the normalized interquartile distance, the
difference between the 75th percentile value and 25th percentile values, for the corresponding feature
across all ionizable residues in that protein. A normalization factor of 1.349 comes from the normal
distribution with a mean of zero and a standard deviation of one. Thus for a given feature Ф, I define Z’
as:
( ){ }1.349 -MEDIANZ ( )=
PERCENTILE( ,0.75) PERCENTILE( , 0.25)⋅ Φ Φ
′ ΦΦ − Φ
(10)
where the median and corresponding percentiles are based on the value of that feature for all ionizable
residues in a given protein. Thus this method achieves the same effect as Wei’s method 54 without
introducing an extra parameter to be fine-tuned.
For the even-numbered moments, Z′n, the robust Z-score for the nth central moment is defined as:
Z′n = Z′(µn ) (n even) (11)
The only even-numbered moments used in the present study are the second and fourth, so the
corresponding robust Z-scores Z′ are Z′2 and Z′4.
Likewise the only odd-numbered moments are the first and third. Their corresponding robust Z-scores are
the deviations of the absolute values from the median. In particular, we define Z′3 as
Z′3 = Z′(µ3 ) (12)
40
The population over which the median and percentiles are computed includes residues of different types
with different free pKa’s. In order to compare the computed first moments across all residue types, the
offset first moment for a given residue is defined as:
M1offset = M1 - pKa(free) (13)
where pKa(free) is the pKa for that residue in free solution. Note that by equation (9), a H-H residue has
M1offset = 0.
This offset may be compared across all residues in the protein. Thus Z′1 is defined as:
Z′1 = Z′ (|M1offset|). (14)
Note that only the first moment requires this modification to make all residue types in the protein
comparable since the H-H equation has only one free parameter, the residue type-dependent translation
parameter pKa.
To summarize, the result of all these computations is to create, for each ionizable residue in any given
protein, a 4-tuple of descriptors (Z′1, Z′2, Z′3, Z′4) of the theoretical titration curve. Z′2, Z′3, and Z′4
describe the shape of the curve and Z′1 measures its displacement along the horizontal axis.
3.3 Training.
A set of 20 proteins was used as the training set. The protein names, the E.C. numbers and the PDB ID for
each of the 20 proteins in the training set are listed in Appendix A.
The labeling of the titration curves for training purposes was performed as follows: All residues listed in
CatRes/CSA as active were labeled positive. Also labeled positive were ionizable residues located near
such annotated active residues with titration curves that displayed perturbed titration curves on visual
inspection. All other residues were labeled negative, with the exception of a few residues with visually
perturbed titration curves and with no literature annotation that are not near any other perturbed residues;
they were removed from the training data set entirely. (Note that such residues with perturbed titration
41
curves that are not in spatial proximity with other perturbed residues are not considered predictive in
THEMATICS.)
From 1575 ionizable residues in the 20 protein training set, I remove 46 isolated residues with perturbed
titration curves. This leaves a training set of 1529 ionizable residues, among which 140 are labeled as
positive training examples. For each ionizable residue in the training set, the four moment-based features
and the corresponding labels were fed into the SVM using SVMLight 56. For both training and
classification, the quadratic kernel K(x,z) = (1+<x,z>)2 was used.
The relative cost of misclassification of positive and negative training examples was set such that false
negatives were penalized 10 times as much as false positives. This was done because there are many more
negative examples than positive examples in the training set, because of the aim to increase the residue
recall rate, and because I have much more confidence in the labeling of the false negatives than the false
positives (see sections 2.2.3 and 3.4). In addition, a linear kernel and several other choices of parameters
were tried, but these resulted in either similar or slightly more training errors.
3.4 Results
Typical criteria used to measure classifier performance are recall (also called sensitivity), the number of
correctly predicted positives divided by the number of true positives, and precision (related to specificity),
and the number of correctly predicted positives divided by the total number of predicted positives. Ideally,
both measures are 100%, which means all and only the true positives are identified as such by the
classifier. In the present case, one can be more confident of the true positive data, because for every
labeled active residue there is experimental evidence supporting that labeling. On the other hand, true
negative data are not as reliable because the experiments are incomplete; some important residues may
not have been tested experimentally. Furthermore some of the experimental literature has not been
included in the CatRes/CSA database, because of the difficulty of exhaustive literature searching. A better
indicator of the selectivity of the method for present purposes is the filtration ratio, the fraction of total
42
residues that are reported as positive. Now the goal of the system is to achieve a high recall with a low
filtration ratio.
A set of 64 test proteins was selected randomly from the CatRes database 57. There is no overlap between
this test set of 64 proteins and the set of 20 proteins used to train the SVM. The trained SVM was applied
on the test set to measure the overall accuracy of the method, assuming that the CatRes annotations define
the true positive residues. Results are summarized here, while a detailed list of all proteins studied with all
predicted residues and clusters can be found in the Appendix B.
3.4.1 Success in site prediction.
First I examine the degree of matching between our predictions and the CatRes list for each protein.
Overall, the SVM identified an average of 2.7 clusters per subunit. Based on the overlap of the predicted
active site and the CatRes listed set, the prediction for a protein is assigned to one of three categories. If
50% or more CatRes listed active residues were found by the system, we consider this a correct site
prediction. If some, but fewer than half, of the CatRes listed active residues were found by our system, we
consider it partially correct. If none of the CatRes listed active residues were found by our system, we
consider the site prediction incorrect. This type of categorization has been used previously 18. Measuring
this degree of overlap of predicted clusters with just the ionizable CatRes listed active-site residues, the
percentages of proteins for which the predictions are correct, partially correct, and incorrect are 86%, 5%
and 9% respectively, as shown in figure 2(a).
3.4.2 Success in catalytic residue prediction.
Out of the 9303 ionizable residues from the 64 proteins, 1338 were predicted as active site candidates by
the SVM, forming 244 clusters. There are 233 ionizable residues labeled as active site residues in the
CatRes database and 182 of them were found by our SVM, corresponding to a global residue recall of
78%. The average residue recall rate, averaged over all 64 proteins, is 76%.
43
For these 64 proteins, for filtration ratio defined as residues predicted over a total of 32016 residues
including both ionizable and non-ionizables, the average is only 3.9%. This ratio is less than 8% for each
of the 64 proteins. The average precision, or fraction of predicted residues that are known true positives,
is 20% over the 64 protein set, using only the CatRes/CSA annotations to define the true positives.
3.4.3 Incorporation of non-ionizable residues.
Since not all active site residues are ionizable, it is also of interest to see how well the SVM-reported
residues serve as predictors of activity in their spatial vicinity. Therefore I also define a THEMATICS
positive region to be the spatial region within 6Å of any residue that belongs to a THEMATICS positive
cluster. This may allow the method to find some catalytically important residues that do not have a
perturbed titration curve (including non-ionizable residues). The total number of residues found by this
criterion across the 64 test proteins is 4795, out of 32016 total residues. Among 366 residues that are
labeled as active site residues in CatRes, 263 were found by the system, corresponding to a global recall
of 72%, while the average recall per protein is 81%. The average precision, or fraction of predicted
residues that are known true positives, is 21% over the 64 protein set, using only the CatRes/CSA
annotations to define the true positives.
Table 3.1 compares the performance of the straight SVM predictions versus the SVM+Region predictions.
While the expansion to include the neighborhood surrounding the predicted residue leads to a somewhat
higher recall rate, there is considerable sacrifice in the precision and increase in the filtration ratio.
44
Method Recall Precision Filtration Ratio SVM only 61% 21% 4% SVM + 6Å region 81% 8% 13%
Table 3.1. Performance of the SVM predictions alone versus the SVM regional predictions that include all residues within a 6Å sphere of each SVM-predicted residue. Shown are average values of recall (true positive residues over all known positive residues), precision (true positive residues over all predicted residues), and filtration ratio (residues predicted over total residues in the protein), where averaging is performed over the set of 64 test proteins.
45
Using the same criteria described above for judging correctness of the predictions, but this time counting
all residues in THEMATICS positive regions and comparing with all the CatRes listed active-site
residues, the percentages of correct, partially correct, and incorrect site predictions are 88%, 4% and 8%
respectively (Figure 3.1(b)).
46
(a)
86%
5%
9%
CorrectPartially CorrectIncorrect
(b)
88%
4%8%
CorrectPartially CorrectIncorrect
Figure 3.1. The success rate for site prediction on a per-protein basis: (a) ionizable residues only; (b) all residues, extending the SVM’s predictions by including all residues within 6Å of each predicted residue.
47
Figure 3.2 shows histograms of the filtration ratios achieved using the trained SVM on a per protein basis.
These filtration ratios are expressed in three different ways: 1) All predicted residues over all residues; 2)
Ionizable residues predicted over all ionizable residues; and 3) Ionizable residues predicted over all
residues. Here, 1) is obtained using the 6Å neighborhood criterion and 2) and 3) are obtained from the
straight SVM prediction of ionizable residues only. Using all residues as the basis, the filtration ratio for
the SVM (ionizable) predictions is less than 10% for all 64 proteins. There was only one protein out of
the 64 for which the filtration ratio for all residues predicted (using the 6Å neighborhood criterion) was
higher than 25%. For this protein, human Glyoxalase I, the method identified about 18% of its ionizable
residues as candidates and about 27% of all its residues as candidates (using the 6Å neighborhood
criterion). For well over 90% of the proteins studied the filtration ratio was better than 20% in both the
ionizable/ionizable and the all-residues cases. Even better, in 70% of the proteins tested, the method
reported less than 15% of the ionizable residues as candidates and in 61% of the proteins in the test set,
less than 15% of all residues were identified as candidates using the 6Å neighborhood criterion. It is
important to note that for most of the examples with large filtration ratios, there is a sound functional
basis for this high ratio, e.g. the active site binds multiple substrate or cofactor molecules and thus has an
unusually large interaction site. In other cases with high filtration ratios, the protein has an interaction site
of typical size but an unusually small number of total ionizable residues.
48
0
10
20
30
40
50
60
0-5% 5-10% 10-15% 15-20% 20-25% 25-30%
Filtration Ratio Range
Num
ber o
f Pro
tein
s
All/All Ionizable/Ionizable Ionizable/All
Figure 3.2. Distribution of the 64 proteins across different values for the filtration ratio. Filtration ratios are expressed as: 1) All predicted residues over all residues; 2) Ionizable residues predicted over all ionizable residues; and 3) Ionizable residues predicted over all residues. All residues are predicted using the 6Å neighborhood criterion.
49
3.4.4 Comparison with other methods.
It is useful to compare the results of the present method with some other catalytic site prediction methods
that are based on 3D structure alone. The other methods used for this comparison are QSiteFinder 20 and
SARIG 35, both of which have publicly available servers, and Ko’s statistical analysis of THEMATICS 24.
Of the 64 proteins used for the THEMATICS-SVM test set, one was too large for both SARIG and
QSiteFinder and three others were too large for QSiteFinder. These four were deleted from the test set
and thus the comparison results reported here are for the remaining 60 proteins.
The average (per protein) values for recall, precision, filtration ratio false positive rate and MCC of each
method are listed in Table 3.2. Two sets of results are given for QSiteFinder, one using only the top site
and the other using a combination of the top three sites. This combination of the top three sites is the
basis for the success rate reported for this method in the original article 20. The values in Table 3.2 use all
annotated residues, including non-ionizable residues, as the basis for the recall and all residues in the
protein as the basis for the filtration ratio for all of the methods. Thus the theoretical maximum recall rate
for THEMATICS-SVM (without the 6Å region) is less than 100%, because some known catalytic
residues are non-ionizable.
Figure 3.3 plots recall as a function of false positive rate. The solid line represents Wei’s THEMATICS
analysis with variable parameter α. Performance for Ko’s analysis, QSiteFinder, SARIG, and the present
SVM are depicted as points. The recall and the false positive rate of all the methods in this plot were
measured by ionizable residues only.
50
Method Recall Precision Filtration Ratio False Positive Rate MCC
THEMATICS-SVM 61% 20.0% 3.8% 3.1% 0.31 THEMATICS-SVM-Region 80% 8.1% 13.2% 12.3% 0.22 THEMATICS-Statistical (Ko) 44% 23.5% 2.6% 2.0% 0.29 QSiteFinder (top one) 33% 4.6% 7.5% 7.2% 0.094 QSiteFinder (top three) 61% 4.2% 16.2% 15.8% 0.12 SARIG 61% 8.0% 11.2% 10.6% 0.18
Table 3.2. Comparison of THEMATICS-SVM and other methods. Shown in the table are the recall, precision, filtration ratio, false positive rate, and Matthews correlation coefficient (MCC) of THEMATICS as well as of other site prediction methods including THEMATICS-Statistical (Ko’s method), QSiteFinder, and SARIG. These quantities are per-protein averages over a comparison set consisting of the 60 proteins from the 64-protein test set for which results could be obtained with all methods.
51
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6
False positive rate
Rec
all
Wei's MethodSVMQ-site Finder (largest one)Q-site Finder (largest three)SarigKo's Method
Figure 3.3. Recall-false positive rate plot (ROC curves) of SVM versus other methods. The curve for Wei’s method is obtained by varying the parameter α, a generalization of Ko’s methods.
52
The comparisons in Table 3.2 and the Figure 3.3 show that THEMATICS-SVM can achieve catalytic
residue recall that is as good as or better than other methods, while simultaneously achieving substantially
better precision with lower filtration ratios and false positive rate. This translates to more localized, more
precise predictions of generally better quality with higher MCC. Although from Table 3.2, compared with
Ko’s statistical analysis of THEMATICS, the SVM analysis gives significantly better sensitivity to
catalytic residues with only a small concomitant drop in the precision, Figure 3.3 and the MCC in table
3.2 show that the tradeoff between recall and filtration ratio/false positive rate is better with the SVM
analysis since it outperforms Wei’s method in figure 3.3, which is a generalization and extension of Ko’s
method.
3.5 Discussion
3.5.1 Cluster number and size.
The SVM study (without the neighboring region) found 244 clusters as active site candidates, with an
average of 2.7 clusters per chain in the 64-protein test set. While the number of true positive clusters per
chain is probably somewhat higher than 1.0, 2.7 is too high. Ko’s statistical criteria reported 1.7 clusters
per chain, although I note that the present method was designed to increase the recall rate of true positive
residues. Among the 244 clusters reported by the present method, 90 of them are pairs and it appears that
many of these pairs are false positives resulting from the chance proximity of two residues with similar
pKa’s. Simple geometry-based rules have been investigated that do eliminate many of these pairs 58. The
average size of a cluster is 5.5 ionizable residues. These values include catalytic residues, binding
residues, and also some “second shell” residues, residues that are nearest neighbors of the known catalytic
residues and second-nearest neighbors of the reacting substrate. It is indeed possible that these “second
shell” residues play some supporting role in catalysis, or they may simply be subject to the same strongly
53
pH-dependent field as the catalytic residues. Experimental site-directed mutagenesis studies are in
progress to elucidate this.
3.5.2 Failure analysis.
Although this method effectively found the active sites in most of proteins, there still remain a few cases
where it failed to find the correct active sites. There are a variety of possible causes for this. In some cases,
the “failure” may not be a failure at all. An example is cytosolic ascorbate peroxidase (1APX), in which
the active site listed in CatRes is at the distal side of the heme ring. My method did not find that particular
site, but it did find a cluster at the proximal side of the heme ring, which has been identified as an active
site by several studies 59, 60. This suggests that some of the discrepancy between CatRes labeling and the
SVM results may result from incomplete or incorrect information in the reference database. Failure can
also occur in cases where the active site environment is very hydrophobic. In the case of DNA photolyase
(1DNP), the listed active site consists of three tryptophan residues, which are not ionizable. Even in that
case, the SVM method did find a cluster that lies between those active tryptophan residues and the
cofactors FAD and MTF. Indeed the predicted residues may actually be involved in the electron transfer
process, which proceeds from the tryptophan residues to the cofactors. In some other cases, the SVM
method found a cluster of residues that bind to a substrate or cofactors. These binding residues may
exhibit such strong perturbations in the titration curves that the actual catalytic residues are missed by the
classifier.
3.5.3 Analysis of high filtration ratio cases.
There are a few proteins for which the SVM method gives a high filtration ratio, meaning the active site
candidates our method produces constitute an unusually high fraction of the total residues in the protein.
Most of these cases are cofactor-dependent enzymes and/or systems with larger substrate molecules,
where the site truly is a larger fraction of the protein’s surface. The protein with the highest filtration ratio
is human Glyoxalase I (1FRO), for which the SVM selects 18% of the ionizable residues and
SVM+Region selects 27% of all residues. Glyoxalase I catalyzes the glutathione dependent conversion of
54
2-oxoaldehydes to 2-hydroxycarboxylic acids. Each of its subunits binds glutathione, substrate, and zinc.
In addition to the catalytic residue E172, the SVM method also finds the glutathione-binding residues
R37, E99, and R122, the zinc-binding residue H126, plus some interfacial residues. Another example is
arginine kinase (1BG0), which catalyzes phosphate transfer between arginine and ATP and thus its site
binds arginine, phosphate and MgADP. The predicted residues surround the site of interaction between
the arginine, the ATP, and the reacting phosphate group.
Still other cases of high filtration ratio result from small enzyme size. Enzyme IIA-lactose (1E2A), a part
of the lactose/cellobiose-specific family of enzymes II in the sugar phosphotransferase system, is such an
example. It has only 36 ionizable residues. For this protein the SVM method found two clusters of three
ionizable residues each, one of which is the known active site, giving a filtration ratio of 17% of ionizable
residues. The all-residue analysis yielded a 21% filtration ratio. In this case, the site of interaction truly
constitutes a high fraction of the total residues in the structure.
3.5.4 Some specific examples.
Type I 3-Dehydroquinate dehydratase from Salmonella typhi (PDB ID 1QFE) is an important enzyme
found in plants and microorganisms. It functions as part of the shikimate pathway, which is essential for
the biosynthesis of aromatic compounds including folate, ubiquinone and the aromatic amino acids. The
absence of the shikimate pathway in animals makes it an attractive target for antimicrobial agents. The
study of the structure, active sites and the reaction mechanism opens the way for the design of highly
specific enzyme inhibitors with potential importance as selective therapeutic agents 61. For this protein,
the SVM predicts a total of eight residues in two clusters, [E46, E86, D114, E116, H143, K170] and [D50,
H51]. The known catalytic residues are E86, H143, and K170 and are shown as red sticks in the figure.
The other five predicted residues are shown as yellow sticks. The three predicted catalytic residues, plus
one additional predicted residue (E46), are nearest neighbors of the reacting substrate molecule, as
determined by the LPC server 62. The remaining four residues are “second shell” residues, each of which
is in contact with at least two first shell residues. The elongated theoretical titration curves obtained for
55
the eight predicted residues reflect their membership in subsets of ionizable residues with similar pKa’s in
close proximity.
Phosphatidylinositol-specific phospholipase C, exists in both eukaryotic and prokaryotic cells.
Catalyzing the hydrolysis of 1-phosphatidyl-D-myo-inositol-3,4,5-triphosphate into the second messenger
molecules diacylglycerol and inositol-1,4,5-triphosphate, it plays an important role in signal transduction
and other important biological processes 63. This catalytic process is tightly regulated by reversible
reversible phosphorylation and binding of regulatory proteins. The SVM prediction for
Phosphatidylinositol phospholipase C from Listeria monocytogenes (PDB ID 2PLC) is shown in Figure
3.5. The correctly predicted active site residues are shown as red sticks and the additional predicted
residues are shown as yellow sticks. Two residues missed by the SVM, but found by the SVM+Region
selection, are shown in purple. Note how the additional (yellow) residues occupy a second layer
immediately surrounding the known true positive residues (red and purple).
56
Type I 3-Dehydroquinate dehydratase
Figure 3.4. SVM prediction for protein1QFE. The SVM predicts a total of eight residues in two clusters, among which three known catalytic residues are shown as red sticks; the other five residues are shown as yellow sticks.
57
Phosphatidylinositol phospholipase C
Figure 3.5. The SVM prediction for 2PLC. The correctly predicted active site residues are shown as red sticks and the additional predicted residues are shown as yellow sticks. Two residues missed by the SVM, but found by the SVM+Region selection, are shown in purple.
58
3.6 Conclusions.
The THEMATICS-SVM method is a relatively straightforward method. Combining SVM with
THEMATICS achieves a higher recall rate for catalytic residues than earlier THEMATICS analyses with
only a small sacrifice in precision. Precision rates for THEMATICS predictions tend to be considerably
better than for other methods. The more localized, more specific predictions offer enhanced usefulness for
applications such as functional classification and specific ligand design.
The set of residues with perturbed titration behavior is a small subset of the set of ionizable residues with
strong electrostatic interactions or shifted pKa’s. Thus THEMATICS is more selective than simple
identification of the strongly interacting residues. A previous study 64 showing that titration-based
methods give a large number of false positives was based on the less selective electrostatic properties.
The present study confirms the comparatively low false positive rate for THEMATICS.
SVM+Region, the extended version of THEMATICS-SVM that incorporates residues within a 6Å sphere
of each predicted residue, does deliver improved recall, but with a sacrifice in precision. The major
advantage of THEMATICS over other 3D-structure-based methods is the superior precision;
SVM+Region loses most of this advantage.
THEMATICS selects ionizable residues with strong coupling between their protonation events. These
localized networks of interacting residues are good predictors of active sites. This feature may be a
fundamental property of enzyme active sites and may be a factor to consider in protein engineering.
3.7 Next step.
Although applying SVM with THEMATICS gave us a system predicting protein active sites with better
sensitivity and specificity, there is still room to improve. One possible approach would be using a larger
and better annotated training set and fine-tuning the SVM with different kernels and parameters. Most
likely, this approach would improve the performance of the THEMATICS-SVM method, but there are
59
limitations. The fundamental basis for this approach, as well as other methods using THEMATICS, is the
use of a binary classification system to select the THEMATICS-positive residues and cluster them based
on their physical proximity. If we could develop a new system that can rank order the residues based on
their likelihood of being in the active site, it would be much easier for a user to decide how far down the
list to go. Furthermore, if these rankings are based on probability estimates of each residue being in the
active site, we have principled ways to further combine the results from different systems to obtain a more
powerful system. So in the next chapter, I take a brand new approach and develop the Partial Order
Optimal Likelihood (POOL) method, which yields probability estimates rather than binary classification
decisions.
60
Chapter 4
New Method: Partial Order Optimal Likelihood (POOL).
61
4.1 Ways to estimate class probabilities.
For convenience, I use class probability to refer to ),,|( 11 nn xXxXcCP === L , which is the
probability of an instance with attributes X1 to Xn having values of x1 to xn, respectively, belonging to class
c. Notice that it is different from another commonly used term of class-conditional probability, meaning
the probability )|,,( 11 cCxXxXP nn === L . I refer to any method that estimates class probabilities
as a class probability estimator (CPE). I distinguish two classes of CPEs: One is parametric, which
involves selecting a function to model the class probabilities of training data, and then applying a learning
process to select the parameters of such a function giving the best fit to the training data. Examples are the
logistic regression method 65 and neural networks. The other class is nonparametric and includes the novel
POOL method we describe below. There is no pre-defined function in such nonparametric CPEs, and the
learning process estimates the class probabilities directly from the training data. I believe the
nonparametric CPEs are more general and I will focus my discussion on this class. There are different
nonparametric CPEs, with different ways of computing and representing the probability estimates. I will
introduce four of them as follows.
4.1.1 Simple joint probability table look-up.
The most conceptually straightforward way is to use a lookup table, having the same number of
dimensions as the number of attributes used to represent the data. The values of each feature are
quantized into small intervals. The probability estimates are calculated by taking the ratio between the
numbers of instances in each class to the total number of instances with each feature value combination.
The complete table has to be stored as the representation of the probability estimates. Clearly, both
computation and representation will be exponential in terms of the number of features. The quantization
is clearly necessary if the attributes are distributed over a continuum.
62
If we had an unlimited set of training data and it had the exact same distribution as that of the actual test
data, this would be the best learning system, because there is no information loss caused by any model
restriction. But in reality, the training set is always limited and there will be attribute combinations in the
test data that have never appeared in the training set. Because of this limitation, such a lookup table has
seldom been a realistic choice in any real problems. Some model abstraction has to be done to make
generalization from observed training instances to unknown test instances possible.
4.1.2 Naïve Bayes method.
Assuming conditional independence, i.e.,
)|()|,,(1
11 cCxXPcCxXxXPd
iiidd ====== ∏
=
L , (15)
one can use Bayes’ theorem to estimate the class probability P(class | attributes) by:
∏∏==
=∗=∝=∗===d
ii
l
ii
dd cCXPcCPcCXP
XXPcCPXXcCP
1111 )|()()|(
),,()(),,|(
LL . (16)
P(C=c) is the probability of an instance being in one particular class c, and this probability can be
estimated by the ratio of number of instances in c over the total number of instances in the training
examples. )|( cCXP i = can be estimated as the ratio of number of instances belonging to the class c with
the corresponding attribute values over the number of instances in c. This is also estimated by counting
the number of instances in class c with the corresponding values. ),( 1 dXXP L is independent of class C,
and it can be treated as a constant. Since the sum of class probability over all classes is 1, one can easily
solve the probability of each class, knowing their relative ratio. This method is linear in both computation
and representation of the probability estimates in terms of number of attributes d, since it only needs to
compute and store )|( CXP i and )(CP estimates.
This is an indirect method in the sense that it indirectly estimates class probability P(class | attributes) by
estimating class-conditional probability P(attributes | class) at first and uses conditional independence to
63
convert it. This method also requires quantization, unless some parametric form of the probability density
function is used.
Although I do use a naïve Bayes method to combine probabilities from different types of features in the
application, it is not enough because it is not apparent how one can apply the THEMATICS principles in
this method, since the constraints themselves are on class probabilities instead of the class-conditional
probabilities. Furthermore, because some of the features used in THEMATICS are clearly correlated, the
conditional independence assumption is violated, which makes the naïve Bayes probability estimates
suspect.
4.1.3 The K-nearest-neighbor method.
The third method to introduce is the k-nearest-neighbor method. It estimates the class probability P(class
| attributes) by taking the k training instances with the “closest” attributes to the query points and then
counting how many of them are in that particular class. This is a direct method since it gives the direct
estimate of class probability and it does this estimation without direct quantization. In some sense, it
implicitly quantizes the attributes by taking the range of the values from the nearest neighbors. This
method has a lazy evaluation feature since the estimates of a certain feature combination are calculated at
query time. All the training instances have to be stored, and unless some indexing scheme is used, the
computation and representation cost of this method is linear in the number of training instances.
This method can estimate the class probabilities when there are no matches between the attributes of the
query instance and the training instances, and with properly selected k, it may even reduce some of the
negative effect caused by noise in the training data, but there is no apparent way to apply THEMATICS
principles with this method. Furthermore, correlation in the THEMATICS features is a potential problem,
although a correctly weighted distance function can compensate for this.
4.1.4 POOL method.
64
The fourth method is the one proposed here, the Partial Order Optimal Likelihood (POOL) method, the
details of which will be introduced in the next section.
It finds the “best”, in the sense of maximizing the likelihood, among all possible estimates of the class
probabilities P(class | attributes) that satisfy given monotonicity assumptions. It uses convex optimization
to update the probability estimates of the training instances. It stores these probability estimates as
reference points and computes the probability estimates of query instances at query time, although in
principle one can compute the whole table and store it. With lazy evaluation, this method has a linear
computation and representation cost in terms of the number of training instances.
It is a direct method, since it estimates P(class | attributes) directly. In principle it does not need
quantization, although in our application, we have used quantization for convenience. It is a learning
system that makes full use of training data and enforces some prior belief, like the THEMATICS
principles, to minimize the effect caused by noise in the training set. The POOL method is conceptually
simple, computationally efficient, and appears to be effective in practice, as will be shown below.
4.1.5 Combining CPE’s.
Although one can put all the features together and use one CPE to estimate class probability in one step, it
may not be the best choice in practice. In most cases, it is better to group features into different groups
and use several, say l, CPE’s, each with one group of features iXr
. Each of these smaller CPE’s can be
obtained by any of the methods described above, as well as any parametric methods. At query time, the
class probability ),,|( 1 lXXcCPr
Lr
= can be estimated based on the class probability estimates
)|( iXcCPr
= from each of the smaller CPE’s according to the naïve Bayes combination rule:
∏= =
==∗=∝===
l
i
iill cCP
xXcCPcCPxXxXcCP
111 )(
)|()(),,|(
rrrr
Lrr
(17)
65
which is easily derived from Bayes’ Theorem under the assumption that { }n
ii CX 1)|( =
rare mutually
independent random variables, i.e.,
)|()|,,(1
11 cCxXPcCxXxXPl
iiill ====== ∏
=
rrrrL
rr. (18)
We use the term chaining to mean the application of this naïve Bayes combination rule.
Although strictly speaking, one can only use the chaining to estimate the class probability when the
conditional independence assumption holds, in practice, even if the conditional independence assumption
is not strictly true one might still be able to get some useful results, as with other applications of naïve
Bayes, especially when it comes to relative rankings 66.
4.2 POOL (Partial Order Optimal Likelihood) method in detail.
4.2.1 Maximum likelihood problem with monotonicity assumption.
Before introducing POOL, I first introduce the notion of a class probability monotonicity assumption:
Definition 1: Class Probability Monotonicity Assumption:
Class Probability Monotonicity Assumption refers to a property that for each class c, given a partial order
cp on the attribute space Χ, for any ji xx , ∈ Χ with ix cp jx , it must be
that )|()|( ji xXcCPxXcCP ==≤== . For 2-class-classification, −p is the opposite of +p , i.e.,
ix +p jx iff jx −p ix .
An important special case that inspired my development of this approach is when the attribute space has
the form nℜ . In this case, we can define a partial order on Χ as follows:
Given ∈= ),( 1 nxxx L Χ and ∈= ),( 1 nyyy L Χ, define x p y if ix ≤ iyi ∀, .
We call this the coordinatewise partial order on Χ induced by the ordering on ℜ .
66
The basic idea of POOL is to find a CPE for which the training data likelihood is highest among all CPEs
that conform to the monotonicity assumptions. The effect of the monotonicity assumptions is to create a
set of inequality constraints that relate the class probabilities of certain pairs of points.
Definition 2: Likelihood function L(H) of hypothesis H.
Assume given a hypothesis space containing a family of hypotheses H, i.e., probability density functions
(for continuous distributions) or probability mass functions (for discrete distributions), and nXXr
Lr
,,1 as
n random draws with an actual sample as nxx rL
r ,,1 . Since they are iid,by definition:
)|()|( HxXPHxXP jirrrr
=== , for any i, j, xr , H,
and for each hypothesis H, we may compute the probability density that we observed nXXr
Lr
,,1 as a
function of H,
L(H) = P( nn xXxX rrL
rr== ,,11 |H). (19)
Now, restricting attention to 2-class problems, where the class labels are 0 and 1, I
define ),|1( iii xXHCPp rr=== , and we can write
ii ci
ciiiii ppxXHcCP −−=== 1)1(),|( vr
. (20)
Then,
∏∏==
=======n
iiiiiii
n
iiiii xXPxXHcCPHcCxXPHL
11
)(),|()|,()(rrrrrr
. (21)
After substitution of (16), the likelihood function in our problem becomes:
∏∏=
−
=
− −∝=−=n
i
ci
ci
n
iii
ci
ci
iiii ppxXPppHL1
1
1
1 )1()()1()(rr
. (22)
67
Given a monotonicity assumption, finding maximum likelihood probability estimates becomes a
constrained optimization problem: Maximize L subject to a set of inequality constraints of the
form ji pp ≤ , one for each ),( ji xx rr pair in the training data generating the partial order via transitive
closure. The solution of this problem assigns probability estimates to only the training data. For attribute
combinations not observed in the training data, the monotonicity constraints only determine upper and
lower bounds on their class probabilities; one can then use some form of interpolation to assign actual
estimates.
4.2.2 Convex optimization and K.K.T. conditions.
Given a real vector space1 X together with a convex, real-valued function
ℜ→Af : ,
defined on a convex subset A of X, convex optimization is a problem that finds the point x in A for which
the number f(x) is smallest.
Convex optimization has been studied for a long time. It has many good properties such as if a local
minimum exists, it must also be a global minimum, which makes methods like gradient descent work for
solving this problem without the danger of being stuck in a local minimum instead of finding the global
minimum. A lot of methods have been developed to solve it efficiently. There are standard methods like
gradient projection methods, line-searching methods, interior-point methods and some specialized
versions dedicated to some specific problems of this form. How to solve convex optimization problems
in general is still an active research area.
One way to find and prove the constrained optimal point in a convex optimization problem is to use the
Karush-Kuhn-Tucker conditions (K.K.T. conditions). It is a generalization of Lagrange multipliers.
K.K.T. conditions: 1 In order to describe the convex optimization problem in its general form, I use X to denote the vector space, and x to denote a vector (or point) in that space, until further notice. As a general rule, whenever I discuss general optimization problems, I use x, while p is used in specializing to solve for probabilities, as in later sections.
68
Consider the problem:
minimize f(x)
subject to gi(x) ≤ 0 and hj(x) = 0
where f(x) is the function to be minimized, gi(x) (i = 1,…,m) are the inequality constraints and hj(x) (j =
1,…,l) are the equality constraints.
Necessary conditions:
Suppose that the objective function ℜ→ℜnf : and the constraint functions ℜ→ℜnig : and
ℜ→ℜnjh : are continuously differentiable at a point x* ∈S. If x* is a local minimum, then there exist
constants 0,0 ≥> iµλ (i = 1,…,m) and jν (j = 1,…,l) such that
∑ ∑= =
=∇+∇+∇m
i
l
jjjii xhxgxf
1 1
*** 0)(*)(*)(* νµλ (23)
and
0)( =∗ xgiiµ for all i = 1,…,m. (24)
Sufficient conditions:
If the objective function ℜ→ℜnf : and the constraint functions ℜ→ℜnig : and
ℜ→ℜnjh : are convex functions, the point x* ∈S is a feasible point, and there exists
0≥iµ (i = 1,…,m) and jν (j = 1,…,l) such that
∑ ∑= =
=∇+∇+∇m
i
l
jjjii xhxgxf
1 1
*** 0)(*)(*)( νµ (25)
and
0)( =∗ xgiiµ for all i = 1,…,m. (26)
69
then x* is the global minimum.
4.2.3 Finding Minimum Sum of Squared Error (SSE).
In addition to likelihood, sum of squared error is another commonly used measurement of how well the
model fits the data, and it is easier to work with than the likelihood L. In this section, I will present an
approach to compute minimum SSE under the monotonicity assumption, and in the next section, I will
prove that we can find the maximum of likelihood L by finding the minimum of SSE.
Definition 3: Sum of squared error (SSE). To estimate how close the estimated function, in our case the
class probability estimate2 )(xp r, is to the observation of n ),( cxr pairs, we compute
∑=
−=n
iii pcSSE
1
2)( , where )( ii xpp = . (27)
Let ),,( 1 nppp Lr = . SSE is a quadratic function of pr , and the class probability monotonicity
assumptions form a set of linear inequality constraints. This problem is then a special case of the convex
optimization problem, called a quadratic programming problem, another well studied class of convex
optimization problem. As a matter of fact, SVM, a recently developed machine learning system, is
actually a quadratic programming problem.
I developed the POOL algorithm, described below, to find the solution generating minimum SSE (and
therefore maximum likelihood) under the monotonicity constraints. Very recently, I have discovered
some existing literature describing an approach called isotonic or monotonic regression 67, but most of
this work is focused on one-dimensional problems. There is also earlier literature focusing on the total
order case; the pool adjacent violators algorithm (PAVA) 68 and monotonic smoothing 69 are such
examples. Although in some of these reports, it was pointed out the extension of this problem into multi-
2 In this specific case of estimating class probabilities, I use xr to denote the instances and p to denote the vector I try to assign values to minimize SSE.
70
dimension could be framed as convex optimization, the emphasis of the literature in this field seems to
focus primarily on one-dimensional problems.
Compared with convex optimization in its general form, the present case is very special. First, I want to
optimize a summation of terms each consisting of only one component of the vector, i.e., there are no
cross products between different components of the vector. This feature leads to a very simple gradient of
the target function S, such that after rescaling coordinates, the direction of finding global optimization of
S can be determined locally by choosing a component variable pi to optimize the ith term in S, subject to
the same constraints met.
The constraints are special, too. Each constraint is a linear inequality constraint containing two terms, in
addition to the implicit constraints of 0≤pi ≤1 implied by pi being probabilities. Or, in formal terms, this
is a sparse problem.
Another special feature of the present problem is that by a simple scaling of coordinates in space, the
negative gradient vector at any point points directly to the unconstrained global optimal point, i.e. if one
knows the “best improving” direction at one point, following that direction (in a straight line) will lead to
the optimal point.
We start from a point where all constraints are active, where active means that an inequality constraint is
met by equality, i.e. the constraint of pi ≤ pj being actually satisfied by pi = pj. Also all the constraints are
linear and since the gradient never changes direction along the path from point x to the unconstrained
global optimal point O, we have a very special property: If the “best improving” direction moves “away”
from an active constraint, that constraint will not become active again in the optimized solution. It is also
true that if the “best improving” direction “runs into” an active constraint, the constraint will still be
active in the final optimum solution. This makes it possible to determine the active constraints in the final
solution at the starting point, and once the active constraints in the final solution are determined, it is easy
to calculate the exact p that gives the constrained optimal value of the object function )( pf r. Since it is
known that optimum solution is achieved with these active constraints active, meaning the actual
71
probabilities are equal, all one needs to do is to partition the data and make pools of data having the same
(average) probabilities as required by the active constraints and get the accurate solution.
These special conditions make it possible to develop some special algorithms, like the one that will be
presented in the next subsection, to efficiently and accurately solve this subclass of convex optimization
problem. In a more general form of convex optimization problem without these special conditions, a
solution can only be approximated by numeric methods to a certain degree of accuracy, and typically
involves re-computing active constraint sets as the algorithm iterates toward a solution.
4.2.4 POOL algorithm.
The input to the algorithm is the training set D, and the constraint matrix CnXm. Each column in this
matrix corresponds to a single constraint of the form xi≤xj, and contains two non-zero entries.
This algorithm consists of three steps:
• Determine the active set A, which consists of all the active constraints, at the starting point.
• Given A, compute the corresponding partitioning (pools) of data.
• For each pool, compute the average across all data in the pool and assign this common average
value to each instance in the pool.
72
POOL(D, CnXm) • Initialize starting point S as origin.
• Compute ←Gr
α*▽f(S) (In our re-scaled case:itotal
ii n
nG
_
_+← , where = n+_i and ntotal_i are the
number of positive samples and the number of total samples in ith instances.)
• Initialize µr←0r
. • Until termination condition* has been met: • Compute ←H
rCnXm X µr
• Compute HGFrrr
−←
• Compute ∆µr←CnXmT X F
r
• µr ← µr + α *∆µr • If iµ <0 then iµ ←0 (i = 1,…m))
• Build transitive closure of x
• Let each xi (i = 1,…n) be in its own set. • For each iµ (i = 1,…m) if iµ >0, look in constraint matrix CnXm to find the two non-zero entries
Cai and Cbi in ith column and union the sets containing xa and xb into a new set, replacing the two sets containing xa and xb beforehand, if they are not in the same set.
• Go through all the sets built in the above step. Set pi (i = 1,…n) to be the sum of n+ of all the x divided by the sum of all the ntotal in the same set.
* termination condition is the threshold set by the user to determine convergence of Fr
; in our program, the size of F
r is computed until the difference between two iterations is less than 10-9.
α is used as the step size to control the rate of change in moving toward the improving direction; in our case, it is set as 0.05.
In the above algorithm, the first four steps compute the active set A by solving the dual problem of
determining iµ minimizing 2
1|| ∑
=∗−
m
iii CGrr
µ , subject to 0≥iµ for all i. Constrained gradient descent
such as gradient projection is used to solve this. The constraint gi is active iff 0>iµ .
The reason that I can apply the active set A determined at the starting point at the solution point is the fact
that I rescale the coordinate system, thus preventing the distortion of spherical gradient by different
weights of each term in the objective function.
The original objective function is
73
)2*( _1
_2
_ ii
n
iiiitotal npnpnSSE +
=+ +−= ∑ . (28)
After applying the appropriate transformation
iitotali pnp ∗= _' , (29)
the objective function becomes
iiitotal
in
ii np
n
npSSE _
'
_
_
1
2' 2 ++
=
+−= ∑ , (30)
which has spherical level surfaces, since all quadratic terms have the same coefficient.
Usually, n and m are large numbers but CnXm is sparse because most of its entries are 0. To improve
efficiency in storage and computing, I use the index of non-zero terms of CnXm instead of the matrix itself.
In practice, the speed of applying the POOL algorithm by my program is very fast.
4.2.5 Proof that the POOL algorithm finds the minimum SSE.
In this subsection, I will prove that the POOL algorithm gives the minimum SSE solution under the
monotonicity constraints.
In the present case, the objective function is a quadratic function of3 x, the constraint functions gi are
linear functions and there are no equality constraints hj, so both the necessary and sufficient KKT
conditions hold. This also implies that, if one can find such 0≥iµ (i = 1,…,m) satisfying the above two
equations, one finds the global minimum x*.
Since there are no equality constraints in the present problem, one does not have to consider jν and the
last term in the K.K.T. equations (23) and (25) above. Here I show my approach to find iµ .
3 In order to make the proof consistent with the form stated in section 4.2.2, I use x instead of p as in section 4.2.3; the function value we are minimizing is still SSE.
74
First, notice that all of the constraints can be put into two categories, one is 0≤xi ≤1 for all i=1,…,n; the
other is xi≤xj for some i and j. The first category is automatically satisfied from the starting point to the
end during the optimization process. There is an interesting feature for the m constraints of the second
category: all the bounding hyperplanes of the feasible region intersect at a line where the xi are equal for
all i=1,…,n. Our algorithm takes one point on this line, where all xi are equal to 0, as the starting point.
The outward pointing normals of the m constraints at the starting point form a convex cone. There is also
a vector Gr
at that point, pointing to the global minimum of the objective function when there are no
constraints. Gr
is the negative gradient of the objective function at x.
As mentioned earlier in 4.2.3, the present problem has a special feature that by a simple scaling of
coordinates in space, the negative gradient vector Gr
becomes a constant vector Or
(pointing to where the
unconstrained minimum of f(x) is located) minus vector Xr
, where Xr
is the vector from the origin to the
point x,
▽f(x) = GXrr
− (31)
So, the key point is to find the active constraints in the final solution at the starting point S. Without loss
of generality, for the sake of convenience in stating the proof, we assume the starting point S is located at
the origin. As Figure 4.1 shows, there are three cases based on where Gr
points.
75
O*
1Cr
2Cr G
r
Fig 4.1. (b)
S (x*)
O*(x*)
1Cr
2Cr
Gr
Fig 4.1. (a)
S
76
Figure 4.1. Three cases of Gr
in relation to the convex cone of constraints. Define convex cone as: }0|{ iwwithCwVV ii
ii ∀≥=∑rrr
.Gr
is the negative gradient of the objective function as starting
point S, pointing to the global optimum O*. 1Cr
and 2Cr
are the constraint vectors forming the convex cone. H
ris the projection of G
r on the convex cone formed by the constraint vectors and
x* is the solution that gives the constrained optimum for the object function. 4.1(a) shows the case where the unconstrained global minimum is located in the feasible region; 4.1(b) shows the case where the unconstrained global minimum is located in the inside of the convex cone formed by the constraint normals; 4.1(c) shows the case where unconstrained global minimum is located outside the feasible region and the convex cone formed by the constraint normals.
O*
1Cr
2Cr
Gr
S
Hr
x*
Fig 4.1. (c)
77
Figure 4.1(a) shows the special case where the unconstrained optimal O is located in the feasible region,
meaning *Or
points at O. This is the simplest scenario. All one needs to do to compute Gr
and x* is to add
Gr
to Sr
.
It is trivial to show that x* satisfies the KKT sufficient condition by letting iµ = 0 (i = 1,…,m), since x* is
where the unconstrained global optimal O is, so▽f(x*) is 0. Another way to look at it is that a global
optimum is always a local optimum.
Figure 4.1(b) shows another special case where all constraints have to be active to get the local minimum,
and the start point S happens to be the optimal point x*, or in other words, there is a non-negative
assignment of iµ (i = 1,…,m), that makes ∑=
∗=m
iii CG
1
rrµ , where iC
ris the outward normal of the half-
space defined by the ith inequality constraint of gi(x)≤0, or gi(x)= xCirr
• ≤0 and SXxrrr
+= * . Apparently,
x* is feasible because it is the origin and *xCirr
• =0. The proof that x* satisfies the sufficient K.K.T.
condition is the same as the more general proof for figure 7(c), just with the fact that x*=S and 0* =Xr
.
Figure 4.3(c) is the general case that the negative gradient Gr
cannot be expressed as linear combination
of iCr
with non-negative assignment of coefficients iµ (i = 1,…,m). In this case, among all possible non-
negative assignments of iµ to get ∑=
∗=m
iii CH
1
rrµ , one finds the specific one that gives the minimum
distance between the tip of this Hr
vector and O. That is, we seek Hr
of this form minimizing the length of
the vector HGrr
− . We then set HGSxrrr
−+=* .
First, we show that x* is feasible, i.e., *XCi
rr• =gi(x*)≤0 (i = 1,…,m), where vector G
ris the same as the
vector *Xv
.
78
Since *xr is located at HGSrrr
−+ and Sr
is the origin, we get HGXrrr
−=* . Note that iµ is special in that
it is non-negative and it gives the minimum length of vector HGrr
− , which is the same as the length
of *Xr
.
We have
0)( **
=∂
•∂
i
XXµ
rr
when 0>iµ (32)
and
0)( **
≥∂
•∂
i
XXµ
rr
when 0=iµ . (i = 1,…,m). (33)
Since
)()(1
*i
m
ii CGHGX
rrrrr∗−=−= ∑
=µ , (34)
We have
)2)(2)( *
1
**
XCGCCXXii
m
iii
i
rrrrrrr
•∗−=−∗•∗=∂
•∂ ∑=
µµ
(35)
Substitute (35) in (32) and (33), we get
0* =• XCi
rr when 0>iµ (36)
and
0* ≤• XCi
rr when 0=iµ (i = 1,…,m). (37)
Combine (36) and (37), we get 0)( ** ≤•= XCxg ii
rr (i = 1,…,m), i.e., *xr is feasible.
Next, we show ∑=
=∇+∇m
iii xgxf
1
** 0)(*)( µ . The following has already been shown in (31)
79
HGXxfrrr
−=−=∇ )( * (31)
and
∑=
∇m
iii xg
1
* )(*µ =∑=
•∇m
iii XC
1
* )(*rr
µ
=∑=
m
iii C
1*r
µ
= Hr
(38)
Thus, we have
∑=
=∇+∇m
iii xgxf
1
** 0)(*)( µ . (39)
Last, we show that 0)(* * =xgiiµ , (i = 1,…,m). Geometrically, since HGXrrr
−=* has the minimum
size, *Xr
must be orthogonal to Hr
, i.e., 0* =• XHrr
. Substituting Hr
with∑=
m
iii C
1*r
µ , we have
0)(*)(*)*(1
*
1
**
1
* ==•=• ∑∑∑===
m
iii
m
iii
m
ii xgXCXC µµµ
rrrr (40)
and the fact that 0)( * ≤xgi and 0≥iµ , (i = 1,…,m), (40) holds only when 0)(* * =xgiiµ for all i
from 1 to m.
This completes the proof that x* satisfies the sufficient part of the K.K.T. Condition, and verifies that x* is
the minimum of the objective function under the monotonicity assumptions.
I could have used x* described above to find the probability assignment of every data point in the table to
give the minimum sum of error squared, but in the present program, I actually use iµ to find the active
constraints that the optimal solution should satisfy and compute x* based on that, by combining together
all instances in the groups under equality constraints and assigning the corresponding p values to be the
80
total positive instances observed divided by the total instances in the group. This procedure is more
accurate than computing from HGSrrr
−+ . This new way of computing x* generates the same x* as before,
since it gives the minimum SSE assignments of p for each group of active constraints.
4.2.6 Maximum likelihood vs. minimum SSE.
Theorem 1. In this problem of finding estimated probability under monotonicity constraints, a value of
pr that minimizes SSE will also maximize likelihood L subject to the same monotonicity assumption.4
While this is well-known when there are no constraints, it is not true under arbitrary constraints; what I
show is that it also holds under the particular constraints used here.
As defined earlier in (22), the likelihood function L is defined as follows for binary problems:
∏∏=
−
=
− −∝=−=n
i
ci
ci
n
iii
ci
ci
iiii ppxXPppPL1
1
1
1 )1()()1()(rrr
(22)
Since the present problem only involves assigning pi, given ii xX rr= , I will only use the
∏=
−−n
i
ci
ci
ii pp1
1)1( part from now on for )(PLr
.
Adopting the convention that 100 = , each factor of )(PLr
can be rewritten as:
)10()1(),( 1 ≤≤−∗= −i
ci
ciiii pppcpL ii
Before we prove the solution from minimizing SSE also maximizes )(PLr
, I prove the following lemma.
Lemma 1. Suppose we wish to maximize a function having the form:
4 Since this is to prove the same solution minimizing SSE will also maximize L, and I use this in the special probability estimate problem presented here, I once again switch to p and P
rto denote the vector
variable and the vector space respectively instead of using x and Xr
as earlier.
81
),,(),,(),,,,,()( 121111 nmmnmm ppFppFppppPF LLLLr
++ ∗== , where Pr
is partitioned into two
groups: mpp ,,1 L and nm pp ,,1 L+ , subject to a given set of constraints.
If an assignment of *Pr
, namely ),,,,,( **11
**11 nnmmmm pppppppp ==== ++ LL maximizes 1F and
2F under the same constraints, then *Pr
maximizes )(PFr
under the same constraints.
This lemma is readily proved by contradiction. If there is another assignment of 'Pr
under the same
constraints that gives )()( *' PFPFrr
> , then give the same assignments of 'Pr
to 1F and 2F . Then at least
one of the following has to be true: )()( *1
'1 PFPF
rr> or )()( *
2'
2 PFPFrr
> . But if either one of them is
true, the assumption that *Pr
maximizes 1F and 2F under those constraints is violated.
With Lemma 1 proved, I can prove Theorem 1. Based on the value of ip from the minimum SSE
solution *Pr
, I break the equation (22) into two parts after moving factors with 10 == ii porp at the end
of the equation:
)()()( 21 PLPLPLrrr
∗= , where
∏+=
− ==−=n
niii
ci
ci porpwhereppPL ii
1'
11 10)1()(r
and
∏=
− <<−='
1
12 10)1()(
n
ii
ci
ci pwhereppPL ii
r
Apparently, since *Pr
is minimum SSE solution under the same constraint set for maximum-likelihood
problem, if I can show *Pr
maximizes both )(1 PLr
and )(2 PLr
, I will prove Theorem 1 based on Lemma 1.
82
Showing *Pr
maximizes )(1 PLr
is easy. Notice the fact that in )(1 PLr
, ip can only be 1 or 0, and the only
way when ip could be 1 in minimum SSE solution is when ic has the same value, which is 1, and the
same is true when ip is 0. Substituting ip and ic with either both 0, or both 1 gives a )(1 PLr
value of 1,
which is the unconstrained maximum already, and apparently it is the constrained maximum also.
I will show in the remaining section that *Pr
also maximizes )(2 PLr
under the same constraint set.
Since all ip ’s in )(2 PLr
have values strictly larger than 0 and less than 1, a negative-log-likelihood
function may be defined as:
∑ ∑= =
−−−−=−=n
i
n
iiiii pcpcLPG
1 12 )1log(*)1(log*log)(
r (41)
Since with 1,0 ≤< yx , )2
log()log(log21 yxyx +−≥+− , iplog− is convex function, as is
)1log( ip−− . Since )(PGr
is the weighted sum of convex functions with positive weights, it is a convex
function, too.
After taking the derivative of G over pi, we get:
)1(*1
1)1(1
ii
ii
ii
ii
i ppcp
pc
pc
pG
−−
=−
∗−+∗−=∂∂
(42)
Notice that the derivative of the SSE function:
∑=
−=n
iii pcF
1
2)( (43)
is
)(*2 iii
cppF −=
∂∂
(44)
83
Comparing (44) with (42), (42) can be rewritten as the following, with Pr
andCr
representing the vectors
),,( 1 npp L and ),,( 1 ncc L , respectively:
T
nn
iiCP
pp
pp
pp
PG ))(2(
)1(**210000
0)1(**2
1000
00)1(**2
1
11
rr
LML
OMM
MM
MLO
LL
r −∗×
−
−
−
−=∂∂
)(
)1(**210000
0)1(**2
1000
00)1(**2
1
11
PF
pp
pp
pp
nn
ii
r
LML
OMM
MM
MLO
LL
∂∂×
−
−
−
−= (45)
If we can find iG _µ and a specific *Pr
to satisfy the following K.K.T. condition for G:
∑=
=∇+∇m
lllG PgPG
1
*_
* 0)(*)(rr
µ (46)
and
0)(* *_ =PgllG
rµ , (47)
then we know that *Pr
is the optimal solution corresponding to the minimum negative-log-likelihood
function )( *PGr
, also to the maximum likelihood function )( *2 PLr
.
84
Now, we will show that the *Pr
obtained by solving the minimum SSE function with the same set of
constraints )(Pgr
, along with the Gµ constructed from the Fµ corresponding to the minimum SSE
solution, does satisfy the K.K.T. condition of (46) and (47).
First, since *Pr
is the solution minimizing the SSE function, based on the K.K.T necessary condition, there
is a corresponding Fµ satisfying
∑=
=∇+∇m
lllF PgPF
1
*_
* 0)(*)(rr
µ (48)
with
0)(* *_ =PgllF
rµ . (49)
From (49), we have either
0_ =lFµ
or
0)( * =Pgl
r,
this latter case meaning the inequality constraints are actually met by equality. Without loss of generality,
let the lth constraint function lg be )()()( lblal ppPg −=r
, where )(la and )(lb correspond to the indices
of the variables p appearing in the constraint according to the given partial order. Since 0)( * =Pgl
r, we
let llbla ppp == )()( . Now we can construct
),,1(*)1(**2
1__ ml
pp lFll
lG L=−
−= µµ (50)
Since F and G have the same constraint set, so )()( *_
*_ PgPg lFlG ∇=∇ ,
85
),1(,)1(**2
)(*)1(**2
1
)1(**2)(*
)(
)1(**210000
0)1(**2
1000
00)1(**2
1
)(*)(
1
*__
1
*__
11
1
*__
*
nipp
PgpF
pp
ppPg
PF
pp
pp
pp
PgPG
m
l ll
lGlF
iii
m
l ll
lGlF
nn
ii
m
llGlG
L
r
r
r
LML
OMM
MM
MLO
LL
rr
=−
∇−
∂∂∗
−−=
−∇
−∂∂×
−
−
−
−=
∇+∇
∑
∑
∑
=
=
=
µ
µ
µ
As mentioned earlier, lp is the same as the two p ’s appearing in lFg _ where 0_ ≠lFµ , so the above
equation can be rewritten as:
∑
∑
∑
=
=
=
∇∗+∂∂
−−=
−∇
−∂∂∗
−−=
∇+∇
m
llFlF
iii
m
l ll
lGlF
iii
m
llGlG
PgpF
pp
ppPg
pF
pp
PgPG
1
*__
1
*__
1
*__
*
))((*)1(**2
1)1(**2
)(*)1(**2
1
)(*)(
r
r
rr
µ
µ
µ
(51)
Since *Pr
and the corresponding ip are the assignments giving minimum of SSE function F, from (39),
we have 0)(* *
1__ =∇+
∂∂ ∑
=
PgpF m
llFlF
i
rµ (52)
Substituting (49) into (48), we have:
0
))((*)1(**2
1
)(*)(
1
*__
1
*__
*
=
∇∗+∂∂
−−=
∇+∇
∑
∑
=
=m
llFlF
iii
m
llGlG
PgpF
pp
PgPG
r
rr
µ
µ
(53)
86
Since the functions G and F share the same constraint set )(Pgr
,
0)(**)1(**2
1)(* *_
*_ =
−= Pg
ppPg llF
llllG
rrµµ (54)
(53) and (54) show that the *Pr
minimizing the SSE function F also minimizes the negative-log-
Likelihood function G and therefore maximizes the likelihood function )(2 PLr
.
Notice the subtlety that the constraint set we used to show *Pr
maximized )(2 PLr
is actually a subset of the
original constraint set, because some of the constraints in the original problem do not play roles in
maximizing )(2 PLr
since they involve some variables of ip with value of 1 or 0, which are not in
)(2 PLr
at all. This is not an issue, because we show *Pr
already maximized )(2 PLr
in a less constrained
manner, )(2 PLr
cannot get a more optimized value with a tighter constraint. On the other hand, since
*Pr
is actually a solution from minimum SSE under the original constraints, it will not violate any
constraints in the original constraint set but not in the one we used to optimize )(2 PLr
.
Combining all the results above, using lemma 1, this completes the proof of Theorem 1.
With Theorem 1, the answer achieved by least sum of squared error using the POOL algorithm is the
same as the maximum likelihood solution.
4.3 Additional computational steps.
4.3.1 Preprocessing.
In general, CnXm is given by specific problem requirements. In the present study it is derived from the
THEMATICS principles. Since the data in this problem is sparse and n is large, instead of writing out all
the constraints, we introduce the idea of immediate successor and immediate predecessor:
Definition 4: Immediate successor.
87
y is an immediate successor of x if and only if zyzxtsz pp ,..∀ .
There is a trivial linear time algorithm that, given monotonicity assumptions, can find all the immediate
successors of a particular instance with particular attributes in a single scan of all the instances and their
attributes. With immediate successors of each instance known, the transitive closure will contain the
whole monotonicity assumption. We use this fact to build CnXm in quadratic time. For storage efficiency,
we could even store CnXm in one mX2 array, if all scaling factors are 1, by just storing the indices of
instances with 1 and -1 as their coefficients respectively; in the present case, with a different scaling
factor for each cell, we use three mX2 arrays, one storing the indices, and the other two for scaling factors.
4.3.2 Interpolation.
If the training set does not have an instance with a specific attribute value, an interpolation scheme is
needed to estimate the class probability with this attribute value. There are some required properties this
interpolation scheme should have. One is that all the interpolated class probabilities should conform to the
monotonicity assumption over the whole virtual table. In addition to that, there are also some desirable
properties, such as, whenever possible, the interpolation should reflect the monotonicity assumption
strictly. In other words, if the original monotonicity assumption specifies that P(C=c|X=x)≤P(C=c|Y=y)
when x≤y, if we have x<y, we prefer P(C=c|X=x)<P(C=c|Y=y) as long as it does not violate the
monotonicity assumption as a whole. Another desirable feature may be, whenever possible, we prefer
P(C=c|X) continuous over X.
After testing some interpolation schemes in the present application, we found a linear interpolation
between the maximum and minimum allowed P based on the Manhattan distance of instance X to its
limiting predecessor and successor gives good results in practice.
The result of applying this new POOL method to the protein active site prediction problem will be given
in the next chapter.
88
Chapter 5
Applying the POOL Method with THEMATICS in Protein
Active Site Prediction.
89
5.1 Introduction.
In chapter 3, I reported the use of SVM and THEMATICS to predict the protein active sites based on
protein 3D structure alone and introduced one way to expand the original THEMATICS method and
include the non-ionizable residues into the prediction. Although the SVM method outperforms all prior
structure-based methods, including other approaches using THEMATICS, and achieves similar or slightly
better performance than methods using both structural and sequence comparison information, there is still
room for further improvements of the method, as we briefly discussed in section 3.7.
One straightforward improvement can be the addition of more information about a residue into the
learning system. In the study reported in this chapter, in addition to THEMATICS features, I add
different features, such as the size of the cleft in which a surface residue resides, and the conservation of
the residue among proteins of similar sequence into our system, and examine how helpful they are in
terms of improving the sensitivity and the specificity of the prediction.
Another improvement comes from changing a classification problem into a ranking problem. In a
classification problem, the results are binary labels of either positive or negative, with nothing in between.
One of the disadvantages of this approach is that it is less convenient or even impossible in some cases for
the users to fine tune their result if they want to improve the sensitivity at the cost of lowering the
specificity or vice versa. In this study, the result is a ranked list of all residues in a protein based on their
likelihood of being in active sites. Users can choose a cut-off best suited to their needs in different
situations. One previous method, called PCats 21 generates such a rank-ordered list of probabilities.
Because my method actually estimates probabilities, I can easily combine probability estimates from
different methods, using the chaining method I introduced in section 4.1.5. This overcomes one of the
common hurdles one encounters by including more features into the system as described in the paragraph
90
above. This study did show the combined probability estimates with additional features work better than a
single probability estimate.
In the SVM approach, there is a 9Ǻ cut-off to form SVM positive residues into clusters as active site
residue candidates. This threshold seems to give good predictions, but it is arbitrary and could be
optimized in a more systematic way. In this study, I eliminated this step and used features containing the
degree of perturbation of titration curves of nearby residues. This approach is more systematic over the
whole system and is optimized over the whole process.
Combining THEMATICS and the power of the POOL method by enforcing the hypothesis into the
learning system, I achieve a substantially higher sensitivity and specificity at the same time than the SVM
method, one I had already shown to be the best among all other 3D-structure based methods, as compared
in Chapter 3. Performance can be improved further with other 3D-structure-based features, including the
size of the cleft in which surface residues reside. Performance can also be improved using sequence
conservation scores for individual residues, obtained from a sequence alignment of proteins of similar
sequence, provided there are enough such proteins. Note that this latter enhancement turns a purely 3D
structure based method into a sequence and structure based method.
A set of 64 different proteins from the CatRes (CSA) database is used to compare the performance of the
different methods for functional site prediction. A more complete selection of 160 proteins from the
CatRes (CSA) database is used to further confirm the advantage I gain by adding extra features in
addition to THEMATICS into the system to form an improved system for prediction. In this chapter, I
also improve the way I extend the method to predict non-ionizable active site residues. In addition, we use
ROC curves to compare the performance between different methods and RFR (Recall-Filtration Ratio)
curves to guide potential users in setting the actual cut-off in practice.
5.2 THEMATICS curves and other features used in the POOL method
91
In the work presented in Chapter 3, I used moments of the first derivative curves of the titration curves.
These were defined analogously to the moments of density functions, as these first derivative curves are
essentially probability distribution functions 53. One aspect of these prior approaches such as 24, 45, 54 is the
use of spatial clustering as a way of reducing the number of apparent false positives. That is, residues are
reported as positive by the method if and only if they are in sufficiently close spatial proximity to at least
one other residue identified as a candidate by the outlier detector in Ko’s and Wei’s approach or the SVM
in Tong’s approach. The overall identification process in these prior approaches thus involve two stages,
where the first stage makes a binary yes/no decision on each residue. In this new approach I do not begin
with such a binary decision because it is my goal to assign to every (ionizable) residue a probability that it
is an active-site residue. Thus, as an alternative to this clustering approach, I instead consider what I call
environment features. For a given scalar feature x, I define the value of the environment feature xenv(r)
for a given residue r to be
∑∑
≠′
≠′′
′′=
rr
rrenv
rw
rxrwrx )(
)()()( (55)
where r' is an ionizable residue whose distance d(r',r) to residue r is less than 9Ǻ, and the weight w(r') is
given by 1/d(r',r)2 .
In this study, I use the same features µ3 and µ4 used in the Ko approach, along with the additional features
µ3env and µ4
env as an alternative to the clustering stage. Thus every ionizable residue in any protein is
assigned the 4-dimensional feature vector (µ3, µ4, µ3env, µ4
env), which is the THEMATICS feature for
ionizable residues.
Although µ3 and µ4 themselves are only defined for ionizable residues, the environment features, such as
µ3env, µ4
env, are well-defined for non-ionizable residues. For non-ionizable residues, the THEMATICS
features I use are the 2-dimensional feature vectors (µ3env, µ4
env).
92
There is one additional subtlety that all THEMATICS-based methods have had to address, and the current
approach is no exception: the need for some kind of normalization across proteins. In Ko’s and Wei’s
approach, the raw features are individually transformed into Z-scores by subtracting the within-protein
mean and dividing by the within-protein standard deviation. Similarly, in my SVM approach, the raw
features are likewise transformed into robust Z-scores by subtracting the within-protein median and
dividing by the within-protein interquartile distance. Here, I apply yet another within-protein feature
transformation to each feature, which I call rank normalization. Within each protein, each feature value is
ranked from lowest to highest in that protein, and each data point is then assigned a number uniformly
across the interval [0,1] based on the rank of that feature in that protein. The highest value for that feature
is thus transformed to 1, and the lowest value is transformed to 0. Note that unlike the use of Z-scores or
robust Z-scores, this is a nonlinear transformation of the raw feature values. For each scalar feature x,
denote its within-protein rank-normalized value as x~ , which by definition lies in [0,1]. I extend the use of
this notation to feature vectors in the obvious way. That is, )~,~~,~(~43,21 xxxx=x .
Note that the use of within-protein rank normalization does not affect the within-protein partial order used
in the THEMATICS Principles, which I introduced in Section 2.3.3. That is, yxp is true for raw feature
vectors x and y in the same protein if and only if yx ~~p . However, when I combine data from multiple
proteins for training and use the results to make predictions for new proteins, as I describe in more detail
later, this actually amounts to making an even stronger monotonicity assumption across proteins in which
the within-protein rankings replace the raw feature values. This is obviously a more controversial
assumption, but some such approach is required to be able to train on multiple proteins and make
predictions for novel proteins, and, as I show below, this approach appears to give good results.
As discussed in Chapter 2, in addition to THEMATICS, there are other methods for predicting active site
residues. They use features such as geometric position of residues, amino acid type information and
sequence conservation, which are very different from the electrostatic information I use in THEMATICS.
93
It is reasonable to speculate that if I combine these features with THEMATICS features, I may get better
performance. I tested this hypothesis in this study.
In addition to THEMATICS features, I try the cleft feature, which is a number I assigned for every
residue in a given protein based on the rank of the size of the cleft to which the residue belongs. One
special value is assigned to every residue not on the protein surface, and another is assigned to every
residue on the surface but not within any cleft. Ignoring these special values, it is easy to construct the
monotonicity assumption that the larger the cleft to which a residue belongs, the more likely that residue
is to belong to the active site. I can apply POOL on the cleft feature based on this monotonicity
assumption.
ConSurf 27 is a sequence comparison based method that identifies functionally important regions on the
surface of a protein of known three-dimensional (3D) structure, based on the phylogenetic relations
between its close sequence homologues. If there are more than five homologues (the method is
considered reliable if the number of homologues is greater than 10) to the query protein, it can assign a
score between 1 and 9 to each residue in the query sequence based on how conserved this residue is
among those homologues. The larger the score is, the more conserved the residues are. With some
exceptions discussed in Chapter 2, it is commonly believed that the more conserved a residue is, the more
likely it is functionally important. This gives me another monotonicity assumption to which I can apply
POOL. I call this the ConSurf feature. In this study, if the protein has more than 10 homologues, I used
the scores ConSurf assigns to each residue as their ConSurf feature values. For the proteins with 10 or
fewer homologues, I assign 0 as the ConSurf feature values for all their residues. Since I am only
interested in the rank list of residues within a protein, rather than across proteins, this special treatment
will not affect the final results.
In addition to these features, I also tried features such as residue type and ASA (area of solvent
accessibility) of residues in our study but found that including these did not improve the performance. No
further details will be given here about those features that did not improve overall performance.
94
5.3 Performance measurement.
Before presenting the results, I must first decide how to measure the performance of our system. For a
standard classification problem, performances are typically measured by recall, false-positive rate and
Matthews correlation coefficient (MCC). Within a specific system, recall and false positive rate usually
affect each other: lowering the false-positive rate most likely will lower recall at the same time. So it only
makes sense when one gives out both metrics at the same time. Although MCC is a single metric to
measure the overall performance of a classification system, it only measures the performance at a specific
setting. If I want to measure the performance of my system at different thresholds, ROC (Receiver
operation characteristic) curves, which plot recall against the false positive rate is the answer. One can
compare two systems by comparing their ROC curves. In general, one can say a system giving a higher
recall and a lower false-positive rate at the same time out-performs a system giving a lower recall and a
higher false-positive rate at that specific setting. If the ROC curve from system A is always at the upper-
left side of the ROC curve from system B, one can conclude that system A dominates system B and
always out-performs system B. Studies also have shown that area under the ROC curve (AUC) is a very
reliable single-value assessment for the evaluation of different machine learning algorithms 70.
In order to generate ROC curves, I need to be able to calculate recall and false-positive rate values, which
come from classification problems. In the POOL system, the result for each protein is a ranked list based
on the probability of a residue being in the active site. A natural way to draw a ROC curve for every
protein is to move the cutoff one residue at a time from the top to the end of the list. The resulting ROC
curve has a stairwise shape: only recall increases when an active site residue is encountered and only false
positive rate increases when a non-active-site residue is encountered.
We define average specificity (AveS) for each protein in the set:
examplespositiveofNumber
rposrSAveS
N
r∑
== 1
))(*)(( (56)
95
where r is the rank, N is the number of residues in a protein, pos(r) is a binary function that indicates
whether the residue of a given rank r is annotated in the reference database in the active site (pos(r)= 1) or
not (pos(r)= 0), and S(r) is the specificity at a given cut-off rank r.
It is not hard to see that AveS represents the area under the ROC curve (AUC). This is analogous to AveP,
the area under the Recall-Precision curve, used in the information retrieval field. Unlike MCC, AveS is a
single-number measurement of the performance of a classification system over the whole range of
different cutoffs settings, rather than from a single setting.
Since the AveS is a measurement on a ROC curve for predicting active site residues from only one
protein, I need a measurement for the performance on a set of proteins. For this, I use Mean Average
Specificity (MAS), which is the mean of AveS of all the proteins in the set. For all methods that generate a
ranked list as in this study, including the POOL method, and one SVM method, I report the corresponding
Mean Average Specificity (MAS) from all the proteins in the test set. As in all statistical analysis, a
difference between the means does not mean too much without further analysis about the statistical
significance of the difference observed. In order to test the significance of the observed difference, I
perform the Wilcoxon signed-rank test 71 on AveS from different methods to estimate the probability of
observing such a difference under the null hypothesis that the observed better-performing method is
actually not better than the other.
To visually compare the performances from different methods, I generate the averaged ROC curve for
each POOL method by computing the recall and false-positive rate after truncating the list after each of
the positive residues in turn, followed by linearly interpolating the value at each recall value and
computing the mean of the interpolated false-positive rate value from all proteins in the dataset.
Although ROC curves and their associated AveS values are good ways to compare performance between
different methods, they do not directly provide a guide for the user to select the cut-off values, because
both recall and false-positive rate are not known to users unless they happen to know the true positives of
their proteins up front. I use another plot that I call the RFR curve; this is a plot of recall against filtration
96
ratio. Its purpose is to provide a guide for the user to select their cut-offs. They are almost the same as
the ROC curves except that filtration ratios are used in place of false-positive rates.
Since for every protein in the dataset POOL generates a ranked list of residues based on their probabilities
of being in the active site, and from this list one generates a corresponding ROC curve and a
corresponding RFR curve, I need to average these curves into a single ROC curve and a single RFR curve
for the whole dataset for comparison purposes. Since it is more natural to ask for any given method what
its expected false-positive rate is for given values of recall, this is what I use for the averaged ROC curve.
Another important fact about ROC curves is that there need be no prior commitment to how specific
classifiers are created. They express the tradeoff no matter how the classifiers are parameterized. On the
other hand, for the user who wants to use a fixed-proportion cutoff scheme, I provide averaged RFR
(recall-filtration ratio) curves; these curves give the expected recall for given filtration ratio values.
5.4 Computational procedure.
The three-dimensional coordinate files for the protein structures were downloaded from the Protein Data
Bank (http://www.rcsb.org/pdb/). In order to predict the theoretical titration curve of each ionizable
residue in the structure, finite-difference Poisson-Boltzmann calculations were performed using UHBD 72
on each protein followed by the program HYBRID 73, which calculates average net charge as a function
of pH. These titration curves were obtained for each ionizable residue: Arg, Asp, Cys, Glu, His, Lys, Tyr,
and the N- and C- termini. The pH range we simulated for all curves is from -15.0 to 30.0, in increments
of 0.2 pH units. This wide theoretical pH range is necessary for proper numerical integration of the first
derivative functions. The structures were processed and analyzed to obtain the central moments, as
described in Chapters 2 and 3.
These individual features, the central moments µ3 and µ4, were then rank-normalized within each protein,
and thus assigned values in the interval [0,1], as described earlier. This four-dimensional representation of
each curve was used for training and for testing. The results given in the remaining sections were based
97
on eight-fold cross-validation on a set of 64 proteins or 10-fold cross-validation on a set of 160 proteins,
both taken from the Catalytic Site Atlas (CSA) database 57, 74. The labels were taken directly from the
CSA database; if a residue is identified there as active in catalysis, it was labeled as positive in my dataset.
If not so identified in the CSA, we labeled it as negative. The CSA annotations, although incomplete,
constitute the best source of active residue labels for enzymes. In anticipation that the POOL method
would not be overly sensitive to mislabeled data, I performed no hand tuning of the labels and omitted no
residues during training, in contrast to the SVM work reported in Chapter 3.
For the eight-fold cross-validation procedure, I randomly divided the 64-protein set into eight folds of
eight proteins each, training on seven of the eight folds (56 proteins) and testing on the remaining fold (8
proteins). For the ten-fold cross-validation procedure, I randomly divided the 160-protein set into ten
folds of sixteen proteins each, training on nine of the ten folds (144 proteins) and testing on the remaining
fold (16 proteins). Training was performed applying the POOL method to obtain a function )~|1(ˆ xP for
each rank-normalized feature vector x~ in the appropriate feature space [0,1]k (where k = 4 for the POOL
method applied on the four THEMATICS features of ionizable residues as stated earlier, denoted as
POOL(T4); k=5 for the POOL method applied on the four THEMATICS features of ionizable residues
plus the geometric feature of the cleft size, denoted as POOL(T4G); k=1 for the POOL method applied
just on the geometric feature of cluster size, denoted as POOL(G); and k=2 for POOL applied to non-
ionizable residues, denoted as POOL(T2)). An additional detail is that for training we quantized the
multi-dimensional data points. For example, for POOL(T4), each rank-normalized feature fell into one of
20 bins whose sizes varied depending on their distance from 0.0. In particular, the lowest ranked bins
covered the half-open intervals, [0.0, 0. 2), [0. 2, 0.4), [0.4, 0.6), [0.6, 0.7), and there were 16 more bins of
width 0.02 above that, with one special bin for 1.0. Thus the lowest-ranking data were quantized more
coarsely than the remaining data. This is appropriate since these data tend to have very low average
probability of being in the active site anyway, because the vast majority of residues are negatives. Thus
the inability to make fine distinctions among these low-probability candidates does not degrade the
98
overall quality of the results. It does, however, improve the efficiency of the training procedure
significantly, so this is an important component of the analysis. This is especially helpful in the 10-fold
cross-validation on the 160-protein set. The typical training set of 144 proteins contained about 14500
ionizable residues, which fell into more than 6000 quantized bins in the 4-dimensional space used for
POOL(T4). The corresponding number of corresponding inequality constraints was about 35000-40000.
One final detail is that the probability estimates generated by the POOL method as I have applied it tend
to have numerous ties as well as some places where there is no well-defined value. The latter places
occur because the method only assigns values to existing data points (or bins containing data in the case
of our use of quantization). The locally constant regions occur both because of the quantization applied to
the training data at the outset and because the data pools created by the algorithm acquire a single value.
In cells where no value is defined, the interpolation scheme I use is to simply assign a value linearly
interpolated based on the Manhattan distance between the least upper bound and the greatest lower bound
for that cell based on the monotonicity constraint. Finally, since both the data pooling performed by the
algorithm and this interpolation scheme tend to lead to ties, I use the Manhattan distance from the origin
of the four THEMATICS features as a tie-breaker for any residues whose probability estimates are
identical. This simply imposes a slight bias toward strict monotonicity even though the mathematical
formulation I use to determine these probabilities is based on a non-strict monotonicity assumption,
making it possible to obtain well-defined rankings for all the residues in a protein.
I use CASTp 37, which uses the weighted Delaunay triangulation and the alpha complex for shape
measurements to calculate the cleft information for each residue in the protein. The clefts were ranked
based on their sizes in decreasing order and each residue having atoms located in any cleft is assigned the
rank number of the largest of the clefts where its atoms are located. One special value is assigned to
every residue not on the protein surface, and another is assigned to every residue on the surface but not
within any cleft. Ignoring these special values, the monotonicity assumption is that the larger the cleft to
which a residue belongs, the more likely that residue is to belong to the active site.
99
I use ConSurf 27 to calculate the sequence conservation information for residues in each protein. ConSurf
takes a protein sequence and find its closest sequence homologues using MUSCLE 75, a multiple-
sequence alignment algorithm. Two sequences with similarity higher than a preset threshold are treated as
homologues. ConSurf analyze the homologues of the query sequence and determines how conserved is
each residue in the query protein among these homologues. In order to normalize the result and make it
comparable between different proteins with different numbers of homologues and with different degrees
of overall conservation, the program labels each residue with a conservation score between 1 and 9, with
9 being the most conserved and 1 being the most variable. If there exist more than 50 homologues for the
query sequence, the 50 homologues closest to the query sequence are analyzed. If there are less than six
homologues, the method will not work. For proteins with 6-10 homologues, ConSurf does report a
conservation score, but these scores are less reliable. In this study, I only use conservation score from
ConSurf when there are at least 11 homologues for a protein. Under the assumption that active site
residues tend to be more conserved than others, we apply the POOL method on the conservation score
with the monotonicity assumption that the larger the conservation score a residue has, the more likely that
residue is to belong to the active site.
5.5 Results
The results presented in this section are based on two sets of proteins, a set of 64 test proteins selected
randomly from the CSA database 57, 74, and a 160-protein set covering most of the CSA database. A
detailed list of the proteins in both sets and the CSA-labeled positive residues within that protein can be
found in Appendices C and D. In each case, the results are based on eight-fold cross-validation for the 64
protein set and ten-fold cross-validation for the 160 protein set. The ROC curves and RFR curves I
display show average performance over all proteins in all of the test sets, using the averaging methods
described in section 5.4.
5.5.1 Ionizable residues using only THEMATICS features.
100
First I evaluate the ability of POOL with the four THEMATICS features, POOL(T4), to predict ionizable
residues in the active site. For the purposes of Figures 1 and 2, only the CSA-annotated ionizable active
site residues are taken as the labeled positives. Thus if a method successfully predicts all of the labeled
ionizable active residues, the true positive rate is 100%. The prediction of all active residues, including
the non-ionizable ones, is addressed below.
Figure 5.1 shows the ROC curve, true positive fraction (TP) as a function of false positive fraction (FP),
obtained using POOL(T4), with just the four-dimensional THEMATICS feature vectors described earlier
(solid curve). Recall that the POOL method computes maximum-likelihood probability estimates, but for
these ROC curves, only the rankings of all residues within a single protein matter. For comparison, I also
show in Figure 5.1 a corresponding ROC curve for the earlier THEMATICS statistical approach
introduced by Ko et al. 24 and refined by Wei et al. 54 (dashed curve), plus the single point (X)
corresponding to the THEMATICS SVM-based approach 45. The data set used for the statistical curve
consists of the same 64 proteins used here. Note that the POOL(T4) curve always lies above and to the
left of the statistical curve for all non-zero values of the true positive fraction. For any given non-zero
value of the FP fraction, the true positive fraction is always higher for POOL(T4) than for the statistical
selector. The point representing the particular SVM classifier is based on a separate set of data, trained
and tested on data sets somewhat different from the present data set, so the results are not strictly
comparable. Nevertheless, this point lies well below the POOL(T4) curve and strongly suggests that
POOL(T4) is superior to the SVM approach 45. Below I present further evidence that POOL outperforms
an SVM on this active-site classification task. Thus POOL(T4) represents our best method yet for
identifying ionizable active-site residues using THEMATICS features alone.
101
0
0.2
0.4
0.6
0.8
1
0 0.05 0.1 0.15 0.2 0.25
False Positive Rate
Rec
all
POOL(T4)Wei's MethodSVM
Figure 5.1 Averaged ROC curve comparing POOL(T4), Wei’s statistical analysis and Tong’s SVM using THEMATICS features. Shown in the plot are averaged ROC curve for POOL(T4) (solid curve), Wei’s statistical analysis (dashed curve) and Tong’s SVM (point X) using THEMATICS features on ionizable residues only for the prediction of annotated active site ionizable residues. POOL(T4) outperforms both SVM and the Wei’s method.
102
5.5.2 Ionizable residues using THEMATICS plus cleft information.
Next I evaluate the three different ways of combining THEMATICS features with cleft size information.
Figure 5.2 shows averaged ROC curves for these three different methods, along with the best-performing
THEMATICS-only method, POOL(T4). The three methods are: (i) POOL(T4G), which uses the POOL
method with the 5-dimensional concatenated feature vectors of THEMATICS and cleft size rank (G
stands for geometric feature); (ii) SVM(T4G), which uses a support vector machine trained using the
same 5-dimensional feature vectors, with varying threshold; and (iii) CHAIN(POOL(T4), POOL(G)), the
result of chaining POOL(T4) estimates with POOL(G) estimates.
103
0
0.2
0.4
0.6
0.8
1
0 0.05 0.1 0.15 0.2 0.25
False Positive Rate
Reca
ll
POOL(T4)CHAIN(POOL(T4), POOL(G))SVM(T4G)POOL(T4G)
Figure 5.2. Averaged ROC curves comparing different methods of predicting ionizable active site residues using a combination of THEMATICS and geometric features of ionizable residues only. The method using chaining to combine both THEMATICS feature and geometrics information has the best performance.
104
To compare the averaged ROC curves from Figure 5.2 quantatively, I compute the area under the curve
for each ROC curve in the figure using the mean average specificity (MAS). The MAS values for
CHAIN(POOL(T4), POOL(G)), POOL(T4), POOL(T4G) and SVM(T4G) are 0.939, 0.921, 0.909 and
0.903, respectively. Figure 5.2 and the MAS values show the comparison of averaged performance
between different methods. In order to estimate the statistical significance of the performance difference
considering all pair-wise comparison results (i.e., on a per-protein basis), I perform the Wilcoxon signed
test. Table 5.1 shows the p-value of the Wilcoxon signed-rank test, the probability of observing the
specified AveS measurement with the null hypothesis that the method listed in the corresponding row does
not out-perform the method listed in the corresponding column, as the first number in each cell. The
number N in parentheses indicates the number of proteins out of the 64, for which the method in that row
outperforms the method in that column. For the remaining (64-N) proteins in the set, the two methods
either give equal performance or the method in the column outperforms the method in the row.
105
SVM(T4G) POOL(T4G) POOL(T4)
CHAIN(POOL(T4), POOL(G))
<0.0001
(53)
<0.0001
(59)
<0.0001
(46)
POOL(T4) 0.0002
(40)
0.0006
(41)
POOL(T4G) 0.038
(37)
Table 5.1 Wilcoxon signed-rank tests between methods shown in figure 5.2. The first number in each cell
is the Wilcoxon p value, the probability that the method in the corresponding row does not outperform the
method in the corresponding column. The number in parentheses is the number of proteins out of 64 for
which the method in the row outperforms the method in the column.
106
The figure and the table above clearly show that chaining the POOL(T4) and POOL(G) probability
estimates is the method that gives the best performance. It is interesting to note that this method,
CHAIN(POOL(T4), POOL(G)), is the only one that outperforms POOL(T4) alone. It is also interesting
to note that POOL(T4) is consistently at least as good as SVM(T4G), and is significantly better than
SVM(T4G) in the upper recall range, even though the latter has the advantage of the additional cleft
information. In general, there is little difference between POOL(T4), SVM(T4G), and POOL(T4G) in the
lower recall range, but for recall above about 0.6, POOL(T4) has a significantly lower false positive rate,
on average, than the other two, given equal recall. So these ROC curves provide strong evidence that
CHAIN(POOL(T4), POOL(G)) is the only one of the methods reported to date that is capable of taking
good advantage of additional geometric information that is not contained in THEMATICS features alone
and thereby outperforms any purely THEMATICS-based method so far.
The better performance of this chained method CHAIN(POOL(T4), POOL(G)) over POOL(T4) alone is
consistent throughout the ROC curve. For recall rates greater than 0.50, the TP fraction for the chained
method is better than that of POOL(T4) by roughly 10% for a given FP fraction. This qualitative trend is
apparent from visual inspection of the ranked lists from the two methods. For a typical protein, these two
ranked lists tend to be very similar, with annotated positive residues generally ranking a little higher, on
average, in the list resulting from chaining.
I believe that the observation that chaining the two four- and one- dimensional estimators gives better
results than applying POOL directly to the single, five-dimensional concatenated feature vector is
probably an overfitting issue. There may be too much flexibility when POOL is used with a high-
dimensional input space5, especially when the data are sparse.
5 As a side note, as far as possible worst-case performance is concerned, it is easy to show that applying coordinate-wise monotonicity with even a 2-dimensional input space has infinite VC dimension.
107
Since I established that chaining the POOL probabilities gives better results, from the next section on, I
omit POOL from the method reference, to make the change of features more distinguishable.
CHAIN(POOL(T4), POOL(G)) will be abbreviated as CHAIN(T4, G) instead.
5.5.3 All residues using THEMATICS plus cleft information. So far only predictions for ionizable residues have been described. The THEMATICS environment
variables are now used to incorporate predictions for non-ionizable residues in the active site.
Figure 5.3 shows the ROC curve for a combined method by which a single merged list ranking all
residues, both ionizable and non-ionizable, in a protein is generated. The method assigns probability
estimates for ionizable residues using the best of the previous ionizables-only estimators, the chained
estimator CHAIN(TALL, G) corresponding to the best ROC curve CHAIN(POOL(T4), POOL(G)) in
Figure 5.2. It also assigns probability estimates to non-ionizable residues using POOL with the two
THEMATICS environment features chained with POOL(G), and then rank orders all the residues based
on their probability estimates. Also included in Figure 5.3 for comparison is a ROC curve CHAIN(TION,
G) based on the same estimates for the ionizable residues but assigning probability estimates of zero to
all non-ionizable residues. Note that the data for this latter method are essentially the same as those of the
chained CHAIN(POOL(T4), POOL(C)) curve of Figure 5.2, except that the denominator for the recall
values is now the number of total active-site residues in the protein, whether ionizable or not, and the
denominator for the false positive rate is now the total number of non-active-site residues in the protein,
ionizable or not. The improved ROC curve for the merged estimate method CHAIN(TALL, G) compared
to the curve for the ionizables-only method CHAIN(TION, G) indicates that taking into account both
THEMATICS environment variables and cleft information does indeed help identify the non-ionizable
active-site residues. When the lists are merged, the rankings of some annotated positive ionizable residues
may be lowered, but it is apparent that this effect is more than offset, on average, by the rise in the
ranking of some annotated positive non-ionizable residues that are obviously missed by excluding them
108
altogether. If this were not the case, then one would expect the merged curve to cross below (and to the
right of) the comparison curve in the lower recall (and lower false positive) range.
109
0
0.2
0.4
0.6
0.8
1
0 0.05 0.1 0.15 0.2 0.25
False Positive Rate
Reca
ll
CHAIN(T_ION, G)CHAIN(T_ALL, G)
Figure 5.3 Averaged ROC curve comparing POOL methods applied to ionizable residues only CHAIN(TION, G) and to all residues CHAIN(TALL, G).
110
The MAS values for CHAIN(TALL, G) and CHAIN(TION, G) are 0.933 and 0.833, respectively. The p-
value of the Wilcoxon signed-rank test of observing such AveS under the null hypothesis that the
CHAIN(TALL, G) does not out-perform CHAIN(TION, G) is <0.0001 and CHAIN(TALL, G) outperforms
CHAIN(TION, G) in 31 of the 64 proteins. The number of proteins for which CHAIN(TALL, G)
outperforms CHAIN(TION, G) in this case may seem low, but both methods perform the same in 25 out of
the 64 proteins. For many of these latter cases, the protein does not have any non-ionizables in the active
site.
I have shown that the extension of the POOL method to non-ionizable residues gives a satisfactory result.
From now on, all residues are included in the study and I will just use T to indicate the way I apply
THEMATICS in TALL: For ionizable residues, I estimate the probability of being in active sites by
chaining the result of the POOL method on four THEMATICS features; for non-ionizable residues, I
estimate the probability of being in active sites by chaining the result of POOL method on two
THEMATICS features; I then combine the results and rank order the list of all residues based on their
probability of being in active sites.
5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if
applicable.
So far all the information I used in protein active site prediction is only derived from the protein 3D
structures, in other words, no sequence comparison information is used. As discussed in Chapter 2, it
makes the method applicable to those proteins with no or very few sequence homologues; indeed many of
the newly discovered protein structures from Structural Genomics projects have few or no sequence
homologues. It is generally true that most active site residues tend to be more conserved than others, with
only a few exceptions. Based on this observation, I believe, if I can include the sequence conservation
into our system when the information is available, I may get better performance. I put this hypothesis to
the test in this section, and the result is presented in Figure 5.4.
111
This figure shows the ROC curves using different features on the 160-protein set, with all residues
included as I did in 5.5.3. The reason I used the 160 protein set instead of the 64 set is that not all the
proteins have reliable sequence conservation information to use, which will be explained later. If I still
use the 64 protein set, the number of proteins with reliable sequence conservation information may not be
large enough to perform significance testing. Also, using a 160-protein set will show that the performance
improvement using different feature sets is consistent between different test sets. As demonstrated in
Figure 5.2, chaining the POOL results together works better than applying POOL directly on high
dimensional features, so I use chaining to combine the features of THEMATICS, cleft and the sequence
conservation information.
As pointed out earlier, not all proteins have enough homologues to perform reliable sequence
conservation analysis. In this study, I use ConSurf to do the sequence analysis. As a requirement, it needs
more than five homologues to perform conservation analysis and the result is claimed to be more reliable
if the number of homologues is larger than 10. In this study, I will only use the conservation information
when the protein has more than 10 homologues. For those not having enough homologues (28 out of 160
in this case), I assign 1 as probability estimate for the conservation POOL table. Since the ranked list is
performed for residues within the same protein, this treatment is valid and will not affect the results from
other proteins in the set.
There are four curves for comparison6: CHAIN(T) uses the four THEMATICS features for ionizable
residues and the two THEMATICS features for the non-ionizables; CHAIN(T, G) uses both
THEMATICS and the cleft feature; CHAIN(T, C) uses both THEMATICS and the sequence conservation
information; while CHAIN(T, G, C) uses all three features by chaining.
Figure 5.4 shows, among all four curves, CHAIN(T) is dominated by all other three curves, suggesting
that including either cleft or sequence conservation features, or both, can improve the performance. Both
CHAIN(T, C) and CHAIN(T, G, C) dominate CHAIN(T, G), suggesting that incorporating sequence
6 I use the notation CHAIN(T) for consistency. This could also have been notated more simply as “T”.
112
conservation information does improve performance more than just incorporating cleft information alone.
Surprisingly, CHAIN(T, C) and CHAIN(T, G, C) have very similar performance, although in the recall
range below 80%, CHAIN(T, G, C) performs slightly better.
The MAS for CHAIN(T, G, C), CHAIN(T, C), CHAIN(T, G) and CHAIN(T) are 0.925, 0.923, 0.907 and
0.899, respectively. The p-values of the Wilcoxon signed-rank test of observing such AveS measurement
with null hypothesis that the method in the row does not outperform the method in the column are listed
in Table 5.2, as the first number in each cell. The number in the parentheses indicates the number of
proteins for which the method in that row outperforms the method in that column:
113
0
0.2
0.4
0.6
0.8
1
0 0.05 0.1 0.15 0.2 0.25
False Positive Rate
Reca
ll
CHAIN(T)CHAIN(T, G)CHAIN(T, C)CHAIN(T, G, C)
Figure 5.4 Averaged ROC curves comparing different methods of combining THEMATICS, geometric and sequence conservation features of all residues. The method using chaining to combine THEMATIC, geometric and sequence conservation features has the best performance.
114
CHAIN(T) CHAIN(T, G) CHAIN(T, C)
CHAIN(T, G, C) <0.0001
(115)
<0.0001
(95)
<0.0001
(103)
CHAIN(T, C) <0.0001
(101)
0.0008
(89)
CHAIN(T, G) <0.0001
(101)
Table 5.2 Wilcoxon signed-rank tests between methods shown in figure 5.4.
115
5.5.5 Recall-filtration ratio curves.
The results reported so far are all in the form of ROC curves. As discussed earlier, my analysis is not
committed to any particular cutoff or rule to select the active site residues from the top of the list. For
instance, users can select the top k residues in the ranked list of residues ordered by the estimated
probability of being in the active site, or they can select the residues with an estimated probability of
being in the active site greater than a certain cutoff value, or they can select the top p percent of the
residues in the ranked list. Among the three methods listed above, I think the third probably would be
preferred in general since it is less susceptible to variation of protein size and availability of sequence
conservation information. In this case, RFR curves (recall-filtration ration curve) may be more useful
than ROC curves.
Since the main purpose for the RFR-curve is to provide a guide for users to select the appropriate cut-off,
I only report the results for CHAIN(T, G, C), which performs the best among all the methods. The test
was performed on all residues ranked by their probability of being in the active site and I average the
recall for each filtration ratio value to get the averaged RFR curve. For the curve shown in Figure 5.5, for
example, choosing the top 10% of the residues from the ranked list gives an average recall of 90%, while
choosing the top 5% of the residues from the ranked list gives an average recall of 79%.
116
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Filtration ratio
Rec
all
CHAIN(T, G, C)
Figure 5.5 Averaged RFR curve of for CHAIN(T, G, C) on the 160 protein test set.
117
5.5.6 Comparison with other methods.
In addition, I also compare the CHAIN(T, G) and CHAIN(T, G, C) results with the results from some
other top performing active site prediction methods, particularly, Petrova’s method 39, Youn’s method 40,
and Xie’s geometric potential method 38. All these methods use SVM. The first two use both sequence
conservation and 3D structural information, while Xie’s method uses structural information only.
The authors of the three methods report the results for the dataset by different measures than what I used
in my studies. Therefore I will simply compare the result from the CHAIN(T, G) and CHAIN(T, G, C) on
the 160 protein test set and compare with their results in their form of analysis. Because the performance
measures are not achieved from the same dataset, results are not strictly comparable, but qualitively, the
comparisons below give a good idea of the relative performance.
In order to compare my results with theirs at a similar recall level, I used a 4% filtration ratio cutoff in the
POOL method to compare with Youn’s method, and a variable filtration ratio cutoff to compare with
Petrova’s method. Note that while our test set consists of proteins with a wide variety of different folds
and functions, Youn’s results are reported for sets of proteins with common fold or with similar structure
and function. Performance on the more varied set is a much more realistic test of predictive capability on
proteins of unknown function, particularly novel folds. Performance on a set of structurally or
functionally related proteins is also substantially better than performance on a diverse set, as one would
expect and as has been demonstrated by Petrova and Wu 39.
Youn’s method 40 achieved about 57% recall at 18.5% precision with MAS (AUC) of 0.929, using both
sequence conservation and structural information when they train and test on proteins from the same
family; however the performance dropped when the training and testing is performed on proteins of the
same superfamily and fold level, while our CHAIN(T, G, C) with a preset 4% filtration ratio cutoff,
achieves the averaged recall of 64.68% with averaged precision of 19.07%, and an MAS (AUC) of 0.925
for all 160 proteins in the test set, consisting of proteins from completely different folds and classes.
Without the use of sequence conservation, the CHAIN(T, G) achieves the averaged recall, the averaged
118
precision and the AUC of 61.74%, 18.06% and 0.907, respectively. Our chained POOL method thus does
about as well as Youn’s method, even when we exclude conservation information, and a little better with
conservation information included, even though our diverse test set is one for which good performance is
most difficult to achieve. The complete results are shown in Table 5.3.
Petrova’s method 39 measured the performance of their method globally using all residues in all proteins,
instead of computing the recall, accuracy, false positive rate and MCC values for each protein and then
averaging them. Like Youn’s method, they use both sequence conservation information and 3D structural
properties as input to the SVM. They use a dataset that they call the benchmarking dataset that contains a
wide variety of proteins that are dissimilar in sequence, are structurally diverse, and span the full range of
E.C. classes of chemical functions. This dataset constitutes a fair test of how a method will perform on
structural genomics proteins of unknown function for which sequence conservation information is
available. Their method achieves a global residue level 89.8% recall with an overall predictive accuracy
of 86%, with an MCC of 0.23 and a 13% false positive rate on a subset of 79 proteins from CatRes
database. Testing on the 72 proteins from their set that also appear in my 160 protein set, CHAIN(T, G,
C) with a 10% filtration ratio cutoff achieves a residue level 88.6% recall at the overall predictive
accuracy of 91.0%, with an MCC of 0.28 and a 9% false positive rate. The resulting residue level recall,
overall predictive accuracy and the MCC from the CHAIN(T, G) are 85.2%, 91.0% and 0.27, respectively.
The results for Petrova’s method and for the present CHAIN methods with different filtration ratio cutoffs
are shown in Table 5.4 and the ROC curves in Figure 5.6. CHAIN(T, G, C) achieves comparable recall
with somewhat better accuracy and a lower false positive rate. CHAIN(T, G) performs almost as well,
even without conservation information.
In 38, a purely 3D structure based method, the performance was reported in the following fashion: their
method achieves at least a 50% recall with 20% or less false positive rate for 85% of the proteins they
analyzed. The performance of the CHAIN(T, G) and CHAIN(T, G, C) methods measured in the same
way is listed in table 5.5. Xie’s method should be compared against CHAIN(T, G), because these methods
119
do not use conservation data. CHAIN(T, G) achieves at least a 50% recall with a false positive rate of
20% or less for 96% of all proteins.
The results in the tables clearly show that CHAIN(T, G), which only uses 3D structural information of
proteins, achieves about as good or even better performance than that of these best performing current
active site prediction methods. When additional sequence conservation information is available, the
CHAIN(T, G, C) performs still better.
120
Method/Data set Sensitivity (%) Precision (%) AUC
Youn / Family 57.02 18.51 0.9290
Youn / Superfamily 53.93 16.90 0.9135
Youn / Fold 51.11 17.13 0.9144
CHAIN(T, G, C) / all protein 64.68 19.07 0.925
CHAIN(T, G) / all protein 61.74 18.06 0.907
Table 5.3 Comparison of sensitivity, precision, and AUC of CHAIN(T, G, C) with Youn’s reported results for proteins in the same family, super family, and fold.
121
Table 5.4 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Petrova’s method.
Residue level recall Residue level accuracy
Residue level false positive rate
Residue level MCC
Petrova’s method 89.8% 86% 13% 0.23
CHAIN(T, G, C) 7% 81.4% 90.9% 6.2% 0.31
CHAIN(T, G, C) 8% 85.2% 91.0% 7.1% 0.31
CHAIN(T, G, C) 9% 85.6% 91.0% 8.0% 0.29
CHAIN(T, G, C) 10% 88.6% 91.0% 9.0% 0.28
CHAIN(T, G, C) 12% 90.2% 89.1% 11% 0.26
CHAIN(T, G, C) 15% 91.9% 86.1% 14% 0.23
CHAIN(T, G) 7% 73.7% 93.7% 6.1% 0.28
CHAIN(T, G) 8% 77.1% 92.8% 7.0% 0.28
CHAIN(T, G) 9% 81.8% 91.9% 8.0% 0.27
CHAIN(T, G) 10% 85.2% 91.0% 9.0% 0.27
CHAIN(T, G) 12% 86.9% 89.0% 11% 0.25
CHAIN(T, G) 15% 89.4% 86.0% 14% 0.22
122
0.7
0.75
0.8
0.85
0.9
0.95
0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
False positive rate
Recall
Petrova's method
CHAIN(T, G, C)
CHAIN(T, G)
Figure 5.6 ROC curves comparing CHAIN(T, G), CHAIN(T, G, C) and Petrova’s method.
123
Method Recall ≥ False positive rate < Achieved for
Xie 50% 20% 85%
CHAIN(T, G, C) 50% 20% 97%
CHAIN(T, G, C) 80% 20% 84%
CHAIN(T, G, C) 60% 10% 85%
CHAIN(T, G) 50% 20% 96%
CHAIN(T, G) 80% 20% 77%
CHAIN(T, G) 60% 10% 81%
Table 5.5 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Xie’s method. Each method achieves at least the specified recall rate with a false positive rate less than specified for the percentage of proteins in the last column.
124
5.5.7 Rank of the first positive.
The last result I will present is one that is only applicable to methods that generate a ranked list: the rank
of the first true positive in the list. This metric is useful for users who are interested in finding a few of the
active site residue candidates and who do not necessarily need to know all of the active site residues.
Users could use the list from the POOL method to guide their site directed mutagenesis experiments by
going down the ranked list one by one and hopefully, once the first active site residue is found, it is easier
to find the rest of active site residues by examining its neighbors. A histogram giving the rank of the first
active site residue found by CHAIN(T, G, C) on the 160 protein set is shown in Figure 5.7. The median
rank of the first true positive active site residue in the 160 protein set with CHAIN(T, G, C) method is
two. For 46 out of 160 proteins, the first residue in the resulting ranked list is an annotated active site
residue. 65.0%, 81.3% and 90.0% of the 160 proteins have the first annotated active site residue located
within the top 3, 5 and 10 residues of the ranked list, respectively. Such measurements are not easily
made for binary classification methods.
125
Rank of the first annotated active site residue in the list
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 10+
Rank of the first true positive
Perc
enta
ge o
f pro
tein
s
Cumulative distribution of the rank of the first annotated active site residue in the list
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
Rank
Perc
enta
ge o
f pro
tein
s
Figure 5.7. Histogram of the first annotated active site residue. Top: rank of the first annotated active site residue in the ranked list from CHAIN(T, G, C) on the 160 protein set. Bottom: the cumulative distribution of the first annotated active site residue in the ranked list from CHAIN(T, G, C) on the 160 protein set.
126
5.6 Discussion.
In this chapter, I presented the application of the POOL method using THEMATICS plus some other
features for protein active site prediction.
I started with the application of the POOL method just on THEMATICS features, with features similar to
those I used before in the SVM method, as well as those used in Ko and Wei’s statistical analysis 24, 54.
My results show that the POOL method outperforms all of the earlier THEMATICS methods with no
training data cleaning and no clustering after the classification. This suggests that by emphasizing the
underlying THEMATICS principles, the POOL method makes better use of the training data and
automatically limits the adverse effect that noise in the training data set might have caused in the other
methods. The results also supply further evidence that the THEMATICS principles are valid in reality. In
some sense, this opens another possible application for the POOL method, to verify the underlying
monotonicity hypothesis, which could be worth further investigation in the future.
I also tested different ways of incorporating additional features into the learning system. Not surprisingly,
the results show that in order to improve performance, I have to incorporate the right features in the right
way. In addition to using CastP to get the size rank of the cleft in which a residue resides, I also tried area
of solvent accessibility, residue type and some other features. Unfortunately, these extra features did not
help the performance of the POOL method. One possible reason for this might be overfitting or possible
correlation between these features with some other features already present in the system. Even with
features that were found to be helpful in improving the performance, how they are incorporated matters.
The results show that chaining the results from separate POOL estimates is better than simply combining
all the available features into a big POOL table with very high dimension. As mentioned earlier, the
reason behind this might be overfitting, since combining features into a POOL table with high dimension
causes the number of probabilities needed for estimation to grow exponentially, while the training data
127
can only increase linearly in most cases. In other words, the high dimensionality makes the table too
sparse and less accurate for probability estimates.
I also extended the application of THEMATICS to all residues, not just ionizable residues, in a natural
way and showed that it is effective. Although the performance for predicting non-ionizable residues is
not as good as the performance for predicting ionizable ones, this extension does provide a way to
combine features from THEMATICS, which by itself can only be applied to ionizable residues directly,
with some other features, making the comparison with the performance of other methods more accurate
and fair.
The incorporation of sequence conservation information does improve the prediction when there are
enough homologues with appropriate similarities. The POOL method gives us a means for easily utilizing
this information when it is available, while not affecting the training and classification when it is not.
When comparing with other methods, especially if the other methods use binary classification instead of a
ranked list, I have to commit to a specific cutoff value and turn my system into a binary classification
system. The results in Section 5.5.6 clearly show that the POOL method using THEMATICS and
geometric features achieves equivalent or better performance than the other methods in comparison, even
in cases where their methods are tested on very special groups of proteins. This makes my method more
widely applicable to proteins with few or no sequence homologues, such as some Structural Genomics
proteins, while both Youn’s and Petrova’s methods need sequences alignments from homologues.
Performances of Youn’s and Petrova’s methods will degrade significantly when sequence conservation
information is not available. However, my CHAIN(T, G) method’s performance will not degrade in the
absence of sequence conservation information. The results also show that with additional sequence
conservation information, when available, the performance can be further improved.
Interestingly enough, when I compare the performance of CHAIN(T, G) and CHAIN(T, G, C) in Figure
5.4, it is apparent that the addition of the conservation information does improve the performance a little,
128
but not to the extent observed previously for sequence-structure methods. Typically the conservation
information is the most important input feature, and without it performance is substantially worse 18. This
suggests that the 3D structure based THEMATICS features are quite powerful compared with other 3D
structure based features.
When looking at the recall and false positive rates of the results from all the protein active site prediction
methods, one must keep in mind that the annotation of the catalytic residues in the protein dataset is never
perfect. Since most of the labeling comes from experimental evidence, some active site residues are not
labeled as positive simply because there is no experiment designed and carried out to verify the role of
that specific residue. Since I use the CatRes/CSA annotations as the sole criteria to evaluate the
performance in order to keep the comparisons consistent, as mentioned several times in this dissertation,
the resulting false positive rate may be higher than in reality. There is evidence available to support the
functional importance of some residues that are not labeled as active site in the CatRes/CSA database 24, 58,
but they have high ranks in the list from the POOL method and are classified as positive by
THEMATICS-SVM and THEMATICS-statistical analysis as well.
Although I evaluated the POOL method performance by using filtration ratio values as a cutoff, it is just
for the purpose of comparing with other protein active site prediction methods that use a binary
classification scheme. The ranked list of residues based on their probability of being in the active site
contains much more information than traditional binary classification labeling. The rank of the first
annotated positive residue analysis in Section 5.5.7 shows just one application of the extra information
contained in a ranked list rather than a traditional binary label. There are many possible measurements of
performance depending on the actual application by users, and in turn many possible applications that can
benefit from using a ranked list form. It is noteworthy that P-Cats 21 uses a k-nearest neighbor method to
estimate the probability of a residue of being in an active site of a protein, and in principle can be a basis
of creating a ranked list as results, but their method just uses the probability estimates as the basis to
assign binary labels; residues with probability larger than 0.50 are labeled as positive and the others as
129
negative. Although the online server of their method 76 does report the probability estimates along with
the corresponding binary label, the potential benefits of using a ranked list instead of binary labeling has
not been fully addressed either in the paper or online.
In conclusion, I have established that applying the POOL method with THEMATICS and other features,
appears to yield the best protein active site prediction system yet found, and it provides more information
than other active site prediction methods.
130
Chapter 6
Summary and Conclusions.
131
Here, I summarize the work I have presented in this dissertation.
This dissertation starts with an introduction to the central problem I try to solve, which is using machine
learning methods to build an automated system that can predict active sites from protein structure alone,
but can also further improve the prediction by using sequence conservation too, when the information is
available.
In the second chapter, the dissertation briefly surveyed the background of protein active prediction and
the machine learning techniques, especially the probability based learning techniques, which forms the
foundation of the POOL methods. This chapter also introduces THEMATICS, an effective and accurate
protein active site predictor using only structural information of proteins, which forms another foundation
of this work.
The third chapter of my dissertation reports my work of using SVM with structure only information and
achieved better performance than not only other competing methods using all kinds of features but also
THEMATICS with statistical analysis methods. At the end of this chapter, I explain the limitations of
using traditional machine learning techniques in the form of classification to solve the protein active site
prediction problem.
Chapter Four of my dissertation proposes a novel POOL method as an approach to solve a class of
problems involving estimating class probabilities under multi-dimensional monotonicity constraints. In
this chapter, I describe the properties of such problems and a framework under which we can describe and
solve such problems. I presented an algorithm for solving such problems and the mathematical proof that
the solution indeed is optimal for both sum-of-squared-error and the maximum-likelihood criteria.
In Chapter Five, the protein active site prediction problem is reframed into a ranked list problem from a
standard binary classification problem. The dissertation presents the application of using the POOL
method to estimate the class probabilities of residues being in the active site under class probability
132
monotonicity assumptions. Using the POOL method and THEMATICS, I achieved better performance
than using the THEMATICS-SVM system. After incorporating more features, including sequence
conservation information, and extension of the methods into all residues from ionizable residues only, the
POOL method achieved the best performance so far, comparing with both the earlier THEMATICS
method and other existing methods using SVMs and all kinds of structural and sequence information from
proteins.
My work has established the following two claims:
THEMATICS is an effective and accurate protein active site predictor and can be automated by different
machine learning techniques. Incorporating more features makes it even more effective and accurate in
protein active site prediction.
The POOL method is an efficient way to estimate probability with maximum likelihood under the multi-
dimensional monotonicity constraints. It provides a platform allowing probability estimates to be easily
combined in a simple manner. It can be used in protein active site prediction and potentially many other
applications where monotonicity constraints play a role, such as disease detection from markers in the
blood, risk assessment and many more.
6.1 Contributions.
Listed below are the novel contributions to using machine learning techniques with THEMATICS in
protein active site prediction that have been presented in this dissertation.
Use SVM with THEMATICS in active site prediction. The work of using SVM with THEMATICS in
protein active site prediction presented in this dissertation is the first successful approach that uses the
machine learning techniques to automate the THEMATICS method. It outperforms both the
THEMATICS-statistical method as well as other 3D structure based protein active site prediction
methods.
133
Turn the protein active site prediction problem into a probability based ranked list problem. This
dissertation frames the protein active site prediction in the form of a ranked list problem, instead of a
traditional binary classification problem. Although the P-Cats method 21 also estimates the probability of
residues being in the active site using a k-nearest neighbor method, it is still framed as a binary
classification problem. This dissertation emphasizes the benefits of using the ranked list scheme, which
gives users more control in setting their own cutoff thresholds. The probability estimates behind the
ranked list make possible the next contribution, combining the results from different methods.
Combine probability estimates by applying the POOL method on different features. Since all the
results in the POOL method are essentially probability estimates, it becomes possible to utilize more
features using the chaining technique. This makes the method less susceptible to the problem of
analyzing sparse data in high dimensionality.
Introduce monotonicity constraints into machine learning. This is a successful approach that enforces
a prior belief of the data in the machine learning task, and the results indicate that if the prior belief is
indeed correct, the performance of the learning system can improve by the incorporation and the
enforcement of this knowledge.
Develop a novel POOL method. This dissertation frames the problem of assigning probabilities under
multi-dimensional monotonicity constraints with minimum sum of squared error (SSE) in the form of a
special form of convex optimization problem and develops the POOL algorithm to solve it more
efficiently and accurately than solving a general convex optimization problem.
Prove that minimizing SSE maximizes likelihood in the present problem. This dissertation proves
that the probability assignments minimizing SSE also maximizes the likelihood under the multi-
dimensional monotonicity constraints using the K.K.T. conditions.
134
Use the POOL method with THEMATICS in protein active site prediction. This dissertation presents
a practical application of the POOL method by applying the method with THEMATICS in protein active
site prediction. It outperforms all other protein active site prediction methods up to date.
Use the environment feature to incorporate influences of nearby residues in the protein active site
prediction. This dissertation introduces an environment feature (see 5.2) as a new way to incorporate the
influences from nearby residues in the active site prediction. This gives two benefits: first, it makes the
THEMATICS method applicable to non-ionizable residues; second, it avoids the extra step of clustering
after the classification process as used in THEMATICS-statistical analysis and the THEMATICS-SVM
study.
6.2 Future research.
There are many possible ways to extend the research described in this dissertation. I have mentioned
some of them in earlier sections. Here I outline some of the directions where the application of the POOL
method with THEMATICS can be improved to further improve the performance of the protein active site
prediction.
One place for further research is the use of the ranked list from the POOL method. As mentioned earlier,
the POOL method does not make any commitment to any rule or value for the cutoff. But users may still
want to know “exactly” which residues are in the active sites. Although I suggested using filtration ratio
as a possible cutoff and used this for comparison with other methods, it is still coarse and I believe it can
be further improved. One possible way is to look at the actual probability estimates the POOL method
gives and possibly the scale and the difference between adjacent residues in the ranked list may give some
clue about where to choose the cutoff value. Another approach is to use machine learning as the first
screening step in the process and get more human involvement in refining the predictions. Human experts
can look at the 3D structure of the proteins to see where the residues near the top of the ranked list are
located. Residues in some areas, such as in clefts near the surface and close to each other are more likely
to be in the active site than others such as those deeply buried or isolated from other residues near the top
135
of the ranked list. Of course one can also feed the result from the POOL method, either the rank, or the
raw probability estimates into another machine learning system to further improve the performance. Most
likely, some normalization needs to be performed on these features if cross protein training and testing is
used.
Another possible area to improve performance in protein active site prediction is the feature selection.
Some methods I compared against in Section 5.5.6 use many more features. Although it is not true that
the more features one uses, the better the performance one gets, it worth exploring the use of more
features, including both the simple ones extracted from the 3-D structures or sequences directly, and more
sophisticated ones such as the four THEMATICS features I used, or even results from other machine
learning systems. These features can either be fed to the POOL method, or to another learning system
along with the result of the POOL method.
I will mention one more area where further research is needed, although it is beyond the scope of protein
active site prediction. Once one knows the active site of a protein, the next natural question would be
what function the active site performs. It is the next step in function prediction after active site prediction.
I believe the exact shape of THEMATICS titration curves of a certain residue and the shapes of
THEMATICS titration curves of residues within a certain region may give us some clues about what class
of reaction it catalyses. In principle, this might also be solved by machine learning. Apparently, it is a
very challenging and rewarding task.
136
Appendices
Appendix A. The training set used in THEMATICS-SVM
Name EC Classification PDB ID
Acetylcholinesterase (E.C. 3.1.1.7) 1ACE
3-Ketoacetyl-Coa Thiolase (E.C. 2.3.1.16) 1AFW
Ornithine carbamoyltransferase (E.C. 2.1.3.3) 1AKM
Glutamate racemase (E.C. 5.1.1.3) 1B74
Alanine racemase (E.C. 5.1.1.1) 1BD0
Adenosine kinase (E.C. 2.7.1.20) 1BX4
Subtilisin Carlsberg (E.C. 3.4.21.62) 1CSE
Micrococcal nuclease (E.C. 3.1.31.1) 1EY0
Oxalate oxidase (E.C. 1.2.3.4) 1FI2
DNA- Lyase (E.C. 4.2.99.18) 1HD7
2-amino-4-hydroxy-6-hydroxymethyl-dihydropteridine
Pyrophospho-kinase (E.C. 2.7.6.3) 1HKA
Colicin E3 Immunity Protein (E.C. 3.1.21.-) 1JCH
L-lactate dehydrogenase (E.C. 1.1.1.27) 1LDG
Papain (E.C. 3.4.22.2) 1PIP
Mannose-6-phosphate isomerase (E.C. 5.3.1.8) 1PMI
Pepsin (E.C. 3.4.23.1) 1PSO
137
Triosephosphate Isomerase (E.C. 5.3.1.1) 1TPH
Aldose Reductase (E.C. 1.1.1.21) 2ACS
HIV-1 Protease (E.C. 3.4.23.16) 2AID
Mandelate Racemase (E.C. 5.1.2.2) 2MNR
138
Appendix B. The test set used in THEMATICS-SVM
The following table gives the testing results for 64 CatRes/CSA proteins. Bold indicates residues that are
CatRes/CSA active, ionizable, and correctly predicted in a cluster of just ionizables by the SVM. Bold
italic indicates CatRes/CSA active, missed in the SVM result, but found in our cluster if neighbors within
6Å are included. Underline indicates CatRes/CSA active missed by both criteria. []ab means the
symmetric clusters appear in both chain a and chain b. [XXXab] means residue XXX in both chain and
chain b are in the same cluster. The number to the left of “;” in the recall and filtration ratio columns
indicates the percentage we get from just SVM and clustering on ionizable only and the number to the
right indicates the result from the test including the neighbors within 6Å.
PDB Code Protein Name CatRes Positive
SVM Reported Positive
Recall (%)
Filtration ratio (%)
Result SVM only / SVM-region
1AL6 Citrate (si)-synthase H274 H320 D375
[H246a H249a R413a E420a H246b H249b R413b E420b] [Y231b H235b H238a H274a R329b D375b R401b R421a] [Y231a H235a H238b H274b R329a D375a R401a R421b] [D174 D257]a,b [Y158 Y167]a, b [Y318 Y330]a,b
67;100 5;17 Correct/Correct
1APX Heme peroxidase R38 H42 N71
[E65 H68] [H163 D208] [H116 E244]
0;33 2;10 Incorrect/Partial correct
1AQ2 Phosphoenol pyruvate carboxykinase
H232 K254 R333
[R65 K70 C125 Y207 E210 K212 K213 H232 C233 K254 D268 D269 E270 H271 E282 C285 Y286 E311 R333 Y336] [C408 Y421 Y524] [E36 H146]
100;100 5;16 Correct/Correct
1B3R Adenosylhomocysteinase D130 K185 D189 N190 C194
[C52 H54 D129 D130 E154 E155 D189 C194 C227]
60;100 3;20 Correct/Correct
139
[Y220 Y256]
1B6B Aralkylamine N-acetyltransferase
S97 L111 H122 L124 Y168
[E54 H127]a [H120 H122]ab [H145 H174]a [E50 E52]b [E190 E192]b [C160 Y168]b
30;70 5;17 Partial correct/Correct
1B93 Methylglyoxyl synthase H19 G66 D71 D91 H98 D101
[H19 D71 H98 D101]
67;100 3;11 Correct/Correct
1BG0 Arginine Kinase R126 E225 R229 R280 R309
[Y68 Y89 H90 H99 R124 R126 C127 D192 E224 E225 D226 R229 C271 R280 R309 E314 H315 R330 E335] [H185 H284] [Y145 R208] [Y134 K151]
100;100 7;20 Correct/Correct
1BOL Ribonuclease T2 H46 E105 H109
[Y116 Y121 E128 D129 D132 Y202] [E105 H109]
67;100 4;13 Correct/Correct
1BWD L-arginine:Inosamine-phosphate amidinotransferase
D108 R127 D179 H227 D229 H331 C332
[E9 E37 Y53 H87 H102 D103 C105 R107 D108 E130 D179 H227 D229 H278 C279 H331 C332]ab [D30 Y161]b
86;100 5;15 Correct/Correct
1BZY Hypoxanthine-guanine phosphoribosyltransferase
E133 D134 D137 K165 R169
[R100 K102 D193] [Y104 D137 K165] [E133 D134]
80;100 4;12 Correct/Correct
1CD5 Glucosamine-6-phosphate isomerase
D72 D141 H143 E148
[D72 E73 Y74 Y85] [H19 E198] [K124 Y128]
25;25 3;9 Partial correct/ Partial correct
1CHD Protein-glutamate methylesterase
S164 T165 H190 M283 D286
[H233 E235 H248] [H190 H256 D286]
40;80 3;10 Partial correct / Correct
1COY Cholesterol Oxidase E361 H447 N485
[R44 C57 R65 Y92 Y107 Y219 Y446] [Y21 K225] [R328 Y376]
0;33 2;10 Incorrect / Partial correct
1CQQ Picornain 3C H40 E71 G145 C147
[Y97 C100 K153]
0;0 2;5 Incorrect / Incorrect
1CTT Cytidine Deaminase
E104 [H102 E104 C129 H131 C132]ab [E138 H203 E229]ab [C217 Y252]b
100;100 3;11 Correct/Correct
1D0S Nicotinate-nucleotide-dimethylbenzimidazole phosphoribosyltransferase E174 E317
G176
[D69 H70 E174 D242 D263]
25;75 1;8 Partial Correct / Correct
140
K213 1DAA Aminotransferase class-IV E177
K145 L201 [R22a R98a H100a Y31b E32b H47b R50b Y88b Y114b K145b] [Y31a E32a H47a R50a Y88a K145a R98b] [C142 E177]ab [R22 R93]b
67;100 4;14 Correct / Correct
1DAE Dethiobiotin synthase T11 K15 K37 S41
[D10ab C151ab H154ab] [K15 K37]ab
50;75 2;11 Correct/Correct
1DB3 GDP-mannose 4,6-dehydratase
T132 E134 Y156 K160
[D13 Y128 Y177 C179 H186] [D105 E134 Y156 K160] [H228 D231 E344] [K2 Y26] [E315 E317]
75;100 5;16 Correct/Correct
1DL2 Mannosyl-oligosaccharide 1,2-alpha-mannosidase
E132 R136 D275 E435
[D86 E132 E207 D275 E279 K283 D336 H337 Y389 E399 E435 E438 E503 Y507 E526 H528] [K216 Y235 E290 Y293] [D52 H68]
75;100 5;16 Correct/Correct
1DNK Deoxyribonuclease I E78 H134 D212 H252
[E39 Y76 E78 H134 D168 D212 H252] [R31 Y32]
100;100 4;10 Correct/Correct
1DNP Deoxyribodipyrimidine photolyase
W306 W359 W382
[H44 E106 E109 C251 R278 R282 D372 D374]ab [R8 D10 D130]a [E318 Y365]ab [D409 D431]ab [D327 D331]b [D354 H453]ab [K353 Y464]a
0;0 7;15 Incorrect / Incorrect
1DZR dTDP-4-dehydrorhamnose 3,5-epimerase
H63 D170 [R60 H63 D84 H120 Y133 Y139 E144]
50;100 4;14 Correct/Correct
1E2A Histidine Kinase IIAlac H78 Q80 D81 H82
[H78abc D81abc H82abc] [E32 H94 H95]b [E32 H94 H95]c [E32 H95]a
75;75 5;21 Correct/Correct
1EBF Homoserine dehydrogenase D219 K223
[K117 E208 D210 D213 D214 D219 K223]
100;100 2;10 Correct/Correct
1FRO Lactoylglutathione lyase E172 [R37a E99a Y114a D165a D167a R122b
100;100 6;27 Correct/Correct
141
H126b E172b] [R37c E99c Y114c D165c D167c H126d E172d] [R37b E99b Y114b D165b D167b H126a E172a] [Y70 H102 E107]abcd [H126c E172c E99d] [C138b K151b] [K150d K158d] [D165d D167d] [H115ab]
1GOG Galactose Oxidase C228 Y272 W190 Y495
[C228 Y272 H334 C383 Y405 H442 Y495 H496 H581] [H85 D166 H522 D524] [D324 D404]
75;100 2;8 Correct/Correct
1GRC Phosphoribosylglycinamide formyltransferase
N106 H108 S135 D144
[H54ab E70ab H73ab E74ab Y78ab] [H108 H137 D144]a [Y67 R90]ab [E44ab]
50;50 5;14 Correct/Correct
1HXQ UDP-glucose--hexose-1-phosphate uridylyltransferase
C160 H164 H166 Q168
[E182 E232 H281 H296 H298] [H115 E152 H164 H166] [R13 Y201 R211 R324] [E121 D267 H342] [D183 H292]
50;75 5;18 Correct/Correct
1I7D DNA topoisomerase III E7 K8 R330
[E7 K8 H44 D103 D105 E107 E114 D136 Y320 D332 C333 H340 C372 D379 H381 H382 Y410 D520 E525] [H100 D113] [E286 E458]
67;67 4;11 Correct/Correct
1IDJ Pectin lyase R176 R236 [D186 D217 D221 R236 K239 D242] [H247 E272] [H178 H210]
50;50 3;10 Correct/Correct
1KAS 3-oxoacyl-[acyl-carrier protein] synthase
C163 H303 H340 F400
[C163 H303 D311 E314 K335 H340 E349 C395]ab [H168ab H172ab D181ab] [E115ab H118ab]
75;100 3;14 Correct/Correct
1LBA T7 Lysomsome Y46 K128 [H17 C18 Y46 H47 H68 C80
50;100 5;18 Correct/Correct
142
H122 C130]
1MAS Purine nucleosidase D14 N168 H241
[H157ab E158ab D192ab H195a] [D10 D14 D15 E166 H241 D242]b [D10 D14 D15 H241 D242]a [E265ab] [K44 Y92]a
67;100 4;13 Correct/Correct
1MHY Methane Monooxygenase C151 T213 [D74a K78a H80a E89a R171a D172a C173a R179a K45b Y46b K49b R186b D190b D196b E199b C200b D270b D418b H439b H446b D450b E454b E460b E462b R463b Y464b E465b C466b H467b E471b R45c D51c Y54c E58c E62c H112c R116c K12c D133c] [H39a C57a Y99a H109a H110a D176a E71b D75b E111b E114b D143b E144b H147b H149b C151b D170b R172b R175b E209b D242b E243b H246b] [H166 K189 H252 D256 R264]a [K104 Y162 H168 R360]b [Y112a K116a Y290a K65b] [R98 Y288 Y351]b [D243 E246]a
50;100 8;25 Correct/Correct
1MPP Mucoropepsin D32 S35 Y75 D215
[D9 D11 E13 D32 D215]
50;75 1;4 Correct/Correct
1NID Nitrite Reductase (Copper Containing)
D98 H255 [H95a D98a H100a H135a C136a H145a H255c E279c H306c] [H95c D98c H100c H135c C136c H145c H255b E279b H306b] [H95b D98b H100b H135b C136b H145b H255a E279a
100;100 5;19 Correct/Correct
143
H306a] [E180 D182 H245 E310]abc [D251abc] [H260 Y293 R296]abc
1NSP Nucleoside-diphosphate kinase
K16 N119 H122
[C16 H55 Y56 E58 R109 H122 E133]
67;100 5;15 Correct/Correct
1NZY 4-Chlorobenzoyl Coenzyme A Dehalogenase
F64 H90 G114 W137 D145
[D160abc E163abc E175abc D178abc] [D123b D145b Y150b R154b H218b C228c E232c] [C228a E232a H90c D145c Y150c R154c H218c] [D145a Y150a R154a H218a C228b E232b] [H138 D168]b
40;60 4;16 Partial correct / Correct
1PGS Peptide Aspartylglucosaminidase
D60 E206 [Y62 R80 Y116] [Y161 Y293] [Y183 K302] [H224 Y277]
0;0 3;14 Incorrect / Incorrect
1PJB Alanine dehydrogenase K74 H95 E117 D269
[E8 E13 R15 K72 K74 E75 Y93 H95 Y116 E117]
75;75 3;9 Correct / Correct
1PKN Pyruvate Kinase R72 R119 K269 T327 S361 E363
[R72 D112 E117 K269 E271 D295 E299] [C316 D356 C357 E385 R444 R466] [D224 D227] [K265 Y465]
33;67 3;12 Partial correct / Correct
1PNL Penicillin Acylase S1 A69 N241
[E152a K179a H192a D73b D74b D76b Y180b Y190b Y196b D252b] [D12a H18a Y31a D38a R39a Y96a Y33b H38b H520b] [R145a Y27b Y31b Y52b] [R263 K394]b [D484 D501]b [R479 Y528]b [E80 H123]b [Y33 K106]a
0;0 4;18 Incorrect / Incorrect
1PUD Queuine tRNA-ribosyltransferase (tRNA-guanine transglycosylase)
D102 [K55a D315b C318b C320b C323b E348b H349b] [D315a C318a C320a C323a E348a H349a K55b] [R38 R60
100;100 3;10 Correct / Correct
144
R362]ab [D102 D280]ab
1QFE 3-dehydroquinate dehydratase
E86 H143 K170
[E46 E86 D114 E116 H143 K170]ab [D50 H51]ab
100;100 3;12 Correct / Correct
1QPR Quinolinate phosphoribosyltransferase (decarboxylating) (Type II)
K140 E201 D222
[R139 R146 K150 H161 R162 K172 D173 E199 E201 D203 D222 E246] [D57 D80]
67;67 5;13 Correct / Correct
1QQ5 2-haloacid dehalogenase D8 T12 R39 N115 K147 S171 N173 F175 D176
[D8 R39 Y89 K147 Y153 D176] [Y95 R192]
56;100 4;13 Correct / Partial Correct
1QUM Deoxyribonuclease IV E261 [H69 D70 E94 E145 C177 D179 C181 H182 H216 E261]
100;100 4;12 Correct / Correct
1RA2 Dihydrofolate reductase I5 M20 D27 L28 F31 L54 I94
[K38 K109 Y111]
0;0 2;6 Incorrect / Incorrect
1UAE UDP-N-acetylglucosamine 1-carboxyvinyltransferase
N23 C115 D305 R397
[K22 R91 C115 R120 E188 E190 D231 E234 H299 R331 D369 R371 H394 R397 Y399 K405]
50;100 4;14 Correct / Correct
1UAG UDP-N-acetylmuramoylalanine--D-glutamate ligase
K115 N138 H183
[K115 H183 Y187 Y194 R302 K319 D346 K348 C413 R425]
67;100 9;8 Correct / Correct
1ULA Purine-nucleoside phosphorylase (type 1)
H86 E89 N243
[D134a H135a Y166a E201c E205c Y249c] [E201a E205a Y249a D134b H135b Y166b] [E201b E205b Y249b D134c H135c Y166c] [H257 E258]ac [H86 E89]abc
67;67 3;11 Correct / Correct
1UOK Oligo-1,6-glucosidase D199 E255 D329
[Y12 Y15 Y39 D60 D64 D98 H103 H161 D169 D199 E255 H283 D285 Y324 H328 D329 R332 R336 H356 Y365 E368 E369 D385E387 D416 R419 Y464 R471 Y495 R497] [D21 D29]
100;100 6;20 Correct/ Correct
145
1VAO Vanillyl Alcohol Oxidase Y108 D170 H422 Y503 R504
[Y108 D170 Y187 R312 D317 R398 E410 E464 H466 Y503 R504] [D59 H61 H422 H506] [Y148 D167 H193] [H467 C470] [H313 Y440] [Y276 Y358]
100;100 4;15 Correct / Correct
1WGI Inorganic pyrophosphatase D117 [E48 K56 E58 Y89 E101 H107 D115 D117 D120 D147 D152 Y192]a [E48 K56 E58 E101 D115 D117 D120 D147 D152 Y192]b [E123 D159 D162]b [H87ab]
100;100 5;15 Correct / Correcy
1YTW Protein Tyrosine Phosphatase
E290 D356 H402 C403 R409 T410
[C259 Y261 H270 Y301 H350 H402 C403]
33;67 2;11 Partial Correct / Correct
2CPO Heme Chloroperoxidase H105 E183
[E104 H105 D106 H107 D113 E161 D168 E183]
100;100 3;7 Correct / Correct
2HDH 3-hydroxyacyl-CoA dehydrogenase
S137 H158 E170 N208
[R209a Y214ab E217ab R220ab R224a Y242a H275a] [H158a E170b] [H158b E170a] [H266b H275b]
50;100 3;10 Correct / Correct
2HGS Glutathione Synthase R125 S151 G369 R450
[D24a H107b E214b R221b E224b R236b Y265b R267b Y270b E287b K293b C294b D296b Y432b] [H107a E214a R221a E224a R236a Y265a R267a Y270a E287a K293a C294a D296a Y432a D24b] [R125 D127 E144 K305 K364 E368 Y375 E425 R450 K452]a [R125 D127 E144 K305 K364 E368 Y375 E425 R450]b [H163 D469]ab
50;100 5;21 Correct / Correct
2JCW Superoxide dismutase H63 R143 [H46 H48 H63 50;100 5;16 Correct / Correct
146
H71 H80 D83 H120 D124]ab
2PFL Formate C-acetyltransferase W333 C418 C419 G734
[D74a H84a R141ab K142ab H144ab R174ab D180ab Y181ab R183a R218ab E221ab E222ab E225ab Y240a Y259a Y262ab K267a E368ab D413ab H498ab Y499ab H501ab D502ab D503ab Y504ab Y506ab E507ab H514ab R520ab Y594ab R595ab] [Y172 R176 R319 Y323 E400 C418 C419 R435 Y490 Y612 H704 R731 Y735]ab [H84 Y240 Y259 K267]b [C159 Y444]b [R316 D330]ab
50;100 6;19 Correct / Correct
2PLC 1-phosphatidylinositol phosphodiesterase
H45 D46 R84 H93 D278
[H45 D46 D82 E128 D204 H236 D278] [Y71 K115]
60;100 3;11 Correct / Correct
2THI Thiamine pyridinylase C113 E241 [Y16 Y50 D64 C113 E171 D175 Y222 Y239 E241 D265 Y270 D272]ab [E37ab D84ab E284ab] [Y323 Y333]b [Y180 R349]ab [H282ab]
100;100 5;16 Correct / Correct
8TLN Metalloproteinase M4 E143 H231
[D138 H142 E143 E166 D170 E177 D185 E190] [K18 D72 Y76 K182]
50;100 4;13 Correct / Correct
147
Appendix C. The 64 protein test set used in THEMATICS-POOL
PDB Code Protein Name E.C. Number CSA Annotated Active Site Residues
1A05 1,4-Diacid decarboxylating dehydrogenase 1.1.1.85 Y140, K190, D222
1A26 ADP-ribosyltransferase 2.4.2.30 Y907, E988
1A4I Methylenetetrahydrofolate Dehydrogenase 1.5.1.5 K56
1A4S Aldehyde dehydrogenase (NAD+) / Betaine-aldehyde dehydrogenase
1.2.1.8 N166, E263, C297
1AFW Acetyl-CoA C-acyltransferase 2.3.1.16 C125, H375, C403, G405
1AKM Ornithine Carbamoyltransferase 2.1.3.3 R106, H133, Q136, D231, C273, R319
1AOP Sulphite reductase 1.8.1.2 R83, R153, K215, K217 C483
1APX Heme peroxidase 1.11.1.11 R38, H42, N71
1B6B Aralkylamine N-acetyltransferase 2.3.1.87 S97, L111, H122, L124, Y168
1BG0 Arginine Kinase 2.7.3.3 R126, E225, R229, R280, R309
1BRM Aspartate-beta-semialdehyde dehydrogenase 1.2.1.11 C135, Q162, H274
1BRW Pyrimidine-nucleoside phosphorylase 2.4.2.2 H82, R168, S183, K187
1BWD L-arginine:Inosamine-phosphate amidinotransferase
2.1.4.2 D108, R127, D179, H227, D229 H331, C332
1BZY Hypoxanthine-guanine phosphoribosyltransferase
2.4.2.8 E133, D134, D137, K165, R169
1C3J DNA beta-glucosyltransferase 2.4.1.27 E22, D100
1COY Cholesterol Oxidase 1.1.3.6 E361, H447, N485
1CQQ Picornain 3C 3.4.22.28 H40, E71, G145, C147
1D0S Nicotinate-nucleotide-dimethylbenzimidazole phosphoribosyltransferase
2.4.2.21 E317
148
1D4A NAD(P)H dehydrogenase (quinone) 1.6.99.2 G149, Y155, H161
1D4C Succinate dehydrogenase (Fumerate reductase)
1.3.99.1 H364, R401, H503, R544
1DII 4-cresol dehydrogenase 1.17.99.1 Y73, Y95, E380, E427, H436, R474
1DLI UDP-glucose 6-dehydrogenase 1.1.1.22 T118, E145, K204, N208, C260, D264
1DO8 Malate dehydrogenase 1.1.1.39 Y112, K183, D278
1E2A Histidine Kinase IIAlac 2.7.1.69 H78, Q80, D81, H82
1EBF Homoserine dehydrogenase 1.1.1.3 D219, K223
1FOH Phenol 2-monooxygenase 1.14.13.7 D54, R281, Y289
1FUG Methionine adenosyltransferase 2.5.1.6 H14, K165, R244, K245, K265, K269, D271
1G72 Methanol dehydrogenase 1.1.99.8 D297
1GET Glutathione reductase 1.6.4.2 C42, C47, K50, Y177, E181, H439, E444
1GOG Galactose Oxidase 1.1.3.9 C228, Y272, W290, Y495
1GPR The IIAglc Histidine kinase 2.7.1.69 T66, H68, H83, G85
1GRC Phosphoribosylglycinamide formyltransferase (GARTFase II)
2.1.2.2 N106, H108, S135, D144
1IVH Isovaleryl-CoA dehydrogenase 1.3.99.10 E254
1JDW Glycine amidinotransferase 2.1.4.1 D254, H303, C407
1KAS 3-oxoacyl-[acyl-carrier protein] synthase 2.3.1.41 C163, H303, H340, F400
1L9F Monomeric sarcosine oxidase 1.5.3.1 H45, R49, H269, C315
1LCB Thymidylate synthase 2.1.1.45 E60, R178, C198, S219, D221, D257, H259
1LXA UDP-N-acetylglucosamine acyltransferase 2.3.1.129 H125
1MBB UDP-N-acetylmuramate dehydrogenase 1.1.1.158 R159, S229, E325
149
1MHL Mammalian Myeloperoxidase 1.11.1.7 Q91, H95, R239
1MLA [Acyl-carrier protein]
S-malonyltransferase
2.3.1.39 S92, H201, Q250
1MOQ Glucosamine--fructose-6-phosphate aminotransferase (isomerising domain)
2.6.1.16 E481, K485, E488, H504, K603
1MPY Extradiol Catecholic Dioxygenase 1.13.11.2 H199, H246, Y255
1NID Nitrite Reductase 1.7.99.3 D98, H255
1NSP Nucleoside-diphosphate kinase 2.7.4.6 K16, N119, H122
1OFG Glucose-fructose oxidoreductase 1.1.99.28 K129, Y217
1PFK Phosphofructokinase 2.7.1.11 G11, R72, T125, D127, R171
1PJB Alanine dehydrogenase 1.4.1.1 K74, H95, E117, D269
1PKN Pyruvate Kinase 2.7.1.40 R72, R119, K269, T327, S361, E363
1PUD Queuine tRNA-ribosyltransferase (tRNA-guanine transglycosylase)
2.4.2.29 D102
1R51 Urate Oxidase 1.7.3.3 R176, Q228
1RA2 Dihydrofolate reductase 1.5.1.3 I5, M20, D27, L28, F31, L54, I94
1UAE UDP-N-acetylglucosamine
1-carboxyvinyltransferase
2.5.1.7 N23, C115, D305, R397
1ULA Purine-nucleoside phosphorylase (type 1) 2.4.2.1 H86, E89, N243
1VAO Vanillyl Alcohol Oxidase 1.1.3.13 Y108, D170, H422, Y503, R504
1VNC Chloride peroxidase 1.11.1.10 K353, H404
1XVA Glycine N-methyltransferase 2.1.1.20 E15
1ZIO Adenylate kinase 2.7.4.3 K13, R127, R160, D162, D163, R171
2ALR Mammalian Aldehyde Reductase 1.1.1.2 Y49, K79
150
2BBK Methylamine dehydrogenase 1.4.99.3 D32, W57, D76, W108, Y119, T122
2CPO Heme Chloroperoxidase 1.11.1.10 H105, E183
2HDH 3-hydroxyacyl-CoA dehydrogenase 1.1.1.35 S137, H158, E170, N208
2JCW Superoxide dismutase 1.15.1.1 H63, R143
3PCA Protocatechuate dioxygenase 1.13.11.3 Y447, R457
151
Appendix D. The 160 protein test set used in THEMATICS-POOL
PDB Code Protein Name E.C. Number CSA Annotated Active Site Residues
12AS Aspartate--ammonia ligase 6.3.1.1 D46, R100, Q116
13PK Phosphoglycerate kinase 2.7.2.3 R39, K219, G376, G399
1A05 1,4-Diacid decarboxylating dehydrogenase
1.1.1.85 Y140, K190, D222
1A26 ADP-ribosyltransferase 2.4.2.30 Y907, E988
1A4I Methylenetetrahydrofolate Dehydrogenase
1.5.1.5 K56
1A4S Aldehyde dehydrogenase (NAD+) / Betaine-aldehyde dehydrogenase
1.2.1.8 N166, E263, C297
1AE7 Phospholipase A2 (PLA2) 3.1.1.4 G30, H48, D99
1AFW Acetyl-CoA C-acyltransferase
2.3.1.16 C125, H375, C403, G405
1AH7 Phospholipase C 3.1.4.3 D55
1AKM Ornithine Carbamoyltransferase
2.1.3.3 R106, H133, Q136, D231, C273, R319
1ALK Alkaline Phosphatase 3.1.3.1 S102, R166
1AOP Sulphite reductase 1.8.1.2 R83, R153, K215, K217 C483
1APX Heme peroxidase 1.11.1.11 R38, H42, N71
1APY Aspartylglucosylaminidase 3.5.1.26 T183, T201, T234, G235
1AQ2 Phosphoenol pyruvate carboxykinase
4.1.1.49 H232, K254, R333
1AW8 Aspartate 1-decarboxylase 4.1.1.11 Y58
1B3R Adenosylhomocysteinase 3.3.1.1 D130, K185, D189, N190, C194
152
1B57 Fructose-bisphosphate aldolase (class II)
4.1.2.13 D109, E182, N286
1B66 6-pyruvoyl tetrahydropterin synthase
4.6.1.10 C42, D88, H89, E133
1B6B Aralkylamine N-acetyltransferase
2.3.1.87 S97, L111, H122, L124, Y168
1B73 Glutamate racemase 5.1.1.3 D7, S8, C70, E147, C178, H180
1B93 Methylglyoxyl synthase 4.2.99.11 H19, G66, D71, D91, H98, D101
1BCR Carboxypeptidase D 3.4.16.6 G53, S146, Y147, D338, H397
1BG0 Arginine Kinase 2.7.3.3 R126, E225, R229, R280, R309
1BJP 4-oxalocrotonate tautomerase
5.3.2.0 P1, R39, F50
1BML Plasmin/streptokinase 3.4.21.7 H603, S608, D646
1BOL Ribonuclease T2 3.1.27.1 H46, E105, H109
1BRM Aspartate-beta-semialdehyde dehydrogenase
1.2.1.11 C135, Q162, H274
1BRW Pyrimidine-nucleoside phosphorylase
2.4.2.2 H82, R168, S183, K187
1BS4 Peptide Deformylase 3.5.1.31 G45, Q50, L91, E133
1BTL Beta-Lactamase Class A 3.5.2.6 S70, K73, S130, E166
1BWD L-arginine:Inosamine-phosphate amidinotransferase
2.1.4.2 D108, R127, D179, H227, D229 H331, C332
1BWP 2-acetyl-1-alkylglycerophosphocholine esterase
3.1.1.47 S47, G74, N104, D192, H195
1BZY Hypoxanthine-guanine phosphoribosyltransferase
2.4.2.8 E133, D134, D137, K165, R169
153
1C3C Adenylosuccinate lyase 4.3.2.2 H68, H141, E275
1C3J DNA beta-glucosyltransferase
2.4.1.27 E22, D100
1CB8 Chondroitin AC lyase 4.2.2.5 H225, Y234, R288
1CD5 Glucosamine-6-phosphate isomerase
5.3.1.10 D72, D141, H143, E148
1CHD Protein-glutamate methylesterase
3.1.1.61 S164, T165, H190, M283, D286
1CHK Chitosanase 3.2.1.132 E22, D40
1CHM Creatinase 3.5.3.3 H232, E262, E358
1COY Cholesterol Oxidase 1.1.3.6 E361, H447, N485
1CQQ Picornain 3C 3.4.22.28 H40, E71, G145, C147
1CTT Cytidine Deaminase 3.5.4.5 E104
1D0S Nicotinate-nucleotide-dimethylbenzimidazole phosphoribosyltransferase
2.4.2.21 E317
1D4A NAD(P)H dehydrogenase (quinone)
1.6.99.2 G149, Y155, H161
1D4C Succinate dehydrogenase (Fumerate reductase)
1.3.99.1 H364, R401, H503, R544
1D8C Malate synthase 4.1.3.2 D270, E272, R338, D631
1D8H Polynucleotide 5'-phosphatase
3.1.3.33 R393, E433, K456, R458
1DAA Aminotransferase class-IV 2.6.1.21 K145, E177, L201
1DAE Dethiobiotin synthase 6.3.3.3 T11, K15, K37, S41
1DB3 GDP-mannose 4,6-dehydratase
4.2.1.47 T132, E134, Y156, K160
1DBT Orotidine-5'-monophosphate decarboxylase
4.1.1.23 D60, K62
154
1DCO 4a-hydroxytetrahydrobiopterin dehydratase
4.2.1.96 H62, H63, H80, D89
1DGS NAD+ dependent DNA ligase
6.5.1.2 K116, D118, R196, K312
1DII 4-cresol dehydrogenase 1.17.99.1 Y73, Y95, E380, E427, H436, R474
1DIZ DNA-3-methyl adenine glycosylase II
3.2.2.21 Y222, W272, D238
1DL2 Mannosyl-oligosaccharide
1,2-alpha-mannosidase
3.2.1.113 E132, R136, D275, E435
1DLI UDP-glucose
6-dehydrogenase
1.1.1.22 T118, E145, K204, N208, C260, D264
1DNK Deoxyribonuclease I 3.1.21.1 E78, H134, D212, H252
1DO8 Malate dehydrogenase 1.1.1.39 Y112, K183, D278
1DQS 3-dehydroquinate synthase 4.6.1.3 H275
1DZR dTDP-4-dehydrorhamnose 3,5-epimerase
5.1.3.13 H63, D170
1E2A Histidine Kinase IIAlac 2.7.1.69 H78, Q80, D81, H82
1EBF Homoserine dehydrogenase 1.1.1.3 D219, K223
1EF8 Methylmalonyl-CoA decarboxylase
4.1.1.41 H66, G110, Y140
1EUG Uridine Nucleosidase (Uracil DNA glycosylase)
3.2.2.3 D64, H187
1EYI Fructose-1,6-bisphosphatase
3.1.3.11 D68, D74, E98
1FGH Aconitase 4.2.1.3 D100, H101, H147, D165, H167, E262, H642
1FOH Phenol 2-monooxygenase 1.14.13.7 D54, R281, Y289
1FRO Lactoylglutathione lyase 4.4.1.5 E172
155
1FUA L-fuculose-phosphate aldolase
4.1.2.17 E73
1FUG Methionine adenosyltransferase
2.5.1.6 H14, K165, R244, K245, K265, K269, D271
1FUI Arabinose isomerase 5.3.1.3 E337, D361
1G72 Methanol dehydrogenase 1.1.99.8 D297
1GET Glutathione reductase 1.6.4.2 C42, C47, K50, Y177, E181, H439, E444
1GIM Adenylosuccinate synthetase
6.3.4.4 D13, H41, Q224
1GOG Galactose Oxidase 1.1.3.9 C228, Y272, W290, Y495
1GPM GMP synthase 6.3.5.2 G59, C86, Y87, H181, E183, D239
1GPR The IIAglc Histidine kinase 2.7.1.69 T66, H68, H83, G85
1GRC Phosphoribosylglycinamide formyltransferase (GARTFase II)
2.1.2.2 N106, H108, S135, D144
1GTP GTP Cyclohydrolase 3.5.4.16 H112, H179
1HFS Stromelysin-1 (hydrolase) 3.4.24.17 E202, M219
1HXQ UDP-glucose--hexose-1-phosphate uridylyltransferase
2.7.7.12 C160, H164, H166, Q168
1I7D DNA topoisomerase III 5.99.1.2 E7, K8, F328, R330
1IVH Isovaleryl-CoA dehydrogenase
1.3.99.10 E254
1JDW Glycine amidinotransferase 2.1.4.1 D254, H303, C407
1KAS 3-oxoacyl-[acyl-carrier protein] synthase
2.3.1.41 C163, H303, H340, F400
1KFU m-Calpain Form II 3.4.22.17 Q99, C105, H262, N286
156
1KRA Urease 3.5.1.5 H219, D221, H320, R336
1L9F Monomeric sarcosine oxidase
1.5.3.1 H45, R49, H269, C315
1LBA T7 Lysomsome 3.5.1.28 Y48, K128
1LCB Thymidylate synthase 2.1.1.45 E60, R178, C198, S219, D221, D257, H259
1LXA UDP-N-acetylglucosamine acyltransferase
2.3.1.129 H125
1MAS Purine nucleosidase 3.2.2.1 D14, N168, H241
1MBB UDP-N-acetylmuramate dehydrogenase
1.1.1.158 R159, S229, E325
1MHL Mammalian Myeloperoxidase
1.11.1.7 Q91, H95, R239
1MHY Methane Monooxygenase 1.14.13.25 C151, T213
1MKA 3-hydroxydecanoyl-[acyl-carrier protein] dehydratase
4.2.1.60 H70, V76, G79, C80, D84
1MLA [Acyl-carrier protein]
S-malonyltransferase
2.3.1.39 S92, H201, Q250
1MOQ Glucosamine--fructose-6-phosphate aminotransferase (isomerising domain)
2.6.1.16 E481, K485, E488, H504, K603
1MPP Mucoropepsin 3.4.23.23 D32, S35, Y75, D215
1MPY Extradiol Catecholic Dioxygenase
1.13.11.2 H199, H246, Y255
1NBA Carbamoylsarcosine Amidohydrolase
3.5.1.59 D51, K144, A172, T173, C177
1NID Nitrite Reductase 1.7.99.3 D98, H255
1NSP Nucleoside-diphosphate kinase
2.7.4.6 K16, N119, H122
1NZY Chlorobenzoate 3.8.1.6 F64, H90, G114,
157
Dehalogenase W137, D145
1OFG Glucose-fructose oxidoreductase
1.1.99.28 K129, Y217
1PFK Phosphofructokinase 2.7.1.11 G11, R72, T125, D127, R171
1PGS Peptide Aspartylglucosaminidase
3.5.1.52 D60, E206
1PJB Alanine dehydrogenase 1.4.1.1 K74, H95, E117, D269
1PKN Pyruvate Kinase 2.7.1.40 R72, R119, K269, T327, S361, E363
1PS1 Pentalenene Synthase 4.6.1.5 F77, R157, R173, N219, K226, R230, S305, H309
1PUD Queuine tRNA-ribosyltransferase (tRNA-guanine transglycosylase)
2.4.2.29 D102
1PYA Histidine decarboxylase 4.1.1.22 Y62, S81, F195, E197
1PYM Phosphoenolpyruvate mutase
5.4.2.9 G47, L48, D58, K120
1QFE 3-dehydroquinate dehydratase
4.2.1.10 E86, H143, K170
1QPR Quinolinate phosphoribosyltransferase (decarboxylating) (Type II)
2.4.2.19 R105, K140, E201, D222
1QQ5 2-haloacid dehalogenase 3.8.1.2 D8, T12, R39, N115, K147, S171, N173, F175, D176
1QUM Deoxyribonuclease IV 3.1.21.2 E261
1R51 Urate Oxidase 1.7.3.3 R176, Q228
1RA2 Dihydrofolate reductase 1.5.1.3 I5, M20, D27, L28, F31, L54, I94
1RBL Ribulose bisphosphate carboxylase
4.1.1.39 K175, K177, K201, D203, H294, H327
158
1REQ Methylmalonyl-CoA mutase
5.4.99.2 Y89, H244, K604, D608, H610
1RPT High molecular weight Acid Phosphatase
3.1.3.2 R11, H12, R15, R79, H257, D258
1SMN Serratia marcescens nuclease
3.1.30.2 R87, H89, N119
1TYF CLP Protease (clpP) 3.4.21.92 G68, S97, M98, H122, D171
1UAE UDP-N-acetylglucosamine
1-carboxyvinyltransferase
2.5.1.7 N23, C115, D305, R397
1UAG UDP-N-acetylmuramoylalanine--D-glutamate ligase
6.3.2.9 K115, N138, H183
1ULA Purine-nucleoside phosphorylase (type 1)
2.4.2.1 H86, E89, N243
1UOK Oligo-1,6-glucosidase 3.2.1.10 D199, E255, D329
1VAO Vanillyl Alcohol Oxidase 1.1.3.13 Y108, D170, H422, Y503, R504
1VNC Chloride peroxidase 1.11.1.10 K353, H404
1WGI Inorganic pyrophosphatase 3.6.1.1 D117
1XVA Glycine N-methyltransferase
2.1.1.20 E15
1YTW Protein Tyrosine Phosphatase
3.1.3.48 E290, D356, H402, C403, R409, T410
1ZIO Adenylate kinase 2.7.4.3 K13, R127, R160, D162, D163, R171
2ACY Acylphosphatase 3.6.1.7 R23, N41
2ADM ADENINE-N6-DNA-METHYLTRANSFERASE
2.1.1.72 N105, P106, Y108
2ALR Mammalian Aldehyde Reductase
1.1.1.2 Y49, K79
2BBK Methylamine 1.4.99.3 D32, W57, D76,
159
dehydrogenase W108, Y119, T122
2BMI METALLO-BETA-LACTAMASE
3.5.2.6 D86, N176
2CPO Heme Chloroperoxidase 1.11.1.10 H105, E183
2HDH 3-hydroxyacyl-CoA dehydrogenase
1.1.1.35 S137, H158, E170, N208
2HGS Glutathione Synthase 6.3.2.3 R125, S151, G369, R450
2JCW Superoxide dismutase 1.15.1.1 H63, R143
2PDA Pyruvate synthase 1.2.7.1 E64
2PFL Formate C-acetyltransferase 2.3.1.54 W333, C418, C419, G734
2PHK Protein Serine/threonine kinase
2.7.1.38 D149, K151
2PLC 1-phosphatidylinositol phosphodiesterase
3.1.4.10 H45, D46, R84, H93, D278
2THI Thiamine pyridinylase 2.5.1.2 C113, E241
3CSM Chorismate Mutase 5.4.99.5 R16, R157, K168, E246
3ECA Asparaginase/Glutaminase 3.5.1.1 T12, Y25, T89, D90, K162
3PCA Protocatechuate dioxygenase
1.13.11.3 Y447, R457
4KBP Purple Acid Phosphatase 3.1.3.2 H202, H295, H296
5COX Prostaglandin-Endoperoxide Synthase
1.14.99.1 Q203, H207, Y385
5ENL Enolase 4.2.1.11 E168, E211, K345, H373
5FIT diadenosine P1, P3-triphosphate (ApppA) hydrolase
3.6.1.29 Q83, H94, H96
8TLN Metalloproteinase M4 3.4.24.27 E143, H231
160
9PAP Thiol-Endopeptidase 3.4.22.2 Q19, C25, H159, N175
161
Bibliography
1. Schmid, M. B., Structural Proteomics: The potential of high throughput structure determination. Trends Microbiol 2002, 10 (Suppl.), S27-S31. 2. Stultz, C. M.; White, J. V.; Smith, T. F., Structural Analysis Based on State-space Modeling. Protein Sci 1993, 2, 305-314. 3. Combet, C.; Jambon, M.; Deleage, G.; Geourjon, C., Geno3D: automatic comparative molecular modeling of protein. Bioinformatics 2002, 18, (1), 213-214. 4. Venclovas, C., Comparative Modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins Suppl 2001, 5, 47-54. 5. Lambert, C.; Leonard, N.; De Bolle, X.; Depiereux, E., ESyPred3D: Prediction of proteins 3D structures. Bioinformatics 2002, 18, (9), 1250-1256. 6. Terwilliger, T. C., Waldo, G., Peat, TS, Newman, JM, Chu, K., Berendzen, J., Class-directed structure determination: Foundation for a protein structure initiative. Protein Sci 1998, 7, 1851-1856. 7. Madern, D.; Pfister, C.; Zaccai, G., Mutation at a Single Acidic Amino Acid Enhances the Halophilic Behaviour of Malate Dehydrogenase from Haloarcula Marismortui in Physiological Salts. European Journal of Biochemistry 1995, 230, 1088. 8. Looger, L. L., M.A. Dwyer, J.J. Smith, and H.W. Hellinga, Computational design of receptor and sensor proteins with novel functions. Nature 2003, 423, 185-190. 9. Oshiro, C. M., I.D. Kuntz and R.M.A. Knegtel, Molecular Docking and Structure-based Design. In Encyclopedia of Computational Chemistry, Schleyer, P. v. R., Ed. Wiley: Chichester, West Sussex, U.K, 1998; pp 1606-1613. 10. Lichtarge, O., Sowa, M.E., A. Philippi, Evolutionary traces of functional surfaces along G protein signaling pathway. Methods in Enzymology 2002, 344, 536-556. 11. Lima, C. D., M.G. Klein, and W.A. Hendrickson, Structure-based analysis of catalysis and substrate definition in the HIT protein family. Science 1997, 278, 286-290. 12. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., Eisenberg, D., A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402, (6757), 83-6. 13. Marcotte, E. M., Computational genetics: finding protein function by nonhomology methods. Curr Opin Struct Biol 2000, 10, (3), 359-65. 14. Mattos, C.; Ringe, D., Locating and characterizing binding sites on proteins. Nat Biotechnol 1996, 14, (5), 595-9. 15. Mlinsek, G., Novic, M., Hodoscek, M., Solmajer, T., Prediction of enzyme binding: human thrombin inhibition study by quantum chemical and artificial intelligence methods based on x-ray structures. J Chem Inf Comput Sci 2001, 41, (5), 1286-94. 16. Ondrechen, M. J., J.G. Clifton and D. Ringe, THEMATICS: A simple computational predictor of enzyme function from structure. Proc. Natl. Acad. Sci. (USA) 2001, 98, 12473-12478. 17. Elcock, A. H., Prediction of functionally important residues based solely on the computed energetics of protein structure. J Mol Biol 2001, 312, 885-896.
162
18. Gutteridge, A., G. Bartlett, and J.M. Thornton, Using a neural network and spatial clustering to predict the location of active sites in enzymes. Journal of Molecular Biology 2003, 330, 719-734. 19. Innis, C. A., A.P. Anand, and R. Sowdhamini, Prediction of functional sites in proteins using conserved functional group analysis. Journal of Molecular Biology 2004, 337, 1053-1068. 20. Laurie, A. T. R., and R.M. Jackson, Q-SiteFinder: An energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21, 1908-1916. 21. Ota, M.; K. Kinoshita; Nishikawa, K., Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. Journal of Molecular Biology 2003, 327, 1053-1064. 22. Ondrechen, M. J., THEMATICS as a tool for functional genomics. Genome Informatics 2002, 13, 563-564. 23. Ondrechen, M. J., Identification of functional sites based on prediction of charged group behavior. In Current Protocols in Bioinformatics, Baxevanis, A. D.; Davison, D. B.; Page, R. D. M.; Petsko, G. A.; Stein, L. D.; Stormo, G. D., Eds. John Wiley & Sons: Hoboken, N.J., 2004; pp 8.6.1 - 8.6.10. 24. Ko, J., L.F. Murga, P. Andre, H. Yang, M.J. Ondrechen, R.J. Williams, A. Agunwamba, and D.E. Budil, Statistical Criteria for the Identification of Protein Active Sites Using Theoretical Microscopic Titration Curves. Proteins: Structure Function Bioinformatics 2005, 59, 183-195. 25. Kaelbling, L.; Littman, M.; Moore, A., Reinforcement Learning: A Survey. J. of Artificial Intelligence Research. 1996, 4, 237-285. 26. Mitchell, T. M., Machine Learning. McGraw-Hill: New York, 1997. 27. Landau M.; Mayrose I.; Rosenberg Y.; Glaser F.; Martz E.; T., P.; N., B.-T., ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. . Nucl. Acids Res. 2005, 33, W299-W302. 28. Pupko, T., R.E. Bell, I. Mayrose, F. Glaser, & N. Ben-Tal, Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18, S71-S77. 29. Fetrow, J. S., Siew, N., Di Gennaro, J. A., Martinez-Yamout, M., Dyson, H. J., Skolnick, J., Genomic-scale comparison of sequence- and structure-based methods of function prediction: does structure provide additional insight? Protein Sci 2001, 10, (5), 1005-14. 30. Lichtarge, O., H. R. Bourne and F. E. Cohen., An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257, (2), 342-58. 31. Yao, H., D.M. Kristensen, I. Mihalek, M.E. Sowa, C. Shaw, M. Kimmel, L. Kavraki, & O. Lichtarge, An accurate, sensitive, and scalable method to identify functional sites in proteins. J Mol Biol 2003, 326, 255-261. 32. Cheng, G.; Qian, B.; Samudrala, R.; Baker, D., Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acid Research 2005, 33, (18), 5861-5867. 33. Devos, D.; Valencia, A., Practical limits of function prediction. Proteins: Structure, Functing and Genetics 2000, 4, 98-107. 34. Wilson, M. A., C.V. St. Amour, J.L. Collins, D. Ringe and G.A. Petsko, The 1.8 A resolution crystal structure of YDR533Cp from Saccharomyces cerevisiae: A member of the DJ-1/ThiJ/PfpI superfamily. Proc Natl Acad Sci U S A 2004, 101, 1531-1536.
163
35. Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Shmuel, P., Network Analysis of Protein Structures Identifies Functional Residues Journal of Molecular Biology 2004, 344, (4), 1135-1146. 36. Laskowski, R. A., SURFNET: A program for visualizing molecular surfaces, cavities and intermolecular interactions. J Mol Graph 1995, 13, 323-330. 37. Dundas, J.; Ouyang, Z.; Tseng, J.; Binkowski, A.; Turpaz, Y.; Liang, J., CASTp: computed atas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. . Nucl. Acids Res. 2006, 34, W116-W118. 38. Xie, L.; Bourne, P. E., A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites. BMC Bioinformatics 2007, 8, s4-s9. 39. Petrova, N.; Wu, C., Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics 2006, 7, (1), 312. 40. Youn, E.; Peters, B.; Radivojac, P.; Mooney, S. D., Evaluation of features for catalytic residue prediction in novel folds. Protein Sci. 2007, 16, 216-226. 41. Duda, R. O.; Hart, P. E.; Stork, D. G., Pattern Classification. Wiley: New York, 2001; p 654. 42. Belur, V., Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques,. 1991. 43. Vapnik, V., Statistical Learning Theory. Springer-Verlag: New York, 1998. 44. Schlkopf, B.; Smola, A., Learning with Kernels. MIT Press: Cambridge, MA, 2002. 45. Tong, W.; Williams, R. J.; Wei, Y.; Murga, L. F.; Ko, J.; Ondrechen, M. J., Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci. 2008, 17, 333-341. 46. Freund, Y.; Schapire, R., A short introduction to boosting. J of Japanese Society for Artificial Intelligence. 1999, 14, (5), 771-780. 47. Schapire, R. E., The Strength of Weak Learnability. Machine Learning 1990, 5, (2), 197-227. 48. Matthews, B. W., Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 1975, 405, 442-451 49. Yang, A. S., Gunner, M. R., Sampogna, R., Sharp, K., Honig, B., On the calculation of pKas in proteins. Proteins 1993, 15, (3), 252-65. 50. Bashford, D.; Karplus, M., Multiple-site Titration Curves of Proteins: An Analysis of Exact and Approximate Methods for Their Calculation. J. Phys. Chem. 1991, 95, 9556-9561. 51. Warwicker, J.; Watson, H. C., Calculation of the electric potential in the active site cleft due to alpha-helix dipoles. J Mol Biol 1982, 157, (4), 671-9. 52. Antosiewicz, J., Briggs, J.M., Elcock, A.H., Gilson, M.K., and McCammon, J.A., Computing the Ionization States of Proteins with a Detailed Charge Model. J. Comp. Chem. 1996, 17, 1633-1644. 53. Di Cera, E., S.J. Gill, and J. Wyman, Binding Capacity: Cooperativity and buffering in biopolymers. Proc Natl Acad Sci U S A 1988, 85, 449-452. 54. Wei, Y.; Ko, J.; Murga, L.; Ondrechen, M. J., Selective prediction of Interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007, 8, 119. 55. Yang, T. Statistical applications for structure-based protein function prediction. Northeastern University, Boston, 2007. 56. Joachims, T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods. MIT-Press: Cambridge, MA, 1999.
164
57. Bartlett, G. J., C.T. Porter, N. Borkakoti, and J.M. Thornton, Analysis of Catalytic Residues in Enzyme Active Sites. J Mol Biol 2002, 324, 105-121. 58. Wei, Y. Computed Electrostatic Properties of Protein 3D Structure for Functional Annotation and Biomedical Application. Northeastern University, Boston, 2007. 59. Patterson, W. R.; Poulos, T., Crystal Structure of Recombinant Pea Cytosolic Ascorbate Peroxidase. Biochemistry 1995, 34, (13), 4331-4341. 60. Edwards, S. L.; Xuong, N. h.; Hamlin, R. C.; Kraut, J., Crystal Structure of Cytochrome c Peroxidaes Compound I. Biochemistry 1987, 26, 1503-1511. 61. Gourley, D. G.; Shrive, A. K.; Polikarpov, I.; Krell, T.; Coggins, J. R.; Hawkins, A. R.; Isaacs, N. W.; Sawyer, L., The two types of 3-dehydroquinase have distinct structures but catalyze the same overall reaction. Nat Struct Biol. 1999, 6, 521-525. 62. Sobolev, V., A. Sorokine, J. Prilusky, E.E. Abola, and M. Edelman, Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15, 327-332. 63. Moser, J.; Gerstel, B.; Meyer, J. E.; Chakraborty, T.; Wehland, J.; Heinz, D. W., Crystal structure of the phosphatidylinositol-specific phospholipase C from the human pathogen Listeria monocytogenes. . J. Mol. Biol. 1997, 273, 269-282. 64. Bate, P., and J. Warwicker, Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. J Mol Biol 2004, 340, 263-276. 65. Perlich, C.; Provost, F.; Simonoff, J., Tree induction vs. logistic regression: a learning-curve analysis. The Journal of Machine Learning Research 2003, 4, 211-255. 66. Domingos, P.; Pazzani, M., Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning 1997, 29, 103-130. 67. Best, M. J.; Chakravarti, N., Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming 1990, 47, 425–439. 68. Barlow, R. E.; Bartholomew, D. J.; Bremmer, J. M.; Brunk, H. D., Statistical Inference under Order Restrictions. Wiley: 1972. 69. Ramsay, J. O., Estimating smooth monotone functions. Journal of the Royal Statistical Society, Series B 1998, 60, 365-375. 70. Bradley, A. P., The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 1997, 30, (7), 1145-1159. 71. Wilcoxon, F., Individual comparisons by ranking methods. Biometrics 1945, 1, 80-83. 72. Madura, J. D., J.M. Briggs, R.C. Wade, M.E. Davis, B.A. Luty, A. Ilin, J. Antosiewicz, M.K. Gilson, B. Bagheri, L.R. Scott, & J.A. McCammon, Electrostatics and diffusion of molecules in solution - Simulations with the University of Houston Brownian Dynamics program. Comp Phys Commun 1995, 91, 57-95. 73. Gilson, M. K., Multiple-site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins. Proteins 1993, 15, (3), 266-82. 74. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. . Nucl Acids Res 2004, 32, D129-133. 75. Edgar, R. C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004, 32, 1792-1797. 76. Kinoshita, K.; Ota, M., P-cats: prediction of catalytic residues in proteins from their tertiary structures Bioinformatics 2005, 21, 3570-3571.