SVM and a Novel POOL Method Coupled with THEMATICS for …926/fulltext.pdf · SVM and a Novel POOL Method Coupled with THEMATICS for Protein Active Site Prediction A DISSERTATION

SVM and a Novel POOL Method Coupled with THEMATICS for

Protein Active Site Prediction

A DISSERTATION

SUBMITTED TO THE COLLEGE OF COMPUTER AND INFORMATION SCIENCE

OF NORTHEASTERN UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

By

Wenxu Tong

April 2008

©Wenxu Tong, 2008

ALL RIGHTS RESERVED

iii

Acknowledgements

I have a lot of people to thank. I am mostly indebted to my advisor Dr. Ron Williams. He generously

allowed me to work on a problem with the idea he developed decades ago and guided me through the

research to turn it from just a mere idea into a useful system to solve important problems. Without his

kindness, wisdom and persistence, there would be no this dissertation.

Another person I am so fortunate to meet and work with is Dr. Mary Jo Ondrechen, my co-advisor. It was

her who developed the THEMATICS method, which this dissertation works on. Her guidance and help

during my research is critical for me.

I am grateful to all the committee members for reading and commenting my dissertation. Especially, I am

thankful to Dr. Jay Aslam, who provided a lot of advice for my research, and Dr. Bob Futrelle, who

brought me into the field and provided much help in my writing of this dissertation. I also thank Dr. Budil

for the time he spent serving as my committee member, especially during all the difficulties and

inconvenience he happened to experience unfortunately during the time.

I am so fortunate to work in the THEMATICS group with Dr. Leo Murga, Dr. Ying Wei and my fellow

graduate student soon to be Dr. Heather Brodkin. I also thank Dr. Jun Gong, Dr. Emine Yilmaz and soon

to be Dr. Virgiliu Pavlu, without their generous help, my journey through the tunnel towards my degree

would be much darker and harder.

I would not have been what I am without the love and support of my parents, Yunkun Tong and Xiazhu

Wang. Thanks to raising us and giving us the best education possible during all the hardship they had

endured, both of my sister, Wenyi Tong and I have received Ph.D. degrees, the highest degree one can

expect. I am so grateful and proud of them.

iv

Last but definitely not the least I would thank Ying Yang, my beloved wife. Without her patience and

confidence in me, I could not imagine that I can do what I have done and I will humbly dedicate this

dissertation to her.

v

Table of Contents

Abstract ...............................................................................................................12

1 Introduction .....................................................................................................13

1.1 THEMATICS and protein active site prediction.................................................................... 14

1.2 Machine Learning ................................................................................................................... 15

2. Background and Related Work .....................................................................18

2.1 Protein Active Site Prediction ................................................................................................. 18

2.2 Machine Learning ................................................................................................................... 20

2.2.1 Commonly used supervised learning methods ........................................................... 20

2.2.2 Probability based approach ........................................................................................ 22

2.2.3 Performance measure for classification problems ..................................................... 23

2.3 THEMATICS .......................................................................................................................... 27

2.3.1 The THEMATICS method and its features ............................................................... 27

2.3.2 Statistical analysis with THEMATICS....................................................................... 32

2.3.3 Challenges of the site prediction problem using THEMATICS data ........................ 34

3.Applying SVM to THEMATICS.....................................................................37

3.1 Introduction............................................................................................................................. 38

3.2 THEMATICS curve features used in the SVM...................................................................... 38

3.3 Training ................................................................................................................................... 40

3.4 Results...................................................................................................................................... 41

3.4.1 Success in site prediction............................................................................................. 42

3.4.2 Success in catalytic residue prediction........................................................................ 42

3.4.3 Incorporation of non-ionizable residues ..................................................................... 43

vi

3.4.4 Comparison with other methods................................................................................. 49

3.5 Discussion ................................................................................................................................ 52

3.5.1 Cluster number and size ............................................................................................. 52

3.5.2 Failure analysis............................................................................................................ 53

3.5.3 Analysis of high filtration ratio cases.......................................................................... 53

3.5.4 Some specific examples ............................................................................................... 56

3.6 Conclusions.............................................................................................................................. 58

3.7 Next step .................................................................................................................................. 58

4. New Method: Partial Order Optimal Likelihood (POOL) ...........................61

4.1 Ways to estimate class probabilities........................................................................................ 61

4.1.1 Simple joint probability table look-up........................................................................ 61

4.1.2 Naïve Bayes method .................................................................................................... 62

4.1.3 The K-nearest-neighbor method................................................................................. 63

4.1.4 POOL method ............................................................................................................. 63

4.1.5 Combining CPE's........................................................................................................ 64

4.2 POOL method in detail ........................................................................................................... 65

4.2.1 Maximum likelihood problem with monotonicity assumption .................................. 65

4.2.2 Convex optimization and K.K.T. conditions .............................................................. 67

4.2.3 Finding Minimum Sum of Squared Error (SSE) ....................................................... 69

4.2.4 POOL algorithm ......................................................................................................... 71

4.2.5 Proof that the POOL algorithm finds the minimum SSE. ......................................... 73

4.2.6 Maximum likelihood vs. minimum SSE. .................................................................... 80

4.3 Additional computational steps............................................................................................... 86

4.3.1 Preprocessing .............................................................................................................. 86

4.3.2 Interpolation................................................................................................................ 87

vii

5. Applying the POOL Method with THEMATICS in Protein Active Site

Prediction .......................................................................................................88

5.1 Introduction............................................................................................................................. 89

5.2 THEMATICS curves and other features used in the POOL method .................................... 90

5.3 Performance measurement ..................................................................................................... 94

5.6 Computational procedure ....................................................................................................... 96

5.5 Results...................................................................................................................................... 99

5.5.1 Ionizable residues using only THEMATICS features ................................................ 99

5.5.2 Ionizable residues using THEMATICS plus cleft information................................ 102

5.5.3 All residues using THEMATICS plus cleft information .......................................... 107

5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if

applicable.................................................................................................................. 110

5.5.5 Recall-filtration ratio curves..................................................................................... 115

5.5.6 Comparison with other methods............................................................................... 117

5.5.7 Rank of the first positive. .......................................................................................... 124

5.6 Discussion .............................................................................................................................. 126

6. Summary and Conclusions ...........................................................................130

6.1 Contributions......................................................................................................................... 132

6.2 Future research ..................................................................................................................... 134

Appendices ........................................................................................................136

Appendix A. The training set used in THEMATICS-SVM ....................................................... 136

Appendix B. The test set used in THEMATICS-SVM............................................................... 138

Appendix C. The 64 protein testing set used in THEMATICS-POOL...................................... 147

Appendix D. The 160 protein testing set used in THEMATICS-POOL.................................... 151

viii

Bibliography......................................................................................................161

ix

List of Figures

Figure 2.1 Titration curves.................................................................................................................. 30

Figure 3.1 The success rate for site prediction on a per-protein basis ............................................... 46

Figure 3.2 Distribution of the 64 proteins across different values for the filtration ratio. ................ 48

Figure 3.3 Recall-false positive rate plot (ROC curves) of SVM versus other methods. ................... 51

Figure 3.4 SVM prediction for protein1QFE. .................................................................................... 56

Figure 3.5 The SVM prediction for 2PLC. ......................................................................................... 57

Figure 4.1 Three cases of Gr

in relation to the convex cone of constraints......................................... 76

Figure 5.1 Averaged ROC curve comparing POOL(T4), Wei’s statistical analysis and Tong’s SVM

using THEMATICS features. ................................................................................................... 101

Figure 5.2 Averaged ROC curves comparing different methods of predicting ionizable active site

residues using a combination of THEMATICS and geometric features of ionizable residues

only. ........................................................................................................................................... 103

Figure 5.3 Averaged ROC curve comparing POOL methods applied to ionizable residues only

CHAIN(TION, G) and to all residues CHAIN(TALL, G). ...................................................... 109

Figure 5.4 Averaged ROC curves comparing different methods of combining THEMATICS,

geometric and sequence conservation features of all residues. ................................................ 113

Figure 5.5 Averaged RFR curve of for CHAIN(T, G, C) on the 160 protein test set....................... 116

Figure 5.6 ROC curves comparing CHAIN(T, G), CHAIN(T, G, C) and Petrova’s method.......... 122

Figure 5.7 Histogram of the first annotated active site residue........................................................ 125

x

List of Tables

Table 2.1 Confusion matrix of classification labeling. ...........................................................25

Table 3.1 Performance of the SVM predictions alone versus the SVM regional predictions

that include all residues within a 6Å sphere of each SVM-predicted residue. ..............44

Table 3.2 Comparison of THEMATICS-SVM and other methods.......................................50

Table 5.1 Wilcoxon signed-rank tests between methods shown in figure 5.2 .....................105

Table 5.2 Wilcoxon signed-rank tests between methods shown in figure 5.4 .....................114

Table 5.3 Comparison of sensitivity, precision, and AUC of CHAIN(T, G, C) with Youn’s

reported results for proteins in the same family, super family, and fold. ...................120

Table 5.4 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Petrova’s method. ....121

Table 5.5 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Xie’s method.............123

11

List of Abbreviation

ANN Artificial Neural Network

ASA Area of Solvent Accessibility

AUC Area Under the Curve

AveS Averaged Specificity

CPE Class Probability Estimator

CSA Catalytic Site Atlas

E.C. number Enzyme Commission number

H-H equation Henderson-Hasselbalch equation

K.K.T. conditions Karush-Kuhn-Tucker conditions

k-NN k-Nearest Neighbor method

MAP Maximum a posteriori

MCC Matthews Correlation Coefficient

MAS Mean Average Specificity

ML Maximum Likelihood

PDB Protein Data Bank

POOL Partial Order Optimal Likelihood

RFR curve Recall-Filtration Ratio curve

ROC curve Receiver Operating Characteristic curve

SSE Sum of Squared Errors

SVM Support Vector Machine

THEMATICS Theoretical Microscopic Titration Curves

VC dimension Vapnik-Chervonenkis dimension

12

Abstract

Protein active site prediction is a very important problem in bioinformatics. THEMATICS is a simple and

effective method based on the special electrostatic properties of ionizable residues to predict such sites

from protein three-dimensional structure alone. The process involves distinguishing computed titration

curves with perturbed shape from normal ones; the differences are subtle in many cases. In this

dissertation, I develop and apply special machine learning techniques to automate the process and achieve

higher sensitivity than results from other methods while maintaining high specificity. I first present

application of support vector machines (SVM) to automate the active site prediction using THEMATICS;

at the time this work was developed, it achieved better performance than any other 3D structure based

methods. I then present the more recently developed Partial Order Optimal Likelihood (POOL) method,

which estimates the probabilities of residues being active under certain natural monotonicity assumptions.

The dissertation shows that applying the POOL method just on THEMATICS features outperforms the

SVM results. Furthermore, since the overall approach is based on estimating certain probabilities from

labeled training data, it provides a principled way to combine the use of THEMATICS features with other

non-electrostatic features proposed by others. In particular, I consider the use of geometric features as

well, and the resulting classifiers are the best structure-only predictors yet found. Finally, I show that

adding in sequence-based conservation scores where applicable yields a method that outperforms all

existing method while using only whatever combination of structure-based or sequence-based features is

available.

13

Chapter 1

Introduction

14

This dissertation employs both standard and novel machine learning techniques to automate one aspect of

the problem of protein function prediction from the three-dimensional structure. In addition to applying

established techniques, particularly the support vector machine (SVM), I introduce a novel method, called

partial order optimal likelihood (POOL), to perform the task of selection of functionally important sites

in protein structures. In my approach to the protein function prediction problem, I start with just the 3D

structure of proteins and use THEMATICS, one of the most effective methods which focuses on

electrostatic features of residues. Later, I also add some geometric features, and then the conservation of

residues among homologous sequences, when available, into our system to achieve better results.

1.1 THEMATICS and protein active site prediction.

Function prediction (predicting protein function from protein structure) is an important and challenging

task in genomics and proteomics 1. More and more protein structures have been deposited to the PDB

(Protein Data Bank) database, many with unknown functions. As of this writing, there are over 3600

protein structures in the PDB of unknown or uncertain function. The recent development of generating

structures from proteins expressed from gene sequences using high throughput methods 2-6 only makes

effective and efficient function prediction even more important, as most of these Structural Genomics

proteins are of unknown function.

Determination of active sites, including enzyme catalytic sites, ligand binding sites, recognition epitopes,

and other functionally important sites is one of the keys to protein function prediction.

In addition, the importance of site prediction goes beyond predicting active sites for proteins with

unknown function. Even for a protein with known function, it is not necessarily true that the active site of

that protein is fully or partially characterized. Correctly finding the active site of a protein is always a

prerequisite to understanding the protein’s catalytic mechanism. It also opens the door to the design of

ligands to inhibit, activate, or otherwise modify the protein’s function. Protein engineering applications to

design a protein of particular functions 7-9 also require knowledge of the proper features needed to create a

functioning active site .

15

Because of its importance in genomics and proteomics, many different methods have been developed to

predict the active site of a protein 10-21. We will survey some of these methods in a later section. But

among them, there is one particular method, namely THEMATICS (Theoretical Microscopic Titration

Curves), which is powerful, accurate and precise 16, 22-24. Based on protein 3D structure alone, it can

predict correct active sites that are highly localized in small regions of the proteins’ structures.

The details of the THEMATICS method will be given later, but the key point of this method is that it

takes advantage of the special chemical and electrostatic properties of active site residues, since active site

residues tend to have anomalous titration behavior; THEMATICS generates the titration curves of

ionizable residues of a protein. In its original formulation the presence of two or more residues with

perturbed titration curves in physical proximity is considered a reliable predictor of the active sites for

proteins.

For the THEMATICS method to work well, one needs a criterion to distinguish the perturbed titration

curves from normal, unperturbed ones, which is not a trivial task. My work uses machine-learning

technology to automatically this process. In SVM, I tried to solve this problem in the form of

classification by predicting each residue as either an active site residue, or not. Later, I developed the

POOL method to solve this problem by rank-ordering the residues in a protein according to their

probability of being in an active site, based on how perturbed their titration curves are in addition to some

other 3D-structure-based information. Later still, sequence conservation information, if available, was

added as well.

1.2 Machine Learning.

Machine learning is a well-developed field in computer science. There are many types of tasks, ranging

from reinforcement learning 25 to more passive forms of learning, like supervised learning and

unsupervised learning 26. A typical supervised learning task is to create a function or classification from a

set of training data. If the output of the function is a label from some finite set of classes, it is called a

classification problem. If the output is a continuous value, it is called a regression problem. A training set

16

consists of a set of training examples, i.e. pairs of input-output vectors. And the machine-learning

problem is typically an optimization problem to generate a function that will give an output from a valid

input that generalizes from the seen training data in a “reasonable” way. The part that gets optimized is

typically generalization error, which is the error a certain trained machine will make on unseen data with

the same distribution as that of the population. For an unsupervised learning task, there is no labeled

training data. Usually, the task in unsupervised learning is to cluster the observed data with some criteria,

or fit a model to represent the observed data. I will focus on supervised learning; here my learning task is

essentially a classification task. As will be described below, in one part of this work the goal will be to

estimate actual class probabilities, which in some respects is like a regression problem.

17

Chapter 2

Background and Related Work

18

2.1 Protein Active Site Prediction.

Since the main focus of this dissertation is in Computer Science, I will just briefly survey some of the

methods used for this application, to serve as a background for the method comparison later in my

dissertation.

There are two major classes of methodology used to predict protein active sites. Almost all current

methods in active site prediction use one of them or a mix of both approaches.

The first methodology is based on sequence comparison, or evolutionary information derived from

sequence alignments. The rationale is that active sites of a protein are important regions in the proteins,

and that the amino acids, termed residues, in active sites therefore should be more conserved throughout

evolution than some other regions of the protein. If we can find highly conserved regions among

sequences in similar proteins from different sources (species/tissues), or even in different proteins but

with similar functions, most likely active sites should consist of subsets within these regions. This is a

valid assumption and indeed, many methods have been developed based on this approach, such as

ConSurf 27, Rate4Site 28, and others 29-32.

However there are two drawbacks to this approach.

First, in order to use this method, there have to be at least 10, and preferably 50, different protein

sequences with certain degrees of similarity in order to get reliable results. The method does not work

well if the similarities between sequences are either too high or too low. There are studies showing that

sequence-based methods can transfer reliably the extracted functional information only when applied to

sequences with as high as 40% sequence identity 33, 34. This drawback makes the method unsuitable for

many proteins, particular Structural Genomics proteins, since they often do not have enough similar

sequences with suitable range of similarities.

19

Second, although most active site residues tend to be conserved through evolution, it is certainly not true

that all conserved regions of a protein are active sites. Residues in protein sequences can be conserved

for a variety of reasons, not just because of involvement in active sites. One well-known counterexample

is the set of residues that stabilize the structure of the protein; they are so important to the protein that

once mutated, the protein will not have the proper structure to perform its function. These residues will

be conserved among different protein homologues, even if they are not active sites. Therefore typically

sites predicted from sequence based methods are non-local, spanning a much larger area than the true

active site. Another difficulty arises for cases where an active site region in a protein is less conserved

than other regions of the protein, especially when the function and/or substrate of the proteins in the class

are somewhat versatile.

The second methodology is structure-based active site prediction. There are different properties that have

been studied and used in different methods, such as electrostatics properties as in THEMATICS 16,

residue interaction as in the graph theoretic method SARIG 35, van der Waals binding energy of a probe

molecule as in Q-site Finder 20, geometric cleft location as in surfNet 36 and castP 37, and a geometric

shape descriptor termed geometric potential 38.

There are also studies that combine results from different methods, employing either statistical or

machine-learning techniques. Among all such studies, I list a few examples that either use similar residue

properties or similar machine learning methods as I used in my earlier and current study. P-cats 21, uses a

k-nearest neighbor method to smooth the joint probability lookup table; a study by Gutteridge uses a

neural network and spatial clustering to predict the location of active sites 18; Petrova’s work uses a

support vector machine (SVM) to predict catalytic residues 39; and Youn’s work uses a support vector

machine to predict catalytic residues in proteins 40. All these methods use both sequence conservation and

3D structural information.

Depending on the properties that these methods are based on, the computational cost and accuracy varies.

20

Among all these methods, THEMATICS is the most accurate to date. The computational cost is

acceptable. To analyze a typical protein using THEMATICS takes less than an hour on a desktop PC,

although actual CPU times depend on protein size. The details of this method will be explained in a later

section.

Although THEMATICS is the most effective and accurate method among these when used on its own, it

is natural to consider whether the predictions can be improved by using additional information. I examine

this using both geometric and conservation information, and find that this is indeed the case.

2.2 Machine Learning.

Machine learning is a very broad subfield of artificial intelligence. It is almost impossible to survey the

whole area in this dissertation. Here, I focus on just supervised learning, mostly classification. Even this

area is still too broad, and I will briefly introduce the framework and some of the most commonly used

methods with their basic principles.

2.2.1 Commonly used supervised learning methods.

The first method is called artificial neural network (ANN) or just neural network (NN) 26, 41. It is based on

a computational model of a group of interconnected nodes (neurons) akin to the nervous system in

humans. Each neuron has a certain number of inputs and typically one output. The input to a neuron can

be either the features of input data, or the outputs of other neurons. The output of one neuron can serve as

input to multiple neurons. Typically, there is a weight associated with each input of each neuron. At

processing time (classifying a query instance), each neuron in the network takes the input and computes

the weighted sum of the total input with their associated weights, and generates its output by some

nonlinear function f. There are different flavors of the structures of ANN, such as feedforward versus

recurrent network. During training time, a cost function is defined to estimate the accuracy of the ANN

with respect to the data, or essentially a measurement of how much error the current ANN makes on the

training data. The learning process is to find the optimal setting of the structure and/or weights of the

21

ANN to minimize the cost function on the training data. ANN in general is a very powerful method, and

it has been used in numerous applications including in protein active site predictions 18. One drawback is

that ANN is a somewhat black-box method, meaning although one may find a very good classifier, the

structure of the network and the weights associated with each input may not reveal too much useful

information on why it works.

Another popular and intuitively appealing method is the nearest neighbor method, or its more general

form, the k-nearest-neighbor method (k-NN) 42. The principal idea is to classify the query instance based

on its k nearest neighbors among the training set, which represent the most “similar” training instances.

The success of this method relies on several choices; k and the distance function that defines what

“similar” means are among the most important ones. The training or learning process of this method is

somewhat different from most other machine learning techniques. In most cases, instead of solving an

optimization problem, it uses cross validation directly to select the best k and best distance function.

However, this method and the naïve Bayes method, which will be discussed later, are both susceptible to

the presence of correlated features. The method is also susceptible to the presence of noisy and irrelevant

features.

The support vector machine (SVM) 43, 44 is a relatively newly developed method. A one-sentence

description of this method is that it uses the “kernel trick” to find a best linear separator (hyperplane) in

kernel space to separate instances which are not linearly separable in feature space. There are two major

advantages of SVM. First, unlike ANN or some other classification techniques, the linear separator SVM

finds is not only the one that successfully separate the two classes (in hard margin case) or that make the

“fewest” errors (in soft margin case), but is the one that is the best among all of those separators. The

reason that it finds the “best” among all possible good separators is it maximizes the margin, which

measures how “far” this separator can move without making more mistakes on the training data, and there

is a rigorous proof that a classifier giving the maximum margin tends to make the fewest errors on the

testing data. Another advantage is that the “kernel trick” maps instances in the original feature space to

22

instances in the kernel space and in certain cases the linearly-inseparable instances in feature space

become linearly-separable in kernel space and the kernel transform is easy to compute and the explicit

form of mapping function between instances is not required to be known. To successfully use this method,

one needs to select the right kernel. There are commonly used kernels, but to take full advantage of using

and developing kernels is not a trivial task. I have applied this method to protein active site prediction

with success 45.

Last, I will mention boosting, another method developed quite recently which has achieved a lot of

success 46, 47. Boosting is a meta-learning algorithm. Boosting occurs in stages, by incrementally adding

to the current learned function. At every stage, a weak learner (i.e., one that has accuracy greater than

chance) is trained with the data. The output of the weak learner is then added to the learned function, with

some strength (proportional to how accurate the weak learner is). Then, the data is re-weighted: examples

that the current learned function gets wrong are "boosted" in importance, so that future weak learners will

attempt to fix the errors. If every weak learner is guaranteed to perform better than random guessing, the

boosting method can find the learned function that makes fewer errors on training data than any pre-set

threshold very fast. It is a very powerful method to combine different learners into one “super system”.

There is also a well-developed mathematical theory showing that boosting can lower generalization error.

2.2.2 Probability based approach.

A lot of machine learning work overlaps with statistics, where probability is used to classify instances

directly. The first idea introduced here is Bayesian inference, which is based on Bayes’ theorem 26. Bayes’

theorem may be written as:

)()()|()|(

BPAPABPBAP ∗=

The probability of an event A conditional on another event B is generally different from the probability of

B conditional on A. However, there is a definite relationship between the two, and Bayes' theorem is the

23

statement of that relationship. P(A) and P(B) are called prior probability, and P(A|B) is called conditional

probability of A given B.

Bayes’ theorem is important to a number of applications, including several different places in the present

problem. One way to formulate our problem is to generate the hypothesis H that gives the probability that

a residue with certain features is in the active site, based on the observed data (D), or training examples,

i.e., P(H|D). Take H here as a look-up table for probability of positive with different feature values x. To

use such H at query time, just go to the entry that has the same reading as x and read the corresponding

probability. Usually the number of ways to fill out the look-up table H is infinite, so how should one

choose? One simple answer is to choose the H that gives the largest P(D|H); this is called the maximum

likelihood (ML) hypothesis. Taking P(H), the prior probabilities of H into consideration, one could pick

the H that gives the largest P(D|H)P(H); this gives the maximum a posteriori (MAP) hypothesis. Notice

that ML is equivalent to MAP with “flat prior”, when the prior probability of all hypotheses under

consideration are same. The POOL method I introduce below for active site prediction is MAP with all

hypotheses satisfying the monotonicity constraints having a flat prior, while all other hypotheses have a

prior probability at 0. Both ML and MAP method pick out the “most favorable” hypothesis H* out of all

possible ones based on the data, and use that to predict the “most probable class” of new query data,

which is, given query data q, the probability that q is in class c is determined by P(C=c| q, H*), and the

objective is to find the particular c that gives the largest P(C=c| q, H*). There is another way to do

prediction at query time, namely consider the prediction from all possible hypotheses and sum the result

with the probability of each hypothesis as weights: P(C=c|q, D) = ∑P(C=c|q, Hi, D)*P(Hi|D). This is

called the Bayes classifier, which gives an optimal result, but is often difficult to compute in practice.

2.2.3 Performance measure for classification problems.

In the context of binary classification tasks, the terms true positives, true negatives, false positives and

false negatives are used to describe the given classification of an item (the class label assigned to the item

24

by a classifier) with the desired correct classification (the class the item actually belongs to). This is

illustrated by the confusion matrix below:

25

Predicted classification

Positive Negative

Positive True Positive (TP) False Negative (FN) Actual

classification Negative False Positive (FP) True Negative (TN)

Table 2.1. Confusion matrix of classification labeling.

26

In the confusion matrix above, TP, FP, FN and TN are the number of true positive, false positive, false

negative and true negative instances respectively. Although all the information about the classifier’s

performance is included in the confusion matrix, people tend to use some other measurement derived

from the listed information to compare performance of classifiers. We list some commonly used ones:

FNTP

TPysensitivitrecall+

==

TNFP

TNyspecificit+

=

FPTP

TPvaluepredictivepositiveprecision+

==

FNTN

TNvaluepredictivenegative+

=

FPTN

FPyspecificitratepositivefalse+

=−= 1

FNFPTNTPTNTPaccuracy

++++=

FNFPTNTPFNFPaccuracyerror

++++=−= 1

FNFPTNTPFPTPratiofiltration

++++=

))()()(( FNTNFPTNFNTPFPTPFNFPTNTPMCC

++++∗−∗=

Most of the measurements above are very straightforward and can be treated as mere definitions with no

need for further explanation, with the exception of the filtration ratio and the Matthews correlation

27

coefficient (MCC). I invented the term filtration ratio, to be used in place of precision and false positive

rate in the present problem. One of the advantages of the filtration ratio is that it is the only measurement

listed above that can be determined with information from predicted classification, without the

requirement of knowing the actual classification, which is always unknown in practice. Another reason I

invent and use filtration ratio is that in the present problem, information available in the literature about

actual positives is incomplete, even in our training and testing dataset. Thus a certain fraction of the

nominal false positives probably are not false. In situations like ours where one expects true positives to

represent a fairly small proportion of all the instances, and the measured false positive rate is suspect,

using filtration ratio is more appropriate than some other measurements, such as precision. Both of these

issues will be discussed in more detail later in the dissertation. On the other hand, MCC is widely used in

machine learning as a measure of the quality of binary classifications. It takes into account both

sensitivity and selectivity and is generally regarded as a balanced measure which can be used even if the

classes are of very different sizes. It returns a value between -1 and +1. A coefficient of +1 represents a

perfect prediction, 0 an average random prediction and -1 the worst possible prediction. While there is no

perfect way of describing the confusion matrix of true and false positives and negatives by a single

number, the MCC is generally regarded as being one of the best such measures. In the denominator of

MCC, if any of the four sums is zero, the denominator can be arbitrarily set to one; this results in a MCC

of zero, which can be shown to be the correct limiting value. 48

2.3 THEMATICS

I am going to discuss the THEMATICS method in more detail because this is the basis for most of the

input data in the work of this dissertation.

2.3.1 The THEMATICS method and its features.

In the application of THEMATICS, one begins with the 3D structure of the query protein, solves the

Poisson-Boltzmann (P-B) equations using well-established methods 49-52, and then performs a Monte

28

Carlo procedure to compute the proton occupations of each ionizable amino acid as a function of the pH.

Each such function is called a titration curve, as shown in figure 2.1 (a).

From the theoretical titration curves computed from the 3D structure of a query protein, THEMATICS

identifies residues (amino acids) that exhibit significant deviation from Henderson-Hasselbalch (H-H)

behavior, which I now describe.

A typical ionizable residue in a protein obeys the H-H equation, which may be expressed as a proton

occupation O as a function of pH as:

1)110()( −− += apKpHpHO (1)

For the residues that form a cation upon protonation (Arg, His, Lys, and the N-terminus), the mean net

charge C on particular residue is equal to O, whereas for the residues that form an anion upon

deprotonation (Asp, Cys, Glu, Tyr, and the C-terminus), the mean net charge is given by )1( −O as:

)()( pHOpHC = cationic (2)

1)()( −= pHOpHC anionic (3)

Note that C represents the average net charge on a particular residue for a large ensemble of protein

molecules. Equations (1) - (3) have the sigmoid shape that is typical of a weak acid or base that obeys the

H-H equation. Thus, as pH increases, the predicted average charge falls sharply in a pH range close to the

pKa, which is defined as the pH at which that residue is protonated in exactly half of the protein

molecules in the ensemble.

Underlying the THEMATICS approach is the observation that the computed titration curves tend to

deviate more from this H-H shape for ionizable residues belonging to active sites than for ionizable

residues not belonging to such sites. The key step in the application of the THEMATICS approach is thus

recognizing significant deviation from H-H behavior in the shape of these predicted titration curves.

29

When THEMATICS was first developed, visual inspection of the computed curves was used to identify

THEMATICS positive residues. Although simple, it is inefficient, vulnerable to bias, and in some cases

ineffective, since some deviations of the curves are subtle and not easily recognized visually, as indicated

by figure 2.1(b).

30

(a)

(b) Figure 2.1. Titration curves. (a) A standard HH curve (black), a typical perturbed curve (red), and a typical unperturbed curve from residues not in active site (blue). (b) Titration curves from active site residues (red) versus non-active-site residues (green) from a set of 20 proteins (Appendix A); only glutamate residues are shown.

31

Before introducing the methods for automation of the classification of residues using THEMATICS, I

will first present the feature extraction process.

In order to perform any machine-learning or statistical analysis on titration curves, one needs to find

features that are easy to compute and are effective to distinguish positive and negative instances.

I have defined features that may be used to measure the deviation of a particular titration curve from H-H

behavior. In particular, four features extracted from the titration curves are most useful in separating

THEMATICS positives residues from the others.

These four features are based on the first four moments of the derivatives of the titration curves, as I now

describe briefly. A more detailed description can be found in Ko’s study 24.

Define the variable x to be the offset of the pH from the pKa, as:

apKpHx −= , (4)

Then equation (1) for Henderson-Hasselbalch titration curves becomes:

1)110()( −+= xxO . (5)

The key observation on which the moment analysis is based is that, for any titration curve O(x), whether

of Henderson-Hasselbalch form or not, the corresponding derivative

)(//)( pHddOdxdOpHf −=−= (6)

is effectively a probability density function (ignoring those rare cases when the titration curve fails to be a

non-decreasing function of x, in which case this derivative function takes on negative values) 53.

The nth moment of f is defined as

∫= )()()( pHdpHfpHM nn (7)

and the corresponding nth central moment µn is

32

∫ −= )()()( 1 pHdpHfMpH nnµ (8)

where these integrals are over all space (−∞ to +∞).

The features I use are based on the first moment M1 and the second, third, and fourth central moments µ2,

µ3, and µ4, respectively, of the derivatives f. For a pure H-H titration curve these moments are

M1 = pKa , µ2 = 0.620, µ3 = 0, and µ4 = 1.62. (9)

It is interesting to note that, for an arbitrary probability density function, M1 is its mean and µ2 is its

variance, while µ3 and µ4 are related to the skewness and kurtosis, respectively, standard quantities used in

statistics to measure departure from normality.

When applied to a general titration curve of ionizable residues in a protein, the pKa shift is closely related

to how much M1 differs from the free-solution pKa. Those residues that interact strongly with other

ionizing residues in such a way that the predicted titration functions O(pH) are elongated will have

broader first derivative functions f and thus have generally higher values for µ2 and especially µ4. The

moment µ3 measures the asymmetry of the function f and has a nonzero value for any residue that

interacts with other ionizing groups in such a way that the strength of this interaction in the range pH <

pKa is different from that in the range pH > pKa.

Thus it is clear that the first moment and the second, third, and fourth central moments are useful

measures for determining deviation from H-H behavior.

The methods introduced below all use some of the four features described above, with some additional

features in some specific methods.

2.3.2 Statistical analysis with THEMATICS.

One automated analysis was proposed and studied by Ko 24. Ko introduced simple statistical metrics to

automatically evaluate the degree of perturbation of a titration curve from H-H behavior. The method is

simple, just looking at two of the above features, namely µ3 and µ4. A statistical Z-score was computed on

33

these features; i.e. for every curve analyzed, the deviation of the µ3 and µ4 values from the mean,

expressed in units of the standard deviation, of all curves from the same protein were computed. Any

residue with a titration curve with either the absolute value of Z-score of µ3 or Z-score of µ4 greater than

1.0 was classified as a THEMATICS positive residue and any such residues with at least one other

THEMATICS positive residue located within 9Å were reported as active site candidates.

Good results were obtained for the identification of active sites in a set of 44 proteins with Ko’s method.

Although this method has excellent recall of catalytic sites, identifying the correct catalytic site for 90%

of enzymes, the recall rate is lower (about 50%) for the identification of catalytic residues. It is desirable

to improve the catalytic residue recall rate and also to expand the method to include predictions of non-

ionizable residues.

Wei studied Ko’s analysis and modified it 54, introducing a new fractional parameter, α, which typically

runs between 0.95 and 1.0. In this method, the mean and the standard deviation of µ3 and µ4 are

calculated from the titration curves from a portion of the residues in a protein, in contrast to the whole

population in Ko’s method. The portion of the residues excluded from the sample are the residues with

titration curves with µ3 and µ4 values in the highest (1-α) fraction. This modification did yield better

recall of annotated catalytic residues than Ko’s analysis, but the optimal α is different for different

proteins and was finally fixed at 0.99, the value that yields the best overall performance when averaged

over the set of annotated proteins. The purpose of this α is to exclude residues with titration curves with

the most extreme µ3 and µ4 values from influencing the mean and standard deviation of the population too

much and thus yielding better statistics and slightly more reliable predictions.

Meanwhile, Yang developed another rule-based statistical analysis identifying THEMATICS positive

residues 55. In addition to the four features described earlier, her method uses an additional feature, a

value called the buffer range R, which measures the width of the pH range over which the residue is

34

partially ionized. Also, outliers are selected within each amino acid type, when possible, instead of the

entire set of ionizable residues. The performance of this method is a little better than Ko’s method.

The three statistics-based analyses listed above all employ handcrafted cutoff values to differentiate the

positive from the negative instances. The study described in this dissertation begins with the hypothesis

that a machine learning method can utilize similar sets of features, define a threshold in a systematic way,

and achieve better performance in practice.

2.3.3 Challenges of the site prediction problem using THEMATICS data.

One of the challenges of the task to classify THEMATICS results arises from special characteristics of the

training data set.

First, the vast majority of the residues in the training data are negative examples. Literature-confirmed

active site residues typically consist of less than 3% of total residues. At the same time, the negative

examples, which comprise most of the data in the training set, share some “common” property, while the

relatively few positive examples have “abnormal” behaviors in a varied way. This is one of the key

reasons that a simple outlier detection process like Ko’s analysis is quite successful in solving this

problem. But it is not clear how this method can incorporate additional non-THEMATICS features to

possibly improve the active-site prediction.

Secondly, the nature of this problem limits the quality of the training data. The ultimate goal of the

project is to predict the active sites of proteins using THEMATICS data, however, the absolute criterion

to label a residue in a protein as active is that someone has done the experiment in the lab and published

the result supporting the claim. There are databases collecting such annotations and by no means are

these annotations complete. There is another subtlety in that although THEMATICS positive residues

have been shown to be a very reliable indicator of active sites, THEMATICS sometime predicts

additional nearby residues that are not annotated as active, including second shell residues. There is some

evidence to support the hypothesis that these second shell residues may be important. Alternatively, they

35

may be affected by the special electrostatic field created by the nearby active sites residues. The

THEMATICS positive residues in the second case may not be shown experimentally to be active site

residues. Because residue activity is often measured in a kinetics experiment and a number of factors can

sometimes cause large errors in these experiments, the training set inevitably contains some positive

instances that are misclassified in the first place, or some instances that cannot be correctly distinguished

by the model. In particular, there are most probably instances of true positives that are improperly

annotated as negatives, simply because no experiments have been tried on the vast majority of residues.

In order to overcome these two obstacles, in my earlier work of neural network machine learning and

SVM method, I “cleaned” the training set. Instead of using just literature confirmed positive instances, I

also labeled “apparent” THEMATICS positives, near a known active site although not experimentally

identified as active site residues. I also removed some of the isolated THEMATICS positive instances

from the training set. Although this data cleaning did improve the results, it is ad hoc and lacks a

systematic justification.

For any machine learning problem, if there is some prior belief, or bias, which turns out to be true,

applying it should always help the performance.

After studying THEMATICS and its application in protein active site prediction, it would be fair to

conclude the following THEMATICS principles as prior belief:

THEMATICS Principle 1: The more perturbed the titration curve is (relative to other titration curves in

the same protein), the greater the probability that residue is in the active site.

THEMATICS Principle 2: The more perturbed the neighboring titration curves are (relative to other

titration curves in the same protein), the greater the probability that residue is in the active site.

The ad hoc method used before implicitly cleaned the data based on the THEMATICS principles.

In addition to THEMATICS features, to which we can apply THEMATICS principles, there are some

non-THEMATICS features having either positive or negative correlation to the probability that a residue

36

is located in the active site. Those features may not be a reliable indicator by themselves, but combined

with THEMATICS methods, they may improve the overall prediction accuracy.

While there may be ways to enforce inductive bias in classifiers like neural networks and SVMs, I believe

the most straightforward approach is instead to try to estimate P(class | attributes) nonparametrically,

while enforcing these principles as constraints, as explained in Chapters 4 and 5.

37

Chapter 3

Applying SVM to THEMATICS

38

3.1 Introduction.

As discussed in earlier chapters, THEMATICS is a technique for the prediction of local interaction sites

in a protein from its three-dimensional structure alone. Various approaches have been taken to automate

and standardize the process with various sensitivity and specificity. Here, I will present my first work on

this project, using a support vector machine, with four extracted features from THEMATICS alone to

predict the active sites of enzymes.

In this chapter it is shown how support vector machines (SVM’s) may be combined with THEMATICS to

achieve a substantially higher recall rate for catalytic residues with only a small sacrifice in specificity

when compared to Ko’s statistical analysis of THEMATICS 24. It is argued that clusters predicted by

THEMATICS-SVM are small, local networks of ionizable residues with strong coupling between their

protonation events; these characteristics appear to be very common, perhaps nearly universal, in enzyme

active sites. Performance of THEMATICS-SVM in active site prediction is compared with other 3D-

structure-based methods, including THEMATICS combined with previous analyses and shown to return

equal or better recall with generally higher specificity and lower filtration ratio. The high specificity and

low filtration ratio translate to better quality, more localized, predictions.

This work builds on the prior work of Ko using variants of some of the same features that were found to

be successful there, plus some additional features. Results of our method are presented for 60 different

proteins. In this chapter, I also present a way to extend the method’s capabilities to the prediction of non-

ionizable residues.

3.2 THEMATICS curve features used in the SVM

To use an SVM to classify residues as either likely or not likely to be in the active site, I represent the

computed titration curves as points in a four-dimensional space. These four features are based on the first

four moments of these curves, as described in section 2.3.1.

39

The four features, namely M1, µ2, µ3 and µ4 are conceptually similar to those described in Ko’s analysis 24,

except I slightly modified the normalization process to prevent both the sample mean and sample standard

deviation from being too strongly influenced by extreme values. A more robust estimator is used to

distinguish “typical” from “atypical” titration curves within a single protein than the standard Z-score. In

my normalization, each of the four moments was normalized to its corresponding robust Z-score Z′,

which is defined as its deviation from the median, divided by the normalized interquartile distance, the

difference between the 75th percentile value and 25th percentile values, for the corresponding feature

across all ionizable residues in that protein. A normalization factor of 1.349 comes from the normal

distribution with a mean of zero and a standard deviation of one. Thus for a given feature Ф, I define Z’

as:

( ){ }1.349 -MEDIANZ ( )=

PERCENTILE( ,0.75) PERCENTILE( , 0.25)⋅ Φ Φ

′ ΦΦ − Φ

(10)

where the median and corresponding percentiles are based on the value of that feature for all ionizable

residues in a given protein. Thus this method achieves the same effect as Wei’s method 54 without

introducing an extra parameter to be fine-tuned.

For the even-numbered moments, Z′n, the robust Z-score for the nth central moment is defined as:

Z′n = Z′(µn ) (n even) (11)

The only even-numbered moments used in the present study are the second and fourth, so the

corresponding robust Z-scores Z′ are Z′2 and Z′4.

Likewise the only odd-numbered moments are the first and third. Their corresponding robust Z-scores are

the deviations of the absolute values from the median. In particular, we define Z′3 as

Z′3 = Z′(µ3 ) (12)

40

The population over which the median and percentiles are computed includes residues of different types

with different free pKa’s. In order to compare the computed first moments across all residue types, the

offset first moment for a given residue is defined as:

M1offset = M1 - pKa(free) (13)

where pKa(free) is the pKa for that residue in free solution. Note that by equation (9), a H-H residue has

M1offset = 0.

This offset may be compared across all residues in the protein. Thus Z′1 is defined as:

Z′1 = Z′ (|M1offset|). (14)

Note that only the first moment requires this modification to make all residue types in the protein

comparable since the H-H equation has only one free parameter, the residue type-dependent translation

parameter pKa.

To summarize, the result of all these computations is to create, for each ionizable residue in any given

protein, a 4-tuple of descriptors (Z′1, Z′2, Z′3, Z′4) of the theoretical titration curve. Z′2, Z′3, and Z′4

describe the shape of the curve and Z′1 measures its displacement along the horizontal axis.

3.3 Training.

A set of 20 proteins was used as the training set. The protein names, the E.C. numbers and the PDB ID for

each of the 20 proteins in the training set are listed in Appendix A.

The labeling of the titration curves for training purposes was performed as follows: All residues listed in

CatRes/CSA as active were labeled positive. Also labeled positive were ionizable residues located near

such annotated active residues with titration curves that displayed perturbed titration curves on visual

inspection. All other residues were labeled negative, with the exception of a few residues with visually

perturbed titration curves and with no literature annotation that are not near any other perturbed residues;

they were removed from the training data set entirely. (Note that such residues with perturbed titration

41

curves that are not in spatial proximity with other perturbed residues are not considered predictive in

THEMATICS.)

From 1575 ionizable residues in the 20 protein training set, I remove 46 isolated residues with perturbed

titration curves. This leaves a training set of 1529 ionizable residues, among which 140 are labeled as

positive training examples. For each ionizable residue in the training set, the four moment-based features

and the corresponding labels were fed into the SVM using SVMLight 56. For both training and

classification, the quadratic kernel K(x,z) = (1+<x,z>)2 was used.

The relative cost of misclassification of positive and negative training examples was set such that false

negatives were penalized 10 times as much as false positives. This was done because there are many more

negative examples than positive examples in the training set, because of the aim to increase the residue

recall rate, and because I have much more confidence in the labeling of the false negatives than the false

positives (see sections 2.2.3 and 3.4). In addition, a linear kernel and several other choices of parameters

were tried, but these resulted in either similar or slightly more training errors.

3.4 Results

Typical criteria used to measure classifier performance are recall (also called sensitivity), the number of

correctly predicted positives divided by the number of true positives, and precision (related to specificity),

and the number of correctly predicted positives divided by the total number of predicted positives. Ideally,

both measures are 100%, which means all and only the true positives are identified as such by the

classifier. In the present case, one can be more confident of the true positive data, because for every

labeled active residue there is experimental evidence supporting that labeling. On the other hand, true

negative data are not as reliable because the experiments are incomplete; some important residues may

not have been tested experimentally. Furthermore some of the experimental literature has not been

included in the CatRes/CSA database, because of the difficulty of exhaustive literature searching. A better

indicator of the selectivity of the method for present purposes is the filtration ratio, the fraction of total

42

residues that are reported as positive. Now the goal of the system is to achieve a high recall with a low

filtration ratio.

A set of 64 test proteins was selected randomly from the CatRes database 57. There is no overlap between

this test set of 64 proteins and the set of 20 proteins used to train the SVM. The trained SVM was applied

on the test set to measure the overall accuracy of the method, assuming that the CatRes annotations define

the true positive residues. Results are summarized here, while a detailed list of all proteins studied with all

predicted residues and clusters can be found in the Appendix B.

3.4.1 Success in site prediction.

First I examine the degree of matching between our predictions and the CatRes list for each protein.

Overall, the SVM identified an average of 2.7 clusters per subunit. Based on the overlap of the predicted

active site and the CatRes listed set, the prediction for a protein is assigned to one of three categories. If

50% or more CatRes listed active residues were found by the system, we consider this a correct site

prediction. If some, but fewer than half, of the CatRes listed active residues were found by our system, we

consider it partially correct. If none of the CatRes listed active residues were found by our system, we

consider the site prediction incorrect. This type of categorization has been used previously 18. Measuring

this degree of overlap of predicted clusters with just the ionizable CatRes listed active-site residues, the

percentages of proteins for which the predictions are correct, partially correct, and incorrect are 86%, 5%

and 9% respectively, as shown in figure 2(a).

3.4.2 Success in catalytic residue prediction.

Out of the 9303 ionizable residues from the 64 proteins, 1338 were predicted as active site candidates by

the SVM, forming 244 clusters. There are 233 ionizable residues labeled as active site residues in the

CatRes database and 182 of them were found by our SVM, corresponding to a global residue recall of

78%. The average residue recall rate, averaged over all 64 proteins, is 76%.

43

For these 64 proteins, for filtration ratio defined as residues predicted over a total of 32016 residues

including both ionizable and non-ionizables, the average is only 3.9%. This ratio is less than 8% for each

of the 64 proteins. The average precision, or fraction of predicted residues that are known true positives,

is 20% over the 64 protein set, using only the CatRes/CSA annotations to define the true positives.

3.4.3 Incorporation of non-ionizable residues.

Since not all active site residues are ionizable, it is also of interest to see how well the SVM-reported

residues serve as predictors of activity in their spatial vicinity. Therefore I also define a THEMATICS

positive region to be the spatial region within 6Å of any residue that belongs to a THEMATICS positive

cluster. This may allow the method to find some catalytically important residues that do not have a

perturbed titration curve (including non-ionizable residues). The total number of residues found by this

criterion across the 64 test proteins is 4795, out of 32016 total residues. Among 366 residues that are

labeled as active site residues in CatRes, 263 were found by the system, corresponding to a global recall

of 72%, while the average recall per protein is 81%. The average precision, or fraction of predicted

residues that are known true positives, is 21% over the 64 protein set, using only the CatRes/CSA

annotations to define the true positives.

Table 3.1 compares the performance of the straight SVM predictions versus the SVM+Region predictions.

While the expansion to include the neighborhood surrounding the predicted residue leads to a somewhat

higher recall rate, there is considerable sacrifice in the precision and increase in the filtration ratio.

44

Method Recall Precision Filtration Ratio SVM only 61% 21% 4% SVM + 6Å region 81% 8% 13%

Table 3.1. Performance of the SVM predictions alone versus the SVM regional predictions that include all residues within a 6Å sphere of each SVM-predicted residue. Shown are average values of recall (true positive residues over all known positive residues), precision (true positive residues over all predicted residues), and filtration ratio (residues predicted over total residues in the protein), where averaging is performed over the set of 64 test proteins.

45

Using the same criteria described above for judging correctness of the predictions, but this time counting

all residues in THEMATICS positive regions and comparing with all the CatRes listed active-site

residues, the percentages of correct, partially correct, and incorrect site predictions are 88%, 4% and 8%

respectively (Figure 3.1(b)).

46

(a)

86%

5%

9%

CorrectPartially CorrectIncorrect

(b)

88%

4%8%

CorrectPartially CorrectIncorrect

Figure 3.1. The success rate for site prediction on a per-protein basis: (a) ionizable residues only; (b) all residues, extending the SVM’s predictions by including all residues within 6Å of each predicted residue.

47

Figure 3.2 shows histograms of the filtration ratios achieved using the trained SVM on a per protein basis.

These filtration ratios are expressed in three different ways: 1) All predicted residues over all residues; 2)

Ionizable residues predicted over all ionizable residues; and 3) Ionizable residues predicted over all

residues. Here, 1) is obtained using the 6Å neighborhood criterion and 2) and 3) are obtained from the

straight SVM prediction of ionizable residues only. Using all residues as the basis, the filtration ratio for

the SVM (ionizable) predictions is less than 10% for all 64 proteins. There was only one protein out of

the 64 for which the filtration ratio for all residues predicted (using the 6Å neighborhood criterion) was

higher than 25%. For this protein, human Glyoxalase I, the method identified about 18% of its ionizable

residues as candidates and about 27% of all its residues as candidates (using the 6Å neighborhood

criterion). For well over 90% of the proteins studied the filtration ratio was better than 20% in both the

ionizable/ionizable and the all-residues cases. Even better, in 70% of the proteins tested, the method

reported less than 15% of the ionizable residues as candidates and in 61% of the proteins in the test set,

less than 15% of all residues were identified as candidates using the 6Å neighborhood criterion. It is

important to note that for most of the examples with large filtration ratios, there is a sound functional

basis for this high ratio, e.g. the active site binds multiple substrate or cofactor molecules and thus has an

unusually large interaction site. In other cases with high filtration ratios, the protein has an interaction site

of typical size but an unusually small number of total ionizable residues.

48

0

10

20

30

40

50

60

0-5% 5-10% 10-15% 15-20% 20-25% 25-30%

Filtration Ratio Range

Num

ber o

f Pro

tein

s

All/All Ionizable/Ionizable Ionizable/All

Figure 3.2. Distribution of the 64 proteins across different values for the filtration ratio. Filtration ratios are expressed as: 1) All predicted residues over all residues; 2) Ionizable residues predicted over all ionizable residues; and 3) Ionizable residues predicted over all residues. All residues are predicted using the 6Å neighborhood criterion.

49

3.4.4 Comparison with other methods.

It is useful to compare the results of the present method with some other catalytic site prediction methods

that are based on 3D structure alone. The other methods used for this comparison are QSiteFinder 20 and

SARIG 35, both of which have publicly available servers, and Ko’s statistical analysis of THEMATICS 24.

Of the 64 proteins used for the THEMATICS-SVM test set, one was too large for both SARIG and

QSiteFinder and three others were too large for QSiteFinder. These four were deleted from the test set

and thus the comparison results reported here are for the remaining 60 proteins.

The average (per protein) values for recall, precision, filtration ratio false positive rate and MCC of each

method are listed in Table 3.2. Two sets of results are given for QSiteFinder, one using only the top site

and the other using a combination of the top three sites. This combination of the top three sites is the

basis for the success rate reported for this method in the original article 20. The values in Table 3.2 use all

annotated residues, including non-ionizable residues, as the basis for the recall and all residues in the

protein as the basis for the filtration ratio for all of the methods. Thus the theoretical maximum recall rate

for THEMATICS-SVM (without the 6Å region) is less than 100%, because some known catalytic

residues are non-ionizable.

Figure 3.3 plots recall as a function of false positive rate. The solid line represents Wei’s THEMATICS

analysis with variable parameter α. Performance for Ko’s analysis, QSiteFinder, SARIG, and the present

SVM are depicted as points. The recall and the false positive rate of all the methods in this plot were

measured by ionizable residues only.

50

Method Recall Precision Filtration Ratio False Positive Rate MCC

THEMATICS-SVM 61% 20.0% 3.8% 3.1% 0.31 THEMATICS-SVM-Region 80% 8.1% 13.2% 12.3% 0.22 THEMATICS-Statistical (Ko) 44% 23.5% 2.6% 2.0% 0.29 QSiteFinder (top one) 33% 4.6% 7.5% 7.2% 0.094 QSiteFinder (top three) 61% 4.2% 16.2% 15.8% 0.12 SARIG 61% 8.0% 11.2% 10.6% 0.18

Table 3.2. Comparison of THEMATICS-SVM and other methods. Shown in the table are the recall, precision, filtration ratio, false positive rate, and Matthews correlation coefficient (MCC) of THEMATICS as well as of other site prediction methods including THEMATICS-Statistical (Ko’s method), QSiteFinder, and SARIG. These quantities are per-protein averages over a comparison set consisting of the 60 proteins from the 64-protein test set for which results could be obtained with all methods.

51

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6

False positive rate

Rec

all

Wei's MethodSVMQ-site Finder (largest one)Q-site Finder (largest three)SarigKo's Method

Figure 3.3. Recall-false positive rate plot (ROC curves) of SVM versus other methods. The curve for Wei’s method is obtained by varying the parameter α, a generalization of Ko’s methods.

52

The comparisons in Table 3.2 and the Figure 3.3 show that THEMATICS-SVM can achieve catalytic

residue recall that is as good as or better than other methods, while simultaneously achieving substantially

better precision with lower filtration ratios and false positive rate. This translates to more localized, more

precise predictions of generally better quality with higher MCC. Although from Table 3.2, compared with

Ko’s statistical analysis of THEMATICS, the SVM analysis gives significantly better sensitivity to

catalytic residues with only a small concomitant drop in the precision, Figure 3.3 and the MCC in table

3.2 show that the tradeoff between recall and filtration ratio/false positive rate is better with the SVM

analysis since it outperforms Wei’s method in figure 3.3, which is a generalization and extension of Ko’s

method.

3.5 Discussion

3.5.1 Cluster number and size.

The SVM study (without the neighboring region) found 244 clusters as active site candidates, with an

average of 2.7 clusters per chain in the 64-protein test set. While the number of true positive clusters per

chain is probably somewhat higher than 1.0, 2.7 is too high. Ko’s statistical criteria reported 1.7 clusters

per chain, although I note that the present method was designed to increase the recall rate of true positive

residues. Among the 244 clusters reported by the present method, 90 of them are pairs and it appears that

many of these pairs are false positives resulting from the chance proximity of two residues with similar

pKa’s. Simple geometry-based rules have been investigated that do eliminate many of these pairs 58. The

average size of a cluster is 5.5 ionizable residues. These values include catalytic residues, binding

residues, and also some “second shell” residues, residues that are nearest neighbors of the known catalytic

residues and second-nearest neighbors of the reacting substrate. It is indeed possible that these “second

shell” residues play some supporting role in catalysis, or they may simply be subject to the same strongly

53

pH-dependent field as the catalytic residues. Experimental site-directed mutagenesis studies are in

progress to elucidate this.

3.5.2 Failure analysis.

Although this method effectively found the active sites in most of proteins, there still remain a few cases

where it failed to find the correct active sites. There are a variety of possible causes for this. In some cases,

the “failure” may not be a failure at all. An example is cytosolic ascorbate peroxidase (1APX), in which

the active site listed in CatRes is at the distal side of the heme ring. My method did not find that particular

site, but it did find a cluster at the proximal side of the heme ring, which has been identified as an active

site by several studies 59, 60. This suggests that some of the discrepancy between CatRes labeling and the

SVM results may result from incomplete or incorrect information in the reference database. Failure can

also occur in cases where the active site environment is very hydrophobic. In the case of DNA photolyase

(1DNP), the listed active site consists of three tryptophan residues, which are not ionizable. Even in that

case, the SVM method did find a cluster that lies between those active tryptophan residues and the

cofactors FAD and MTF. Indeed the predicted residues may actually be involved in the electron transfer

process, which proceeds from the tryptophan residues to the cofactors. In some other cases, the SVM

method found a cluster of residues that bind to a substrate or cofactors. These binding residues may

exhibit such strong perturbations in the titration curves that the actual catalytic residues are missed by the

classifier.

3.5.3 Analysis of high filtration ratio cases.

There are a few proteins for which the SVM method gives a high filtration ratio, meaning the active site

candidates our method produces constitute an unusually high fraction of the total residues in the protein.

Most of these cases are cofactor-dependent enzymes and/or systems with larger substrate molecules,

where the site truly is a larger fraction of the protein’s surface. The protein with the highest filtration ratio

is human Glyoxalase I (1FRO), for which the SVM selects 18% of the ionizable residues and

SVM+Region selects 27% of all residues. Glyoxalase I catalyzes the glutathione dependent conversion of

54

2-oxoaldehydes to 2-hydroxycarboxylic acids. Each of its subunits binds glutathione, substrate, and zinc.

In addition to the catalytic residue E172, the SVM method also finds the glutathione-binding residues

R37, E99, and R122, the zinc-binding residue H126, plus some interfacial residues. Another example is

arginine kinase (1BG0), which catalyzes phosphate transfer between arginine and ATP and thus its site

binds arginine, phosphate and MgADP. The predicted residues surround the site of interaction between

the arginine, the ATP, and the reacting phosphate group.

Still other cases of high filtration ratio result from small enzyme size. Enzyme IIA-lactose (1E2A), a part

of the lactose/cellobiose-specific family of enzymes II in the sugar phosphotransferase system, is such an

example. It has only 36 ionizable residues. For this protein the SVM method found two clusters of three

ionizable residues each, one of which is the known active site, giving a filtration ratio of 17% of ionizable

residues. The all-residue analysis yielded a 21% filtration ratio. In this case, the site of interaction truly

constitutes a high fraction of the total residues in the structure.

3.5.4 Some specific examples.

Type I 3-Dehydroquinate dehydratase from Salmonella typhi (PDB ID 1QFE) is an important enzyme

found in plants and microorganisms. It functions as part of the shikimate pathway, which is essential for

the biosynthesis of aromatic compounds including folate, ubiquinone and the aromatic amino acids. The

absence of the shikimate pathway in animals makes it an attractive target for antimicrobial agents. The

study of the structure, active sites and the reaction mechanism opens the way for the design of highly

specific enzyme inhibitors with potential importance as selective therapeutic agents 61. For this protein,

the SVM predicts a total of eight residues in two clusters, [E46, E86, D114, E116, H143, K170] and [D50,

H51]. The known catalytic residues are E86, H143, and K170 and are shown as red sticks in the figure.

The other five predicted residues are shown as yellow sticks. The three predicted catalytic residues, plus

one additional predicted residue (E46), are nearest neighbors of the reacting substrate molecule, as

determined by the LPC server 62. The remaining four residues are “second shell” residues, each of which

is in contact with at least two first shell residues. The elongated theoretical titration curves obtained for

55

the eight predicted residues reflect their membership in subsets of ionizable residues with similar pKa’s in

close proximity.

Phosphatidylinositol-specific phospholipase C, exists in both eukaryotic and prokaryotic cells.

Catalyzing the hydrolysis of 1-phosphatidyl-D-myo-inositol-3,4,5-triphosphate into the second messenger

molecules diacylglycerol and inositol-1,4,5-triphosphate, it plays an important role in signal transduction

and other important biological processes 63. This catalytic process is tightly regulated by reversible

reversible phosphorylation and binding of regulatory proteins. The SVM prediction for

Phosphatidylinositol phospholipase C from Listeria monocytogenes (PDB ID 2PLC) is shown in Figure

3.5. The correctly predicted active site residues are shown as red sticks and the additional predicted

residues are shown as yellow sticks. Two residues missed by the SVM, but found by the SVM+Region

selection, are shown in purple. Note how the additional (yellow) residues occupy a second layer

immediately surrounding the known true positive residues (red and purple).

56

Type I 3-Dehydroquinate dehydratase

Figure 3.4. SVM prediction for protein1QFE. The SVM predicts a total of eight residues in two clusters, among which three known catalytic residues are shown as red sticks; the other five residues are shown as yellow sticks.

57

Phosphatidylinositol phospholipase C

Figure 3.5. The SVM prediction for 2PLC. The correctly predicted active site residues are shown as red sticks and the additional predicted residues are shown as yellow sticks. Two residues missed by the SVM, but found by the SVM+Region selection, are shown in purple.

58

3.6 Conclusions.

The THEMATICS-SVM method is a relatively straightforward method. Combining SVM with

THEMATICS achieves a higher recall rate for catalytic residues than earlier THEMATICS analyses with

only a small sacrifice in precision. Precision rates for THEMATICS predictions tend to be considerably

better than for other methods. The more localized, more specific predictions offer enhanced usefulness for

applications such as functional classification and specific ligand design.

The set of residues with perturbed titration behavior is a small subset of the set of ionizable residues with

strong electrostatic interactions or shifted pKa’s. Thus THEMATICS is more selective than simple

identification of the strongly interacting residues. A previous study 64 showing that titration-based

methods give a large number of false positives was based on the less selective electrostatic properties.

The present study confirms the comparatively low false positive rate for THEMATICS.

SVM+Region, the extended version of THEMATICS-SVM that incorporates residues within a 6Å sphere

of each predicted residue, does deliver improved recall, but with a sacrifice in precision. The major

advantage of THEMATICS over other 3D-structure-based methods is the superior precision;

SVM+Region loses most of this advantage.

THEMATICS selects ionizable residues with strong coupling between their protonation events. These

localized networks of interacting residues are good predictors of active sites. This feature may be a

fundamental property of enzyme active sites and may be a factor to consider in protein engineering.

3.7 Next step.

Although applying SVM with THEMATICS gave us a system predicting protein active sites with better

sensitivity and specificity, there is still room to improve. One possible approach would be using a larger

and better annotated training set and fine-tuning the SVM with different kernels and parameters. Most

likely, this approach would improve the performance of the THEMATICS-SVM method, but there are

59

limitations. The fundamental basis for this approach, as well as other methods using THEMATICS, is the

use of a binary classification system to select the THEMATICS-positive residues and cluster them based

on their physical proximity. If we could develop a new system that can rank order the residues based on

their likelihood of being in the active site, it would be much easier for a user to decide how far down the

list to go. Furthermore, if these rankings are based on probability estimates of each residue being in the

active site, we have principled ways to further combine the results from different systems to obtain a more

powerful system. So in the next chapter, I take a brand new approach and develop the Partial Order

Optimal Likelihood (POOL) method, which yields probability estimates rather than binary classification

decisions.

60

Chapter 4

New Method: Partial Order Optimal Likelihood (POOL).

61

4.1 Ways to estimate class probabilities.

For convenience, I use class probability to refer to ),,|( 11 nn xXxXcCP === L , which is the

probability of an instance with attributes X1 to Xn having values of x1 to xn, respectively, belonging to class

c. Notice that it is different from another commonly used term of class-conditional probability, meaning

the probability )|,,( 11 cCxXxXP nn === L . I refer to any method that estimates class probabilities

as a class probability estimator (CPE). I distinguish two classes of CPEs: One is parametric, which

involves selecting a function to model the class probabilities of training data, and then applying a learning

process to select the parameters of such a function giving the best fit to the training data. Examples are the

logistic regression method 65 and neural networks. The other class is nonparametric and includes the novel

POOL method we describe below. There is no pre-defined function in such nonparametric CPEs, and the

learning process estimates the class probabilities directly from the training data. I believe the

nonparametric CPEs are more general and I will focus my discussion on this class. There are different

nonparametric CPEs, with different ways of computing and representing the probability estimates. I will

introduce four of them as follows.

4.1.1 Simple joint probability table look-up.

The most conceptually straightforward way is to use a lookup table, having the same number of

dimensions as the number of attributes used to represent the data. The values of each feature are

quantized into small intervals. The probability estimates are calculated by taking the ratio between the

numbers of instances in each class to the total number of instances with each feature value combination.

The complete table has to be stored as the representation of the probability estimates. Clearly, both

computation and representation will be exponential in terms of the number of features. The quantization

is clearly necessary if the attributes are distributed over a continuum.

62

If we had an unlimited set of training data and it had the exact same distribution as that of the actual test

data, this would be the best learning system, because there is no information loss caused by any model

restriction. But in reality, the training set is always limited and there will be attribute combinations in the

test data that have never appeared in the training set. Because of this limitation, such a lookup table has

seldom been a realistic choice in any real problems. Some model abstraction has to be done to make

generalization from observed training instances to unknown test instances possible.

4.1.2 Naïve Bayes method.

Assuming conditional independence, i.e.,

)|()|,,(1

11 cCxXPcCxXxXPd

iiidd ====== ∏

=

L , (15)

one can use Bayes’ theorem to estimate the class probability P(class | attributes) by:

∏∏==

=∗=∝=∗===d

ii

l

ii

dd cCXPcCPcCXP

XXPcCPXXcCP

1111 )|()()|(

),,()(),,|(

LL . (16)

P(C=c) is the probability of an instance being in one particular class c, and this probability can be

estimated by the ratio of number of instances in c over the total number of instances in the training

examples. )|( cCXP i = can be estimated as the ratio of number of instances belonging to the class c with

the corresponding attribute values over the number of instances in c. This is also estimated by counting

the number of instances in class c with the corresponding values. ),( 1 dXXP L is independent of class C,

and it can be treated as a constant. Since the sum of class probability over all classes is 1, one can easily

solve the probability of each class, knowing their relative ratio. This method is linear in both computation

and representation of the probability estimates in terms of number of attributes d, since it only needs to

compute and store )|( CXP i and )(CP estimates.

This is an indirect method in the sense that it indirectly estimates class probability P(class | attributes) by

estimating class-conditional probability P(attributes | class) at first and uses conditional independence to

63

convert it. This method also requires quantization, unless some parametric form of the probability density

function is used.

Although I do use a naïve Bayes method to combine probabilities from different types of features in the

application, it is not enough because it is not apparent how one can apply the THEMATICS principles in

this method, since the constraints themselves are on class probabilities instead of the class-conditional

probabilities. Furthermore, because some of the features used in THEMATICS are clearly correlated, the

conditional independence assumption is violated, which makes the naïve Bayes probability estimates

suspect.

4.1.3 The K-nearest-neighbor method.

The third method to introduce is the k-nearest-neighbor method. It estimates the class probability P(class

| attributes) by taking the k training instances with the “closest” attributes to the query points and then

counting how many of them are in that particular class. This is a direct method since it gives the direct

estimate of class probability and it does this estimation without direct quantization. In some sense, it

implicitly quantizes the attributes by taking the range of the values from the nearest neighbors. This

method has a lazy evaluation feature since the estimates of a certain feature combination are calculated at

query time. All the training instances have to be stored, and unless some indexing scheme is used, the

computation and representation cost of this method is linear in the number of training instances.

This method can estimate the class probabilities when there are no matches between the attributes of the

query instance and the training instances, and with properly selected k, it may even reduce some of the

negative effect caused by noise in the training data, but there is no apparent way to apply THEMATICS

principles with this method. Furthermore, correlation in the THEMATICS features is a potential problem,

although a correctly weighted distance function can compensate for this.

4.1.4 POOL method.

64

The fourth method is the one proposed here, the Partial Order Optimal Likelihood (POOL) method, the

details of which will be introduced in the next section.

It finds the “best”, in the sense of maximizing the likelihood, among all possible estimates of the class

probabilities P(class | attributes) that satisfy given monotonicity assumptions. It uses convex optimization

to update the probability estimates of the training instances. It stores these probability estimates as

reference points and computes the probability estimates of query instances at query time, although in

principle one can compute the whole table and store it. With lazy evaluation, this method has a linear

computation and representation cost in terms of the number of training instances.

It is a direct method, since it estimates P(class | attributes) directly. In principle it does not need

quantization, although in our application, we have used quantization for convenience. It is a learning

system that makes full use of training data and enforces some prior belief, like the THEMATICS

principles, to minimize the effect caused by noise in the training set. The POOL method is conceptually

simple, computationally efficient, and appears to be effective in practice, as will be shown below.

4.1.5 Combining CPE’s.

Although one can put all the features together and use one CPE to estimate class probability in one step, it

may not be the best choice in practice. In most cases, it is better to group features into different groups

and use several, say l, CPE’s, each with one group of features iXr

. Each of these smaller CPE’s can be

obtained by any of the methods described above, as well as any parametric methods. At query time, the

class probability ),,|( 1 lXXcCPr

Lr

= can be estimated based on the class probability estimates

)|( iXcCPr

= from each of the smaller CPE’s according to the naïve Bayes combination rule:

∏= =

==∗=∝===

l

i

iill cCP

xXcCPcCPxXxXcCP

111 )(

)|()(),,|(

rrrr

Lrr

(17)

65

which is easily derived from Bayes’ Theorem under the assumption that { }n

ii CX 1)|( =

rare mutually

independent random variables, i.e.,

)|()|,,(1

11 cCxXPcCxXxXPl

iiill ====== ∏

=

rrrrL

rr. (18)

We use the term chaining to mean the application of this naïve Bayes combination rule.

Although strictly speaking, one can only use the chaining to estimate the class probability when the

conditional independence assumption holds, in practice, even if the conditional independence assumption

is not strictly true one might still be able to get some useful results, as with other applications of naïve

Bayes, especially when it comes to relative rankings 66.

4.2 POOL (Partial Order Optimal Likelihood) method in detail.

4.2.1 Maximum likelihood problem with monotonicity assumption.

Before introducing POOL, I first introduce the notion of a class probability monotonicity assumption:

Definition 1: Class Probability Monotonicity Assumption:

Class Probability Monotonicity Assumption refers to a property that for each class c, given a partial order

cp on the attribute space Χ, for any ji xx , ∈ Χ with ix cp jx , it must be

that )|()|( ji xXcCPxXcCP ==≤== . For 2-class-classification, −p is the opposite of +p , i.e.,

ix +p jx iff jx −p ix .

An important special case that inspired my development of this approach is when the attribute space has

the form nℜ . In this case, we can define a partial order on Χ as follows:

Given ∈= ),( 1 nxxx L Χ and ∈= ),( 1 nyyy L Χ, define x p y if ix ≤ iyi ∀, .

We call this the coordinatewise partial order on Χ induced by the ordering on ℜ .

66

The basic idea of POOL is to find a CPE for which the training data likelihood is highest among all CPEs

that conform to the monotonicity assumptions. The effect of the monotonicity assumptions is to create a

set of inequality constraints that relate the class probabilities of certain pairs of points.

Definition 2: Likelihood function L(H) of hypothesis H.

Assume given a hypothesis space containing a family of hypotheses H, i.e., probability density functions

(for continuous distributions) or probability mass functions (for discrete distributions), and nXXr

Lr

,,1 as

n random draws with an actual sample as nxx rL

r ,,1 . Since they are iid,by definition:

)|()|( HxXPHxXP jirrrr

=== , for any i, j, xr , H,

and for each hypothesis H, we may compute the probability density that we observed nXXr

Lr

,,1 as a

function of H,

L(H) = P( nn xXxX rrL

rr== ,,11 |H). (19)

Now, restricting attention to 2-class problems, where the class labels are 0 and 1, I

define ),|1( iii xXHCPp rr=== , and we can write

ii ci

ciiiii ppxXHcCP −−=== 1)1(),|( vr

. (20)

Then,

∏∏==

=======n

iiiiiii

n

iiiii xXPxXHcCPHcCxXPHL

11

)(),|()|,()(rrrrrr

. (21)

After substitution of (16), the likelihood function in our problem becomes:

∏∏=

−

=

− −∝=−=n

i

ci

ci

n

iii

ci

ci

iiii ppxXPppHL1

1

1

1 )1()()1()(rr

. (22)

67

Given a monotonicity assumption, finding maximum likelihood probability estimates becomes a

constrained optimization problem: Maximize L subject to a set of inequality constraints of the

form ji pp ≤ , one for each ),( ji xx rr pair in the training data generating the partial order via transitive

closure. The solution of this problem assigns probability estimates to only the training data. For attribute

combinations not observed in the training data, the monotonicity constraints only determine upper and

lower bounds on their class probabilities; one can then use some form of interpolation to assign actual

estimates.

4.2.2 Convex optimization and K.K.T. conditions.

Given a real vector space1 X together with a convex, real-valued function

ℜ→Af : ,

defined on a convex subset A of X, convex optimization is a problem that finds the point x in A for which

the number f(x) is smallest.

Convex optimization has been studied for a long time. It has many good properties such as if a local

minimum exists, it must also be a global minimum, which makes methods like gradient descent work for

solving this problem without the danger of being stuck in a local minimum instead of finding the global

minimum. A lot of methods have been developed to solve it efficiently. There are standard methods like

gradient projection methods, line-searching methods, interior-point methods and some specialized

versions dedicated to some specific problems of this form. How to solve convex optimization problems

in general is still an active research area.

One way to find and prove the constrained optimal point in a convex optimization problem is to use the

Karush-Kuhn-Tucker conditions (K.K.T. conditions). It is a generalization of Lagrange multipliers.

K.K.T. conditions: 1 In order to describe the convex optimization problem in its general form, I use X to denote the vector space, and x to denote a vector (or point) in that space, until further notice. As a general rule, whenever I discuss general optimization problems, I use x, while p is used in specializing to solve for probabilities, as in later sections.

68

Consider the problem:

minimize f(x)

subject to gi(x) ≤ 0 and hj(x) = 0

where f(x) is the function to be minimized, gi(x) (i = 1,…,m) are the inequality constraints and hj(x) (j =

1,…,l) are the equality constraints.

Necessary conditions:

Suppose that the objective function ℜ→ℜnf : and the constraint functions ℜ→ℜnig : and

ℜ→ℜnjh : are continuously differentiable at a point x* ∈S. If x* is a local minimum, then there exist

constants 0,0 ≥> iµλ (i = 1,…,m) and jν (j = 1,…,l) such that

∑ ∑= =

=∇+∇+∇m

i

l

jjjii xhxgxf

1 1

*** 0)(*)(*)(* νµλ (23)

and

0)( =∗ xgiiµ for all i = 1,…,m. (24)

Sufficient conditions:

If the objective function ℜ→ℜnf : and the constraint functions ℜ→ℜnig : and

ℜ→ℜnjh : are convex functions, the point x* ∈S is a feasible point, and there exists

0≥iµ (i = 1,…,m) and jν (j = 1,…,l) such that

∑ ∑= =

=∇+∇+∇m

i

l

jjjii xhxgxf

1 1

*** 0)(*)(*)( νµ (25)

and

0)( =∗ xgiiµ for all i = 1,…,m. (26)

69

then x* is the global minimum.

4.2.3 Finding Minimum Sum of Squared Error (SSE).

In addition to likelihood, sum of squared error is another commonly used measurement of how well the

model fits the data, and it is easier to work with than the likelihood L. In this section, I will present an

approach to compute minimum SSE under the monotonicity assumption, and in the next section, I will

prove that we can find the maximum of likelihood L by finding the minimum of SSE.

Definition 3: Sum of squared error (SSE). To estimate how close the estimated function, in our case the

class probability estimate2 )(xp r, is to the observation of n ),( cxr pairs, we compute

∑=

−=n

iii pcSSE

1

2)( , where )( ii xpp = . (27)

Let ),,( 1 nppp Lr = . SSE is a quadratic function of pr , and the class probability monotonicity

assumptions form a set of linear inequality constraints. This problem is then a special case of the convex

optimization problem, called a quadratic programming problem, another well studied class of convex

optimization problem. As a matter of fact, SVM, a recently developed machine learning system, is

actually a quadratic programming problem.

I developed the POOL algorithm, described below, to find the solution generating minimum SSE (and

therefore maximum likelihood) under the monotonicity constraints. Very recently, I have discovered

some existing literature describing an approach called isotonic or monotonic regression 67, but most of

this work is focused on one-dimensional problems. There is also earlier literature focusing on the total

order case; the pool adjacent violators algorithm (PAVA) 68 and monotonic smoothing 69 are such

examples. Although in some of these reports, it was pointed out the extension of this problem into multi-

2 In this specific case of estimating class probabilities, I use xr to denote the instances and p to denote the vector I try to assign values to minimize SSE.

70

dimension could be framed as convex optimization, the emphasis of the literature in this field seems to

focus primarily on one-dimensional problems.

Compared with convex optimization in its general form, the present case is very special. First, I want to

optimize a summation of terms each consisting of only one component of the vector, i.e., there are no

cross products between different components of the vector. This feature leads to a very simple gradient of

the target function S, such that after rescaling coordinates, the direction of finding global optimization of

S can be determined locally by choosing a component variable pi to optimize the ith term in S, subject to

the same constraints met.

The constraints are special, too. Each constraint is a linear inequality constraint containing two terms, in

addition to the implicit constraints of 0≤pi ≤1 implied by pi being probabilities. Or, in formal terms, this

is a sparse problem.

Another special feature of the present problem is that by a simple scaling of coordinates in space, the

negative gradient vector at any point points directly to the unconstrained global optimal point, i.e. if one

knows the “best improving” direction at one point, following that direction (in a straight line) will lead to

the optimal point.

We start from a point where all constraints are active, where active means that an inequality constraint is

met by equality, i.e. the constraint of pi ≤ pj being actually satisfied by pi = pj. Also all the constraints are

linear and since the gradient never changes direction along the path from point x to the unconstrained

global optimal point O, we have a very special property: If the “best improving” direction moves “away”

from an active constraint, that constraint will not become active again in the optimized solution. It is also

true that if the “best improving” direction “runs into” an active constraint, the constraint will still be

active in the final optimum solution. This makes it possible to determine the active constraints in the final

solution at the starting point, and once the active constraints in the final solution are determined, it is easy

to calculate the exact p that gives the constrained optimal value of the object function )( pf r. Since it is

known that optimum solution is achieved with these active constraints active, meaning the actual

71

probabilities are equal, all one needs to do is to partition the data and make pools of data having the same

(average) probabilities as required by the active constraints and get the accurate solution.

These special conditions make it possible to develop some special algorithms, like the one that will be

presented in the next subsection, to efficiently and accurately solve this subclass of convex optimization

problem. In a more general form of convex optimization problem without these special conditions, a

solution can only be approximated by numeric methods to a certain degree of accuracy, and typically

involves re-computing active constraint sets as the algorithm iterates toward a solution.

4.2.4 POOL algorithm.

The input to the algorithm is the training set D, and the constraint matrix CnXm. Each column in this

matrix corresponds to a single constraint of the form xi≤xj, and contains two non-zero entries.

This algorithm consists of three steps:

• Determine the active set A, which consists of all the active constraints, at the starting point.

• Given A, compute the corresponding partitioning (pools) of data.

• For each pool, compute the average across all data in the pool and assign this common average

value to each instance in the pool.

72

POOL(D, CnXm) • Initialize starting point S as origin.

• Compute ←Gr

α*▽f(S) (In our re-scaled case:itotal

ii n

nG

_

_+← , where = n+_i and ntotal_i are the

number of positive samples and the number of total samples in ith instances.)

• Initialize µr←0r

. • Until termination condition* has been met: • Compute ←H

rCnXm X µr

• Compute HGFrrr

−←

• Compute ∆µr←CnXmT X F

r

• µr ← µr + α *∆µr • If iµ <0 then iµ ←0 (i = 1,…m))

• Build transitive closure of x

• Let each xi (i = 1,…n) be in its own set. • For each iµ (i = 1,…m) if iµ >0, look in constraint matrix CnXm to find the two non-zero entries

Cai and Cbi in ith column and union the sets containing xa and xb into a new set, replacing the two sets containing xa and xb beforehand, if they are not in the same set.

• Go through all the sets built in the above step. Set pi (i = 1,…n) to be the sum of n+ of all the x divided by the sum of all the ntotal in the same set.

* termination condition is the threshold set by the user to determine convergence of Fr

; in our program, the size of F

r is computed until the difference between two iterations is less than 10-9.

α is used as the step size to control the rate of change in moving toward the improving direction; in our case, it is set as 0.05.

In the above algorithm, the first four steps compute the active set A by solving the dual problem of

determining iµ minimizing 2

1|| ∑

=∗−

m

iii CGrr

µ , subject to 0≥iµ for all i. Constrained gradient descent

such as gradient projection is used to solve this. The constraint gi is active iff 0>iµ .

The reason that I can apply the active set A determined at the starting point at the solution point is the fact

that I rescale the coordinate system, thus preventing the distortion of spherical gradient by different

weights of each term in the objective function.

The original objective function is

73

)2*( _1

_2

_ ii

n

iiiitotal npnpnSSE +

=+ +−= ∑ . (28)

After applying the appropriate transformation

iitotali pnp ∗= _' , (29)

the objective function becomes

iiitotal

in

ii np

n

npSSE _

'

_

_

1

2' 2 ++

=

+−= ∑ , (30)

which has spherical level surfaces, since all quadratic terms have the same coefficient.

Usually, n and m are large numbers but CnXm is sparse because most of its entries are 0. To improve

efficiency in storage and computing, I use the index of non-zero terms of CnXm instead of the matrix itself.

In practice, the speed of applying the POOL algorithm by my program is very fast.

4.2.5 Proof that the POOL algorithm finds the minimum SSE.

In this subsection, I will prove that the POOL algorithm gives the minimum SSE solution under the

monotonicity constraints.

In the present case, the objective function is a quadratic function of3 x, the constraint functions gi are

linear functions and there are no equality constraints hj, so both the necessary and sufficient KKT

conditions hold. This also implies that, if one can find such 0≥iµ (i = 1,…,m) satisfying the above two

equations, one finds the global minimum x*.

Since there are no equality constraints in the present problem, one does not have to consider jν and the

last term in the K.K.T. equations (23) and (25) above. Here I show my approach to find iµ .

3 In order to make the proof consistent with the form stated in section 4.2.2, I use x instead of p as in section 4.2.3; the function value we are minimizing is still SSE.

74

First, notice that all of the constraints can be put into two categories, one is 0≤xi ≤1 for all i=1,…,n; the

other is xi≤xj for some i and j. The first category is automatically satisfied from the starting point to the

end during the optimization process. There is an interesting feature for the m constraints of the second

category: all the bounding hyperplanes of the feasible region intersect at a line where the xi are equal for

all i=1,…,n. Our algorithm takes one point on this line, where all xi are equal to 0, as the starting point.

The outward pointing normals of the m constraints at the starting point form a convex cone. There is also

a vector Gr

at that point, pointing to the global minimum of the objective function when there are no

constraints. Gr

is the negative gradient of the objective function at x.

As mentioned earlier in 4.2.3, the present problem has a special feature that by a simple scaling of

coordinates in space, the negative gradient vector Gr

becomes a constant vector Or

(pointing to where the

unconstrained minimum of f(x) is located) minus vector Xr

, where Xr

is the vector from the origin to the

point x,

▽f(x) = GXrr

− (31)

So, the key point is to find the active constraints in the final solution at the starting point S. Without loss

of generality, for the sake of convenience in stating the proof, we assume the starting point S is located at

the origin. As Figure 4.1 shows, there are three cases based on where Gr

points.

75

O*

1Cr

2Cr G

r

Fig 4.1. (b)

S (x*)

O*(x*)

1Cr

2Cr

Gr

Fig 4.1. (a)

S

76

Figure 4.1. Three cases of Gr

in relation to the convex cone of constraints. Define convex cone as: }0|{ iwwithCwVV ii

ii ∀≥=∑rrr

.Gr

is the negative gradient of the objective function as starting

point S, pointing to the global optimum O*. 1Cr

and 2Cr

are the constraint vectors forming the convex cone. H

ris the projection of G

r on the convex cone formed by the constraint vectors and

x* is the solution that gives the constrained optimum for the object function. 4.1(a) shows the case where the unconstrained global minimum is located in the feasible region; 4.1(b) shows the case where the unconstrained global minimum is located in the inside of the convex cone formed by the constraint normals; 4.1(c) shows the case where unconstrained global minimum is located outside the feasible region and the convex cone formed by the constraint normals.

O*

1Cr

2Cr

Gr

S

Hr

x*

Fig 4.1. (c)

77

Figure 4.1(a) shows the special case where the unconstrained optimal O is located in the feasible region,

meaning *Or

points at O. This is the simplest scenario. All one needs to do to compute Gr

and x* is to add

Gr

to Sr

.

It is trivial to show that x* satisfies the KKT sufficient condition by letting iµ = 0 (i = 1,…,m), since x* is

where the unconstrained global optimal O is, so▽f(x*) is 0. Another way to look at it is that a global

optimum is always a local optimum.

Figure 4.1(b) shows another special case where all constraints have to be active to get the local minimum,

and the start point S happens to be the optimal point x*, or in other words, there is a non-negative

assignment of iµ (i = 1,…,m), that makes ∑=

∗=m

iii CG

1

rrµ , where iC

ris the outward normal of the half-

space defined by the ith inequality constraint of gi(x)≤0, or gi(x)= xCirr

• ≤0 and SXxrrr

+= * . Apparently,

x* is feasible because it is the origin and *xCirr

• =0. The proof that x* satisfies the sufficient K.K.T.

condition is the same as the more general proof for figure 7(c), just with the fact that x*=S and 0* =Xr

.

Figure 4.3(c) is the general case that the negative gradient Gr

cannot be expressed as linear combination

of iCr

with non-negative assignment of coefficients iµ (i = 1,…,m). In this case, among all possible non-

negative assignments of iµ to get ∑=

∗=m

iii CH

1

rrµ , one finds the specific one that gives the minimum

distance between the tip of this Hr

vector and O. That is, we seek Hr

of this form minimizing the length of

the vector HGrr

− . We then set HGSxrrr

−+=* .

First, we show that x* is feasible, i.e., *XCi

rr• =gi(x*)≤0 (i = 1,…,m), where vector G

ris the same as the

vector *Xv

.

78

Since *xr is located at HGSrrr

−+ and Sr

is the origin, we get HGXrrr

−=* . Note that iµ is special in that

it is non-negative and it gives the minimum length of vector HGrr

− , which is the same as the length

of *Xr

.

We have

0)( **

=∂

•∂

i

XXµ

rr

when 0>iµ (32)

and

0)( **

≥∂

•∂

i

XXµ

rr

when 0=iµ . (i = 1,…,m). (33)

Since

)()(1

*i

m

ii CGHGX

rrrrr∗−=−= ∑

=µ , (34)

We have

)2)(2)( *

1

**

XCGCCXXii

m

iii

i

rrrrrrr

•∗−=−∗•∗=∂

•∂ ∑=

µµ

(35)

Substitute (35) in (32) and (33), we get

0* =• XCi

rr when 0>iµ (36)

and

0* ≤• XCi

rr when 0=iµ (i = 1,…,m). (37)

Combine (36) and (37), we get 0)( ** ≤•= XCxg ii

rr (i = 1,…,m), i.e., *xr is feasible.

Next, we show ∑=

=∇+∇m

iii xgxf

1

** 0)(*)( µ . The following has already been shown in (31)

79

HGXxfrrr

−=−=∇ )( * (31)

and

∑=

∇m

iii xg

1

* )(*µ =∑=

•∇m

iii XC

1

* )(*rr

µ

=∑=

m

iii C

1*r

µ

= Hr

(38)

Thus, we have

∑=

=∇+∇m

iii xgxf

1

** 0)(*)( µ . (39)

Last, we show that 0)(* * =xgiiµ , (i = 1,…,m). Geometrically, since HGXrrr

−=* has the minimum

size, *Xr

must be orthogonal to Hr

, i.e., 0* =• XHrr

. Substituting Hr

with∑=

m

iii C

1*r

µ , we have

0)(*)(*)*(1

*

1

**

1

* ==•=• ∑∑∑===

m

iii

m

iii

m

ii xgXCXC µµµ

rrrr (40)

and the fact that 0)( * ≤xgi and 0≥iµ , (i = 1,…,m), (40) holds only when 0)(* * =xgiiµ for all i

from 1 to m.

This completes the proof that x* satisfies the sufficient part of the K.K.T. Condition, and verifies that x* is

the minimum of the objective function under the monotonicity assumptions.

I could have used x* described above to find the probability assignment of every data point in the table to

give the minimum sum of error squared, but in the present program, I actually use iµ to find the active

constraints that the optimal solution should satisfy and compute x* based on that, by combining together

all instances in the groups under equality constraints and assigning the corresponding p values to be the

80

total positive instances observed divided by the total instances in the group. This procedure is more

accurate than computing from HGSrrr

−+ . This new way of computing x* generates the same x* as before,

since it gives the minimum SSE assignments of p for each group of active constraints.

4.2.6 Maximum likelihood vs. minimum SSE.

Theorem 1. In this problem of finding estimated probability under monotonicity constraints, a value of

pr that minimizes SSE will also maximize likelihood L subject to the same monotonicity assumption.4

While this is well-known when there are no constraints, it is not true under arbitrary constraints; what I

show is that it also holds under the particular constraints used here.

As defined earlier in (22), the likelihood function L is defined as follows for binary problems:

∏∏=

−

=

− −∝=−=n

i

ci

ci

n

iii

ci

ci

iiii ppxXPppPL1

1

1

1 )1()()1()(rrr

(22)

Since the present problem only involves assigning pi, given ii xX rr= , I will only use the

∏=

−−n

i

ci

ci

ii pp1

1)1( part from now on for )(PLr

.

Adopting the convention that 100 = , each factor of )(PLr

can be rewritten as:

)10()1(),( 1 ≤≤−∗= −i

ci

ciiii pppcpL ii

Before we prove the solution from minimizing SSE also maximizes )(PLr

, I prove the following lemma.

Lemma 1. Suppose we wish to maximize a function having the form:

4 Since this is to prove the same solution minimizing SSE will also maximize L, and I use this in the special probability estimate problem presented here, I once again switch to p and P

rto denote the vector

variable and the vector space respectively instead of using x and Xr

as earlier.

81

),,(),,(),,,,,()( 121111 nmmnmm ppFppFppppPF LLLLr

++ ∗== , where Pr

is partitioned into two

groups: mpp ,,1 L and nm pp ,,1 L+ , subject to a given set of constraints.

If an assignment of *Pr

, namely ),,,,,( **11

**11 nnmmmm pppppppp ==== ++ LL maximizes 1F and

2F under the same constraints, then *Pr

maximizes )(PFr

under the same constraints.

This lemma is readily proved by contradiction. If there is another assignment of 'Pr

under the same

constraints that gives )()( *' PFPFrr

> , then give the same assignments of 'Pr

to 1F and 2F . Then at least

one of the following has to be true: )()( *1

'1 PFPF

rr> or )()( *

2'

2 PFPFrr

> . But if either one of them is

true, the assumption that *Pr

maximizes 1F and 2F under those constraints is violated.

With Lemma 1 proved, I can prove Theorem 1. Based on the value of ip from the minimum SSE

solution *Pr

, I break the equation (22) into two parts after moving factors with 10 == ii porp at the end

of the equation:

)()()( 21 PLPLPLrrr

∗= , where

∏+=

− ==−=n

niii

ci

ci porpwhereppPL ii

1'

11 10)1()(r

and

∏=

− <<−='

1

12 10)1()(

n

ii

ci

ci pwhereppPL ii

r

Apparently, since *Pr

is minimum SSE solution under the same constraint set for maximum-likelihood

problem, if I can show *Pr

maximizes both )(1 PLr

and )(2 PLr

, I will prove Theorem 1 based on Lemma 1.

82

Showing *Pr

maximizes )(1 PLr

is easy. Notice the fact that in )(1 PLr

, ip can only be 1 or 0, and the only

way when ip could be 1 in minimum SSE solution is when ic has the same value, which is 1, and the

same is true when ip is 0. Substituting ip and ic with either both 0, or both 1 gives a )(1 PLr

value of 1,

which is the unconstrained maximum already, and apparently it is the constrained maximum also.

I will show in the remaining section that *Pr

also maximizes )(2 PLr

under the same constraint set.

Since all ip ’s in )(2 PLr

have values strictly larger than 0 and less than 1, a negative-log-likelihood

function may be defined as:

∑ ∑= =

−−−−=−=n

i

n

iiiii pcpcLPG

1 12 )1log(*)1(log*log)(

r (41)

Since with 1,0 ≤< yx , )2

log()log(log21 yxyx +−≥+− , iplog− is convex function, as is

)1log( ip−− . Since )(PGr

is the weighted sum of convex functions with positive weights, it is a convex

function, too.

After taking the derivative of G over pi, we get:

)1(*1

1)1(1

ii

ii

ii

ii

i ppcp

pc

pc

pG

−−

=−

∗−+∗−=∂∂

(42)

Notice that the derivative of the SSE function:

∑=

−=n

iii pcF

1

2)( (43)

is

)(*2 iii

cppF −=

∂∂

(44)

83

Comparing (44) with (42), (42) can be rewritten as the following, with Pr

andCr

representing the vectors

),,( 1 npp L and ),,( 1 ncc L , respectively:

T

nn

iiCP

pp

pp

pp

PG ))(2(

)1(**210000

0)1(**2

1000

00)1(**2

1

11

rr

LML

OMM

MM

MLO

LL

r −∗×

−

−

−

−=∂∂

)(

)1(**210000

0)1(**2

1000

00)1(**2

1

11

PF

pp

pp

pp

nn

ii

r

LML

OMM

MM

MLO

LL

∂∂×

−

−

−

−= (45)

If we can find iG _µ and a specific *Pr

to satisfy the following K.K.T. condition for G:

∑=

=∇+∇m

lllG PgPG

1

*_

* 0)(*)(rr

µ (46)

and

0)(* *_ =PgllG

rµ , (47)

then we know that *Pr

is the optimal solution corresponding to the minimum negative-log-likelihood

function )( *PGr

, also to the maximum likelihood function )( *2 PLr

.

84

Now, we will show that the *Pr

obtained by solving the minimum SSE function with the same set of

constraints )(Pgr

, along with the Gµ constructed from the Fµ corresponding to the minimum SSE

solution, does satisfy the K.K.T. condition of (46) and (47).

First, since *Pr

is the solution minimizing the SSE function, based on the K.K.T necessary condition, there

is a corresponding Fµ satisfying

∑=

=∇+∇m

lllF PgPF

1

*_

* 0)(*)(rr

µ (48)

with

0)(* *_ =PgllF

rµ . (49)

From (49), we have either

0_ =lFµ

or

0)( * =Pgl

r,

this latter case meaning the inequality constraints are actually met by equality. Without loss of generality,

let the lth constraint function lg be )()()( lblal ppPg −=r

, where )(la and )(lb correspond to the indices

of the variables p appearing in the constraint according to the given partial order. Since 0)( * =Pgl

r, we

let llbla ppp == )()( . Now we can construct

),,1(*)1(**2

1__ ml

pp lFll

lG L=−

−= µµ (50)

Since F and G have the same constraint set, so )()( *_

*_ PgPg lFlG ∇=∇ ,

85

),1(,)1(**2

)(*)1(**2

1

)1(**2)(*

)(

)1(**210000

0)1(**2

1000

00)1(**2

1

)(*)(

1

*__

1

*__

11

1

*__

*

nipp

PgpF

pp

ppPg

PF

pp

pp

pp

PgPG

m

l ll

lGlF

iii

m

l ll

lGlF

nn

ii

m

llGlG

L

r

r

r

LML

OMM

MM

MLO

LL

rr

=−

∇−

∂∂∗

−−=

−∇

−∂∂×

−

−

−

−=

∇+∇

∑

∑

∑

=

=

=

µ

µ

µ

As mentioned earlier, lp is the same as the two p ’s appearing in lFg _ where 0_ ≠lFµ , so the above

equation can be rewritten as:

∑

∑

∑

=

=

=

∇∗+∂∂

−−=

−∇

−∂∂∗

−−=

∇+∇

m

llFlF

iii

m

l ll

lGlF

iii

m

llGlG

PgpF

pp

ppPg

pF

pp

PgPG

1

*__

1

*__

1

*__

*

))((*)1(**2

1)1(**2

)(*)1(**2

1

)(*)(

r

r

rr

µ

µ

µ

(51)

Since *Pr

and the corresponding ip are the assignments giving minimum of SSE function F, from (39),

we have 0)(* *

1__ =∇+

∂∂ ∑

=

PgpF m

llFlF

i

rµ (52)

Substituting (49) into (48), we have:

0

))((*)1(**2

1

)(*)(

1

*__

1

*__

*

=

∇∗+∂∂

−−=

∇+∇

∑

∑

=

=m

llFlF

iii

m

llGlG

PgpF

pp

PgPG

r

rr

µ

µ

(53)

86

Since the functions G and F share the same constraint set )(Pgr

,

0)(**)1(**2

1)(* *_

*_ =

−= Pg

ppPg llF

llllG

rrµµ (54)

(53) and (54) show that the *Pr

minimizing the SSE function F also minimizes the negative-log-

Likelihood function G and therefore maximizes the likelihood function )(2 PLr

.

Notice the subtlety that the constraint set we used to show *Pr

maximized )(2 PLr

is actually a subset of the

original constraint set, because some of the constraints in the original problem do not play roles in

maximizing )(2 PLr

since they involve some variables of ip with value of 1 or 0, which are not in

)(2 PLr

at all. This is not an issue, because we show *Pr

already maximized )(2 PLr

in a less constrained

manner, )(2 PLr

cannot get a more optimized value with a tighter constraint. On the other hand, since

*Pr

is actually a solution from minimum SSE under the original constraints, it will not violate any

constraints in the original constraint set but not in the one we used to optimize )(2 PLr

.

Combining all the results above, using lemma 1, this completes the proof of Theorem 1.

With Theorem 1, the answer achieved by least sum of squared error using the POOL algorithm is the

same as the maximum likelihood solution.

4.3 Additional computational steps.

4.3.1 Preprocessing.

In general, CnXm is given by specific problem requirements. In the present study it is derived from the

THEMATICS principles. Since the data in this problem is sparse and n is large, instead of writing out all

the constraints, we introduce the idea of immediate successor and immediate predecessor:

Definition 4: Immediate successor.

87

y is an immediate successor of x if and only if zyzxtsz pp ,..∀ .

There is a trivial linear time algorithm that, given monotonicity assumptions, can find all the immediate

successors of a particular instance with particular attributes in a single scan of all the instances and their

attributes. With immediate successors of each instance known, the transitive closure will contain the

whole monotonicity assumption. We use this fact to build CnXm in quadratic time. For storage efficiency,

we could even store CnXm in one mX2 array, if all scaling factors are 1, by just storing the indices of

instances with 1 and -1 as their coefficients respectively; in the present case, with a different scaling

factor for each cell, we use three mX2 arrays, one storing the indices, and the other two for scaling factors.

4.3.2 Interpolation.

If the training set does not have an instance with a specific attribute value, an interpolation scheme is

needed to estimate the class probability with this attribute value. There are some required properties this

interpolation scheme should have. One is that all the interpolated class probabilities should conform to the

monotonicity assumption over the whole virtual table. In addition to that, there are also some desirable

properties, such as, whenever possible, the interpolation should reflect the monotonicity assumption

strictly. In other words, if the original monotonicity assumption specifies that P(C=c|X=x)≤P(C=c|Y=y)

when x≤y, if we have x<y, we prefer P(C=c|X=x)<P(C=c|Y=y) as long as it does not violate the

monotonicity assumption as a whole. Another desirable feature may be, whenever possible, we prefer

P(C=c|X) continuous over X.

After testing some interpolation schemes in the present application, we found a linear interpolation

between the maximum and minimum allowed P based on the Manhattan distance of instance X to its

limiting predecessor and successor gives good results in practice.

The result of applying this new POOL method to the protein active site prediction problem will be given

in the next chapter.

88

Chapter 5

Applying the POOL Method with THEMATICS in Protein

Active Site Prediction.

89

5.1 Introduction.

In chapter 3, I reported the use of SVM and THEMATICS to predict the protein active sites based on

protein 3D structure alone and introduced one way to expand the original THEMATICS method and

include the non-ionizable residues into the prediction. Although the SVM method outperforms all prior

structure-based methods, including other approaches using THEMATICS, and achieves similar or slightly

better performance than methods using both structural and sequence comparison information, there is still

room for further improvements of the method, as we briefly discussed in section 3.7.

One straightforward improvement can be the addition of more information about a residue into the

learning system. In the study reported in this chapter, in addition to THEMATICS features, I add

different features, such as the size of the cleft in which a surface residue resides, and the conservation of

the residue among proteins of similar sequence into our system, and examine how helpful they are in

terms of improving the sensitivity and the specificity of the prediction.

Another improvement comes from changing a classification problem into a ranking problem. In a

classification problem, the results are binary labels of either positive or negative, with nothing in between.

One of the disadvantages of this approach is that it is less convenient or even impossible in some cases for

the users to fine tune their result if they want to improve the sensitivity at the cost of lowering the

specificity or vice versa. In this study, the result is a ranked list of all residues in a protein based on their

likelihood of being in active sites. Users can choose a cut-off best suited to their needs in different

situations. One previous method, called PCats 21 generates such a rank-ordered list of probabilities.

Because my method actually estimates probabilities, I can easily combine probability estimates from

different methods, using the chaining method I introduced in section 4.1.5. This overcomes one of the

common hurdles one encounters by including more features into the system as described in the paragraph

90

above. This study did show the combined probability estimates with additional features work better than a

single probability estimate.

In the SVM approach, there is a 9Ǻ cut-off to form SVM positive residues into clusters as active site

residue candidates. This threshold seems to give good predictions, but it is arbitrary and could be

optimized in a more systematic way. In this study, I eliminated this step and used features containing the

degree of perturbation of titration curves of nearby residues. This approach is more systematic over the

whole system and is optimized over the whole process.

Combining THEMATICS and the power of the POOL method by enforcing the hypothesis into the

learning system, I achieve a substantially higher sensitivity and specificity at the same time than the SVM

method, one I had already shown to be the best among all other 3D-structure based methods, as compared

in Chapter 3. Performance can be improved further with other 3D-structure-based features, including the

size of the cleft in which surface residues reside. Performance can also be improved using sequence

conservation scores for individual residues, obtained from a sequence alignment of proteins of similar

sequence, provided there are enough such proteins. Note that this latter enhancement turns a purely 3D

structure based method into a sequence and structure based method.

A set of 64 different proteins from the CatRes (CSA) database is used to compare the performance of the

different methods for functional site prediction. A more complete selection of 160 proteins from the

CatRes (CSA) database is used to further confirm the advantage I gain by adding extra features in

addition to THEMATICS into the system to form an improved system for prediction. In this chapter, I

also improve the way I extend the method to predict non-ionizable active site residues. In addition, we use

ROC curves to compare the performance between different methods and RFR (Recall-Filtration Ratio)

curves to guide potential users in setting the actual cut-off in practice.

5.2 THEMATICS curves and other features used in the POOL method

91

In the work presented in Chapter 3, I used moments of the first derivative curves of the titration curves.

These were defined analogously to the moments of density functions, as these first derivative curves are

essentially probability distribution functions 53. One aspect of these prior approaches such as 24, 45, 54 is the

use of spatial clustering as a way of reducing the number of apparent false positives. That is, residues are

reported as positive by the method if and only if they are in sufficiently close spatial proximity to at least

one other residue identified as a candidate by the outlier detector in Ko’s and Wei’s approach or the SVM

in Tong’s approach. The overall identification process in these prior approaches thus involve two stages,

where the first stage makes a binary yes/no decision on each residue. In this new approach I do not begin

with such a binary decision because it is my goal to assign to every (ionizable) residue a probability that it

is an active-site residue. Thus, as an alternative to this clustering approach, I instead consider what I call

environment features. For a given scalar feature x, I define the value of the environment feature xenv(r)

for a given residue r to be

∑∑

≠′

≠′′

′′=

rr

rrenv

rw

rxrwrx )(

)()()( (55)

where r' is an ionizable residue whose distance d(r',r) to residue r is less than 9Ǻ, and the weight w(r') is

given by 1/d(r',r)2 .

In this study, I use the same features µ3 and µ4 used in the Ko approach, along with the additional features

µ3env and µ4

env as an alternative to the clustering stage. Thus every ionizable residue in any protein is

assigned the 4-dimensional feature vector (µ3, µ4, µ3env, µ4

env), which is the THEMATICS feature for

ionizable residues.

Although µ3 and µ4 themselves are only defined for ionizable residues, the environment features, such as

µ3env, µ4

env, are well-defined for non-ionizable residues. For non-ionizable residues, the THEMATICS

features I use are the 2-dimensional feature vectors (µ3env, µ4

env).

92

There is one additional subtlety that all THEMATICS-based methods have had to address, and the current

approach is no exception: the need for some kind of normalization across proteins. In Ko’s and Wei’s

approach, the raw features are individually transformed into Z-scores by subtracting the within-protein

mean and dividing by the within-protein standard deviation. Similarly, in my SVM approach, the raw

features are likewise transformed into robust Z-scores by subtracting the within-protein median and

dividing by the within-protein interquartile distance. Here, I apply yet another within-protein feature

transformation to each feature, which I call rank normalization. Within each protein, each feature value is

ranked from lowest to highest in that protein, and each data point is then assigned a number uniformly

across the interval [0,1] based on the rank of that feature in that protein. The highest value for that feature

is thus transformed to 1, and the lowest value is transformed to 0. Note that unlike the use of Z-scores or

robust Z-scores, this is a nonlinear transformation of the raw feature values. For each scalar feature x,

denote its within-protein rank-normalized value as x~ , which by definition lies in [0,1]. I extend the use of

this notation to feature vectors in the obvious way. That is, )~,~~,~(~43,21 xxxx=x .

Note that the use of within-protein rank normalization does not affect the within-protein partial order used

in the THEMATICS Principles, which I introduced in Section 2.3.3. That is, yxp is true for raw feature

vectors x and y in the same protein if and only if yx ~~p . However, when I combine data from multiple

proteins for training and use the results to make predictions for new proteins, as I describe in more detail

later, this actually amounts to making an even stronger monotonicity assumption across proteins in which

the within-protein rankings replace the raw feature values. This is obviously a more controversial

assumption, but some such approach is required to be able to train on multiple proteins and make

predictions for novel proteins, and, as I show below, this approach appears to give good results.

As discussed in Chapter 2, in addition to THEMATICS, there are other methods for predicting active site

residues. They use features such as geometric position of residues, amino acid type information and

sequence conservation, which are very different from the electrostatic information I use in THEMATICS.

93

It is reasonable to speculate that if I combine these features with THEMATICS features, I may get better

performance. I tested this hypothesis in this study.

In addition to THEMATICS features, I try the cleft feature, which is a number I assigned for every

residue in a given protein based on the rank of the size of the cleft to which the residue belongs. One

special value is assigned to every residue not on the protein surface, and another is assigned to every

residue on the surface but not within any cleft. Ignoring these special values, it is easy to construct the

monotonicity assumption that the larger the cleft to which a residue belongs, the more likely that residue

is to belong to the active site. I can apply POOL on the cleft feature based on this monotonicity

assumption.

ConSurf 27 is a sequence comparison based method that identifies functionally important regions on the

surface of a protein of known three-dimensional (3D) structure, based on the phylogenetic relations

between its close sequence homologues. If there are more than five homologues (the method is

considered reliable if the number of homologues is greater than 10) to the query protein, it can assign a

score between 1 and 9 to each residue in the query sequence based on how conserved this residue is

among those homologues. The larger the score is, the more conserved the residues are. With some

exceptions discussed in Chapter 2, it is commonly believed that the more conserved a residue is, the more

likely it is functionally important. This gives me another monotonicity assumption to which I can apply

POOL. I call this the ConSurf feature. In this study, if the protein has more than 10 homologues, I used

the scores ConSurf assigns to each residue as their ConSurf feature values. For the proteins with 10 or

fewer homologues, I assign 0 as the ConSurf feature values for all their residues. Since I am only

interested in the rank list of residues within a protein, rather than across proteins, this special treatment

will not affect the final results.

In addition to these features, I also tried features such as residue type and ASA (area of solvent

accessibility) of residues in our study but found that including these did not improve the performance. No

further details will be given here about those features that did not improve overall performance.

94

5.3 Performance measurement.

Before presenting the results, I must first decide how to measure the performance of our system. For a

standard classification problem, performances are typically measured by recall, false-positive rate and

Matthews correlation coefficient (MCC). Within a specific system, recall and false positive rate usually

affect each other: lowering the false-positive rate most likely will lower recall at the same time. So it only

makes sense when one gives out both metrics at the same time. Although MCC is a single metric to

measure the overall performance of a classification system, it only measures the performance at a specific

setting. If I want to measure the performance of my system at different thresholds, ROC (Receiver

operation characteristic) curves, which plot recall against the false positive rate is the answer. One can

compare two systems by comparing their ROC curves. In general, one can say a system giving a higher

recall and a lower false-positive rate at the same time out-performs a system giving a lower recall and a

higher false-positive rate at that specific setting. If the ROC curve from system A is always at the upper-

left side of the ROC curve from system B, one can conclude that system A dominates system B and

always out-performs system B. Studies also have shown that area under the ROC curve (AUC) is a very

reliable single-value assessment for the evaluation of different machine learning algorithms 70.

In order to generate ROC curves, I need to be able to calculate recall and false-positive rate values, which

come from classification problems. In the POOL system, the result for each protein is a ranked list based

on the probability of a residue being in the active site. A natural way to draw a ROC curve for every

protein is to move the cutoff one residue at a time from the top to the end of the list. The resulting ROC

curve has a stairwise shape: only recall increases when an active site residue is encountered and only false

positive rate increases when a non-active-site residue is encountered.

We define average specificity (AveS) for each protein in the set:

examplespositiveofNumber

rposrSAveS

N

r∑

== 1

))(*)(( (56)

95

where r is the rank, N is the number of residues in a protein, pos(r) is a binary function that indicates

whether the residue of a given rank r is annotated in the reference database in the active site (pos(r)= 1) or

not (pos(r)= 0), and S(r) is the specificity at a given cut-off rank r.

It is not hard to see that AveS represents the area under the ROC curve (AUC). This is analogous to AveP,

the area under the Recall-Precision curve, used in the information retrieval field. Unlike MCC, AveS is a

single-number measurement of the performance of a classification system over the whole range of

different cutoffs settings, rather than from a single setting.

Since the AveS is a measurement on a ROC curve for predicting active site residues from only one

protein, I need a measurement for the performance on a set of proteins. For this, I use Mean Average

Specificity (MAS), which is the mean of AveS of all the proteins in the set. For all methods that generate a

ranked list as in this study, including the POOL method, and one SVM method, I report the corresponding

Mean Average Specificity (MAS) from all the proteins in the test set. As in all statistical analysis, a

difference between the means does not mean too much without further analysis about the statistical

significance of the difference observed. In order to test the significance of the observed difference, I

perform the Wilcoxon signed-rank test 71 on AveS from different methods to estimate the probability of

observing such a difference under the null hypothesis that the observed better-performing method is

actually not better than the other.

To visually compare the performances from different methods, I generate the averaged ROC curve for

each POOL method by computing the recall and false-positive rate after truncating the list after each of

the positive residues in turn, followed by linearly interpolating the value at each recall value and

computing the mean of the interpolated false-positive rate value from all proteins in the dataset.

Although ROC curves and their associated AveS values are good ways to compare performance between

different methods, they do not directly provide a guide for the user to select the cut-off values, because

both recall and false-positive rate are not known to users unless they happen to know the true positives of

their proteins up front. I use another plot that I call the RFR curve; this is a plot of recall against filtration

96

ratio. Its purpose is to provide a guide for the user to select their cut-offs. They are almost the same as

the ROC curves except that filtration ratios are used in place of false-positive rates.

Since for every protein in the dataset POOL generates a ranked list of residues based on their probabilities

of being in the active site, and from this list one generates a corresponding ROC curve and a

corresponding RFR curve, I need to average these curves into a single ROC curve and a single RFR curve

for the whole dataset for comparison purposes. Since it is more natural to ask for any given method what

its expected false-positive rate is for given values of recall, this is what I use for the averaged ROC curve.

Another important fact about ROC curves is that there need be no prior commitment to how specific

classifiers are created. They express the tradeoff no matter how the classifiers are parameterized. On the

other hand, for the user who wants to use a fixed-proportion cutoff scheme, I provide averaged RFR

(recall-filtration ratio) curves; these curves give the expected recall for given filtration ratio values.

5.4 Computational procedure.

The three-dimensional coordinate files for the protein structures were downloaded from the Protein Data

Bank (http://www.rcsb.org/pdb/). In order to predict the theoretical titration curve of each ionizable

residue in the structure, finite-difference Poisson-Boltzmann calculations were performed using UHBD 72

on each protein followed by the program HYBRID 73, which calculates average net charge as a function

of pH. These titration curves were obtained for each ionizable residue: Arg, Asp, Cys, Glu, His, Lys, Tyr,

and the N- and C- termini. The pH range we simulated for all curves is from -15.0 to 30.0, in increments

of 0.2 pH units. This wide theoretical pH range is necessary for proper numerical integration of the first

derivative functions. The structures were processed and analyzed to obtain the central moments, as

described in Chapters 2 and 3.

These individual features, the central moments µ3 and µ4, were then rank-normalized within each protein,

and thus assigned values in the interval [0,1], as described earlier. This four-dimensional representation of

each curve was used for training and for testing. The results given in the remaining sections were based

97

on eight-fold cross-validation on a set of 64 proteins or 10-fold cross-validation on a set of 160 proteins,

both taken from the Catalytic Site Atlas (CSA) database 57, 74. The labels were taken directly from the

CSA database; if a residue is identified there as active in catalysis, it was labeled as positive in my dataset.

If not so identified in the CSA, we labeled it as negative. The CSA annotations, although incomplete,

constitute the best source of active residue labels for enzymes. In anticipation that the POOL method

would not be overly sensitive to mislabeled data, I performed no hand tuning of the labels and omitted no

residues during training, in contrast to the SVM work reported in Chapter 3.

For the eight-fold cross-validation procedure, I randomly divided the 64-protein set into eight folds of

eight proteins each, training on seven of the eight folds (56 proteins) and testing on the remaining fold (8

proteins). For the ten-fold cross-validation procedure, I randomly divided the 160-protein set into ten

folds of sixteen proteins each, training on nine of the ten folds (144 proteins) and testing on the remaining

fold (16 proteins). Training was performed applying the POOL method to obtain a function )~|1(ˆ xP for

each rank-normalized feature vector x~ in the appropriate feature space [0,1]k (where k = 4 for the POOL

method applied on the four THEMATICS features of ionizable residues as stated earlier, denoted as

POOL(T4); k=5 for the POOL method applied on the four THEMATICS features of ionizable residues

plus the geometric feature of the cleft size, denoted as POOL(T4G); k=1 for the POOL method applied

just on the geometric feature of cluster size, denoted as POOL(G); and k=2 for POOL applied to non-

ionizable residues, denoted as POOL(T2)). An additional detail is that for training we quantized the

multi-dimensional data points. For example, for POOL(T4), each rank-normalized feature fell into one of

20 bins whose sizes varied depending on their distance from 0.0. In particular, the lowest ranked bins

covered the half-open intervals, [0.0, 0. 2), [0. 2, 0.4), [0.4, 0.6), [0.6, 0.7), and there were 16 more bins of

width 0.02 above that, with one special bin for 1.0. Thus the lowest-ranking data were quantized more

coarsely than the remaining data. This is appropriate since these data tend to have very low average

probability of being in the active site anyway, because the vast majority of residues are negatives. Thus

the inability to make fine distinctions among these low-probability candidates does not degrade the

98

overall quality of the results. It does, however, improve the efficiency of the training procedure

significantly, so this is an important component of the analysis. This is especially helpful in the 10-fold

cross-validation on the 160-protein set. The typical training set of 144 proteins contained about 14500

ionizable residues, which fell into more than 6000 quantized bins in the 4-dimensional space used for

POOL(T4). The corresponding number of corresponding inequality constraints was about 35000-40000.

One final detail is that the probability estimates generated by the POOL method as I have applied it tend

to have numerous ties as well as some places where there is no well-defined value. The latter places

occur because the method only assigns values to existing data points (or bins containing data in the case

of our use of quantization). The locally constant regions occur both because of the quantization applied to

the training data at the outset and because the data pools created by the algorithm acquire a single value.

In cells where no value is defined, the interpolation scheme I use is to simply assign a value linearly

interpolated based on the Manhattan distance between the least upper bound and the greatest lower bound

for that cell based on the monotonicity constraint. Finally, since both the data pooling performed by the

algorithm and this interpolation scheme tend to lead to ties, I use the Manhattan distance from the origin

of the four THEMATICS features as a tie-breaker for any residues whose probability estimates are

identical. This simply imposes a slight bias toward strict monotonicity even though the mathematical

formulation I use to determine these probabilities is based on a non-strict monotonicity assumption,

making it possible to obtain well-defined rankings for all the residues in a protein.

I use CASTp 37, which uses the weighted Delaunay triangulation and the alpha complex for shape

measurements to calculate the cleft information for each residue in the protein. The clefts were ranked

based on their sizes in decreasing order and each residue having atoms located in any cleft is assigned the

rank number of the largest of the clefts where its atoms are located. One special value is assigned to

every residue not on the protein surface, and another is assigned to every residue on the surface but not

within any cleft. Ignoring these special values, the monotonicity assumption is that the larger the cleft to

which a residue belongs, the more likely that residue is to belong to the active site.

99

I use ConSurf 27 to calculate the sequence conservation information for residues in each protein. ConSurf

takes a protein sequence and find its closest sequence homologues using MUSCLE 75, a multiple-

sequence alignment algorithm. Two sequences with similarity higher than a preset threshold are treated as

homologues. ConSurf analyze the homologues of the query sequence and determines how conserved is

each residue in the query protein among these homologues. In order to normalize the result and make it

comparable between different proteins with different numbers of homologues and with different degrees

of overall conservation, the program labels each residue with a conservation score between 1 and 9, with

9 being the most conserved and 1 being the most variable. If there exist more than 50 homologues for the

query sequence, the 50 homologues closest to the query sequence are analyzed. If there are less than six

homologues, the method will not work. For proteins with 6-10 homologues, ConSurf does report a

conservation score, but these scores are less reliable. In this study, I only use conservation score from

ConSurf when there are at least 11 homologues for a protein. Under the assumption that active site

residues tend to be more conserved than others, we apply the POOL method on the conservation score

with the monotonicity assumption that the larger the conservation score a residue has, the more likely that

residue is to belong to the active site.

5.5 Results

The results presented in this section are based on two sets of proteins, a set of 64 test proteins selected

randomly from the CSA database 57, 74, and a 160-protein set covering most of the CSA database. A

detailed list of the proteins in both sets and the CSA-labeled positive residues within that protein can be

found in Appendices C and D. In each case, the results are based on eight-fold cross-validation for the 64

protein set and ten-fold cross-validation for the 160 protein set. The ROC curves and RFR curves I

display show average performance over all proteins in all of the test sets, using the averaging methods

described in section 5.4.

5.5.1 Ionizable residues using only THEMATICS features.

100

First I evaluate the ability of POOL with the four THEMATICS features, POOL(T4), to predict ionizable

residues in the active site. For the purposes of Figures 1 and 2, only the CSA-annotated ionizable active

site residues are taken as the labeled positives. Thus if a method successfully predicts all of the labeled

ionizable active residues, the true positive rate is 100%. The prediction of all active residues, including

the non-ionizable ones, is addressed below.

Figure 5.1 shows the ROC curve, true positive fraction (TP) as a function of false positive fraction (FP),

obtained using POOL(T4), with just the four-dimensional THEMATICS feature vectors described earlier

(solid curve). Recall that the POOL method computes maximum-likelihood probability estimates, but for

these ROC curves, only the rankings of all residues within a single protein matter. For comparison, I also

show in Figure 5.1 a corresponding ROC curve for the earlier THEMATICS statistical approach

introduced by Ko et al. 24 and refined by Wei et al. 54 (dashed curve), plus the single point (X)

corresponding to the THEMATICS SVM-based approach 45. The data set used for the statistical curve

consists of the same 64 proteins used here. Note that the POOL(T4) curve always lies above and to the

left of the statistical curve for all non-zero values of the true positive fraction. For any given non-zero

value of the FP fraction, the true positive fraction is always higher for POOL(T4) than for the statistical

selector. The point representing the particular SVM classifier is based on a separate set of data, trained

and tested on data sets somewhat different from the present data set, so the results are not strictly

comparable. Nevertheless, this point lies well below the POOL(T4) curve and strongly suggests that

POOL(T4) is superior to the SVM approach 45. Below I present further evidence that POOL outperforms

an SVM on this active-site classification task. Thus POOL(T4) represents our best method yet for

identifying ionizable active-site residues using THEMATICS features alone.

101

0

0.2

0.4

0.6

0.8

1

0 0.05 0.1 0.15 0.2 0.25

False Positive Rate

Rec

all

POOL(T4)Wei's MethodSVM

Figure 5.1 Averaged ROC curve comparing POOL(T4), Wei’s statistical analysis and Tong’s SVM using THEMATICS features. Shown in the plot are averaged ROC curve for POOL(T4) (solid curve), Wei’s statistical analysis (dashed curve) and Tong’s SVM (point X) using THEMATICS features on ionizable residues only for the prediction of annotated active site ionizable residues. POOL(T4) outperforms both SVM and the Wei’s method.

102

5.5.2 Ionizable residues using THEMATICS plus cleft information.

Next I evaluate the three different ways of combining THEMATICS features with cleft size information.

Figure 5.2 shows averaged ROC curves for these three different methods, along with the best-performing

THEMATICS-only method, POOL(T4). The three methods are: (i) POOL(T4G), which uses the POOL

method with the 5-dimensional concatenated feature vectors of THEMATICS and cleft size rank (G

stands for geometric feature); (ii) SVM(T4G), which uses a support vector machine trained using the

same 5-dimensional feature vectors, with varying threshold; and (iii) CHAIN(POOL(T4), POOL(G)), the

result of chaining POOL(T4) estimates with POOL(G) estimates.

103

0

0.2

0.4

0.6

0.8

1

0 0.05 0.1 0.15 0.2 0.25

False Positive Rate

Reca

ll

POOL(T4)CHAIN(POOL(T4), POOL(G))SVM(T4G)POOL(T4G)

Figure 5.2. Averaged ROC curves comparing different methods of predicting ionizable active site residues using a combination of THEMATICS and geometric features of ionizable residues only. The method using chaining to combine both THEMATICS feature and geometrics information has the best performance.

104

To compare the averaged ROC curves from Figure 5.2 quantatively, I compute the area under the curve

for each ROC curve in the figure using the mean average specificity (MAS). The MAS values for

CHAIN(POOL(T4), POOL(G)), POOL(T4), POOL(T4G) and SVM(T4G) are 0.939, 0.921, 0.909 and

0.903, respectively. Figure 5.2 and the MAS values show the comparison of averaged performance

between different methods. In order to estimate the statistical significance of the performance difference

considering all pair-wise comparison results (i.e., on a per-protein basis), I perform the Wilcoxon signed

test. Table 5.1 shows the p-value of the Wilcoxon signed-rank test, the probability of observing the

specified AveS measurement with the null hypothesis that the method listed in the corresponding row does

not out-perform the method listed in the corresponding column, as the first number in each cell. The

number N in parentheses indicates the number of proteins out of the 64, for which the method in that row

outperforms the method in that column. For the remaining (64-N) proteins in the set, the two methods

either give equal performance or the method in the column outperforms the method in the row.

105

SVM(T4G) POOL(T4G) POOL(T4)

CHAIN(POOL(T4), POOL(G))

<0.0001

(53)

<0.0001

(59)

<0.0001

(46)

POOL(T4) 0.0002

(40)

0.0006

(41)

POOL(T4G) 0.038

(37)

Table 5.1 Wilcoxon signed-rank tests between methods shown in figure 5.2. The first number in each cell

is the Wilcoxon p value, the probability that the method in the corresponding row does not outperform the

method in the corresponding column. The number in parentheses is the number of proteins out of 64 for

which the method in the row outperforms the method in the column.

106

The figure and the table above clearly show that chaining the POOL(T4) and POOL(G) probability

estimates is the method that gives the best performance. It is interesting to note that this method,

CHAIN(POOL(T4), POOL(G)), is the only one that outperforms POOL(T4) alone. It is also interesting

to note that POOL(T4) is consistently at least as good as SVM(T4G), and is significantly better than

SVM(T4G) in the upper recall range, even though the latter has the advantage of the additional cleft

information. In general, there is little difference between POOL(T4), SVM(T4G), and POOL(T4G) in the

lower recall range, but for recall above about 0.6, POOL(T4) has a significantly lower false positive rate,

on average, than the other two, given equal recall. So these ROC curves provide strong evidence that

CHAIN(POOL(T4), POOL(G)) is the only one of the methods reported to date that is capable of taking

good advantage of additional geometric information that is not contained in THEMATICS features alone

and thereby outperforms any purely THEMATICS-based method so far.

The better performance of this chained method CHAIN(POOL(T4), POOL(G)) over POOL(T4) alone is

consistent throughout the ROC curve. For recall rates greater than 0.50, the TP fraction for the chained

method is better than that of POOL(T4) by roughly 10% for a given FP fraction. This qualitative trend is

apparent from visual inspection of the ranked lists from the two methods. For a typical protein, these two

ranked lists tend to be very similar, with annotated positive residues generally ranking a little higher, on

average, in the list resulting from chaining.

I believe that the observation that chaining the two four- and one- dimensional estimators gives better

results than applying POOL directly to the single, five-dimensional concatenated feature vector is

probably an overfitting issue. There may be too much flexibility when POOL is used with a high-

dimensional input space5, especially when the data are sparse.

5 As a side note, as far as possible worst-case performance is concerned, it is easy to show that applying coordinate-wise monotonicity with even a 2-dimensional input space has infinite VC dimension.

107

Since I established that chaining the POOL probabilities gives better results, from the next section on, I

omit POOL from the method reference, to make the change of features more distinguishable.

CHAIN(POOL(T4), POOL(G)) will be abbreviated as CHAIN(T4, G) instead.

5.5.3 All residues using THEMATICS plus cleft information. So far only predictions for ionizable residues have been described. The THEMATICS environment

variables are now used to incorporate predictions for non-ionizable residues in the active site.

Figure 5.3 shows the ROC curve for a combined method by which a single merged list ranking all

residues, both ionizable and non-ionizable, in a protein is generated. The method assigns probability

estimates for ionizable residues using the best of the previous ionizables-only estimators, the chained

estimator CHAIN(TALL, G) corresponding to the best ROC curve CHAIN(POOL(T4), POOL(G)) in

Figure 5.2. It also assigns probability estimates to non-ionizable residues using POOL with the two

THEMATICS environment features chained with POOL(G), and then rank orders all the residues based

on their probability estimates. Also included in Figure 5.3 for comparison is a ROC curve CHAIN(TION,

G) based on the same estimates for the ionizable residues but assigning probability estimates of zero to

all non-ionizable residues. Note that the data for this latter method are essentially the same as those of the

chained CHAIN(POOL(T4), POOL(C)) curve of Figure 5.2, except that the denominator for the recall

values is now the number of total active-site residues in the protein, whether ionizable or not, and the

denominator for the false positive rate is now the total number of non-active-site residues in the protein,

ionizable or not. The improved ROC curve for the merged estimate method CHAIN(TALL, G) compared

to the curve for the ionizables-only method CHAIN(TION, G) indicates that taking into account both

THEMATICS environment variables and cleft information does indeed help identify the non-ionizable

active-site residues. When the lists are merged, the rankings of some annotated positive ionizable residues

may be lowered, but it is apparent that this effect is more than offset, on average, by the rise in the

ranking of some annotated positive non-ionizable residues that are obviously missed by excluding them

108

altogether. If this were not the case, then one would expect the merged curve to cross below (and to the

right of) the comparison curve in the lower recall (and lower false positive) range.

109

0

0.2

0.4

0.6

0.8

1

0 0.05 0.1 0.15 0.2 0.25

False Positive Rate

Reca

ll

CHAIN(T_ION, G)CHAIN(T_ALL, G)

Figure 5.3 Averaged ROC curve comparing POOL methods applied to ionizable residues only CHAIN(TION, G) and to all residues CHAIN(TALL, G).

110

The MAS values for CHAIN(TALL, G) and CHAIN(TION, G) are 0.933 and 0.833, respectively. The p-

value of the Wilcoxon signed-rank test of observing such AveS under the null hypothesis that the

CHAIN(TALL, G) does not out-perform CHAIN(TION, G) is <0.0001 and CHAIN(TALL, G) outperforms

CHAIN(TION, G) in 31 of the 64 proteins. The number of proteins for which CHAIN(TALL, G)

outperforms CHAIN(TION, G) in this case may seem low, but both methods perform the same in 25 out of

the 64 proteins. For many of these latter cases, the protein does not have any non-ionizables in the active

site.

I have shown that the extension of the POOL method to non-ionizable residues gives a satisfactory result.

From now on, all residues are included in the study and I will just use T to indicate the way I apply

THEMATICS in TALL: For ionizable residues, I estimate the probability of being in active sites by

chaining the result of the POOL method on four THEMATICS features; for non-ionizable residues, I

estimate the probability of being in active sites by chaining the result of POOL method on two

THEMATICS features; I then combine the results and rank order the list of all residues based on their

probability of being in active sites.

5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if

applicable.

So far all the information I used in protein active site prediction is only derived from the protein 3D

structures, in other words, no sequence comparison information is used. As discussed in Chapter 2, it

makes the method applicable to those proteins with no or very few sequence homologues; indeed many of

the newly discovered protein structures from Structural Genomics projects have few or no sequence

homologues. It is generally true that most active site residues tend to be more conserved than others, with

only a few exceptions. Based on this observation, I believe, if I can include the sequence conservation

into our system when the information is available, I may get better performance. I put this hypothesis to

the test in this section, and the result is presented in Figure 5.4.

111

This figure shows the ROC curves using different features on the 160-protein set, with all residues

included as I did in 5.5.3. The reason I used the 160 protein set instead of the 64 set is that not all the

proteins have reliable sequence conservation information to use, which will be explained later. If I still

use the 64 protein set, the number of proteins with reliable sequence conservation information may not be

large enough to perform significance testing. Also, using a 160-protein set will show that the performance

improvement using different feature sets is consistent between different test sets. As demonstrated in

Figure 5.2, chaining the POOL results together works better than applying POOL directly on high

dimensional features, so I use chaining to combine the features of THEMATICS, cleft and the sequence

conservation information.

As pointed out earlier, not all proteins have enough homologues to perform reliable sequence

conservation analysis. In this study, I use ConSurf to do the sequence analysis. As a requirement, it needs

more than five homologues to perform conservation analysis and the result is claimed to be more reliable

if the number of homologues is larger than 10. In this study, I will only use the conservation information

when the protein has more than 10 homologues. For those not having enough homologues (28 out of 160

in this case), I assign 1 as probability estimate for the conservation POOL table. Since the ranked list is

performed for residues within the same protein, this treatment is valid and will not affect the results from

other proteins in the set.

There are four curves for comparison6: CHAIN(T) uses the four THEMATICS features for ionizable

residues and the two THEMATICS features for the non-ionizables; CHAIN(T, G) uses both

THEMATICS and the cleft feature; CHAIN(T, C) uses both THEMATICS and the sequence conservation

information; while CHAIN(T, G, C) uses all three features by chaining.

Figure 5.4 shows, among all four curves, CHAIN(T) is dominated by all other three curves, suggesting

that including either cleft or sequence conservation features, or both, can improve the performance. Both

CHAIN(T, C) and CHAIN(T, G, C) dominate CHAIN(T, G), suggesting that incorporating sequence

6 I use the notation CHAIN(T) for consistency. This could also have been notated more simply as “T”.

112

conservation information does improve performance more than just incorporating cleft information alone.

Surprisingly, CHAIN(T, C) and CHAIN(T, G, C) have very similar performance, although in the recall

range below 80%, CHAIN(T, G, C) performs slightly better.

The MAS for CHAIN(T, G, C), CHAIN(T, C), CHAIN(T, G) and CHAIN(T) are 0.925, 0.923, 0.907 and

0.899, respectively. The p-values of the Wilcoxon signed-rank test of observing such AveS measurement

with null hypothesis that the method in the row does not outperform the method in the column are listed

in Table 5.2, as the first number in each cell. The number in the parentheses indicates the number of

proteins for which the method in that row outperforms the method in that column:

113

0

0.2

0.4

0.6

0.8

1

0 0.05 0.1 0.15 0.2 0.25

False Positive Rate

Reca

ll

CHAIN(T)CHAIN(T, G)CHAIN(T, C)CHAIN(T, G, C)

Figure 5.4 Averaged ROC curves comparing different methods of combining THEMATICS, geometric and sequence conservation features of all residues. The method using chaining to combine THEMATIC, geometric and sequence conservation features has the best performance.

114

CHAIN(T) CHAIN(T, G) CHAIN(T, C)

CHAIN(T, G, C) <0.0001

(115)

<0.0001

(95)

<0.0001

(103)

CHAIN(T, C) <0.0001

(101)

0.0008

(89)

CHAIN(T, G) <0.0001

(101)

Table 5.2 Wilcoxon signed-rank tests between methods shown in figure 5.4.

115

5.5.5 Recall-filtration ratio curves.

The results reported so far are all in the form of ROC curves. As discussed earlier, my analysis is not

committed to any particular cutoff or rule to select the active site residues from the top of the list. For

instance, users can select the top k residues in the ranked list of residues ordered by the estimated

probability of being in the active site, or they can select the residues with an estimated probability of

being in the active site greater than a certain cutoff value, or they can select the top p percent of the

residues in the ranked list. Among the three methods listed above, I think the third probably would be

preferred in general since it is less susceptible to variation of protein size and availability of sequence

conservation information. In this case, RFR curves (recall-filtration ration curve) may be more useful

than ROC curves.

Since the main purpose for the RFR-curve is to provide a guide for users to select the appropriate cut-off,

I only report the results for CHAIN(T, G, C), which performs the best among all the methods. The test

was performed on all residues ranked by their probability of being in the active site and I average the

recall for each filtration ratio value to get the averaged RFR curve. For the curve shown in Figure 5.5, for

example, choosing the top 10% of the residues from the ranked list gives an average recall of 90%, while

choosing the top 5% of the residues from the ranked list gives an average recall of 79%.

116

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Filtration ratio

Rec

all

CHAIN(T, G, C)

Figure 5.5 Averaged RFR curve of for CHAIN(T, G, C) on the 160 protein test set.

117

5.5.6 Comparison with other methods.

In addition, I also compare the CHAIN(T, G) and CHAIN(T, G, C) results with the results from some

other top performing active site prediction methods, particularly, Petrova’s method 39, Youn’s method 40,

and Xie’s geometric potential method 38. All these methods use SVM. The first two use both sequence

conservation and 3D structural information, while Xie’s method uses structural information only.

The authors of the three methods report the results for the dataset by different measures than what I used

in my studies. Therefore I will simply compare the result from the CHAIN(T, G) and CHAIN(T, G, C) on

the 160 protein test set and compare with their results in their form of analysis. Because the performance

measures are not achieved from the same dataset, results are not strictly comparable, but qualitively, the

comparisons below give a good idea of the relative performance.

In order to compare my results with theirs at a similar recall level, I used a 4% filtration ratio cutoff in the

POOL method to compare with Youn’s method, and a variable filtration ratio cutoff to compare with

Petrova’s method. Note that while our test set consists of proteins with a wide variety of different folds

and functions, Youn’s results are reported for sets of proteins with common fold or with similar structure

and function. Performance on the more varied set is a much more realistic test of predictive capability on

proteins of unknown function, particularly novel folds. Performance on a set of structurally or

functionally related proteins is also substantially better than performance on a diverse set, as one would

expect and as has been demonstrated by Petrova and Wu 39.

Youn’s method 40 achieved about 57% recall at 18.5% precision with MAS (AUC) of 0.929, using both

sequence conservation and structural information when they train and test on proteins from the same

family; however the performance dropped when the training and testing is performed on proteins of the

same superfamily and fold level, while our CHAIN(T, G, C) with a preset 4% filtration ratio cutoff,

achieves the averaged recall of 64.68% with averaged precision of 19.07%, and an MAS (AUC) of 0.925

for all 160 proteins in the test set, consisting of proteins from completely different folds and classes.

Without the use of sequence conservation, the CHAIN(T, G) achieves the averaged recall, the averaged

118

precision and the AUC of 61.74%, 18.06% and 0.907, respectively. Our chained POOL method thus does

about as well as Youn’s method, even when we exclude conservation information, and a little better with

conservation information included, even though our diverse test set is one for which good performance is

most difficult to achieve. The complete results are shown in Table 5.3.

Petrova’s method 39 measured the performance of their method globally using all residues in all proteins,

instead of computing the recall, accuracy, false positive rate and MCC values for each protein and then

averaging them. Like Youn’s method, they use both sequence conservation information and 3D structural

properties as input to the SVM. They use a dataset that they call the benchmarking dataset that contains a

wide variety of proteins that are dissimilar in sequence, are structurally diverse, and span the full range of

E.C. classes of chemical functions. This dataset constitutes a fair test of how a method will perform on

structural genomics proteins of unknown function for which sequence conservation information is

available. Their method achieves a global residue level 89.8% recall with an overall predictive accuracy

of 86%, with an MCC of 0.23 and a 13% false positive rate on a subset of 79 proteins from CatRes

database. Testing on the 72 proteins from their set that also appear in my 160 protein set, CHAIN(T, G,

C) with a 10% filtration ratio cutoff achieves a residue level 88.6% recall at the overall predictive

accuracy of 91.0%, with an MCC of 0.28 and a 9% false positive rate. The resulting residue level recall,

overall predictive accuracy and the MCC from the CHAIN(T, G) are 85.2%, 91.0% and 0.27, respectively.

The results for Petrova’s method and for the present CHAIN methods with different filtration ratio cutoffs

are shown in Table 5.4 and the ROC curves in Figure 5.6. CHAIN(T, G, C) achieves comparable recall

with somewhat better accuracy and a lower false positive rate. CHAIN(T, G) performs almost as well,

even without conservation information.

In 38, a purely 3D structure based method, the performance was reported in the following fashion: their

method achieves at least a 50% recall with 20% or less false positive rate for 85% of the proteins they

analyzed. The performance of the CHAIN(T, G) and CHAIN(T, G, C) methods measured in the same

way is listed in table 5.5. Xie’s method should be compared against CHAIN(T, G), because these methods

119

do not use conservation data. CHAIN(T, G) achieves at least a 50% recall with a false positive rate of

20% or less for 96% of all proteins.

The results in the tables clearly show that CHAIN(T, G), which only uses 3D structural information of

proteins, achieves about as good or even better performance than that of these best performing current

active site prediction methods. When additional sequence conservation information is available, the

CHAIN(T, G, C) performs still better.

120

Method/Data set Sensitivity (%) Precision (%) AUC

Youn / Family 57.02 18.51 0.9290

Youn / Superfamily 53.93 16.90 0.9135

Youn / Fold 51.11 17.13 0.9144

CHAIN(T, G, C) / all protein 64.68 19.07 0.925

CHAIN(T, G) / all protein 61.74 18.06 0.907

Table 5.3 Comparison of sensitivity, precision, and AUC of CHAIN(T, G, C) with Youn’s reported results for proteins in the same family, super family, and fold.

121

Table 5.4 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Petrova’s method.

Residue level recall Residue level accuracy

Residue level false positive rate

Residue level MCC

Petrova’s method 89.8% 86% 13% 0.23

CHAIN(T, G, C) 7% 81.4% 90.9% 6.2% 0.31

CHAIN(T, G, C) 8% 85.2% 91.0% 7.1% 0.31

CHAIN(T, G, C) 9% 85.6% 91.0% 8.0% 0.29

CHAIN(T, G, C) 10% 88.6% 91.0% 9.0% 0.28

CHAIN(T, G, C) 12% 90.2% 89.1% 11% 0.26

CHAIN(T, G, C) 15% 91.9% 86.1% 14% 0.23

CHAIN(T, G) 7% 73.7% 93.7% 6.1% 0.28

CHAIN(T, G) 8% 77.1% 92.8% 7.0% 0.28

CHAIN(T, G) 9% 81.8% 91.9% 8.0% 0.27

CHAIN(T, G) 10% 85.2% 91.0% 9.0% 0.27

CHAIN(T, G) 12% 86.9% 89.0% 11% 0.25

CHAIN(T, G) 15% 89.4% 86.0% 14% 0.22

122

0.7

0.75

0.8

0.85

0.9

0.95

0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14

False positive rate

Recall

Petrova's method

CHAIN(T, G, C)

CHAIN(T, G)

Figure 5.6 ROC curves comparing CHAIN(T, G), CHAIN(T, G, C) and Petrova’s method.

123

Method Recall ≥ False positive rate < Achieved for

Xie 50% 20% 85%

CHAIN(T, G, C) 50% 20% 97%

CHAIN(T, G, C) 80% 20% 84%

CHAIN(T, G, C) 60% 10% 85%

CHAIN(T, G) 50% 20% 96%

CHAIN(T, G) 80% 20% 77%

CHAIN(T, G) 60% 10% 81%

Table 5.5 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Xie’s method. Each method achieves at least the specified recall rate with a false positive rate less than specified for the percentage of proteins in the last column.

124

5.5.7 Rank of the first positive.

The last result I will present is one that is only applicable to methods that generate a ranked list: the rank

of the first true positive in the list. This metric is useful for users who are interested in finding a few of the

active site residue candidates and who do not necessarily need to know all of the active site residues.

Users could use the list from the POOL method to guide their site directed mutagenesis experiments by

going down the ranked list one by one and hopefully, once the first active site residue is found, it is easier

to find the rest of active site residues by examining its neighbors. A histogram giving the rank of the first

active site residue found by CHAIN(T, G, C) on the 160 protein set is shown in Figure 5.7. The median

rank of the first true positive active site residue in the 160 protein set with CHAIN(T, G, C) method is

two. For 46 out of 160 proteins, the first residue in the resulting ranked list is an annotated active site

residue. 65.0%, 81.3% and 90.0% of the 160 proteins have the first annotated active site residue located

within the top 3, 5 and 10 residues of the ranked list, respectively. Such measurements are not easily

made for binary classification methods.

125

Rank of the first annotated active site residue in the list

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 10+

Rank of the first true positive

Perc

enta

ge o

f pro

tein

s

Cumulative distribution of the rank of the first annotated active site residue in the list

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

Rank

Perc

enta

ge o

f pro

tein

s

Figure 5.7. Histogram of the first annotated active site residue. Top: rank of the first annotated active site residue in the ranked list from CHAIN(T, G, C) on the 160 protein set. Bottom: the cumulative distribution of the first annotated active site residue in the ranked list from CHAIN(T, G, C) on the 160 protein set.

126

5.6 Discussion.

In this chapter, I presented the application of the POOL method using THEMATICS plus some other

features for protein active site prediction.

I started with the application of the POOL method just on THEMATICS features, with features similar to

those I used before in the SVM method, as well as those used in Ko and Wei’s statistical analysis 24, 54.

My results show that the POOL method outperforms all of the earlier THEMATICS methods with no

training data cleaning and no clustering after the classification. This suggests that by emphasizing the

underlying THEMATICS principles, the POOL method makes better use of the training data and

automatically limits the adverse effect that noise in the training data set might have caused in the other

methods. The results also supply further evidence that the THEMATICS principles are valid in reality. In

some sense, this opens another possible application for the POOL method, to verify the underlying

monotonicity hypothesis, which could be worth further investigation in the future.

I also tested different ways of incorporating additional features into the learning system. Not surprisingly,

the results show that in order to improve performance, I have to incorporate the right features in the right

way. In addition to using CastP to get the size rank of the cleft in which a residue resides, I also tried area

of solvent accessibility, residue type and some other features. Unfortunately, these extra features did not

help the performance of the POOL method. One possible reason for this might be overfitting or possible

correlation between these features with some other features already present in the system. Even with

features that were found to be helpful in improving the performance, how they are incorporated matters.

The results show that chaining the results from separate POOL estimates is better than simply combining

all the available features into a big POOL table with very high dimension. As mentioned earlier, the

reason behind this might be overfitting, since combining features into a POOL table with high dimension

causes the number of probabilities needed for estimation to grow exponentially, while the training data

127

can only increase linearly in most cases. In other words, the high dimensionality makes the table too

sparse and less accurate for probability estimates.

I also extended the application of THEMATICS to all residues, not just ionizable residues, in a natural

way and showed that it is effective. Although the performance for predicting non-ionizable residues is

not as good as the performance for predicting ionizable ones, this extension does provide a way to

combine features from THEMATICS, which by itself can only be applied to ionizable residues directly,

with some other features, making the comparison with the performance of other methods more accurate

and fair.

The incorporation of sequence conservation information does improve the prediction when there are

enough homologues with appropriate similarities. The POOL method gives us a means for easily utilizing

this information when it is available, while not affecting the training and classification when it is not.

When comparing with other methods, especially if the other methods use binary classification instead of a

ranked list, I have to commit to a specific cutoff value and turn my system into a binary classification

system. The results in Section 5.5.6 clearly show that the POOL method using THEMATICS and

geometric features achieves equivalent or better performance than the other methods in comparison, even

in cases where their methods are tested on very special groups of proteins. This makes my method more

widely applicable to proteins with few or no sequence homologues, such as some Structural Genomics

proteins, while both Youn’s and Petrova’s methods need sequences alignments from homologues.

Performances of Youn’s and Petrova’s methods will degrade significantly when sequence conservation

information is not available. However, my CHAIN(T, G) method’s performance will not degrade in the

absence of sequence conservation information. The results also show that with additional sequence

conservation information, when available, the performance can be further improved.

Interestingly enough, when I compare the performance of CHAIN(T, G) and CHAIN(T, G, C) in Figure

5.4, it is apparent that the addition of the conservation information does improve the performance a little,

128

but not to the extent observed previously for sequence-structure methods. Typically the conservation

information is the most important input feature, and without it performance is substantially worse 18. This

suggests that the 3D structure based THEMATICS features are quite powerful compared with other 3D

structure based features.

When looking at the recall and false positive rates of the results from all the protein active site prediction

methods, one must keep in mind that the annotation of the catalytic residues in the protein dataset is never

perfect. Since most of the labeling comes from experimental evidence, some active site residues are not

labeled as positive simply because there is no experiment designed and carried out to verify the role of

that specific residue. Since I use the CatRes/CSA annotations as the sole criteria to evaluate the

performance in order to keep the comparisons consistent, as mentioned several times in this dissertation,

the resulting false positive rate may be higher than in reality. There is evidence available to support the

functional importance of some residues that are not labeled as active site in the CatRes/CSA database 24, 58,

but they have high ranks in the list from the POOL method and are classified as positive by

THEMATICS-SVM and THEMATICS-statistical analysis as well.

Although I evaluated the POOL method performance by using filtration ratio values as a cutoff, it is just

for the purpose of comparing with other protein active site prediction methods that use a binary

classification scheme. The ranked list of residues based on their probability of being in the active site

contains much more information than traditional binary classification labeling. The rank of the first

annotated positive residue analysis in Section 5.5.7 shows just one application of the extra information

contained in a ranked list rather than a traditional binary label. There are many possible measurements of

performance depending on the actual application by users, and in turn many possible applications that can

benefit from using a ranked list form. It is noteworthy that P-Cats 21 uses a k-nearest neighbor method to

estimate the probability of a residue of being in an active site of a protein, and in principle can be a basis

of creating a ranked list as results, but their method just uses the probability estimates as the basis to

assign binary labels; residues with probability larger than 0.50 are labeled as positive and the others as

129

negative. Although the online server of their method 76 does report the probability estimates along with

the corresponding binary label, the potential benefits of using a ranked list instead of binary labeling has

not been fully addressed either in the paper or online.

In conclusion, I have established that applying the POOL method with THEMATICS and other features,

appears to yield the best protein active site prediction system yet found, and it provides more information

than other active site prediction methods.

130

Chapter 6

Summary and Conclusions.

131

Here, I summarize the work I have presented in this dissertation.

This dissertation starts with an introduction to the central problem I try to solve, which is using machine

learning methods to build an automated system that can predict active sites from protein structure alone,

but can also further improve the prediction by using sequence conservation too, when the information is

available.

In the second chapter, the dissertation briefly surveyed the background of protein active prediction and

the machine learning techniques, especially the probability based learning techniques, which forms the

foundation of the POOL methods. This chapter also introduces THEMATICS, an effective and accurate

protein active site predictor using only structural information of proteins, which forms another foundation

of this work.

The third chapter of my dissertation reports my work of using SVM with structure only information and

achieved better performance than not only other competing methods using all kinds of features but also

THEMATICS with statistical analysis methods. At the end of this chapter, I explain the limitations of

using traditional machine learning techniques in the form of classification to solve the protein active site

prediction problem.

Chapter Four of my dissertation proposes a novel POOL method as an approach to solve a class of

problems involving estimating class probabilities under multi-dimensional monotonicity constraints. In

this chapter, I describe the properties of such problems and a framework under which we can describe and

solve such problems. I presented an algorithm for solving such problems and the mathematical proof that

the solution indeed is optimal for both sum-of-squared-error and the maximum-likelihood criteria.

In Chapter Five, the protein active site prediction problem is reframed into a ranked list problem from a

standard binary classification problem. The dissertation presents the application of using the POOL

method to estimate the class probabilities of residues being in the active site under class probability

132

monotonicity assumptions. Using the POOL method and THEMATICS, I achieved better performance

than using the THEMATICS-SVM system. After incorporating more features, including sequence

conservation information, and extension of the methods into all residues from ionizable residues only, the

POOL method achieved the best performance so far, comparing with both the earlier THEMATICS

method and other existing methods using SVMs and all kinds of structural and sequence information from

proteins.

My work has established the following two claims:

THEMATICS is an effective and accurate protein active site predictor and can be automated by different

machine learning techniques. Incorporating more features makes it even more effective and accurate in

protein active site prediction.

The POOL method is an efficient way to estimate probability with maximum likelihood under the multi-

dimensional monotonicity constraints. It provides a platform allowing probability estimates to be easily

combined in a simple manner. It can be used in protein active site prediction and potentially many other

applications where monotonicity constraints play a role, such as disease detection from markers in the

blood, risk assessment and many more.

6.1 Contributions.

Listed below are the novel contributions to using machine learning techniques with THEMATICS in

protein active site prediction that have been presented in this dissertation.

Use SVM with THEMATICS in active site prediction. The work of using SVM with THEMATICS in

protein active site prediction presented in this dissertation is the first successful approach that uses the

machine learning techniques to automate the THEMATICS method. It outperforms both the

THEMATICS-statistical method as well as other 3D structure based protein active site prediction

methods.

133

Turn the protein active site prediction problem into a probability based ranked list problem. This

dissertation frames the protein active site prediction in the form of a ranked list problem, instead of a

traditional binary classification problem. Although the P-Cats method 21 also estimates the probability of

residues being in the active site using a k-nearest neighbor method, it is still framed as a binary

classification problem. This dissertation emphasizes the benefits of using the ranked list scheme, which

gives users more control in setting their own cutoff thresholds. The probability estimates behind the

ranked list make possible the next contribution, combining the results from different methods.

Combine probability estimates by applying the POOL method on different features. Since all the

results in the POOL method are essentially probability estimates, it becomes possible to utilize more

features using the chaining technique. This makes the method less susceptible to the problem of

analyzing sparse data in high dimensionality.

Introduce monotonicity constraints into machine learning. This is a successful approach that enforces

a prior belief of the data in the machine learning task, and the results indicate that if the prior belief is

indeed correct, the performance of the learning system can improve by the incorporation and the

enforcement of this knowledge.

Develop a novel POOL method. This dissertation frames the problem of assigning probabilities under

multi-dimensional monotonicity constraints with minimum sum of squared error (SSE) in the form of a

special form of convex optimization problem and develops the POOL algorithm to solve it more

efficiently and accurately than solving a general convex optimization problem.

Prove that minimizing SSE maximizes likelihood in the present problem. This dissertation proves

that the probability assignments minimizing SSE also maximizes the likelihood under the multi-

dimensional monotonicity constraints using the K.K.T. conditions.

134

Use the POOL method with THEMATICS in protein active site prediction. This dissertation presents

a practical application of the POOL method by applying the method with THEMATICS in protein active

site prediction. It outperforms all other protein active site prediction methods up to date.

Use the environment feature to incorporate influences of nearby residues in the protein active site

prediction. This dissertation introduces an environment feature (see 5.2) as a new way to incorporate the

influences from nearby residues in the active site prediction. This gives two benefits: first, it makes the

THEMATICS method applicable to non-ionizable residues; second, it avoids the extra step of clustering

after the classification process as used in THEMATICS-statistical analysis and the THEMATICS-SVM

study.

6.2 Future research.

There are many possible ways to extend the research described in this dissertation. I have mentioned

some of them in earlier sections. Here I outline some of the directions where the application of the POOL

method with THEMATICS can be improved to further improve the performance of the protein active site

prediction.

One place for further research is the use of the ranked list from the POOL method. As mentioned earlier,

the POOL method does not make any commitment to any rule or value for the cutoff. But users may still

want to know “exactly” which residues are in the active sites. Although I suggested using filtration ratio

as a possible cutoff and used this for comparison with other methods, it is still coarse and I believe it can

be further improved. One possible way is to look at the actual probability estimates the POOL method

gives and possibly the scale and the difference between adjacent residues in the ranked list may give some

clue about where to choose the cutoff value. Another approach is to use machine learning as the first

screening step in the process and get more human involvement in refining the predictions. Human experts

can look at the 3D structure of the proteins to see where the residues near the top of the ranked list are

located. Residues in some areas, such as in clefts near the surface and close to each other are more likely

to be in the active site than others such as those deeply buried or isolated from other residues near the top

135

of the ranked list. Of course one can also feed the result from the POOL method, either the rank, or the

raw probability estimates into another machine learning system to further improve the performance. Most

likely, some normalization needs to be performed on these features if cross protein training and testing is

used.

Another possible area to improve performance in protein active site prediction is the feature selection.

Some methods I compared against in Section 5.5.6 use many more features. Although it is not true that

the more features one uses, the better the performance one gets, it worth exploring the use of more

features, including both the simple ones extracted from the 3-D structures or sequences directly, and more

sophisticated ones such as the four THEMATICS features I used, or even results from other machine

learning systems. These features can either be fed to the POOL method, or to another learning system

along with the result of the POOL method.

I will mention one more area where further research is needed, although it is beyond the scope of protein

active site prediction. Once one knows the active site of a protein, the next natural question would be

what function the active site performs. It is the next step in function prediction after active site prediction.

I believe the exact shape of THEMATICS titration curves of a certain residue and the shapes of

THEMATICS titration curves of residues within a certain region may give us some clues about what class

of reaction it catalyses. In principle, this might also be solved by machine learning. Apparently, it is a

very challenging and rewarding task.

136

Appendices

Appendix A. The training set used in THEMATICS-SVM

Name EC Classification PDB ID

Acetylcholinesterase (E.C. 3.1.1.7) 1ACE

3-Ketoacetyl-Coa Thiolase (E.C. 2.3.1.16) 1AFW

Ornithine carbamoyltransferase (E.C. 2.1.3.3) 1AKM

Glutamate racemase (E.C. 5.1.1.3) 1B74

Alanine racemase (E.C. 5.1.1.1) 1BD0

Adenosine kinase (E.C. 2.7.1.20) 1BX4

Subtilisin Carlsberg (E.C. 3.4.21.62) 1CSE

Micrococcal nuclease (E.C. 3.1.31.1) 1EY0

Oxalate oxidase (E.C. 1.2.3.4) 1FI2

DNA- Lyase (E.C. 4.2.99.18) 1HD7

2-amino-4-hydroxy-6-hydroxymethyl-dihydropteridine

Pyrophospho-kinase (E.C. 2.7.6.3) 1HKA

Colicin E3 Immunity Protein (E.C. 3.1.21.-) 1JCH

L-lactate dehydrogenase (E.C. 1.1.1.27) 1LDG

Papain (E.C. 3.4.22.2) 1PIP

Mannose-6-phosphate isomerase (E.C. 5.3.1.8) 1PMI

Pepsin (E.C. 3.4.23.1) 1PSO

137

Triosephosphate Isomerase (E.C. 5.3.1.1) 1TPH

Aldose Reductase (E.C. 1.1.1.21) 2ACS

HIV-1 Protease (E.C. 3.4.23.16) 2AID

Mandelate Racemase (E.C. 5.1.2.2) 2MNR

138

Appendix B. The test set used in THEMATICS-SVM

The following table gives the testing results for 64 CatRes/CSA proteins. Bold indicates residues that are

CatRes/CSA active, ionizable, and correctly predicted in a cluster of just ionizables by the SVM. Bold

italic indicates CatRes/CSA active, missed in the SVM result, but found in our cluster if neighbors within

6Å are included. Underline indicates CatRes/CSA active missed by both criteria. []ab means the

symmetric clusters appear in both chain a and chain b. [XXXab] means residue XXX in both chain and

chain b are in the same cluster. The number to the left of “;” in the recall and filtration ratio columns

indicates the percentage we get from just SVM and clustering on ionizable only and the number to the

right indicates the result from the test including the neighbors within 6Å.

PDB Code Protein Name CatRes Positive

SVM Reported Positive

Recall (%)

Filtration ratio (%)

Result SVM only / SVM-region

1AL6 Citrate (si)-synthase H274 H320 D375

[H246a H249a R413a E420a H246b H249b R413b E420b] [Y231b H235b H238a H274a R329b D375b R401b R421a] [Y231a H235a H238b H274b R329a D375a R401a R421b] [D174 D257]a,b [Y158 Y167]a, b [Y318 Y330]a,b

67;100 5;17 Correct/Correct

1APX Heme peroxidase R38 H42 N71

[E65 H68] [H163 D208] [H116 E244]

0;33 2;10 Incorrect/Partial correct

1AQ2 Phosphoenol pyruvate carboxykinase

H232 K254 R333

[R65 K70 C125 Y207 E210 K212 K213 H232 C233 K254 D268 D269 E270 H271 E282 C285 Y286 E311 R333 Y336] [C408 Y421 Y524] [E36 H146]


1B3R Adenosylhomocysteinase D130 K185 D189 N190 C194

[C52 H54 D129 D130 E154 E155 D189 C194 C227]


139

[Y220 Y256]

1B6B Aralkylamine N-acetyltransferase

S97 L111 H122 L124 Y168

[E54 H127]a [H120 H122]ab [H145 H174]a [E50 E52]b [E190 E192]b [C160 Y168]b

30;70 5;17 Partial correct/Correct

1B93 Methylglyoxyl synthase H19 G66 D71 D91 H98 D101

[H19 D71 H98 D101]


1BG0 Arginine Kinase R126 E225 R229 R280 R309

[Y68 Y89 H90 H99 R124 R126 C127 D192 E224 E225 D226 R229 C271 R280 R309 E314 H315 R330 E335] [H185 H284] [Y145 R208] [Y134 K151]


1BOL Ribonuclease T2 H46 E105 H109

[Y116 Y121 E128 D129 D132 Y202] [E105 H109]


1BWD L-arginine:Inosamine-phosphate amidinotransferase

D108 R127 D179 H227 D229 H331 C332

[E9 E37 Y53 H87 H102 D103 C105 R107 D108 E130 D179 H227 D229 H278 C279 H331 C332]ab [D30 Y161]b


1BZY Hypoxanthine-guanine phosphoribosyltransferase

E133 D134 D137 K165 R169

[R100 K102 D193] [Y104 D137 K165] [E133 D134]


1CD5 Glucosamine-6-phosphate isomerase

D72 D141 H143 E148

[D72 E73 Y74 Y85] [H19 E198] [K124 Y128]

25;25 3;9 Partial correct/ Partial correct

1CHD Protein-glutamate methylesterase

S164 T165 H190 M283 D286

[H233 E235 H248] [H190 H256 D286]

40;80 3;10 Partial correct / Correct

1COY Cholesterol Oxidase E361 H447 N485

[R44 C57 R65 Y92 Y107 Y219 Y446] [Y21 K225] [R328 Y376]

0;33 2;10 Incorrect / Partial correct

1CQQ Picornain 3C H40 E71 G145 C147

[Y97 C100 K153]

0;0 2;5 Incorrect / Incorrect

1CTT Cytidine Deaminase

E104 [H102 E104 C129 H131 C132]ab [E138 H203 E229]ab [C217 Y252]b


1D0S Nicotinate-nucleotide-dimethylbenzimidazole phosphoribosyltransferase E174 E317

G176

[D69 H70 E174 D242 D263]

25;75 1;8 Partial Correct / Correct

140

K213 1DAA Aminotransferase class-IV E177

K145 L201 [R22a R98a H100a Y31b E32b H47b R50b Y88b Y114b K145b] [Y31a E32a H47a R50a Y88a K145a R98b] [C142 E177]ab [R22 R93]b

67;100 4;14 Correct / Correct

1DAE Dethiobiotin synthase T11 K15 K37 S41

[D10ab C151ab H154ab] [K15 K37]ab


1DB3 GDP-mannose 4,6-dehydratase

T132 E134 Y156 K160

[D13 Y128 Y177 C179 H186] [D105 E134 Y156 K160] [H228 D231 E344] [K2 Y26] [E315 E317]


1DL2 Mannosyl-oligosaccharide 1,2-alpha-mannosidase

E132 R136 D275 E435

[D86 E132 E207 D275 E279 K283 D336 H337 Y389 E399 E435 E438 E503 Y507 E526 H528] [K216 Y235 E290 Y293] [D52 H68]


1DNK Deoxyribonuclease I E78 H134 D212 H252

[E39 Y76 E78 H134 D168 D212 H252] [R31 Y32]


1DNP Deoxyribodipyrimidine photolyase

W306 W359 W382

[H44 E106 E109 C251 R278 R282 D372 D374]ab [R8 D10 D130]a [E318 Y365]ab [D409 D431]ab [D327 D331]b [D354 H453]ab [K353 Y464]a


1DZR dTDP-4-dehydrorhamnose 3,5-epimerase

H63 D170 [R60 H63 D84 H120 Y133 Y139 E144]


1E2A Histidine Kinase IIAlac H78 Q80 D81 H82

[H78abc D81abc H82abc] [E32 H94 H95]b [E32 H94 H95]c [E32 H95]a


1EBF Homoserine dehydrogenase D219 K223

[K117 E208 D210 D213 D214 D219 K223]


1FRO Lactoylglutathione lyase E172 [R37a E99a Y114a D165a D167a R122b


141

H126b E172b] [R37c E99c Y114c D165c D167c H126d E172d] [R37b E99b Y114b D165b D167b H126a E172a] [Y70 H102 E107]abcd [H126c E172c E99d] [C138b K151b] [K150d K158d] [D165d D167d] [H115ab]

1GOG Galactose Oxidase C228 Y272 W190 Y495

[C228 Y272 H334 C383 Y405 H442 Y495 H496 H581] [H85 D166 H522 D524] [D324 D404]


1GRC Phosphoribosylglycinamide formyltransferase

N106 H108 S135 D144

[H54ab E70ab H73ab E74ab Y78ab] [H108 H137 D144]a [Y67 R90]ab [E44ab]


1HXQ UDP-glucose--hexose-1-phosphate uridylyltransferase

C160 H164 H166 Q168

[E182 E232 H281 H296 H298] [H115 E152 H164 H166] [R13 Y201 R211 R324] [E121 D267 H342] [D183 H292]


1I7D DNA topoisomerase III E7 K8 R330

[E7 K8 H44 D103 D105 E107 E114 D136 Y320 D332 C333 H340 C372 D379 H381 H382 Y410 D520 E525] [H100 D113] [E286 E458]


1IDJ Pectin lyase R176 R236 [D186 D217 D221 R236 K239 D242] [H247 E272] [H178 H210]


1KAS 3-oxoacyl-[acyl-carrier protein] synthase

C163 H303 H340 F400

[C163 H303 D311 E314 K335 H340 E349 C395]ab [H168ab H172ab D181ab] [E115ab H118ab]


1LBA T7 Lysomsome Y46 K128 [H17 C18 Y46 H47 H68 C80


142

H122 C130]

1MAS Purine nucleosidase D14 N168 H241

[H157ab E158ab D192ab H195a] [D10 D14 D15 E166 H241 D242]b [D10 D14 D15 H241 D242]a [E265ab] [K44 Y92]a


1MHY Methane Monooxygenase C151 T213 [D74a K78a H80a E89a R171a D172a C173a R179a K45b Y46b K49b R186b D190b D196b E199b C200b D270b D418b H439b H446b D450b E454b E460b E462b R463b Y464b E465b C466b H467b E471b R45c D51c Y54c E58c E62c H112c R116c K12c D133c] [H39a C57a Y99a H109a H110a D176a E71b D75b E111b E114b D143b E144b H147b H149b C151b D170b R172b R175b E209b D242b E243b H246b] [H166 K189 H252 D256 R264]a [K104 Y162 H168 R360]b [Y112a K116a Y290a K65b] [R98 Y288 Y351]b [D243 E246]a


1MPP Mucoropepsin D32 S35 Y75 D215

[D9 D11 E13 D32 D215]


1NID Nitrite Reductase (Copper Containing)

D98 H255 [H95a D98a H100a H135a C136a H145a H255c E279c H306c] [H95c D98c H100c H135c C136c H145c H255b E279b H306b] [H95b D98b H100b H135b C136b H145b H255a E279a


143

H306a] [E180 D182 H245 E310]abc [D251abc] [H260 Y293 R296]abc

1NSP Nucleoside-diphosphate kinase

K16 N119 H122

[C16 H55 Y56 E58 R109 H122 E133]


1NZY 4-Chlorobenzoyl Coenzyme A Dehalogenase

F64 H90 G114 W137 D145

[D160abc E163abc E175abc D178abc] [D123b D145b Y150b R154b H218b C228c E232c] [C228a E232a H90c D145c Y150c R154c H218c] [D145a Y150a R154a H218a C228b E232b] [H138 D168]b


1PGS Peptide Aspartylglucosaminidase

D60 E206 [Y62 R80 Y116] [Y161 Y293] [Y183 K302] [H224 Y277]


1PJB Alanine dehydrogenase K74 H95 E117 D269

[E8 E13 R15 K72 K74 E75 Y93 H95 Y116 E117]


1PKN Pyruvate Kinase R72 R119 K269 T327 S361 E363

[R72 D112 E117 K269 E271 D295 E299] [C316 D356 C357 E385 R444 R466] [D224 D227] [K265 Y465]


1PNL Penicillin Acylase S1 A69 N241

[E152a K179a H192a D73b D74b D76b Y180b Y190b Y196b D252b] [D12a H18a Y31a D38a R39a Y96a Y33b H38b H520b] [R145a Y27b Y31b Y52b] [R263 K394]b [D484 D501]b [R479 Y528]b [E80 H123]b [Y33 K106]a


1PUD Queuine tRNA-ribosyltransferase (tRNA-guanine transglycosylase)

D102 [K55a D315b C318b C320b C323b E348b H349b] [D315a C318a C320a C323a E348a H349a K55b] [R38 R60


144

R362]ab [D102 D280]ab

1QFE 3-dehydroquinate dehydratase

E86 H143 K170

[E46 E86 D114 E116 H143 K170]ab [D50 H51]ab


1QPR Quinolinate phosphoribosyltransferase (decarboxylating) (Type II)

K140 E201 D222

[R139 R146 K150 H161 R162 K172 D173 E199 E201 D203 D222 E246] [D57 D80]


1QQ5 2-haloacid dehalogenase D8 T12 R39 N115 K147 S171 N173 F175 D176

[D8 R39 Y89 K147 Y153 D176] [Y95 R192]

56;100 4;13 Correct / Partial Correct

1QUM Deoxyribonuclease IV E261 [H69 D70 E94 E145 C177 D179 C181 H182 H216 E261]


1RA2 Dihydrofolate reductase I5 M20 D27 L28 F31 L54 I94

[K38 K109 Y111]


1UAE UDP-N-acetylglucosamine 1-carboxyvinyltransferase

N23 C115 D305 R397

[K22 R91 C115 R120 E188 E190 D231 E234 H299 R331 D369 R371 H394 R397 Y399 K405]


1UAG UDP-N-acetylmuramoylalanine--D-glutamate ligase

K115 N138 H183

[K115 H183 Y187 Y194 R302 K319 D346 K348 C413 R425]


1ULA Purine-nucleoside phosphorylase (type 1)

H86 E89 N243

[D134a H135a Y166a E201c E205c Y249c] [E201a E205a Y249a D134b H135b Y166b] [E201b E205b Y249b D134c H135c Y166c] [H257 E258]ac [H86 E89]abc


1UOK Oligo-1,6-glucosidase D199 E255 D329

[Y12 Y15 Y39 D60 D64 D98 H103 H161 D169 D199 E255 H283 D285 Y324 H328 D329 R332 R336 H356 Y365 E368 E369 D385E387 D416 R419 Y464 R471 Y495 R497] [D21 D29]

100;100 6;20 Correct/ Correct

145

1VAO Vanillyl Alcohol Oxidase Y108 D170 H422 Y503 R504

[Y108 D170 Y187 R312 D317 R398 E410 E464 H466 Y503 R504] [D59 H61 H422 H506] [Y148 D167 H193] [H467 C470] [H313 Y440] [Y276 Y358]


1WGI Inorganic pyrophosphatase D117 [E48 K56 E58 Y89 E101 H107 D115 D117 D120 D147 D152 Y192]a [E48 K56 E58 E101 D115 D117 D120 D147 D152 Y192]b [E123 D159 D162]b [H87ab]

100;100 5;15 Correct / Correcy

1YTW Protein Tyrosine Phosphatase

E290 D356 H402 C403 R409 T410

[C259 Y261 H270 Y301 H350 H402 C403]

33;67 2;11 Partial Correct / Correct

2CPO Heme Chloroperoxidase H105 E183

[E104 H105 D106 H107 D113 E161 D168 E183]


2HDH 3-hydroxyacyl-CoA dehydrogenase

S137 H158 E170 N208

[R209a Y214ab E217ab R220ab R224a Y242a H275a] [H158a E170b] [H158b E170a] [H266b H275b]


2HGS Glutathione Synthase R125 S151 G369 R450

[D24a H107b E214b R221b E224b R236b Y265b R267b Y270b E287b K293b C294b D296b Y432b] [H107a E214a R221a E224a R236a Y265a R267a Y270a E287a K293a C294a D296a Y432a D24b] [R125 D127 E144 K305 K364 E368 Y375 E425 R450 K452]a [R125 D127 E144 K305 K364 E368 Y375 E425 R450]b [H163 D469]ab


2JCW Superoxide dismutase H63 R143 [H46 H48 H63 50;100 5;16 Correct / Correct

146

H71 H80 D83 H120 D124]ab

2PFL Formate C-acetyltransferase W333 C418 C419 G734

[D74a H84a R141ab K142ab H144ab R174ab D180ab Y181ab R183a R218ab E221ab E222ab E225ab Y240a Y259a Y262ab K267a E368ab D413ab H498ab Y499ab H501ab D502ab D503ab Y504ab Y506ab E507ab H514ab R520ab Y594ab R595ab] [Y172 R176 R319 Y323 E400 C418 C419 R435 Y490 Y612 H704 R731 Y735]ab [H84 Y240 Y259 K267]b [C159 Y444]b [R316 D330]ab


2PLC 1-phosphatidylinositol phosphodiesterase

H45 D46 R84 H93 D278

[H45 D46 D82 E128 D204 H236 D278] [Y71 K115]


2THI Thiamine pyridinylase C113 E241 [Y16 Y50 D64 C113 E171 D175 Y222 Y239 E241 D265 Y270 D272]ab [E37ab D84ab E284ab] [Y323 Y333]b [Y180 R349]ab [H282ab]


8TLN Metalloproteinase M4 E143 H231

[D138 H142 E143 E166 D170 E177 D185 E190] [K18 D72 Y76 K182]


147

Appendix C. The 64 protein test set used in THEMATICS-POOL

PDB Code Protein Name E.C. Number CSA Annotated Active Site Residues

1A05 1,4-Diacid decarboxylating dehydrogenase 1.1.1.85 Y140, K190, D222

1A26 ADP-ribosyltransferase 2.4.2.30 Y907, E988

1A4I Methylenetetrahydrofolate Dehydrogenase 1.5.1.5 K56

1A4S Aldehyde dehydrogenase (NAD+) / Betaine-aldehyde dehydrogenase

1.2.1.8 N166, E263, C297

1AFW Acetyl-CoA C-acyltransferase 2.3.1.16 C125, H375, C403, G405

1AKM Ornithine Carbamoyltransferase 2.1.3.3 R106, H133, Q136, D231, C273, R319

1AOP Sulphite reductase 1.8.1.2 R83, R153, K215, K217 C483

1APX Heme peroxidase 1.11.1.11 R38, H42, N71

1B6B Aralkylamine N-acetyltransferase 2.3.1.87 S97, L111, H122, L124, Y168

1BG0 Arginine Kinase 2.7.3.3 R126, E225, R229, R280, R309

1BRM Aspartate-beta-semialdehyde dehydrogenase 1.2.1.11 C135, Q162, H274

1BRW Pyrimidine-nucleoside phosphorylase 2.4.2.2 H82, R168, S183, K187


2.1.4.2 D108, R127, D179, H227, D229 H331, C332


2.4.2.8 E133, D134, D137, K165, R169

1C3J DNA beta-glucosyltransferase 2.4.1.27 E22, D100

1COY Cholesterol Oxidase 1.1.3.6 E361, H447, N485

1CQQ Picornain 3C 3.4.22.28 H40, E71, G145, C147

1D0S Nicotinate-nucleotide-dimethylbenzimidazole phosphoribosyltransferase

2.4.2.21 E317

148

1D4A NAD(P)H dehydrogenase (quinone) 1.6.99.2 G149, Y155, H161

1D4C Succinate dehydrogenase (Fumerate reductase)

1.3.99.1 H364, R401, H503, R544

1DII 4-cresol dehydrogenase 1.17.99.1 Y73, Y95, E380, E427, H436, R474

1DLI UDP-glucose 6-dehydrogenase 1.1.1.22 T118, E145, K204, N208, C260, D264

1DO8 Malate dehydrogenase 1.1.1.39 Y112, K183, D278

1E2A Histidine Kinase IIAlac 2.7.1.69 H78, Q80, D81, H82

1EBF Homoserine dehydrogenase 1.1.1.3 D219, K223

1FOH Phenol 2-monooxygenase 1.14.13.7 D54, R281, Y289

1FUG Methionine adenosyltransferase 2.5.1.6 H14, K165, R244, K245, K265, K269, D271

1G72 Methanol dehydrogenase 1.1.99.8 D297

1GET Glutathione reductase 1.6.4.2 C42, C47, K50, Y177, E181, H439, E444

1GOG Galactose Oxidase 1.1.3.9 C228, Y272, W290, Y495

1GPR The IIAglc Histidine kinase 2.7.1.69 T66, H68, H83, G85

1GRC Phosphoribosylglycinamide formyltransferase (GARTFase II)

2.1.2.2 N106, H108, S135, D144

1IVH Isovaleryl-CoA dehydrogenase 1.3.99.10 E254

1JDW Glycine amidinotransferase 2.1.4.1 D254, H303, C407

1KAS 3-oxoacyl-[acyl-carrier protein] synthase 2.3.1.41 C163, H303, H340, F400

1L9F Monomeric sarcosine oxidase 1.5.3.1 H45, R49, H269, C315

1LCB Thymidylate synthase 2.1.1.45 E60, R178, C198, S219, D221, D257, H259

1LXA UDP-N-acetylglucosamine acyltransferase 2.3.1.129 H125

1MBB UDP-N-acetylmuramate dehydrogenase 1.1.1.158 R159, S229, E325

149

1MHL Mammalian Myeloperoxidase 1.11.1.7 Q91, H95, R239

1MLA [Acyl-carrier protein]

S-malonyltransferase

2.3.1.39 S92, H201, Q250

1MOQ Glucosamine--fructose-6-phosphate aminotransferase (isomerising domain)

2.6.1.16 E481, K485, E488, H504, K603

1MPY Extradiol Catecholic Dioxygenase 1.13.11.2 H199, H246, Y255

1NID Nitrite Reductase 1.7.99.3 D98, H255

1NSP Nucleoside-diphosphate kinase 2.7.4.6 K16, N119, H122

1OFG Glucose-fructose oxidoreductase 1.1.99.28 K129, Y217

1PFK Phosphofructokinase 2.7.1.11 G11, R72, T125, D127, R171

1PJB Alanine dehydrogenase 1.4.1.1 K74, H95, E117, D269

1PKN Pyruvate Kinase 2.7.1.40 R72, R119, K269, T327, S361, E363


2.4.2.29 D102

1R51 Urate Oxidase 1.7.3.3 R176, Q228

1RA2 Dihydrofolate reductase 1.5.1.3 I5, M20, D27, L28, F31, L54, I94

1UAE UDP-N-acetylglucosamine

1-carboxyvinyltransferase

2.5.1.7 N23, C115, D305, R397

1ULA Purine-nucleoside phosphorylase (type 1) 2.4.2.1 H86, E89, N243

1VAO Vanillyl Alcohol Oxidase 1.1.3.13 Y108, D170, H422, Y503, R504

1VNC Chloride peroxidase 1.11.1.10 K353, H404

1XVA Glycine N-methyltransferase 2.1.1.20 E15

1ZIO Adenylate kinase 2.7.4.3 K13, R127, R160, D162, D163, R171

2ALR Mammalian Aldehyde Reductase 1.1.1.2 Y49, K79

150

2BBK Methylamine dehydrogenase 1.4.99.3 D32, W57, D76, W108, Y119, T122

2CPO Heme Chloroperoxidase 1.11.1.10 H105, E183

2HDH 3-hydroxyacyl-CoA dehydrogenase 1.1.1.35 S137, H158, E170, N208

2JCW Superoxide dismutase 1.15.1.1 H63, R143

3PCA Protocatechuate dioxygenase 1.13.11.3 Y447, R457

151

Appendix D. The 160 protein test set used in THEMATICS-POOL

PDB Code Protein Name E.C. Number CSA Annotated Active Site Residues

12AS Aspartate--ammonia ligase 6.3.1.1 D46, R100, Q116

13PK Phosphoglycerate kinase 2.7.2.3 R39, K219, G376, G399

1A05 1,4-Diacid decarboxylating dehydrogenase

1.1.1.85 Y140, K190, D222

1A26 ADP-ribosyltransferase 2.4.2.30 Y907, E988

1A4I Methylenetetrahydrofolate Dehydrogenase

1.5.1.5 K56

1A4S Aldehyde dehydrogenase (NAD+) / Betaine-aldehyde dehydrogenase

1.2.1.8 N166, E263, C297

1AE7 Phospholipase A2 (PLA2) 3.1.1.4 G30, H48, D99

1AFW Acetyl-CoA C-acyltransferase

2.3.1.16 C125, H375, C403, G405

1AH7 Phospholipase C 3.1.4.3 D55

1AKM Ornithine Carbamoyltransferase

2.1.3.3 R106, H133, Q136, D231, C273, R319

1ALK Alkaline Phosphatase 3.1.3.1 S102, R166

1AOP Sulphite reductase 1.8.1.2 R83, R153, K215, K217 C483

1APX Heme peroxidase 1.11.1.11 R38, H42, N71

1APY Aspartylglucosylaminidase 3.5.1.26 T183, T201, T234, G235

1AQ2 Phosphoenol pyruvate carboxykinase

4.1.1.49 H232, K254, R333

1AW8 Aspartate 1-decarboxylase 4.1.1.11 Y58

1B3R Adenosylhomocysteinase 3.3.1.1 D130, K185, D189, N190, C194

152

1B57 Fructose-bisphosphate aldolase (class II)

4.1.2.13 D109, E182, N286

1B66 6-pyruvoyl tetrahydropterin synthase

4.6.1.10 C42, D88, H89, E133

1B6B Aralkylamine N-acetyltransferase

2.3.1.87 S97, L111, H122, L124, Y168

1B73 Glutamate racemase 5.1.1.3 D7, S8, C70, E147, C178, H180

1B93 Methylglyoxyl synthase 4.2.99.11 H19, G66, D71, D91, H98, D101

1BCR Carboxypeptidase D 3.4.16.6 G53, S146, Y147, D338, H397

1BG0 Arginine Kinase 2.7.3.3 R126, E225, R229, R280, R309

1BJP 4-oxalocrotonate tautomerase

5.3.2.0 P1, R39, F50

1BML Plasmin/streptokinase 3.4.21.7 H603, S608, D646

1BOL Ribonuclease T2 3.1.27.1 H46, E105, H109

1BRM Aspartate-beta-semialdehyde dehydrogenase

1.2.1.11 C135, Q162, H274

1BRW Pyrimidine-nucleoside phosphorylase

2.4.2.2 H82, R168, S183, K187

1BS4 Peptide Deformylase 3.5.1.31 G45, Q50, L91, E133

1BTL Beta-Lactamase Class A 3.5.2.6 S70, K73, S130, E166


2.1.4.2 D108, R127, D179, H227, D229 H331, C332

1BWP 2-acetyl-1-alkylglycerophosphocholine esterase

3.1.1.47 S47, G74, N104, D192, H195


2.4.2.8 E133, D134, D137, K165, R169

153

1C3C Adenylosuccinate lyase 4.3.2.2 H68, H141, E275

1C3J DNA beta-glucosyltransferase

2.4.1.27 E22, D100

1CB8 Chondroitin AC lyase 4.2.2.5 H225, Y234, R288

1CD5 Glucosamine-6-phosphate isomerase

5.3.1.10 D72, D141, H143, E148

1CHD Protein-glutamate methylesterase

3.1.1.61 S164, T165, H190, M283, D286

1CHK Chitosanase 3.2.1.132 E22, D40

1CHM Creatinase 3.5.3.3 H232, E262, E358

1COY Cholesterol Oxidase 1.1.3.6 E361, H447, N485

1CQQ Picornain 3C 3.4.22.28 H40, E71, G145, C147

1CTT Cytidine Deaminase 3.5.4.5 E104

1D0S Nicotinate-nucleotide-dimethylbenzimidazole phosphoribosyltransferase

2.4.2.21 E317

1D4A NAD(P)H dehydrogenase (quinone)

1.6.99.2 G149, Y155, H161

1D4C Succinate dehydrogenase (Fumerate reductase)

1.3.99.1 H364, R401, H503, R544

1D8C Malate synthase 4.1.3.2 D270, E272, R338, D631

1D8H Polynucleotide 5'-phosphatase

3.1.3.33 R393, E433, K456, R458

1DAA Aminotransferase class-IV 2.6.1.21 K145, E177, L201

1DAE Dethiobiotin synthase 6.3.3.3 T11, K15, K37, S41

1DB3 GDP-mannose 4,6-dehydratase

4.2.1.47 T132, E134, Y156, K160

1DBT Orotidine-5'-monophosphate decarboxylase

4.1.1.23 D60, K62

154

1DCO 4a-hydroxytetrahydrobiopterin dehydratase

4.2.1.96 H62, H63, H80, D89

1DGS NAD+ dependent DNA ligase

6.5.1.2 K116, D118, R196, K312

1DII 4-cresol dehydrogenase 1.17.99.1 Y73, Y95, E380, E427, H436, R474

1DIZ DNA-3-methyl adenine glycosylase II

3.2.2.21 Y222, W272, D238

1DL2 Mannosyl-oligosaccharide

1,2-alpha-mannosidase

3.2.1.113 E132, R136, D275, E435

1DLI UDP-glucose

6-dehydrogenase

1.1.1.22 T118, E145, K204, N208, C260, D264

1DNK Deoxyribonuclease I 3.1.21.1 E78, H134, D212, H252

1DO8 Malate dehydrogenase 1.1.1.39 Y112, K183, D278

1DQS 3-dehydroquinate synthase 4.6.1.3 H275

1DZR dTDP-4-dehydrorhamnose 3,5-epimerase

5.1.3.13 H63, D170

1E2A Histidine Kinase IIAlac 2.7.1.69 H78, Q80, D81, H82

1EBF Homoserine dehydrogenase 1.1.1.3 D219, K223

1EF8 Methylmalonyl-CoA decarboxylase

4.1.1.41 H66, G110, Y140

1EUG Uridine Nucleosidase (Uracil DNA glycosylase)

3.2.2.3 D64, H187

1EYI Fructose-1,6-bisphosphatase

3.1.3.11 D68, D74, E98

1FGH Aconitase 4.2.1.3 D100, H101, H147, D165, H167, E262, H642

1FOH Phenol 2-monooxygenase 1.14.13.7 D54, R281, Y289

1FRO Lactoylglutathione lyase 4.4.1.5 E172

155

1FUA L-fuculose-phosphate aldolase

4.1.2.17 E73

1FUG Methionine adenosyltransferase

2.5.1.6 H14, K165, R244, K245, K265, K269, D271

1FUI Arabinose isomerase 5.3.1.3 E337, D361

1G72 Methanol dehydrogenase 1.1.99.8 D297

1GET Glutathione reductase 1.6.4.2 C42, C47, K50, Y177, E181, H439, E444

1GIM Adenylosuccinate synthetase

6.3.4.4 D13, H41, Q224

1GOG Galactose Oxidase 1.1.3.9 C228, Y272, W290, Y495

1GPM GMP synthase 6.3.5.2 G59, C86, Y87, H181, E183, D239

1GPR The IIAglc Histidine kinase 2.7.1.69 T66, H68, H83, G85

1GRC Phosphoribosylglycinamide formyltransferase (GARTFase II)

2.1.2.2 N106, H108, S135, D144

1GTP GTP Cyclohydrolase 3.5.4.16 H112, H179

1HFS Stromelysin-1 (hydrolase) 3.4.24.17 E202, M219

1HXQ UDP-glucose--hexose-1-phosphate uridylyltransferase

2.7.7.12 C160, H164, H166, Q168

1I7D DNA topoisomerase III 5.99.1.2 E7, K8, F328, R330

1IVH Isovaleryl-CoA dehydrogenase

1.3.99.10 E254

1JDW Glycine amidinotransferase 2.1.4.1 D254, H303, C407

1KAS 3-oxoacyl-[acyl-carrier protein] synthase

2.3.1.41 C163, H303, H340, F400

1KFU m-Calpain Form II 3.4.22.17 Q99, C105, H262, N286

156

1KRA Urease 3.5.1.5 H219, D221, H320, R336

1L9F Monomeric sarcosine oxidase

1.5.3.1 H45, R49, H269, C315

1LBA T7 Lysomsome 3.5.1.28 Y48, K128

1LCB Thymidylate synthase 2.1.1.45 E60, R178, C198, S219, D221, D257, H259

1LXA UDP-N-acetylglucosamine acyltransferase

2.3.1.129 H125

1MAS Purine nucleosidase 3.2.2.1 D14, N168, H241

1MBB UDP-N-acetylmuramate dehydrogenase

1.1.1.158 R159, S229, E325

1MHL Mammalian Myeloperoxidase

1.11.1.7 Q91, H95, R239

1MHY Methane Monooxygenase 1.14.13.25 C151, T213

1MKA 3-hydroxydecanoyl-[acyl-carrier protein] dehydratase

4.2.1.60 H70, V76, G79, C80, D84

1MLA [Acyl-carrier protein]

S-malonyltransferase

2.3.1.39 S92, H201, Q250

1MOQ Glucosamine--fructose-6-phosphate aminotransferase (isomerising domain)

2.6.1.16 E481, K485, E488, H504, K603

1MPP Mucoropepsin 3.4.23.23 D32, S35, Y75, D215

1MPY Extradiol Catecholic Dioxygenase

1.13.11.2 H199, H246, Y255

1NBA Carbamoylsarcosine Amidohydrolase

3.5.1.59 D51, K144, A172, T173, C177

1NID Nitrite Reductase 1.7.99.3 D98, H255

1NSP Nucleoside-diphosphate kinase

2.7.4.6 K16, N119, H122

1NZY Chlorobenzoate 3.8.1.6 F64, H90, G114,

157

Dehalogenase W137, D145

1OFG Glucose-fructose oxidoreductase

1.1.99.28 K129, Y217

1PFK Phosphofructokinase 2.7.1.11 G11, R72, T125, D127, R171

1PGS Peptide Aspartylglucosaminidase

3.5.1.52 D60, E206

1PJB Alanine dehydrogenase 1.4.1.1 K74, H95, E117, D269

1PKN Pyruvate Kinase 2.7.1.40 R72, R119, K269, T327, S361, E363

1PS1 Pentalenene Synthase 4.6.1.5 F77, R157, R173, N219, K226, R230, S305, H309


2.4.2.29 D102

1PYA Histidine decarboxylase 4.1.1.22 Y62, S81, F195, E197

1PYM Phosphoenolpyruvate mutase

5.4.2.9 G47, L48, D58, K120

1QFE 3-dehydroquinate dehydratase

4.2.1.10 E86, H143, K170

1QPR Quinolinate phosphoribosyltransferase (decarboxylating) (Type II)

2.4.2.19 R105, K140, E201, D222

1QQ5 2-haloacid dehalogenase 3.8.1.2 D8, T12, R39, N115, K147, S171, N173, F175, D176

1QUM Deoxyribonuclease IV 3.1.21.2 E261

1R51 Urate Oxidase 1.7.3.3 R176, Q228

1RA2 Dihydrofolate reductase 1.5.1.3 I5, M20, D27, L28, F31, L54, I94

1RBL Ribulose bisphosphate carboxylase

4.1.1.39 K175, K177, K201, D203, H294, H327

158

1REQ Methylmalonyl-CoA mutase

5.4.99.2 Y89, H244, K604, D608, H610

1RPT High molecular weight Acid Phosphatase

3.1.3.2 R11, H12, R15, R79, H257, D258

1SMN Serratia marcescens nuclease

3.1.30.2 R87, H89, N119

1TYF CLP Protease (clpP) 3.4.21.92 G68, S97, M98, H122, D171

1UAE UDP-N-acetylglucosamine

1-carboxyvinyltransferase

2.5.1.7 N23, C115, D305, R397

1UAG UDP-N-acetylmuramoylalanine--D-glutamate ligase

6.3.2.9 K115, N138, H183

1ULA Purine-nucleoside phosphorylase (type 1)

2.4.2.1 H86, E89, N243

1UOK Oligo-1,6-glucosidase 3.2.1.10 D199, E255, D329

1VAO Vanillyl Alcohol Oxidase 1.1.3.13 Y108, D170, H422, Y503, R504

1VNC Chloride peroxidase 1.11.1.10 K353, H404

1WGI Inorganic pyrophosphatase 3.6.1.1 D117

1XVA Glycine N-methyltransferase

2.1.1.20 E15

1YTW Protein Tyrosine Phosphatase

3.1.3.48 E290, D356, H402, C403, R409, T410

1ZIO Adenylate kinase 2.7.4.3 K13, R127, R160, D162, D163, R171

2ACY Acylphosphatase 3.6.1.7 R23, N41

2ADM ADENINE-N6-DNA-METHYLTRANSFERASE

2.1.1.72 N105, P106, Y108

2ALR Mammalian Aldehyde Reductase

1.1.1.2 Y49, K79

2BBK Methylamine 1.4.99.3 D32, W57, D76,

159

dehydrogenase W108, Y119, T122

2BMI METALLO-BETA-LACTAMASE

3.5.2.6 D86, N176

2CPO Heme Chloroperoxidase 1.11.1.10 H105, E183

2HDH 3-hydroxyacyl-CoA dehydrogenase

1.1.1.35 S137, H158, E170, N208

2HGS Glutathione Synthase 6.3.2.3 R125, S151, G369, R450

2JCW Superoxide dismutase 1.15.1.1 H63, R143

2PDA Pyruvate synthase 1.2.7.1 E64

2PFL Formate C-acetyltransferase 2.3.1.54 W333, C418, C419, G734

2PHK Protein Serine/threonine kinase

2.7.1.38 D149, K151

2PLC 1-phosphatidylinositol phosphodiesterase

3.1.4.10 H45, D46, R84, H93, D278

2THI Thiamine pyridinylase 2.5.1.2 C113, E241

3CSM Chorismate Mutase 5.4.99.5 R16, R157, K168, E246

3ECA Asparaginase/Glutaminase 3.5.1.1 T12, Y25, T89, D90, K162

3PCA Protocatechuate dioxygenase

1.13.11.3 Y447, R457

4KBP Purple Acid Phosphatase 3.1.3.2 H202, H295, H296

5COX Prostaglandin-Endoperoxide Synthase

1.14.99.1 Q203, H207, Y385

5ENL Enolase 4.2.1.11 E168, E211, K345, H373

5FIT diadenosine P1, P3-triphosphate (ApppA) hydrolase

3.6.1.29 Q83, H94, H96

8TLN Metalloproteinase M4 3.4.24.27 E143, H231

160

9PAP Thiol-Endopeptidase 3.4.22.2 Q19, C25, H159, N175

161

Bibliography

1. Schmid, M. B., Structural Proteomics: The potential of high throughput structure determination. Trends Microbiol 2002, 10 (Suppl.), S27-S31. 2. Stultz, C. M.; White, J. V.; Smith, T. F., Structural Analysis Based on State-space Modeling. Protein Sci 1993, 2, 305-314. 3. Combet, C.; Jambon, M.; Deleage, G.; Geourjon, C., Geno3D: automatic comparative molecular modeling of protein. Bioinformatics 2002, 18, (1), 213-214. 4. Venclovas, C., Comparative Modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins Suppl 2001, 5, 47-54. 5. Lambert, C.; Leonard, N.; De Bolle, X.; Depiereux, E., ESyPred3D: Prediction of proteins 3D structures. Bioinformatics 2002, 18, (9), 1250-1256. 6. Terwilliger, T. C., Waldo, G., Peat, TS, Newman, JM, Chu, K., Berendzen, J., Class-directed structure determination: Foundation for a protein structure initiative. Protein Sci 1998, 7, 1851-1856. 7. Madern, D.; Pfister, C.; Zaccai, G., Mutation at a Single Acidic Amino Acid Enhances the Halophilic Behaviour of Malate Dehydrogenase from Haloarcula Marismortui in Physiological Salts. European Journal of Biochemistry 1995, 230, 1088. 8. Looger, L. L., M.A. Dwyer, J.J. Smith, and H.W. Hellinga, Computational design of receptor and sensor proteins with novel functions. Nature 2003, 423, 185-190. 9. Oshiro, C. M., I.D. Kuntz and R.M.A. Knegtel, Molecular Docking and Structure-based Design. In Encyclopedia of Computational Chemistry, Schleyer, P. v. R., Ed. Wiley: Chichester, West Sussex, U.K, 1998; pp 1606-1613. 10. Lichtarge, O., Sowa, M.E., A. Philippi, Evolutionary traces of functional surfaces along G protein signaling pathway. Methods in Enzymology 2002, 344, 536-556. 11. Lima, C. D., M.G. Klein, and W.A. Hendrickson, Structure-based analysis of catalysis and substrate definition in the HIT protein family. Science 1997, 278, 286-290. 12. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., Eisenberg, D., A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402, (6757), 83-6. 13. Marcotte, E. M., Computational genetics: finding protein function by nonhomology methods. Curr Opin Struct Biol 2000, 10, (3), 359-65. 14. Mattos, C.; Ringe, D., Locating and characterizing binding sites on proteins. Nat Biotechnol 1996, 14, (5), 595-9. 15. Mlinsek, G., Novic, M., Hodoscek, M., Solmajer, T., Prediction of enzyme binding: human thrombin inhibition study by quantum chemical and artificial intelligence methods based on x-ray structures. J Chem Inf Comput Sci 2001, 41, (5), 1286-94. 16. Ondrechen, M. J., J.G. Clifton and D. Ringe, THEMATICS: A simple computational predictor of enzyme function from structure. Proc. Natl. Acad. Sci. (USA) 2001, 98, 12473-12478. 17. Elcock, A. H., Prediction of functionally important residues based solely on the computed energetics of protein structure. J Mol Biol 2001, 312, 885-896.

162

18. Gutteridge, A., G. Bartlett, and J.M. Thornton, Using a neural network and spatial clustering to predict the location of active sites in enzymes. Journal of Molecular Biology 2003, 330, 719-734. 19. Innis, C. A., A.P. Anand, and R. Sowdhamini, Prediction of functional sites in proteins using conserved functional group analysis. Journal of Molecular Biology 2004, 337, 1053-1068. 20. Laurie, A. T. R., and R.M. Jackson, Q-SiteFinder: An energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21, 1908-1916. 21. Ota, M.; K. Kinoshita; Nishikawa, K., Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. Journal of Molecular Biology 2003, 327, 1053-1064. 22. Ondrechen, M. J., THEMATICS as a tool for functional genomics. Genome Informatics 2002, 13, 563-564. 23. Ondrechen, M. J., Identification of functional sites based on prediction of charged group behavior. In Current Protocols in Bioinformatics, Baxevanis, A. D.; Davison, D. B.; Page, R. D. M.; Petsko, G. A.; Stein, L. D.; Stormo, G. D., Eds. John Wiley & Sons: Hoboken, N.J., 2004; pp 8.6.1 - 8.6.10. 24. Ko, J., L.F. Murga, P. Andre, H. Yang, M.J. Ondrechen, R.J. Williams, A. Agunwamba, and D.E. Budil, Statistical Criteria for the Identification of Protein Active Sites Using Theoretical Microscopic Titration Curves. Proteins: Structure Function Bioinformatics 2005, 59, 183-195. 25. Kaelbling, L.; Littman, M.; Moore, A., Reinforcement Learning: A Survey. J. of Artificial Intelligence Research. 1996, 4, 237-285. 26. Mitchell, T. M., Machine Learning. McGraw-Hill: New York, 1997. 27. Landau M.; Mayrose I.; Rosenberg Y.; Glaser F.; Martz E.; T., P.; N., B.-T., ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. . Nucl. Acids Res. 2005, 33, W299-W302. 28. Pupko, T., R.E. Bell, I. Mayrose, F. Glaser, & N. Ben-Tal, Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18, S71-S77. 29. Fetrow, J. S., Siew, N., Di Gennaro, J. A., Martinez-Yamout, M., Dyson, H. J., Skolnick, J., Genomic-scale comparison of sequence- and structure-based methods of function prediction: does structure provide additional insight? Protein Sci 2001, 10, (5), 1005-14. 30. Lichtarge, O., H. R. Bourne and F. E. Cohen., An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257, (2), 342-58. 31. Yao, H., D.M. Kristensen, I. Mihalek, M.E. Sowa, C. Shaw, M. Kimmel, L. Kavraki, & O. Lichtarge, An accurate, sensitive, and scalable method to identify functional sites in proteins. J Mol Biol 2003, 326, 255-261. 32. Cheng, G.; Qian, B.; Samudrala, R.; Baker, D., Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acid Research 2005, 33, (18), 5861-5867. 33. Devos, D.; Valencia, A., Practical limits of function prediction. Proteins: Structure, Functing and Genetics 2000, 4, 98-107. 34. Wilson, M. A., C.V. St. Amour, J.L. Collins, D. Ringe and G.A. Petsko, The 1.8 A resolution crystal structure of YDR533Cp from Saccharomyces cerevisiae: A member of the DJ-1/ThiJ/PfpI superfamily. Proc Natl Acad Sci U S A 2004, 101, 1531-1536.

163

35. Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Shmuel, P., Network Analysis of Protein Structures Identifies Functional Residues Journal of Molecular Biology 2004, 344, (4), 1135-1146. 36. Laskowski, R. A., SURFNET: A program for visualizing molecular surfaces, cavities and intermolecular interactions. J Mol Graph 1995, 13, 323-330. 37. Dundas, J.; Ouyang, Z.; Tseng, J.; Binkowski, A.; Turpaz, Y.; Liang, J., CASTp: computed atas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. . Nucl. Acids Res. 2006, 34, W116-W118. 38. Xie, L.; Bourne, P. E., A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites. BMC Bioinformatics 2007, 8, s4-s9. 39. Petrova, N.; Wu, C., Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics 2006, 7, (1), 312. 40. Youn, E.; Peters, B.; Radivojac, P.; Mooney, S. D., Evaluation of features for catalytic residue prediction in novel folds. Protein Sci. 2007, 16, 216-226. 41. Duda, R. O.; Hart, P. E.; Stork, D. G., Pattern Classification. Wiley: New York, 2001; p 654. 42. Belur, V., Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques,. 1991. 43. Vapnik, V., Statistical Learning Theory. Springer-Verlag: New York, 1998. 44. Schlkopf, B.; Smola, A., Learning with Kernels. MIT Press: Cambridge, MA, 2002. 45. Tong, W.; Williams, R. J.; Wei, Y.; Murga, L. F.; Ko, J.; Ondrechen, M. J., Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci. 2008, 17, 333-341. 46. Freund, Y.; Schapire, R., A short introduction to boosting. J of Japanese Society for Artificial Intelligence. 1999, 14, (5), 771-780. 47. Schapire, R. E., The Strength of Weak Learnability. Machine Learning 1990, 5, (2), 197-227. 48. Matthews, B. W., Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 1975, 405, 442-451 49. Yang, A. S., Gunner, M. R., Sampogna, R., Sharp, K., Honig, B., On the calculation of pKas in proteins. Proteins 1993, 15, (3), 252-65. 50. Bashford, D.; Karplus, M., Multiple-site Titration Curves of Proteins: An Analysis of Exact and Approximate Methods for Their Calculation. J. Phys. Chem. 1991, 95, 9556-9561. 51. Warwicker, J.; Watson, H. C., Calculation of the electric potential in the active site cleft due to alpha-helix dipoles. J Mol Biol 1982, 157, (4), 671-9. 52. Antosiewicz, J., Briggs, J.M., Elcock, A.H., Gilson, M.K., and McCammon, J.A., Computing the Ionization States of Proteins with a Detailed Charge Model. J. Comp. Chem. 1996, 17, 1633-1644. 53. Di Cera, E., S.J. Gill, and J. Wyman, Binding Capacity: Cooperativity and buffering in biopolymers. Proc Natl Acad Sci U S A 1988, 85, 449-452. 54. Wei, Y.; Ko, J.; Murga, L.; Ondrechen, M. J., Selective prediction of Interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007, 8, 119. 55. Yang, T. Statistical applications for structure-based protein function prediction. Northeastern University, Boston, 2007. 56. Joachims, T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods. MIT-Press: Cambridge, MA, 1999.

164

57. Bartlett, G. J., C.T. Porter, N. Borkakoti, and J.M. Thornton, Analysis of Catalytic Residues in Enzyme Active Sites. J Mol Biol 2002, 324, 105-121. 58. Wei, Y. Computed Electrostatic Properties of Protein 3D Structure for Functional Annotation and Biomedical Application. Northeastern University, Boston, 2007. 59. Patterson, W. R.; Poulos, T., Crystal Structure of Recombinant Pea Cytosolic Ascorbate Peroxidase. Biochemistry 1995, 34, (13), 4331-4341. 60. Edwards, S. L.; Xuong, N. h.; Hamlin, R. C.; Kraut, J., Crystal Structure of Cytochrome c Peroxidaes Compound I. Biochemistry 1987, 26, 1503-1511. 61. Gourley, D. G.; Shrive, A. K.; Polikarpov, I.; Krell, T.; Coggins, J. R.; Hawkins, A. R.; Isaacs, N. W.; Sawyer, L., The two types of 3-dehydroquinase have distinct structures but catalyze the same overall reaction. Nat Struct Biol. 1999, 6, 521-525. 62. Sobolev, V., A. Sorokine, J. Prilusky, E.E. Abola, and M. Edelman, Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15, 327-332. 63. Moser, J.; Gerstel, B.; Meyer, J. E.; Chakraborty, T.; Wehland, J.; Heinz, D. W., Crystal structure of the phosphatidylinositol-specific phospholipase C from the human pathogen Listeria monocytogenes. . J. Mol. Biol. 1997, 273, 269-282. 64. Bate, P., and J. Warwicker, Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. J Mol Biol 2004, 340, 263-276. 65. Perlich, C.; Provost, F.; Simonoff, J., Tree induction vs. logistic regression: a learning-curve analysis. The Journal of Machine Learning Research 2003, 4, 211-255. 66. Domingos, P.; Pazzani, M., Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning 1997, 29, 103-130. 67. Best, M. J.; Chakravarti, N., Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming 1990, 47, 425–439. 68. Barlow, R. E.; Bartholomew, D. J.; Bremmer, J. M.; Brunk, H. D., Statistical Inference under Order Restrictions. Wiley: 1972. 69. Ramsay, J. O., Estimating smooth monotone functions. Journal of the Royal Statistical Society, Series B 1998, 60, 365-375. 70. Bradley, A. P., The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 1997, 30, (7), 1145-1159. 71. Wilcoxon, F., Individual comparisons by ranking methods. Biometrics 1945, 1, 80-83. 72. Madura, J. D., J.M. Briggs, R.C. Wade, M.E. Davis, B.A. Luty, A. Ilin, J. Antosiewicz, M.K. Gilson, B. Bagheri, L.R. Scott, & J.A. McCammon, Electrostatics and diffusion of molecules in solution - Simulations with the University of Houston Brownian Dynamics program. Comp Phys Commun 1995, 91, 57-95. 73. Gilson, M. K., Multiple-site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins. Proteins 1993, 15, (3), 266-82. 74. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. . Nucl Acids Res 2004, 32, D129-133. 75. Edgar, R. C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004, 32, 1792-1797. 76. Kinoshita, K.; Ota, M., P-cats: prediction of catalytic residues in proteins from their tertiary structures Bioinformatics 2005, 21, 3570-3571.

Documents

SVM and a Novel POOL Method Coupled with THEMATICS for …926/fulltext.pdf · SVM and a Novel POOL Method Coupled with THEMATICS for Protein Active Site Prediction A DISSERTATION