Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Research Collection
Master Thesis
Exploring the role of amino-acid alphabets in protein-interfaceclassification
Author(s): Somody, Joseph C.
Publication Date: 2015
Permanent Link: https://doi.org/10.3929/ethz-a-010415857
Rights / License: In Copyright - Non-Commercial Use Permitted
This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.
ETH Library
Master’s Thesis
Exploring the Role of Amino-AcidAlphabets in Protein-Interface
Classification
Author:
Joseph C. Somody
Supervisors:
Prof. Dr. Amedeo Caflisch[Internal]
Dr. Guido Capitani[Lead, External]
Dr. Jose M. Duarte[External]
A thesis submitted in partial fulfilment of the requirements
for the degree of Master of Science
in
Computational Biology & Bioinformatics
Department of Computer Science
ETH Zurich
with research at
Protein Crystallography and Structural Bioinformatics
Laboratory of Biomolecular Research
Paul Scherrer Institute
http://dx.doi.org/10.3929/ethz-a-010415857
1 April 2015
Declaration of Originality
I, Joseph C. Somody, hereby confirm that I am the sole author of the written work here
enclosed, entitled “Exploring the Role of Amino-Acid Alphabets in Protein-
Interface Classification”, and that I have compiled it in my own words. Parts
excepted are corrections of form and content by the supervisors, Prof. Dr. Amedeo
Caflisch, Dr. Guido Capitani, and Dr. Jose M. Duarte.
With my signature, I confirm that:
� I have committed none of the forms of plagiarism described on the information
sheet entitled Citation Etiquette, which are as follows:
� Using the exact words or ideas from another author’s intellectual property
(text, ideas, structure, etc.) without clear citation of the source.
� Using text from the Internet without citing the uniform resource locator
(URL) and the date it was accessed.
� Reusing one’s own written texts, from different course papers or performance
assessments, or parts thereof, without explicitly identifying them as such.
� Translating and using a foreign-language text without citing its source.
� Submitting work that has been written by another under one’s own name
(“ghostwriting”).
� Using an extract of another author’s work, paraphrasing it, and citing the
source elsewhere than in the context of that extract.
� I have documented all methods, data, and processes truthfully.
� I have not manipulated any data.
� I have mentioned all persons who were significant facilitators of the work.
I am aware that this work may be screened electronically for plagiarism.
Place and Date: Zurich, Switzerland, 1 April 2015
Signature:
iii
Abstract
Exploring the Role of Amino-Acid Alphabets
in Protein-Interface Classification
Joseph C. Somody
It is no trivial task to determine whether a protein–protein interface in a crystal structure
is biologically relevant. A large alignment of homologous sequences can be constructed,
and the evidence provided by evolution can be extracted. Residues involved in biological
interfaces tend to demonstrate less variability than other residues on the surface of the
protein, but how the variability is calculated has an interesting impact on how well the
interfaces can be classified. This thesis begins by optimising the parameters for the
algorithm described above, then continues on to an investigation of reduced alphabets
for the amino acids. Results show that grouping the amino acids into a mere two families
plays a beneficial role in discriminating between biologically-relevant interfaces and the
artefacts of crystal packing; however, grouping similar amino acids together does not
necessarily have a positive effect: placing aspartic acid and glutamic acid in the same
group of the reduced amino-acid alphabet, for example, has a decidedly-harmful impact
on the classification accuracy. Two tangential appendices are attached to the thesis to
document research forays that, although unfruitful, have the potential to succeed given
further investment.
Table of Contents
Declaration of Originality iii
Abstract v
Table of Contents vii
Abbreviations xi
1 Introduction 1
1.1 Basic Molecular Biology and Biochemistry . . . . . . . . . . . . . . . . . . 1
1.2 X-Ray Crystallography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Protein Crystallisation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Structural Modelling, Model Building, and Model Refinement . . . 4
1.3 Interface-Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Crystallographic Symmetry . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Early Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 CRK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 EPPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 The Fall of CRK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 The Rise of EPPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Amino-Acid Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
vii
Table of Contents viii
2 Parameter Optimisation 15
2.1 Parameters in EPPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 BioMany and XtalMany . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Treatment of nopred Calls . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Prefiltering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Measures of Classifier Performance . . . . . . . . . . . . . . . . . . 18
2.3.4 Optimised Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Reduced Amino-Acid Alphabets 27
3.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Published Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . 30
3.2.4 Further Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Random 2-Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Transitioning 2-Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Surface-Residue Sampling 41
4.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Strategies for Surface-Residue Sampling . . . . . . . . . . . . . . . . . . . 42
4.2.1 Classic Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Novel Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 2-Alphabets 47
5.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Stirling Numbers of the Second Kind . . . . . . . . . . . . . . . . . 48
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Table of Contents ix
6 Conclusion 55
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A Hydrogen-Bond Detection 57
A.1 Hydrogen Bonding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B Core-Residue Decay 61
B.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
List of Figures 65
List of Tables 67
Acknowledgements 69
References 71
Abbreviations
ETH Eidgenossische Technische HochschuleURL Uniform Resource LocatorDNA Deoxyribonucleic AcidRNA Ribonucleic AcidA AdenineC CytosineG GuanineT ThymineU UracilPPI Protein–Protein InteractionPDB Protein Data BankAU Asymmetric UnitPISA Proteins, Interfaces, Structures and AssembliesCRK Core–Rim Ka/Ks RatioMSA Multiple Sequence AlignmentASA Accessible Surface AreaBSA Buried Surface AreaEPPIC Evolutionary Protein–Protein Interface ClassifierDC Duarte–CapitaniNMR Nuclear Magnetic ResonanceHPCC High-Performance Computing ClusterTP True PositiveFP False PositiveTN True NegativeFN False NegativeMCC Matthews Correlation CoefficientROC Receiver Operating CharacteristicAUROC Area Under the ROC CurveAPI Application Programming InterfaceJPA Java Persistence APIRDBMS Relational Database Management SystemPDB ID PDB IdentifierLBR Laboratory of Biomolecular ResearchPSI Paul Scherrer Institute
xi
Chapter 1
Introduction
This master’s thesis is mainly an investigation into using reduced amino-acid alphabets
to calculate conservation in alignments of homologous amino-acid sequences. A direct
consequence of using computational methods to solve biological questions is that the
reader should be well-versed in both basic biology and basic computer science. In order
to lay the foundation for all readers to begin with equal footing, this introductory chapter
will briefly describe the required background knowledge.
1.1 Basic Molecular Biology and Biochemistry
As with any scientific report with a focus on molecular biology, this master’s thesis must
necessarily start with an overview of the central dogma of molecular biology. Deoxyri-
bonucleic acid (DNA) is the genetic material that forms the inheritable basis for life.
DNA can be replicated, an essential part of cell division, or transcribed into ribonu-
cleic acid (RNA), the first step of gene expression. DNA and RNA are both chains
of nucleotides, each nucleotide being one of adenine (A), cytosine (C), guanine (G),
and thymine (T)—or uracil (U) for RNA—attached to a sugar–phosphate backbone,
as shown in Figure 1.1. Once RNA has been transcribed from DNA, nascent RNA is
processed by the cellular machinery, thereby regulating its transcription or splice pat-
tern and forming mature RNA. The ribosome, itself composed of protein and RNA,
is responsible for translating the RNA into the protein for which it codes, which then
folds into an energetically-stable conformation as it comes off the ribosomal complex.
A protein is a chain of amino acids, each being one of twenty possibilities that possess
different properties. It is their sequence that determines the structure of the protein
and, thus, the protein’s function [1].
1
Chapter 1 — Introduction 2
RNARibonucleic Acid
DNADeoxyribonucleic Acid
NucleobasesCytosine
Guanine
Adenine
ThymineUracil
Cytosine
Guanine
AdenineSugar–Phosphate
Nucleobasesof RNA
Nucleobasesof DNA
Backbone
Figure 1.1: The composition of RNA and DNA. RNA is a single-stranded chainof nucleotides, each being one of adenine (A), cytosine (C), guanine (G), and uracil(U), attached to a sugar–phosphate backbone. DNA is of similar composition, but itis double-stranded and contains thymine (T) instead of uracil (U). This figure was
adapted from graphics by Sponk and R. Mattern [2].
As stated by Hartwell et al., “a discrete biological function can only rarely be attributed
to an individual molecule” [3]. Biological functions, which can be anything from signal
transduction to cell-cycle regulation to enzymatic digestion, often require the formation
of protein complexes [4, 5]. These protein–protein interactions (PPIs) can be between
subunits of the same type, forming homooligomers, or of different types, forming het-
erooligomers. The interactions in homooligomers can be further subdivided into isolo-
gous and heterologous [6]. The subunits form an isologous association when they both
contribute the same set of amino-acid residues to the interface, whereas they form a het-
erologous association when they contribute a different subset of amino-acid residues to
the interface. Schematic examples of such oligomers can be seen in Figure 1.2. Further-
more, isologous interfaces have identical interacting surfaces, resulting in a necessarily
dimeric and closed structure with a twofold axis of symmetry, while heterologous inter-
faces have non-identical interacting surfaces, which must be complementary but are not
usually symmetric [7]. Heterologous interfaces can, but need not, give rise to infinite
assemblies like the one shown in Figure 1.2(c). In the case that they do not form infinite
assemblies, heterologous interfaces must display circular symmetry.
Chapter 1 — Introduction 3
(a) (b) (c)
AB
A B
AA
A A
AAA
A A A
...
...
...
...
...
...
...
...
Figure 1.2: Different types of protein–protein interactions. Figure 1.2(a) shows aheterooligomeric interaction between two different protomers, A and B. Figure 1.2(b)shows two identical protomers coming together to form an isologous homooligomericinteraction, the black dot denoting an axis of rotational symmetry, while Figure 1.2(c)shows three identical protomers assembling to form a heterologous homooligomeric
interaction, which can, in this case, but not in general, be extended ad infinitum.
The amino-acid residues that point into the interface between the proteins are the in-
teracting residues and are often required to maintain the PPI. Typically, there are also
residues critical for the PPI that do not interact across the interface: internal support
residues that stabilise the main-chain scaffold of the interface [8].
1.2 X-Ray Crystallography
1.2.1 Protein Crystallisation
The first step in protein X-ray crystallography is to crystallise the protein of interest,
beginning with its purification. The protein is typically obtained by recombinant meth-
ods: the gene for the protein is cloned into an expression vector and transferred into a
production organism, such as Escherichia coli, in order to express the protein in large
quantities. The protein is then purified out of the culture, generally by means of affinity
chromatography targetting a protein tag whose coding sequence had been inserted at
one end of the gene of interest.
Once the protein has been purified, it must be crystallised. To crystallise, protein
molecules must fall out from solution and assemble themselves into a crystal lattice,
typically done by adding a precipitant to a concentrated solution of protein [9]. The
formation of protein crystals is a finicky procedure with many variables, and, as it is not
determinable in advance which solutions and conditions will yield protein crystals, let
alone well-diffracting ones, many trials with varying characteristics must be carried out.
In a popular method known as hanging-drop vapour diffusion, the reservoirs of microtitre
plates are filled with solutions with various acidities, temperatures, ionic strengths,
Chapter 1 — Introduction 4
buffers, precipitants, etc. A small amount of the protein solution is mixed with an equal
amount of the reservoir solution and transferred on to a glass slide, which is then placed
atop the reservoir, thereby sealing it, with the drop facing the interior of the reservoir [9].
Since the drop has a lower precipitant concentration than the reservoir solution, water
evaporates from the drop, leading to a slow increase in protein concentration. Over
time, the drop becomes supersaturated with protein, and a nucleation event may lead
to the formation of protein crystals.
This screening process suffers from the curse of dimensionality: crystallisation is affected
by many variables, which makes sampling the multidimensional solution space an oner-
ous task, especially since the amount of protein available is generally limited. It is for
this reason that sparse-matrix sampling was proposed by Jancarik and Kim in 1991 [10],
which involves strategically screening crystallisation conditions that have been success-
ful in the past and using non-repeating combinations of reagents [9]. This method has
become a very popular technique in protein crystallisation and is now widely applied in
commercial crystallisation kits [11].
1.2.2 Data Collection
After obtaining a satisfactory crystal of the protein of interest, the crystal is most
commonly flash-frozen in liquid nitrogen and mounted to a goniometer head. The crystal,
kept at 100 K by a stream of gaseous nitrogen, is then exposed to an incident beam of
monochromatic X-ray radiation while slowly being rotated about a spindle axis. The
X-rays are diffracted, and diffraction patterns are captured by a detector sensitive to
X-ray photons [9].
1.2.3 Structural Modelling, Model Building, and Model Refinement
There is an inherent difficulty in X-ray crystallography: the phase problem. The infor-
mation regarding the phase of any given spot, known as a reflection, on the diffraction
pattern is lost when captured as an image. This is problematic, because the phases of the
reflections are more crucial to structural modelling than their intensities [12]. One of the
most common solutions to this problem is molecular replacement, where the phases from
a similar, solved structure are grafted on to the intensities of the crystal of interest after
computationally finding the correct orientation and position in the target asymmetric
unit (defined in Subsection 1.3.1) [13]. In Figure 1.3, a schematic example (with artistic
renderings of diffraction patterns, not the real ones) of molecular replacement as a solu-
tion to the phase problem is illustrated: the cat in Figure 1.3(a) represents the crystal
structure to be solved, whose diffraction pattern is shown in Figure 1.3(b), but only
Chapter 1 — Introduction 5
the phaseless (shown as colourless in this analogy) diffraction pattern in Figure 1.3(c)
is available to crystallographers. The phaseless intensities from the reflections in Fig-
ure 1.3(c) are used, but the phases from a similar structure (not shown), appropriately
positioned in the asymmetric unit, are applied on to the intensities. The combination,
shown in Figure 1.3(d), yields the structure in Figure 1.3(e), an approximate solution.
The recorded series of two-dimensional diffraction patterns, with intensities and grafted
phases, must then be converted into a three-dimensional model of the electron density.
The Fourier transform is an operation that decomposes a periodic function into the
frequencies that constitute it. The diffraction pattern of a crystal is composed of peaks
as a result of the crystal’s periodicity, so the inverse of the Fourier transform can be
used to recompose the electron density from the observed pattern [15]. A biochemical
model of the protein can then be built into the electron density and iteratively refined
until the calculated diffraction data from the model agree as much as possible with the
experimental diffraction data. The final structure is then formatted appropriately and
deposited into a database, such as the Protein Data Bank (PDB) [16].
1.3 Interface-Classification Problem
1.3.1 Crystallographic Symmetry
When a protein structure is determined by X-ray crystallography, its space group must
be identified. The space group describes the symmetry of the crystal: it provides suf-
ficient information to mathematically generate a unit cell from the asymmetric unit
(AU) [17]. More specifically, the AU is the smallest unit of the crystal that contains
sufficient information to produce the unit cell through the symmetry operations of the
crystal space group, while the unit cell is the smallest unit of the crystal that contains
sufficient information to produce the (infinite) crystal lattice by translation only [18]. A
schematic example of crystallographic symmetry is shown in Figure 1.4, where the AU
is any one duck, of either colour, and the unit cell is shown in green overlay.
1.3.2 The Problem
The crystal structure of a protein under investigation contains non-biological interac-
tions, a direct consequence of the nature of crystals. (An exception to this is seen in the
case of natural crystals like polyhedrins in insect viruses [19, 20] or like the Cry3A toxin
found naturally crystallised in Bacillus thuringiensis [21], where all interfaces are biolog-
ically relevant since the protein’s crystal form evolved to perform a particular biological
Chapter 1 — Introduction 6
(a) (b)
(c) (d)
(e)
Figure 1.3: An illustration of molecular replacement as an approach to solving thephase problem with artistic renderings of diffraction patterns, not the real Fouriertransforms. The cat in Figure 1.3(a), having the diffraction pattern in Figure 1.3(b),where the colours represent phases, needs to be generated from the phaseless (colourless)diffraction pattern in Figure 1.3(c). Phases from a similar structure are appropriatelyaligned and grafted on to Figure 1.3(c), resulting in Figure 1.3(d). Figure 1.3(d) yieldsthe approximate solution shown in Figure 1.3(e). This figure was inspired by a similar
one by Kevin Cowtan [14].
Chapter 1 — Introduction 7
Figure 1.4: An anatine example of crystallographic symmetry with space group P21,shown schematically down the b axis. The orange symbols denote twofold screw axes,while the colours, pink and blue, denote differential depth along b. The AU is one duck,of either colour, and the unit cell is shown in green overlay. This figure was inspired by
a similar one by David Blow [17].
function.) The matter that constitutes each AU in a crystal interacts not only within
the AU but also between adjacent AUs. The AU is defined without reference to the bi-
ological context of the complex, meaning the complexes in each AU are not necessarily
the true biological complexes [22]. A schematic example is shown in Figure 1.5, where
the AU is clearly any one subunit (of any colour), but the definition of the biological
unit is debatable without any complementary information.
The crystal structure of a protein homooligomeric in solution is not restricted to sym-
metry operations based on the biological oligomeric interactions between amino-acid
chains. For example, a protomer that forms a homodimer in vivo does not necessarily
only have that one isologous interaction in its crystalline form. As a result, when crystal-
lographers investigate the crystal lattice, it is not usually obvious which contacts are due
to biological interactions and which contacts are due simply to crystal packing. In the
past, especially with simple structures, biological interactions have been discriminated
from crystal contacts by mere manual inspection, an endeavour that has become more
and more challenging as the complexity of the proteins being researched has risen over
time [24]. This difficulty has become further exacerbated by a lack of classical biochemi-
cal background on new proteins: it is quite rare to see full biochemical characterisations
for the majority of today’s crystallographic targets [24].
Chapter 1 — Introduction 8
(a)
(b)
(c)
(d)
Figure 1.5: A two-dimensional schematic illustration of the interface classificationproblem. Given Figure 1.5(d) as a crystal lattice, any of Figure 1.5(a), Figure 1.5(b), orFigure 1.5(c) could be equivalently inferred as the biologically-relevant protein complex.The AU is one subunit (outlined shape) of any colour. Without further information, itis not clear which arrangement contains the true biological unit, since protein–proteininterfaces in the lattice are often artefacts of crystal formation. This figure was in-spired by a similar one by Levy and Teichmann [23], which, in turn, was inspired by
“Fish (№ 20)” by M. C. Escher.
1.3.3 Early Approaches
Over the past twenty years, attempts to solve the above problem have varied widely
in their approaches. In the 1970s, Janin started investigating protein–protein interac-
tions [25, 26] and, in 1995, began researching crystal contacts [27]. Among the first was
Janin’s method of discriminating between subunit contacts and crystal artefacts based
solely on the area of the interface [28]. A few years later, Valdar and Thornton trained
neural networks with both the area of the interface and the conservation scores of the
amino-acid residues involved, which resulted in a highly-accurate predictor [29]. A couple
more methods arose in the early 2000s, which focussed on analysing residue conservation
in order to tell biological and non-biological interfaces apart: Elcock and McCammon
Chapter 1 — Introduction 9
determined that comparing the conservation of amino-acid residues involved in the in-
terface with the conservation of residues elsewhere on the surface of the protein was an
effective discriminator [30], and Guharoy and Chakrabarti found that subdividing the in-
terface into a “core” region and a “rim” region and comparing the residue conservation
between the two regions worked even better, since binding sites and other conserved
patches on the rest of the protein surface led to anomalous results [31]. Researchers
thereafter began using methods in machine learning in order to combine various param-
eters describing the interfaces of interest and extract statistical power from a trained
classifier [32–34]. Of particular note is NOXclass, a support vector machine–based clas-
sifier that distinguishes between three classes of interactions by analysing six interface
properties, including interface area and amino-acid composition [33]. These classes con-
sist of obligate interactions, where protomers are not independently found as stable
structures in vivo; non-obligate interactions, where protomers are; and crystal-packing
interactions [33].
In 2007, Krissinel and Henrick developed a thermodynamic method for estimating inter-
face stability based on the binding energy of the interface and the entropy change due
to complex formation [35]. This method is implemented in Proteins, Interfaces, Struc-
tures and Assemblies (PISA) [35], which has become the de facto standard for interface
classification in the community. In 2008, Xu et al. performed an analysis of interface
similarity in crystals of homologous proteins to examine whether having interfaces in
common across different crystal forms can be used to calculate the likelihood that an
interface is biological [36]. Three years later, Xu and Dunbrack released ProtCID [37],
a database of homologous interfaces—both homodimeric and heterodimeric—observed
in multiple crystal forms, grouped by Pfam domain [38] and crystal form.
1.3.4 CRK
The core–rim Ka/Ks ratio (CRK) is a measure developed by Scharer et al. in 2010
to distinguish biological interfaces from non-specific ones [39]. It is based on an idea
originally proposed by Bahadur et al. in a 2004 paper [40], which was to designate each
residue involved in the interface as either a core residue or a rim residue, depending
on the extent to which atoms in the residue are buried upon interface formation—core
residues contain at least one fully-buried atom, while rim residues, which surround core
residues, have only partially-buried atoms.
In 2001, Elcock and McCammon published a method based on comparing the informa-
tion entropy at various positions, each designated as either “interface” or “non-interface”,
in multiple sequence alignments (MSAs) [30]. It was found that the average sequence
Chapter 1 — Introduction 10
entropy for interface residues is lower than that of non-interface residues. By restricting
the focus and looking only at interface residues, Guharoy and Chakrabarti refined this
method in 2005 and showed that the Bahadur et al. classifications—core residues and
rim residues—provided a better basis for comparisons of sequence entropy [31].
The CRK algorithm expands on these methods: given a PDB structure, the sequence
of each chain is used as a BLASTP [41, 42] query to search for homologous sequences
in the UniProt database [43], which are then filtered so as to have homology above a
threshold, appropriate taxonomy, and low pairwise redundancy. The protein sequences
for these homologues (as well as for the query) are then aligned with one another using
T-Coffee [44], the resulting protein MSA is backtranslated into an MSA of coding se-
quences using RevTrans [45], and the selection pressure at each site is calculated with
Selecton [46, 47].
The accessible surface area (ASA) of an amino-acid residue is the size of the region
accessible to a solvent molecule, usually water [48]. This is generally calculated using
the Shrake–Rupley algorithm [49], where a probe sphere is “rolled” along the surface
of the region while checking for steric clashes that would imply a particular point on
the surface is not accessible. The buried surface area (BSA) is similar but measures
how accessible the residue is after interface formation [50]. The geometric descriptions
of the interfaces for the input PDB structure are retrieved from PISA, and the residues
are divided into a rim subset and a core subset, depending on what fraction of their
accessible surface area becomes inaccessible after interface formation. This fraction is
known as the burial, and the default burial cutoff for core residues in CRK is 95%.
The final steps in the CRK algorithm are calculating the ratio ω = Ka/Ks for each
residue, separating the ω values by class (either core or rim), and calculating the statistics
(both the unweighted mean and the mean weighted by the BSA) for both residue classes.
The ratio can then be used to determine whether the interface is biologically relevant:
a biological interface is expected to have more purifying selection for core residues than
rim residues, which would yield a ratio CRK =〈ω〉core〈ω〉rim
< 1.
Chapter 1 — Introduction 11
1.4 EPPIC
1.4.1 The Fall of CRK
CRK, as a proof-of-concept work, was successful but had its limitations. Firstly, the
CRK pipeline was complicated: finding protein homologues, aligning them, backtrans-
lating the alignment, calculating the selection pressure, classifying residues, calculat-
ing ratios, etc. This meant not only that any large-scale surveys of protein structures
would be computationally infeasible but also that there were many welds in the pipeline
where something could go wrong. In fact, mapping protein sequences back to their
coding sequences turned out to be one such problem since there were often issues with
cross-referencing between databases. A protein sequence would often map to multiple
coding sequences or, even worse, to no coding sequences, which led to fewer sufficiently-
significant predictions. In addition, the step in which CRK calculates selection pressures
using Selecton was found to be exceedingly slow, further constraining the method’s us-
ability.
1.4.2 The Rise of EPPIC
In 2012, the research group behind CRK published a new approach, the Evolution-
ary Protein–Protein Interface Classifier (EPPIC) [24]. Although grounded in the same
concepts as CRK, EPPIC overcame CRK’s limitations. Two evolutionary predictors,
core–rim and core–surface, were introduced, as well as a new geometric criterion for
interface classification, which was based simply on the number of core residues in an
interface. This allowed for a call to be made even when insufficient homologues were
available for evolutionary analysis. Next, the use of Ka/Ks ratios was abandoned in
favour of sequence entropies. It was found that better performance was obtainable by
using sequence entropies as long as homologues were stringently selected and had low
redundancy. This change not only did away with the speed-limiting step involving Se-
lecton but also meant that mapping protein sequences back to coding sequences was no
longer necessary.
As mentioned in Subsection 1.3.4, in 2001, Elcock and McCammon pioneered a method
comparing sequence entropy between interface and non-interface positions in MSAs [30].
EPPIC expands on this substantially: for the core–surface call, core residues of the
interface are compared to surface residues outside the interface. More specifically, in a
manner similar to that described by Valdar and Thornton [51], EPPIC randomly samples
the surface residues to be compared from a pool of surface residues in order to maximise
statistical robustness.
Chapter 1 — Introduction 12
When searching for homologues, EPPIC uses a conservative sequence-identity cutoff of
60%, which is lowered to 50% in the event that insufficient homologues are found. These
high cutoffs ensure that homologues have a high structural similarity at both tertiary
and quaternary levels [24].
EPPIC uses the CRK definition for determining burial, as defined in the paper by
Scharer et al. in 2010 [39]. For each of the three predictors contained in EPPIC, a dif-
ferent burial cutoff can be used. EPPIC determines the solvent accessibility of residues
before and after the formation of the protein–protein interface, and the ratio is used to
classify them as either “core” (residues fully buried upon interface formation), “rim”
(residues partially buried upon interface formation), “surface” (surface residues not in-
volved in an interface), or “other” (any non-interface, non-surface residues; for example,
residues in a subunit’s core). The geometry-based predictor simply returns the num-
ber of core residues in the interface as its score, the core–rim evolutionary predictor
returns the ratio of sequence entropies for core interface residues compared to rim inter-
face residues, and, finally, the core–surface evolutionary predictor uses random sampling
of surface residues and returns the “distance” of the average entropy for core residues
to the mean of the samples of surface residues in terms of their standard deviation, a
Z-score–like measure. This will be explained in greater detail in Chapter 4.
These three scores are converted to calls by comparison to a threshold unique to each
predictor. Each call can be one of bio, meaning that the predictor believes the interface
to be biologically relevant; xtal, meaning that the predictor believes it to be a crystal
contact; or nopred, meaning that the predictor has insufficient data to go on. The three
calls are combined to form a consensus call through a simple-majority voting scheme
with some modifications. Since the geometric predictor always returns a call, if both
evolutionary predictors give no prediction, the geometry call is taken as the consensus.
Also, in certain cases, more weight is given to the calls by the evolutionary predictors:
when there is exactly one each of bio, xtal, and nopred, the non-nopred evolutionary
call is made the consensus call.
1.5 Amino-Acid Alphabets
1.5.1 Background
There are twenty different naturally-occurring amino acids, complex strings of which
form proteins. Anfinsen’s dogma states that the native structure of a protein is deter-
mined solely by the protein sequence of amino acids [52]. It has been shown, however,
that point mutations in a protein’s sequence are often tolerated [53] and that proteins
Chapter 1 — Introduction 13
with far fewer types of amino acids can still form stable structures [54, 55]. In one case,
thirty-eight residues of a fifty-seven–residue protein domain were each replaced with one
of five possible residues, and the domain retained its fold [54]. This suggests that the
standard twenty-letter amino-acid alphabet could, perhaps, be reduced.
Investigations into the degeneracy of amino-acid sequences and the possibility to con-
dense the amino-acid alphabet were published as early as 1979 [56]. In 1994, Strelets et al.
developed a model of combinatorial sequence space and broke down known protein se-
quences into overlapping subsequences of a fixed length, which led to an unexpectedly-
high clusterisation that “looked as if certain protein sequence variations occurred and
were fixed in the early course of evolution” [57], further propelling the idea that there
were simplifications that could be made in attempts to model the complexity of protein
sequence.
The simplest reduced protein alphabet is a two-letter one where amino acids are grouped
as either hydrophobic or polar. This has been dubbed the HP model, but, as argued by
Wolynes in 1997, two-letter alphabets result in too many local minima and insufficient
funnelling in the energy landscape of protein folding [58]. In 1999, Wang and Wang took
a computational approach to simplifying the protein-folding alphabet using minimised
mismatch between properties of amino acids as a function of the size of the amino-acid
alphabet and determined that, depending on how many properties are to be modelled,
various plateaus exist to describe proteins at different coarse-grained levels [59].
In 2000, Murphy et al. set out to determine whether there is a minimal protein alphabet
with which all proteins could still fold [60]. They used the BLOSUM50 similarity ma-
trix [61], which was derived from gapless blocks in large alignments of protein sequences,
to correlate the most-related amino acids, merge them, and construct a simplified sim-
ilarity matrix. This process, applied iteratively, showed clear tendencies: residues with
similar physicochemical properties first group together, the small amino acids then come
together to form a group, and the alphabet ultimately reduces to the HP model. The
authors tested the effects by applying the reduced alphabets on protein sequences and
aligning them to the SCOP40 clustered database [62] to determine how much of the pro-
tein fold mapping is lost. An amino-acid alphabet of ten or twelve letters only reduced
the percentage coverage retained by approximately 10%; however, a steep loss of fold
recognition was observed with any simpler alphabet [60].
Three years later, in a very similar approach, Li et al. used the BLOSUM62 similar-
ity matrix [61] to guide an iterative condensation of the protein alphabet and compare
them for information loss just as Murphy et al. did. The authors noted that ten-letter
alphabets allow for nearly the same ability to detect distantly related folds as with
Chapter 1 — Introduction 14
twenty-letter alphabet, which may mean that ten amino acids would suffice to charac-
terise the complexity in proteins [63]. Research into reduced protein alphabets has also
continued in recent years with a greater focus on their applications in protein design [64].
1.5.2 Relevance
In EPPIC, as described in Subsection 1.4.2, an MSA of putative homologues is generated
for the two evolutionary predictors. The sequence entropy is then calculated at each
position, and each position is categorised as “core”, “rim”, “surface”, or “other”. The
scores are based on the differences in sequence entropy between these classes. It is for
these calculations that reduced amino-acid alphabets are needed. The signal-to-noise
ratio of a column is a delicate balance in that variation should be detected but not taken
into account if the variations do not matter at a biochemical level. Part of this could be
resolved through cleverly choosing an appropriate amino-acid alphabet. As a result, a
large part of this thesis is an investigation into the effects of various reduced amino-acid
alphabets on EPPIC’s output.
Chapter 2
Parameter Optimisation
2.1 Parameters in EPPIC
The strategy behind EPPIC is fairly fixed: given a structure, for each possible interface,
the program classifies residues based on solvent accessibility before and after the forma-
tion of the interface; creates a large, non-redundant MSA of putative homologues; and
uses this information to return three different measures by which to classify the inter-
face as either biologically relevant or a crystal contact. This does not, however, mean
that EPPIC does not allow for any customisability. In fact, the gamut of adjustable
parameters that EPPIC offers proves just the opposite.
EPPIC allows the user to choose the size of the amino-acid alphabet used for the sequence
entropy calculations (from the ones proposed by Murphy et al. [60]), the BSA-to-ASA
burial cutoff for core-residue assignment for each of the three classifiers, the calling
thresholds for determining bio or xtal from the raw scores, the hard and soft sequence
identity cutoffs, the maximum number of homologues to use, the number of sphere points
for ASA calculations, as well as several other parameters.
Of the aforementioned parameters, this chapter of the thesis will concern the optimisa-
tion of seven of them: the three burial cutoffs, the three call thresholds, and the reduced
amino-acid alphabet.
15
Chapter 2 — Parameter Optimisation 16
2.2 Optimisation
2.2.1 BioMany and XtalMany
In order to optimise parameters, a method of evaluating the accuracy of the classifier is
required. Therein lies one of the most challenging obstacles: gold-standard datasets with
experimental backing are key to properly training a classifier, but the manually-curated
ones available to date have taken a long time to compile, been small (on the order of
102 entries), and been prone to human error [65].
In 2012, when EPPIC was originally released, two small datasets were compiled by
the authors. These Duarte–Capitani (DC) datasets, DCbio and DCxtal, were used to
optimise the EPPIC parameters [24]. Since it had long been known that biological
interfaces tend to be much larger than crystal contacts [28], Duarte et al. decided to
focus their curation efforts on the region harder to classify. The DCbio dataset, therefore,
consisted of small interfaces validated as biological, while the DCxtal dataset contained
large crystal contacts.
For a true determination of the optimal parameters, the DC datasets, each having
only eighty-one entries, were too small. To this end, Baskaran et al. generated the
Many datasets, BioMany and XtalMany [65]. BioMany was based on the ProtCID
database [36, 37], which infers that an interface is biological when it is present across
multiple crystal forms. The ProtCID database also clusters interfaces by Pfam architec-
ture [38], which allowed for a selection of interfaces of minimal redundancy to be pulled
from the database. Using conservative thresholds, 2666 such interfaces were added to
the BioMany dataset. Additionally, 171 further biological interfaces that had been val-
idated by nuclear magnetic resonance (NMR), another method of determining protein
structure, were added to the dataset. Any interfaces larger than 2000�A2 were removed
from BioMany in order to keep the dataset focussed on the range of interfaces harder to
classify.
In 1965, Monod et al. first discussed interfaces that lead to infinite assemblies [6]. As
mentioned in Section 1.1, homomeric interfaces can be either isologous or heterologous,
the latter of which has the potential to assemble infinitely. Baskaran et al. assert that,
when considering a protein crystal, any interfaces produced by pure translation or a
screw axis cannot result in closed assemblies, and, therefore, any such observed interfaces
can be assumed to be crystal contacts [65]. The XtalMany dataset was populated using
this assumption, the interfaces were clustered by sequence to decrease redundancy, and
any interfaces smaller than 600�A2 were removed. This resulted in a dataset of 2913
crystal contacts.
Chapter 2 — Parameter Optimisation 17
2.2.2 Computation
It was determined early on that the optimisation of parameters would have to involve a
full search of the parameter space. Not all parameters are interdependent, which meant
that there could be a certain degree of parallelisation in the optimisation. Additionally,
the call thresholds only come into play during the conversion of scores to calls, allowing
these parameters to be handled a posteriori.
The burial cutoff for the geometric predictor was evaluated from 50% to 100% in in-
crements of 5%. The burial cutoffs for the core–rim and core–surface evolutionary pre-
dictors were each evaluated from 50% to 100% in increments of 10% in combination
with the reduced amino-acid alphabet being varied across the two-, four-, eight-, ten-,
and fifteen-letter alphabets proposed by Murphy et al. [60] and the six-letter alphabet
proposed by Mirny and Shakhnovich [66]. The reduced amino-acid alphabets used for
the optimisations are shown in Table 2.1.
AlphabetIdentifier
SourceAmino-Acid
Alphabet
2 [60] ACFGILMPSTVWY:DEHKNQR
4 [60] AGPST:CILMV:DEHKNQR:FWY
6 [66] ACILMV:DE:FHWY:GP:KR:NQST
8 [60] AG:CILMV:DENQ:FWY:H:KR:P:ST
10 [60] A:C:DENQ:FWY:G:H:ILMV:KR:P:ST
15 [60] A:C:D:E:FY:G:H:ILMV:KR:N:P:Q:S:T:W
20 — A:C:D:E:F:G:H:I:K:L:M:N:P:Q:R:S:T:V:W:Y
Table 2.1: The seven amino-acid alphabets originally released with EPPIC. For eachalphabet, the alphabet identifier—simply the number of letters in the alphabet—isshown, as well as the authors who engineered it and the alphabet itself. The twenty-
letter alphabet is an unreduced alphabet and, therefore, has no source.
Table 2.1 also introduces a standardised way to describe amino-acid alphabets. Each
group of amino acids represented by one letter in the alphabet is written as a string of
one-letter codes corresponding to the amino acids in that group. The one-letter codes
are written in alphabetical order, and then the full strings are written out from left to
right, also sorted alphabetically, using the colon character (:) as a separator.
For each set of parameters, EPPIC was run on 5607 PDB structures from the BioMany
and XtalMany datasets, which led to a grand total of 5607×(11+6×7+6×7) = 532 665
runs of the program. Clearly, with each run lasting approximately half a minute, this
would have taken somewhere on the order of months to compute on a standard per-
sonal computer. As a result, computations were carried out using two high-performance
computing clusters (HPCCs): Brutus at ETH Zurich and Merlin 4 at the Paul Scherrer
Institute.
Chapter 2 — Parameter Optimisation 18
2.3 Results
2.3.1 Treatment of nopred Calls
As discussed in Subsection 1.4.2, there are three different calls that EPPIC’s predictors
can return: bio, xtal, or nopred. As a classifier, EPPIC’s output should be analysed in
the typical manner, which involves tallying correct and incorrect classifications for both
the positive and negative datasets (BioMany and XtalMany, respectively). The handling
of nopred calls in this binary scheme, however, is not trivial. This is because, no matter
how they are treated, some bias will necessarily appear in the results. For example,
treating them with the same “contempt” as incorrect classifications discourages their
occurrence, which is contrary to the fundamental idea of having nopred calls. Ideally,
nopred calls should be preferred to misclassification. On the other hand, treating them
favourably encourages EPPIC to return fewer bio and xtal calls, which, clearly, is
flawed behaviour for software whose purpose is to make predictions.
It was decided that the nopred calls could be treated as xtal calls. This converts the
hard-to-handle tertiary classification problem into a binary one. A possible downside
to this approach is that nopred calls are now considered as suitable to be returned for
crystal interfaces. This outcome is easily defended: a typical user of EPPIC is searching
for biological interfaces in a sea of crystal contacts and would likely view any non-bio
call similarly. As of February 2015, EPPIC’s final calls across the entire PDB show that
approximately six times more crystal interfaces exist than biological interfaces.
2.3.2 Prefiltering Results
For the core–rim and core–surface evolutionary predictors, two different stringencies for
the list of putative homologues were applied: a minimum of ten homologues, each having
at least 50% sequence identity to the query, and a minimum of thirty homologues, each
having at least 60% sequence identity. Imposing these homologue stringencies helps
ensure that the MSA used for calculations is of statistical significance, since too few
or too dissimilar homologues in the alignment is likely to undermine the accuracy of
classifications made.
2.3.3 Measures of Classifier Performance
As mentioned in Subsection 2.3.1, each of EPPIC’s predictors is considered a binary
classifier for analysis and optimisation, and nopred calls are simply treated as xtal
Chapter 2 — Parameter Optimisation 19
calls. Since biological interfaces are typically of interest to the user, bio and xtal calls
shall be positives and negatives, respectively. True positives (TPs), false positives (FPs),
true negatives (TNs), and false negatives (FNs) are briefly defined in Table 2.2.
Interface
Biological Crystal
Callbio TP FP
xtal FN TN
Table 2.2: The definitions of true positives (TP), false positives (FP), true nega-tives (TN), and false negatives (FN) with respect to the factual interface type, either
biological or crystal, and the call made by the binary classifier, either bio or xtal.
Various measures exist to assess the performance of a binary classifier given the defini-
tions in Table 2.2 [67]. Equation (2.1) defines the sensitivity of the classifier, which is
the proportion of positives correctly labelled as positive. The denominator is expanded
for clarity: datapoints correctly labelled as positive and datapoints incorrectly labelled
as negative together form the set of all positives.
Sensitivity =TP
TP + FN(2.1)
Similarly, Equation (2.2) defines the specificity of the classifier, which is the proportion
of negatives correctly labelled as negative. Again, the denominator is expanded for
clarity: datapoints correctly labelled as negative and datapoints incorrectly labelled as
positive together form the set of all negatives.
Specificity =TN
TN + FP(2.2)
In Equation (2.3), the accuracy of the classifier is defined as the correct proportion of all
predictions. There is, however, an issue with all hitherto-presented measures of perfor-
mance: maximising the sensitivity or specificity alone has the side-effect of minimising
the other. This can be illustrated by an extreme case wherein the classifier always makes
a bio call: perfect sensitivity is obtained at the expense of any specificity. The accuracy
is also flawed in that maximising it promotes skewing the classifier’s predictions to obtain
as many correct predictions as possible, which is not necessarily a desirable result, es-
pecially for imbalanced datasets. Considering, for example, a hypothetical dataset with
ninety positive cases and ten negative cases, a binary classifier trained by maximising
accuracy would value sensitivity—the ability to identify those ninety positives—much
Chapter 2 — Parameter Optimisation 20
more highly than specificity, since even a near-zero specificity would only reduce the
accuracy by approximately 10%.
Accuracy =TP + TN
TP + TN + FP + FN(2.3)
It follows that any useful measure of performance would have to impartially balance
the sensitivity and the specificity. One such measure is presented in Equation (2.4),
which defines the balanced accuracy as the arithmetic mean of the sensitivity and the
specificity.
Balanced Accuracy =Sensitivity + Specificity
2(2.4)
Another interesting measure of performance is the Matthews correlation coefficient
(MCC) [68], presented in Equation (2.5). The MCC was introduced in 1975 as a bal-
anced measure of performance that works with datasets of imbalanced character. It is
effectively a correlation coefficient measuring the agreement between the actual and the
predicted classifications.
MCC =TP× TN− FP× FN√
(TP + FP)(TP + FN)(TN + FP)(TN + FN)(2.5)
Over the course of the experiments carried out to generate the results presented below,
it was found that the best method of assessing a classifier’s performance was to use the
balanced accuracy. This measure, therefore, is used as the objective function for the
optimisation of EPPIC’s parameters.
2.3.4 Optimised Parameters
After varying and testing sets of parameters for each of EPPIC’s classifiers, the results
were sorted by the balanced accuracy achieved on the BioMany and XtalMany datasets.
The top ten sets of parameters for each classifier are presented in the tables below.
In Table 2.3, the ten best parameter sets for the geometric classifier are presented. Only
two parameters can be varied for this classifier: the burial cutoff, which is the proportion
of solvent-accessible surface that has to be buried upon interface formation for a residue
to be assigned to the interface core, and the call threshold, which is the minimum number
of core residues in the interface that have to be identified for an interface to be deemed
biological and given a bio call. The parameter optimisation that preceded the original
Chapter 2 — Parameter Optimisation 21
BurialCutoff
CallThreshold
BalancedAccuracy
Accuracy Sensitivity Specificity MCC
90% 8 0.8635 0.8647 0.7992 0.9279 0.734485% 10 0.8612 0.8621 0.8152 0.9072 0.726490% 9 0.8555 0.8573 0.7533 0.9577 0.728395% 5 0.8544 0.8552 0.8085 0.9004 0.712690% 7 0.8525 0.8530 0.8266 0.8784 0.706485% 9 0.8524 0.8526 0.8419 0.8629 0.705295% 6 0.8501 0.8517 0.7558 0.9443 0.714690% 10 0.8484 0.8507 0.7213 0.9756 0.723095% 4 0.8469 0.8465 0.8704 0.8234 0.694290% 6 0.8445 0.8439 0.8793 0.8097 0.6900
Table 2.3: The top ten parameter sets of burial cutoffs and call thresholds for the ge-ometric classifier in EPPIC. While many measures of classifier performance are shown,
the balanced accuracy is used for sorting purposes.
release of EPPIC in 2012 showed an optimal burial cutoff of 95% and an optimal call
threshold of 6 for the geometric classifier [24]. From the table, it can be observed that a
burial cutoff of 90% and a call threshold of 8 achieve better performance. The result is
also intuitive in the sense that relaxing the requirement to be considered a core residue
comes alongside an increase in how many core residues must be present to consider an
interface biological.
In Table 2.4, the ten best parameter sets for the core–rim evolutionary classifier are pre-
sented. In addition to the burial cutoff (as described for the geometric classifier), two fur-
ther parameters can be varied: the reduced alphabet, which is an abbreviated way of re-
ferring to the number of letters in the reduced amino-acid alphabet used to calculate the
entropy at sites in the MSA (taken from the possibilities proposed by Murphy et al. [60],
as well as the six-letter alphabet proposed by Mirny and Shakhnovich [66]), and the call
threshold, which is the scoring boundary below which an interface is given a bio call.
These results are for datapoints where at least ten putative homologues, each having at
least 50% sequence identity to the query, were identifiable.
Table 2.5 again shows the ten best parameter sets for the core–rim evolutionary classifier,
much like in Table 2.4, but only considering datapoints for which a minimum of thirty
putative homologues, each having at least 60% sequence identity to the query, was
identifiable. The results in Table 2.5 can be accepted with greater confidence, since
more data are used per considered datapoint to make the prediction. The typical values
for the balanced accuracies also contrast between tables: selecting putative homologues
more stringently allows for higher scores, further strengthening the case for the use of
Table 2.5. EPPIC’s original parameter optimisation showed an optimal burial cutoff of
70% and an optimal call threshold of 0.75 for the core–rim evolutionary predictor [24].
Chapter 2 — Parameter Optimisation 22
ReducedAlphabet
BurialCutoff
CallThreshold
BalancedAccuracy
Accuracy Sensitivity Specificity MCC
2 80% 0.75 0.8471 0.8465 0.8215 0.8728 0.69452 80% 0.80 0.8457 0.8453 0.8303 0.8610 0.69122 90% 0.90 0.8455 0.8424 0.7184 0.9726 0.71112 80% 0.70 0.8451 0.8443 0.8099 0.8804 0.69102 80% 0.85 0.8449 0.8447 0.8371 0.8526 0.68952 90% 0.85 0.8442 0.8410 0.7140 0.9743 0.70942 80% 0.65 0.8438 0.8426 0.7938 0.8939 0.68992 90% 0.80 0.8432 0.8399 0.7092 0.9773 0.70902 80% 0.90 0.8431 0.8432 0.8468 0.8395 0.68632 80% 0.60 0.8431 0.8416 0.7826 0.9035 0.6897
Table 2.4: The top ten parameter sets of reduced alphabets, burial cutoffs, and callthresholds for the core–rim evolutionary classifier in EPPIC. These are the results ob-tained when using the less-stringent requirements for putative homologues: a minimumof ten, each having at least 50% sequence identity to the query. While many measuresof classifier performance are shown, the balanced accuracy is used for sorting purposes.
Reduced amino-acid alphabets were not tested, and the ten-letter alphabet proposed
by Murphy et al. [60] was used. From the table, it can be observed that the six-letter
reduced amino-acid alphabet proposed by Mirny and Shakhnovich [66], a burial cutoff
of 80%, and a call threshold of 0.90 produce better results.
ReducedAlphabet
BurialCutoff
CallThreshold
BalancedAccuracy
Accuracy Sensitivity Specificity MCC
6 80% 0.90 0.8745 0.8727 0.8873 0.8617 0.74372 90% 0.90 0.8725 0.8874 0.7684 0.9766 0.77682 90% 0.85 0.8715 0.8867 0.7653 0.9777 0.77582 90% 0.80 0.8705 0.8861 0.7621 0.9789 0.77492 80% 0.85 0.8700 0.8686 0.8795 0.8605 0.73512 80% 0.75 0.8692 0.8706 0.8592 0.8792 0.73662 80% 0.90 0.8688 0.8660 0.8889 0.8488 0.73166 80% 0.85 0.8680 0.8680 0.8685 0.8675 0.73252 80% 0.80 0.8674 0.8673 0.8685 0.8664 0.73122 80% 0.70 0.8670 0.8700 0.8466 0.8875 0.7344
Table 2.5: The top ten parameter sets of reduced alphabets, burial cutoffs, and callthresholds for the core–rim evolutionary classifier in EPPIC. These are the resultsobtained when using the more-stringent requirements for putative homologues: a min-imum of thirty, each having at least 60% sequence identity to the query. While manymeasures of classifier performance are shown, the balanced accuracy is used for sorting
purposes.
In Table 2.6, the ten best parameter sets for the core–surface evolutionary classifier
are presented. The parameters varied for this analysis are exactly as those varied in
Tables 2.4 and 2.5, albeit for the core–surface predictor. These results are for datapoints
where at least ten putative homologues, each having at least 50% sequence identity to
the query, were identifiable.
Chapter 2 — Parameter Optimisation 23
ReducedAlphabet
BurialCutoff
CallThreshold
BalancedAccuracy
Accuracy Sensitivity Specificity MCC
2 80% −0.90 0.8889 0.8878 0.8436 0.9343 0.77962 80% −0.95 0.8886 0.8874 0.8383 0.9389 0.77962 80% −0.85 0.8886 0.8876 0.8472 0.9301 0.77862 80% −0.80 0.8881 0.8872 0.8504 0.9259 0.77732 80% −1.00 0.8877 0.8864 0.8319 0.9436 0.77862 80% −1.05 0.8866 0.8851 0.8255 0.9478 0.77722 80% −0.75 0.8866 0.8858 0.8524 0.9208 0.77392 70% −1.05 0.8819 0.8814 0.8628 0.9010 0.76372 70% −1.00 0.8791 0.8788 0.8660 0.8922 0.758015 80% −0.90 0.8772 0.8759 0.8255 0.9288 0.7568
Table 2.6: The top ten parameter sets of reduced alphabets, burial cutoffs, andcall thresholds for the core–surface evolutionary classifier in EPPIC. These are theresults obtained when using the less-stringent requirements for putative homologues: aminimum of ten, each having at least 50% sequence identity to the query. While manymeasures of classifier performance are shown, the balanced accuracy is used for sorting
purposes.
Table 2.7 again shows the ten best parameter sets for the core–surface evolutionary
classifier, much like in Table 2.6, but only considering datapoints for which a minimum
of thirty putative homologues, each having at least 60% sequence identity to the query,
was identifiable. The high-stringency results in Table 2.7, again, can be accepted with
greater confidence and exhibit higher values of balanced accuracy. EPPIC’s original
parameter optimisation showed an optimal burial cutoff of 70% and an optimal call
threshold of −1.00 for the core–surface evolutionary predictor [24]. Reduced amino-acid
alphabets were not tested, and the ten-letter alphabet proposed by Murphy et al. [60]
was used. From the table, it can be observed that the six-letter reduced amino-acid
alphabet proposed by Mirny and Shakhnovich [66], a burial cutoff of 80%, and a call
threshold of −0.90 produce better results.
Perhaps more interesting than the slight differences in optimised burial cutoffs and call
thresholds is the fact that the two-letter alphabet proposed by Murphy et al. appears so
frequently in the top parameter sets. Indeed, if Tables 2.4 and 2.6 were to be extended to
each show the top twenty-five parameter sets, the two-letter alphabet would be present
in fourteen and eleven of them, respectively. This is not just the case for the low-
stringency homologues, either: similarly extending Tables 2.5 and 2.7 to each show
twenty-five parameter sets would result in twelve and nine appearances, respectively, of
the two-letter alphabet.
In Figure 2.1, the effect of different amino-acid alphabets—the two-, four-, eight-, ten-,
and fifteen-letter reduced amino-acid alphabets proposed by Murphy et al. [60]; the six-
letter one proposed by Mirny and Shakhnovich [66]; and the standard, unreduced twenty-
letter alphabet—on the balanced accuracy of the core–surface evolutionary predictor is
Chapter 2 — Parameter Optimisation 24
ReducedAlphabet
BurialCutoff
CallThreshold
BalancedAccuracy
Accuracy Sensitivity Specificity MCC
6 80% −0.90 0.9142 0.9169 0.8951 0.9332 0.830015 80% −0.90 0.9132 0.9155 0.8967 0.9297 0.82742 80% −0.85 0.9122 0.9149 0.8936 0.9308 0.825915 80% −0.85 0.9114 0.9135 0.8967 0.9261 0.82336 80% −0.95 0.9110 0.9142 0.8889 0.9332 0.82456 80% −0.95 0.9110 0.9142 0.8889 0.9332 0.82456 80% −1.05 0.9110 0.9155 0.8795 0.9426 0.82722 80% −0.80 0.9108 0.9129 0.8967 0.9250 0.82206 80% −0.85 0.9106 0.9129 0.8951 0.9261 0.82192 80% −0.90 0.9104 0.9142 0.8842 0.9367 0.8244
Table 2.7: The top ten parameter sets of reduced alphabets, burial cutoffs, andcall thresholds for the core–surface evolutionary classifier in EPPIC. These are theresults obtained when using the more-stringent requirements for putative homologues:a minimum of thirty, each having at least 60% sequence identity to the query. Whilemany measures of classifier performance are shown, the balanced accuracy is used for
sorting purposes.
shown for various call thresholds. No matter the call threshold, the trends remain quite
consistent. The twenty-letter alphabet does quite poorly, which is to be expected, since
this maximal level of detail is likely to bring more noise than signal into the balance
during the entropy calculations. The four-letter alphabet, however, does particularly
badly, especially in comparison to its two- and six-letter neighbours, while the two-,
six-, and fifteen-letter alphabets perform best.
It is puzzling, or at least unintuitive, that a complete disregard of an arguable 90% of
the information in the MSA—going from twenty amino acids to two—is so beneficial to
the predictions made by the evolutionary classifiers in EPPIC. With a view to elucidate
this observation, the topic of reduced amino-acid alphabets forms much of the remainder
of this thesis. For the sake of conciseness, a terminological adjustment should be made:
henceforth, let an “n-alphabet” refer to an n-letter reduced amino-acid alphabet, where
n can be any natural number from 2 to 20.
2.4 Conclusion
All three of EPPIC’s predictors—geometric, core–rim evolutionary, and core–surface
evolutionary—have a large number of parameters that can be varied. While the default
values for these parameters were chosen based on an optimisation carried out just prior to
the original release of EPPIC in 2012 [24], this optimisation was not an exhaustive search
and was performed using a small dataset. Two new datasets, BioMany and XtalMany,
compiled computationally by Baskaran et al. [65], allowed for a more thorough analysis of
Chapter 2 — Parameter Optimisation 25
2 4 6 8 10 15 20Alphabet Identifier
0.86
0.87
0.88
0.89
0.90
0.91
0.92
Bal
ance
dA
ccu
racy
Core–Surface Call Threshold
−1.05
−1.00
−0.95
−0.90
−0.85
−0.80
−0.75
Figure 2.1: The effects of various reduced amino-acid alphabets on the performanceof EPPIC’s core–surface evolutionary predictor for seven different call thresholds. The
balanced accuracy is used as a measure of performance.
the effects of parameter values on EPPIC’s predictions. Through this, it was found that
the geometric predictor performs best with a burial cutoff of 90% and a call threshold
of 8, as shown in Table 2.3; that the core–rim evolutionary predictor performs best with
the 6-alphabet proposed by Mirny and Shakhnovich [66], a burial cutoff of 80%, and
a call threshold of 0.90, as shown in Table 2.5; and that the core–surface evolutionary
predictor performs best with the 6-alphabet proposed by Mirny and Shakhnovich [66], a
burial cutoff of 80%, and a call threshold of −0.90, as shown in Table 2.7. Additionally,
2-alphabets are surprisingly prominent in the highest-ranking parameter sets for the
evolutionary predictors, a fact that drove much of the remainder of this thesis toward
further investigation of the importance of reduced amino-acid alphabets on EPPIC’s
predictive performance.
Chapter 3
Reduced Amino-Acid Alphabets
3.1 Rationale
In Section 2.4, Chapter 2 was concluded with some interesting findings: the 2-alphabet
proposed by Murphy et al. [60] seems to come up surprisingly often in the highest-
ranking parameter sets for the evolutionary predictors in EPPIC, and the 6-alphabet
proposed by Mirny and Shakhnovich [66] is present in the best parameter sets. Clearly,
further investigation into the effects that different amino-acid alphabets have on EPPIC’s
performance was needed. EPPIC, however, had its seven amino-acid alphabets, those
defined in Table 2.1, hardcoded in its source code. This meant that some changes to
EPPIC had to first be made.
3.2 Published Alphabets
The first step was to collect reduced amino-acid alphabets that had already been pro-
posed by the scientific community. The VisCoSe tool, published by Spitzer et al. [69],
allows users to select from a list of compiled amino-acid alphabets and includes descrip-
tions and citations for each. These alphabets, as well as the ones from Chapter 2, are
listed in Table 3.1. Due to the style in which these alphabets were described by the
VisCoSe tool, it was not immediately obvious that Alphabets 25 and 29 were, in fact,
the same.
3.2.1 Methods
The eleven additional alphabets in Table 3.1, when compared to Table 2.1, were manually
added into EPPIC’s source code, and EPPIC was run on the BioMany and XtalMany
27
Chapter 3 — Reduced Amino-Acid Alphabets 28
AlphabetIdentifier
SourceAmino-Acid
AlphabetSize of
Alphabet
2 [60] ACFGILMPSTVWY:DEHKNQR 24 [60] AGPST:CILMV:DEHKNQR:FWY 46 [66] ACILMV:DE:FHWY:GP:KR:NQST 68 [60] AG:CILMV:DENQ:FWY:H:KR:P:ST 810 [60] A:C:DENQ:FWY:G:H:ILMV:KR:P:ST 1015 [60] A:C:D:E:FY:G:H:ILMV:KR:N:P:Q:S:T:W 1520 — A:C:D:E:F:G:H:I:K:L:M:N:P:Q:R:S:T:V:W:Y 2021 [63] AGPST:CFILMVWY:DEHKNQR 322 [63] AGPST:CFWY:DEHKNQR:ILMV 423 [63] APST:CFWY:DEHKNQR:G:ILMV 524 [63] AST:C:DEQ:FWY:G:HN:IV:KR:LM:P 1025 [59] ADEGHKNPQRST:CFILMVWY 226 [59] AGHPRT:CFILMVWY:DEKNQS 327 [59] AHT:CFILMVWY:DE:GP:KNQRS 528 [59] AGST:CFIM:DENQ:HKPR:LVWY 529 [69] ADEGHKNPQRST:CFILMVWY 230 [69] ACGS:DEKR:FHWY:ILV:MNPQT 531 [69] ACGS:DE:FHWY:ILV:KR:MNPQT 6
Table 3.1: Sixteen amino-acid alphabets proposed by the scientific community. Foreach alphabet, the alphabet identifier—no longer necessarily the number of letters inthe alphabet—is shown, as well as the authors who proposed it, the alphabet itself,and the number of groups in the alphabet. Alphabet 20 is an unreduced alphabet and,
therefore, has no source. Note that Alphabets 25 and 29 are identical.
datasets once per alphabet. The burial cutoff used for core-residue assignment was 80%.
When the authors of a PDB entry collect new data or rerefine a structure, the PDB
entry is made obsolete and superseded by a new PDB entry. When this occurs to PDB
entries in the BioMany or XtalMany datasets, the entry must be removed, since it is no
longer considered a valid structure. The new, superseding structure does not necessarily
conform to the standards applied when generating the datasets and, hence, cannot be
added to either dataset. When this experiment was carried out, four PDB entries had
been superseded: 2H7I, 2HCY, 3DMA, and 4M7N. The more-stringent criteria for putative
homologues, a minimum of thirty, each having at least 60% sequence identity to the
query, were imposed. Additionally, at this point, it was decided that, for these and
future experiments, there is little use investigating both the core–rim and core–surface
evolutionary predictors. The core–surface one offers a much more robust measure and
will, thus, represent both predictors. Computations were carried out using the Merlin 4
HPCC at the Paul Scherrer Institute.
Chapter 3 — Reduced Amino-Acid Alphabets 29
3.2.2 Results
Figure 3.1 shows the results of running the EPPIC core–surface evolutionary predictor on
the BioMany and XtalMany datasets for each of the alphabets in Table 3.1. The results
are shown for seven different core–surface call thresholds. No matter the threshold,
Alphabet 2 clearly comes out on top, with Alphabet 15 following close behind, while
Alphabet 25, identical to Alphabet 29, is the worst by a significant margin.
In Figure 3.1, it is also interesting that the alphabets contributing to high performance
tend to have less variation for different call thresholds. Ideally, there would be a measure
that could summarise not only the balanced accuracy, which, it might be helpful to recall,
is the arithmetic mean of the sensitivity and specificity, but also the variation across call
thresholds. Such a measure is described below in Subsection 3.2.3, and the remainder
of the results from this chapter then follow in Subsection 3.2.4.
2 4 6 8 10 15 20 21 22 23 24 25 26 27 28 29 30 31Alphabet Identifier
0.65
0.70
0.75
0.80
0.85
0.90
Bal
ance
dA
ccu
racy
Core–Surface Call Threshold
−1.05
−1.00
−0.95
−0.90
−0.85
−0.80
−0.75
Figure 3.1: The effects of published reduced amino-acid alphabets on the performanceof EPPIC’s core–surface evolutionary predictor for seven different call thresholds. Thebalanced accuracy is used as a measure of performance and the alphabet identifiers are
those defined in Table 3.1.
Chapter 3 — Reduced Amino-Acid Alphabets 30
3.2.3 Receiver Operating Characteristic
The receiver operating characteristic (ROC) curve is a method of plotting particular
aspects of the performance of a binary classifier against one another, all while varying the
call threshold. Specifically, the true-positive rate, which is equivalent to the sensitivity,
is plotted as a function of the false-positive rate, which is equivalent to unity minus the
specificity, at various threshold settings. The ROC curve necessarily passes through two
points: (0, 0), since designating all interfaces as xtal correctly identifies all negatives
but also labels all positives as false negatives, and (1, 1), since designating all interfaces
as bio correctly identifies all positives but also labels all negatives as false positives.
The diagonal of the plot, stretching between the two aforementioned points, is the line
below which the performance of the binary classifier is mathematically worse than that
of random classification.
The best result for a binary classifier would be to have an ROC curve that rises sharply
from (0, 0) to (0 + ε, 1), stays flat across y = 1, and ends at (1, 1). Thus, in general, an
ROC curve that tends to be higher or more left-lying is considered to be demonstrative
of a better classifier. The area under the ROC curve (AUROC) is a measure of such
characteristics: integrating the curve over x ∈ [0, 1] results in a number from 0 to 1, which
represents the performance of the classifier irrespective of the call threshold chosen. A
binary classifier with a higher AUROC value can be said to be a better classifier; however,
when two binary classifiers have similar AUROC values but dissimilar ROC curves, one
must consider the prevalence of positives versus the prevalence of negatives, as well
as the cost of false positives versus the cost of false negatives, before deciding which
classifier is more appropriate [70].
3.2.4 Further Results
In Figure 3.2, the ROC curves for the sixteen amino-acid alphabets from Table 3.1 are
shown. The integration of each ROC curve has also been calculated, and the AUROC
values are displayed in Table 3.2. With the highest AUROC, 0.9563, Alphabet 2 is
confirmed as the best alphabet for the core–surface evolutionary predictor, and the
identical Alphabets 25 and 29, with AUROCs of 0.7602 and 0.7601, respectively, are
confirmed to be the worst alphabet. Note that the AUROC values for the two identical
alphabets are not exactly the same. This is due to some stochasticity in the algorithm of
the core–surface evolutionary predictor and will be addressed in more detail in Chapter 4.
Arguably, the most intriguing result of this section is that two different 2-alphabets
manage to achieve both first- and last-place rankings. It would be promising to delve
Chapter 3 — Reduced Amino-Acid Alphabets 31
0.0 0.2 0.4 0.6 0.8 1.0False-Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
Tru
e-P
osit
ive
Rat
e
Alphabet Identifier
2
4
6
8
10
15
20
21
22
23
24
25
26
27
28
29
30
31
Figure 3.2: The ROC curves for EPPIC’s core–surface evolutionary predictor for thepublished alphabets being evaluated in Section 3.2. The alphabet identifiers are those
defined in Table 3.1.
AlphabetIdentifier
AUROC
2 0.95364 0.90916 0.93608 0.929510 0.921715 0.9357
AlphabetIdentifier
AUROC
20 0.904521 0.926222 0.904823 0.902124 0.912625 0.7602
AlphabetIdentifier
AUROC
26 0.912727 0.928028 0.892129 0.760130 0.916731 0.9263
Table 3.2: The AUROC values for EPPIC’s core–surface evolutionary predictor forthe published alphabets being evaluated in Section 3.2. The alphabet identifiers are
those defined in Table 3.1.
further into the topic of 2-alphabets by seeing how randomly-generated 2-alphabets fare
in comparison.
3.3 Random 2-Alphabets
In order to test the effects of randomly-generated 2-alphabets, EPPIC had to be re-
designed once again. The hardcoded amino-acid alphabets were kept for convenience,
Chapter 3 — Reduced Amino-Acid Alphabets 32
but the necessity of their use was eliminated by allowing for custom alphabets to be de-
fined in EPPIC’s configuration file. Once this was accomplished, it became much easier
to benchmark the software using random 2-alphabets.
3.3.1 Methods
The process begins by randomly generating ten 2-alphabets. These alphabets are listed
in Table 3.3. Note that the alphabets are all fairly balanced: no fewer than five (and,
hence, no more than fifteen) amino acids are in any one category.
AlphabetIdentifier
2-Alphabet
0 ACDEFGIKLPSTVW:HMNQRY
1 ACDEFGIKMNPRST:HLQVWY
2 ACDEFGKLMPQRTVW:HINSY
3 ACDEFGKNRS:HILMPQTVWY
4 ACDEFHIMNPWY:GKLQRSTV
5 ACDEFHKLNPQRSVW:GIMTY
6 ACDEFHKLPQSWY:GIMNRTV
7 ACDEHIKLMPRSTV:FGNQWY
8 ACDELMNPTWY:FGHIKQRSV
9 ACDFGHILQRSWY:EKMNPTV
Table 3.3: Ten randomly-generated 2-alphabets. For each alphabet, the alphabetidentifier—which has been reset and now starts from 0—is shown, as well as the alphabet
itself.
For each alphabet in Table 3.3, the EPPIC core–surface evolutionary predictor was run
on the BioMany and XtalMany datasets. The burial cutoff used for core-residue as-
signment was 80%. The more-stringent criteria for putative homologues, a minimum of
thirty, each having at least 60% sequence identity to the query, were imposed. Compu-
tations were carried out using the Merlin 4 HPCC at the Paul Scherrer Institute.
3.3.2 Results
Figure 3.3 shows the results of running the EPPIC core–surface evolutionary predictor on
the BioMany and XtalMany datasets for each of the alphabets in Table 3.3. The results
are shown for seven different core–surface call thresholds. No matter the threshold, it
is fairly clear that there is not much variability in the balanced accuracy as a function
of the random 2-alphabet used. Alphabet 4 ranks best, Alphabet 9 follows by a hair’s
breadth, and Alphabet 7 is worst by a fair amount. Just as for the previous section,
the ROC curves for these alphabets can be generated and their AUROC values used to
better summarise the performance.
Chapter 3 — Reduced Amino-Acid Alphabets 33
0 1 2 3 4 5 6 7 8 9Alphabet Identifier
0.72
0.74
0.76
0.78
0.80
0.82
0.84
0.86
Bal
ance
dA
ccu
racy
Core–Surface Call Threshold
−1.05
−1.00
−0.95
−0.90
−0.85
−0.80
−0.75
Figure 3.3: The effects of random 2-alphabets on the performance of EPPIC’s core–surface evolutionary predictor for seven different call thresholds. The balanced accuracyis used as a measure of performance and the alphabet identifiers are those defined in
Table 3.3.
In Figure 3.4, the ROC curves for the ten randomly-generated amino-acid alphabets
from Table 3.3 are shown. The integration of each ROC curve has also been calculated,
and the AUROC values are displayed in Table 3.4. With the highest AUROC, 0.9070,
Alphabet 4 is confirmed as the best of the alphabets in Table 3.3, and Alphabet 7, with
an AUROC of 0.7861, is confirmed to be the worst alphabet.
Although only ten random 2-alphabets were looked at, they were all well-balanced, and
there does not appear to be much variability amongst them, using either the balanced
accuracy or AUROC value as a measure of performance. The average AUROC value
for the random 2-alphabets in this section is 0.8600± 0.0119. Why, then, does the 2-
alphabet proposed by Murphy et al. [60] perform so much better than these (0.9536),
and why does the 2-alphabet proposed by Wang and Wang [59] perform so much worse
(0.7602)?
Chapter 3 — Reduced Amino-Acid Alphabets 34
0.0 0.2 0.4 0.6 0.8 1.0False-Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0T
rue-
Pos
itiv
eR
ate
Alphabet Identifier
0
1
2
3
4
5
6
7
8
9
Figure 3.4: The ROC curves for EPPIC’s core–surface evolutionary predictor for therandom 2-alphabets being evaluated in Section 3.3. The alphabet identifiers are those
defined in Table 3.3.
AlphabetIdentifier
AUROC
0 0.85411 0.87062 0.81683 0.88184 0.9070
AlphabetIdentifier
AUROC
5 0.83736 0.87417 0.78618 0.87009 0.9026
Table 3.4: The AUROC values for EPPIC’s core–surface evolutionary predictor forthe random 2-alphabets being evaluated in Section 3.3. The alphabet identifiers are
those defined in Table 3.3.
3.4 Transitioning 2-Alphabets
In Sections 3.2 and 3.3, it was found that the 2-alphabet proposed by Murphy et al. [60]
performs about 10% better than random 2-alphabets, whereas the 2-alphabet proposed
by Wang and Wang [59] performs about 10% worse than random 2-alphabets. How is it
that two alphabets published by the scientific community could span such a range? In
fact, there are only five differently-classified amino acids between the two alphabets—
alanine, glycine, proline, serine, and threonine. Thus, to mutate one alphabet into
the other requires five one-letter recategorisations. Every alphabet between these two
Chapter 3 — Reduced Amino-Acid Alphabets 35
proposed 2-alphabets—in other words, the power set of these five changes—can be tested
with EPPIC’s core–surface evolutionary predictor.
3.4.1 Methods
All intermediary 2-alphabets can be thought of as a presence or absence of the shift for
each of the five amino acids whose categorisations differ between the alphabets. This
allows for 25 = 32 possible 2-alphabets, which have been listed in Table 3.5. Note that
Alphabet 0 is defined as the 2-alphabet proposed by Murphy et al. [60], while Alphabet V
is defined as the 2-alphabet proposed by Wang and Wang [59]. Any alphabet between
Alphabets 0 and V correspond to systematically-interpolated alphabets. For example,
Alphabets 1 through 5 correspond to taking Alphabet 0 and moving one of alanine,
glycine, proline, serine, or threonine, in that order, to the other group.
AlphabetIdentifier
2-Alphabet
0 ACFGILMPSTVWY:DEHKNQR
1 ADEHKNQR:CFGILMPSTVWY
2 ACFILMPSTVWY:DEGHKNQR
3 ACFGILMSTVWY:DEHKNPQR
4 ACFGILMPTVWY:DEHKNQRS
5 ACFGILMPSVWY:DEHKNQRT
6 ADEGHKNQR:CFILMPSTVWY
7 ADEHKNPQR:CFGILMSTVWY
8 ADEHKNQRS:CFGILMPTVWY
9 ADEHKNQRT:CFGILMPSVWY
A ACFILMSTVWY:DEGHKNPQR
B ACFILMPTVWY:DEGHKNQRS
C ACFILMPSVWY:DEGHKNQRT
D ACFGILMTVWY:DEHKNPQRS
E ACFGILMSVWY:DEHKNPQRT
F ACFGILMPVWY:DEHKNQRST
AlphabetIdentifier
2-Alphabet
G ADEGHKNPQR:CFILMSTVWY
H ADEGHKNQRS:CFILMPTVWY
I ADEGHKNQRT:CFILMPSVWY
J ADEHKNPQRS:CFGILMTVWY
K ADEHKNPQRT:CFGILMSVWY
L ADEHKNQRST:CFGILMPVWY
M ACFILMTVWY:DEGHKNPQRS
N ACFILMSVWY:DEGHKNPQRT
O ACFILMPVWY:DEGHKNQRST
P ACFGILMVWY:DEHKNPQRST
Q ADEGHKNPQRS:CFILMTVWY
R ADEGHKNPQRT:CFILMSVWY
S ADEGHKNQRST:CFILMPVWY
T ADEHKNPQRST:CFGILMVWY
U ACFILMVWY:DEGHKNPQRST
V ADEGHKNPQRST:CFILMVWY
Table 3.5: Thirty-two 2-alphabets, the first and last of which correspond to the2-alphabets proposed by Murphy et al. [60] and Wang and Wang [59], respectively.Since there are only five differences in the categorisation of amino acids between thetwo 2-alphabets, alphabets between the first and last represent all possible alphabets“between” the two. For each alphabet, the alphabet identifier—which has, again, been
reset—is shown, as well as the alphabet itself.
For each alphabet in Table 3.5, the EPPIC core–surface evolutionary predictor was run
on the BioMany and XtalMany datasets. The burial cutoff used for core-residue as-
signment was 80%. The more-stringent criteria for putative homologues, a minimum of
thirty, each having at least 60% sequence identity to the query, were imposed. Compu-
tations were carried out using the Merlin 4 HPCC at the Paul Scherrer Institute.
Chapter 3 — Reduced Amino-Acid Alphabets 36
3.4.2 Results
Figure 3.5 shows the results of running the EPPIC core–surface evolutionary predictor
on the BioMany and XtalMany datasets for each of the alphabets in Table 3.5. The
results are shown for seven different core–surface call thresholds. No matter the thresh-
old, the trend is the same: Alphabet 0 fares best, Alphabet V fares worst, and the
transitioning alphabets between them drop off as more amino acids are recategorised.
Just as for previous sections, the ROC curves for these alphabets can be generated and
their AUROC values used to better summarise the performance.
In Figure 3.6, the ROC curves for the thirty-two transitioning 2-alphabets from Table 3.5
are shown. The integration of each ROC curve has also been calculated, and the AUROC
values are displayed in Table 3.6.
Unlike in the previous sections, however, the noteworthy results here are not immediately
visible from either Figure 3.5 or Figure 3.6. It was obvious that there would be some
sort of decrease in performance as Alphabet 0 transitioned into Alphabet V, but the
real problem was to identify which step in the transitioning was the most detrimental.
0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U VAlphabet Identifier
0.70
0.75
0.80
0.85
Bal
ance
dA
ccu
racy
Core–Surface Call Threshold
−1.05
−1.00
−0.95
−0.90
−0.85
−0.80
−0.75
Figure 3.5: All possible 2-alphabets that group amino acids in a manner intermedi-ary to the two 2-alphabets proposed by Murphy et al. [60] and Wang and Wang [59]and their effects on the performance of EPPIC’s core–surface evolutionary predictorfor seven different call thresholds. The balanced accuracy is used as a measure of
performance and the alphabet identifiers are those defined in Table 3.5.
Chapter 3 — Reduced Amino-Acid Alphabets 37
0.0 0.2 0.4 0.6 0.8 1.0False-Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
Tru
e-P
osit
ive
Rat
e
Alphabet Identifier
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Figure 3.6: The ROC curves for EPPIC’s core–surface evolutionary predictor for thetransitioning 2-alphabets being evaluated in Section 3.4. The alphabet identifiers are
those defined in Table 3.5.
AlphabetIdentifier
AUROC
0 0.95351 0.89602 0.93923 0.94654 0.92405 0.92876 0.89027 0.88918 0.87309 0.8643A 0.9332B 0.8975C 0.9069D 0.9169E 0.9255F 0.9002
AlphabetIdentifier
AUROC
G 0.8787H 0.8569I 0.8505J 0.8597K 0.8528L 0.8430M 0.8856N 0.9000O 0.8620P 0.8908Q 0.8324R 0.8305S 0.8066T 0.8176U 0.8440V 0.7601
Table 3.6: The AUROC values for EPPIC’s core–surface evolutionary predictor forthe transitioning 2-alphabets being evaluated in Section 3.4. The alphabet identifiers
are those defined in Table 3.5.
Chapter 3 — Reduced Amino-Acid Alphabets 38
Looking at Figure 3.5, one could note that going from Alphabet 0 to Alphabet 1, which
corresponds to recategorising the alanine, brings with it a sizeable drop in performance.
But what effect does a recategorisation of the alanine have on alphabets that have already
had some other amino acids recategorised? For example, the only difference between
Alphabet M and Alphabet Q is the alanine, but both alphabets also already had glycine,
proline, and serine recategorised. Looking at the difference between Alphabet M and
Alphabet Q in Figure 3.5 shows, as expected, a decrease in performance. In fact, the
decrease in the AUROC value from Alphabet 0 to Alphabet 1 is 0.0575, and the decrease
in balanced accuracy from Alphabet M to Alphabet Q is 0.0532. The proximity of these
values to one another acted as inspiration to model the AUROC as a function of the
presence or absence of each part of the full transition from Alphabet 0 to Alphabet V.
3.4.3 Linear Model
A linear model was fit to the data, such that the AUROC value for EPPIC’s core–
surface evolutionary predictor could be predicted given only the status of each amino
acid having been recategorised. Equation (3.1) shows the linear model after training,
where δX is the status of X having been recategorised in the amino-acid alphabet being
used. For example, Alphabet 0 has δA = δG = δP = δS = δT = 0, while Alphabet V has
δA = δG = δP = δS = δT = 1.
AUROC ≈ 0.0596δA + 0.0255δG + 0.0143δP + 0.0385δS + 0.0368δT + 0.7926 (3.1)
Figure 3.7 shows the result of using the linear model to predict the AUROC values, in
blue, which, when compared to the true AUROC values, in pink, seems to match quite
well. Indeed, the coefficient of determination, R2, was 0.96. This is a very good result
for a linear model, since the classification of amino acids in a 2-alphabet is certainly not
linearly independent, which becomes evident when one considers that the only difference
in having an amino acid change category is the change in the other amino acids with
which it shares its category.
Further conclusions can be drawn by examining the coefficients in Equation (3.1). The
coefficient of largest magnitude is that for alanine, which means that its categorisation is
the most important of the five differences between Alphabet 0 and Alphabet V. The next
most important is serine, followed closely by threonine, which makes sense considering
their similar properties and prevalences [71]. The coefficient for glycine places fourth,
and, finally, proline follows in the least-important position.
Chapter 3 — Reduced Amino-Acid Alphabets 39
0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U VAlphabet Identifier
0.75
0.80
0.85
0.90
0.95
1.00
AU
RO
CR2 = 0.96
Data
Model
Figure 3.7: A linear model used to deconvolve the effects of recategorising each ofthe five amino acids. The original data (in pink) and the predictions of the model (inblue) are shown for the transitioning 2-alphabets being evaluated in Section 3.4. Thealphabet identifiers are those defined in Table 3.5. The coefficient of determination,
R2 = 0.96.
3.5 Conclusion
In this chapter, reduced amino-acid alphabets published by the scientific community
were tested by running EPPIC’s core–surface evolutionary predictor on the BioMany
and XtalMany datasets as a measure of performance and using the AUROC values as
measures of classifier performance. The performance of random 2-alphabets was also
tested, and an interesting finding was that, while random 2-alphabets exhibit little vari-
ability (0.8600± 0.0119), the 2-alphabet proposed by Murphy et al. [60] performs much
better (0.9536), and the 2-alphabet proposed by Wang and Wang [59] performs much
worse (0.7602). Despite this, the only difference between the two aforementioned al-
phabets is the categorisation of five amino acids: alanine, glycine, proline, serine, and
threonine. All possible permutations between these alphabets were tested, the perfor-
mance was shown to be more-or-less linearly dependent on the amino-acid categorisa-
tion, and alanine was identified as the amino acid whose categorisation was of greatest
significance.
Chapter 4
Surface-Residue Sampling
4.1 Rationale
This chapter deviates from the investigative flow of the thesis but is a necessary explana-
tory tangent, since the following chapter will be predicated on the results from this one.
EPPIC’s core–surface evolutionary predictor was introduced in Subsection 1.4.2, but
was not described in detail.
EPPIC determines the solvent accessibility of residues before and after the formation of
the protein–protein interface, and the ratio—the BSA-to-ASA ratio—is used to classify
them as either “core” (residues fully buried upon interface formation), “rim” (residues
partially buried upon interface formation), “surface” (surface residues not involved in an
interface), or “other” (any non-interface, non-surface residues; for example, residues in
a subunit’s core). The burial cutoffs used in classification are variable parameters. Once
the categories for every residue have been assigned, EPPIC proceeds by searching for ho-
mologues. A minimum number of homologues, each having a minimum sequence identity
to the query, are required for the evolutionary predictor to continue. The homologues
are then input into a clustering algorithm for the purpose of reducing redundancy. This
is done to ensure that each cluster of similar homologues is fairly characterised by one
representative cluster member. These representatives are then aligned with one another
in a large MSA. The information entropy for each column in the MSA is calculated,
making sure to consider the (reduced) amino-acid alphabet of choice. The entropies of
the core residues must now be compared to the entropies of the surface residues, and
this forms the topic for this chapter.
41
Chapter 4 — Surface-Residue Sampling 42
4.2 Strategies for Surface-Residue Sampling
The first problem faced when comparing core-residue entropies to surface-residue en-
tropies is that there tend to be far fewer core residues than surface residues. To handle
this, the surface residues have to be sampled at random. The exact method by which
this is carried out is not trivial, a fact which has led to the inclusion of two different
strategies in EPPIC. Subsections 4.2.1 and 4.2.2 describe these two strategies, and their
main difference is summarised in Table 4.1.
Residues Added to Pool?
Core Rim Surface Other Interfaces
StrategyClassic 7 7 3 7 7
Novel 3 3 3 7 3
Table 4.1: A comparison of the two strategies for surface-residue sampling. In thistable, “core” and “rim” residues are core and rim residues of the interface of inter-est, “surface” residues are residues on the surface of the protein, “other” residues areresidues that do not fit in any other category (for example, residues in the protein’s hy-drophobic core), and “interface” residues are residues on the surface that are involvedin other interfaces of a minimum surface area. For each strategy and residue type,a 3 or 7 is shown to represent whether that strategy is permitted to add residues of
that type to the surface-residue pool.
4.2.1 Classic Strategy
The classic strategy was the first to be implemented, and its algorithm for calculating the
core–surface score is as follows. First, a non-zero core size is checked for: if the interface
has no core residues, no prediction is returned. The average entropy of all core residues
is computed. Then, surface residues are added to a pool as long as they are not involved
in an interface (of at least some minimum surface area). It is ensured that a sufficient
number of surface residues exist in the structure: if there are fewer than 1.5 times as
many non-interface surface residues as core residues, the algorithm does not proceed. A
number of surface residues equal to the number of existing core residues are then drawn
from the pool, without replacement, and this sampling is carried out 10 000 times. The
arithmetic mean of each sample is recorded, providing a “distribution” of 10 000 values.
The mean and standard deviation of this distribution are used to compare the surface
residues to the core residues through the Z-score test statistic.
Equation (4.1) defines the Z-score test statistic: the population mean of the core-residue
entropies minus the sample mean of the surface-residue entropies, all divided by the
sample standard deviation of the non-interface surface-residue entropies. Note that the
Chapter 4 — Surface-Residue Sampling 43
sample mean and sample standard deviation are different than usual, since they represent
statistics from a sample that overrepresents the population due to the large number of
samples.
Z =µcore − xsurface
ssurface(4.1)
This classic strategy for surface-residue sampling and score calculation has a few down-
sides. Firstly, it places a lot of weight on the parameters that define what constitutes a
non-interface surface residue. While it is true that the surface itself will generally contain
other interfaces or binding sites, it would be presumptuous to exclude surface residues
simply because they lie in a region that may or may not be part of a biological interface.
Secondly, since this strategy needs to refer to what is considered an interface-related
residue for all other interfaces of the protein, all these objects need to have persistent
references in the data model. This is not an issue when running an instance of EPPIC
from start to finish, but it quickly becomes problematic when working directly with the
database, which will be the case in the next chapter.
4.2.2 Novel Strategy
The novel strategy is vastly simpler than the classic one, but arguably contradicts the
name of the predictor. Instead of meddling with the surface residues in the pool, the
novel strategy leaves them as they are. In fact, it also considers the core residues and
the rim residues to be part of the surface, and the only residues not part of the pool are
those labelled as “other” residues. Of course, the definition of core residues is that they
become buried upon interface formation, which means that they have to be exposed prior
to interface formation and, hence, must also be on the surface. This is the interpretation
of “surface” for this strategy, which is a superset of the meaning of “surface” in residue
classification.
Other than this major difference, things are mostly as they are with the classic strategy.
A non-zero core size is checked for, the core-residue entropies are averaged, all non-other
residues are added to a pool, and, since there are no longer any removals from the pool,
it is ensured that at least 2.0 times (rather than 1.5) as many surface residues are in
the pool as there are core residues in the interface. The sampling is then performed
as before, and Equation (4.1) is used to calculate a test statistic. The test statistic
is now more like a real Z-score: it represents how likely it is that the values, which
come from the population and are being compared to the population as a whole—in
Chapter 4 — Surface-Residue Sampling 44
this case, the core of the interface compared to the entire surface of the protein—have
a statistically-differing mean.
4.3 Methods
For both of the strategies detailed in Section 4.2, the EPPIC core–surface evolutionary
predictor was run on the BioMany and XtalMany datasets. The burial cutoff used for
core-residue assignment was 80%, and the reduced amino-acid alphabet used was the
6-alphabet proposed by Mirny and Shakhnovich [66]. The more-stringent criteria for
putative homologues, a minimum of thirty, each having at least 60% sequence identity
to the query, were imposed. Computations were carried out using the Merlin 4 HPCC
at the Paul Scherrer Institute.
4.4 Results
Presenting the performance of the classifier using the balanced accuracy as a measure
no longer makes sense, since different scoring systems cannot be compared using the
same call thresholds. For this reason, the AUROC values must be used as a measure of
performance, which ensures that the potential of each classifier is being tested, without
bias, across all thresholds.
In Figure 4.1, the ROC curves for each surface-sampling strategy in Section 4.2 are
shown. The integration of each ROC curve has also been calculated, and these AUROC
values are displayed in the figure’s legend. The classic strategy for sampling surface
residues resulted in an AUROC value of 0.9338, performing slightly better than the
novel strategy, which obtained an AUROC value of 0.9265. This is an acceptable result:
the slight decrease in performance, which may well be within the noise of the data,
is more than made up for by the simplicity of the strategy, its reduced computational
complexity, and the possibility of tackling cases where the classic strategy has failed
due to an insufficient number of surface residues (for example, in coiled-coil structures).
For this reason, the next chapter, which will involve working directly with the database
behind EPPIC, will adopt the novel strategy for the sampling of surface residues.
4.5 Conclusion
Due to both the sheer number and the variability in character of the residues on a
protein’s surface, EPPIC’s core–surface evolutionary predictor requires some statistical
Chapter 4 — Surface-Residue Sampling 45
0.0 0.2 0.4 0.6 0.8 1.0False-Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
Tru
e-P
osit
ive
Rat
e
Surface-Sampling Method
Classic(AUROC = 0.9338)
Novel(AUROC = 0.9265)
Figure 4.1: The ROC curves for EPPIC’s core–surface evolutionary predictor for thetwo different strategies of surface-residue sampling.
sampling to take place before the sequence entropies for the surface residues can be
compared to those of the core residues. The classic strategy for this was to exclude
regions of the surface that could be involved in binding sites or interfaces before sampling,
whereas the novel strategy takes the entire surface of the protein, even the core and rim
residues in the interface of interest, into account. A comparison of these two strategies
in this chapter showed that the classic one outperforms the novel one by a very narrow
margin of 0.78%, which is an acceptable loss to incur when weighed against the gains:
increased simplicity, reduced computational complexity, and handling structures with
few surface residues. Thus, the novel method is adopted in the following chapter of this
thesis.
Chapter 5
2-Alphabets
5.1 Rationale
In Chapter 2, the parameters in EPPIC were optimised based on a maximisation of
performance for the BioMany and XtalMany datasets. Of EPPIC’s built-in amino-acid
alphabets, shown in Table 2.1, the 6-alphabet proposed by Mirny and Shakhnovich [66]
performed best. In Chapter 3, more reduced amino-acid alphabets, which had been pro-
posed by the scientific community, were tested and compared to randomised 2-alphabets.
The results showed that the 2-alphabet proposed by Murphy et al. [60] performs far
better than random 2-alphabets, but the 2-alphabet proposed by Wang and Wang [59]
performs much worse.
The linear model from Subsection 3.4.3 set out to explain this by deconvolving the five
differentially-grouped amino acids—alanine, glycine, proline, serine, and threonine—
between the two 2-alphabets, and it was found that alanine was the amino acid whose
categorisation had the greatest effect on the classifier’s performance. However, this lin-
ear model failed to take into consideration the interdependence of amino-acid categori-
sations: it is the other amino acids with which a category is shared that characterises an
alphabet, not the individual categorisation of each amino acid. For this reason, this final
chapter of the thesis focusses on generating and benchmarking all possible 2-alphabets
and fitting a model of greater precision to the data.
5.2 Methods
Unfortunately, even given the modified version of EPPIC that allows for custom alpha-
bets, any brute-force approach that tests all conceivable amino-acid alphabets would be
47
Chapter 5 — 2-Alphabets 48
an insurmountable task. Therefore, the first step was to engineer a faster way of evalu-
ating 2-alphabets. Working directly with the database that contains the computations
for all current PDB entries would make the procedure much faster, since EPPIC’s pre-
liminary calculations (finding interfaces, calculating ASAs, searching for homologues,
creating MSAs, etc.), normally repeated for each run, would be bypassed. Using an
application programming interface (API) called the Java Persistence API (JPA), a con-
nection to the relational database management system (RDBMS) is made, a list of all
Protein Data Bank identifiers (PDB IDs) in the BioMany and XtalMany datasets is
passed to the database as a set of queries, and the referenced objects are returned.
For each PDB ID in the datasets, three stages of computation need to be carried out:
firstly, acquiring the aligned homologues from the database, adding them to an MSA
structure, and calculating the sequence entropy (with the selected amino-acid alphabet)
for each column; secondly, appropriately mapping the PDB ID’s associated interface
from the dataset to the database, classifying the residues geometrically by their burial
ratio; and, finally, pooling and sampling the residues to calculate the final score for the
interface. Once these scores are available, a bio or xtal call is obtained by comparing
the score to the call threshold. Everything but the sequence-entropy calculation and
sampling is done only once: the MSAs and residue classifications for both entire datasets,
as well as any other required data, are serialised and saved to file. Further amino-acid
alphabet evaluations begin by reading in this serialised file to save time.
Due to the time that had elapsed between this experiment and previous experiments,
some PDB IDs had to be removed from the BioMany and XtalMany datasets. 2ELF,
2FRM, and 2O37 were removed because they had been superseded, while 2PYE, 2VLM,
3QH3, 3RFN, 3RY9, 3VVU, 4FQ2, 4L4V, and 4MAY were removed because one or more of
the chains referred to by the datasets had no UniProt sequence mapping. For all other
PDB IDs, the computations were carried out using the Euler HPCC at ETH Zurich.
5.2.1 Stirling Numbers of the Second Kind
Prior to beginning computations for this chapter, even listing all of the potential amino-
acid alphabets proved difficult. To illustrate the complexity of this, even once the number
of categories for the alphabet is chosen, the question of how many amino acids to add
to each category remains, not to mention determining which particular amino acids to
add.
James Stirling, an eighteenth-century Scottish mathematician, discovered the so-called
“Stirling numbers of the second kind”, which define the number of ways to partition a
set of n objects into k non-empty subsets. The explicit formula for their calculation is
Chapter 5 — 2-Alphabets 49
shown in Equation (5.1) [72]. These numbers, as well as the formula to generate them,
are very useful for the purposes of this experiment, since amino-acid alphabets are, in
fact, partitions of a twenty-element set into k non-empty subsets, where 2 ≤ k ≤ 20 [72].
S(n, k) =
{n
k
}=
1
k!
k∑j=0
(−1)k−j(k
j
)jn (5.1)
Figure 5.1 shows all possible partitions of a set of four elements. The topmost partition,
in red, is not applicable to amino-acid alphabets, because k = 1. The green partitions
correspond to partitions where k = 2 with three elements in one subset and one element
in the other subset. The orange partitions also correspond to partitions where k = 2,
but with two elements in each subset. The six purple partitions are partitions where
k = 3, which can only be distributed by placing two elements in one subset and one
element in each of the other two subsets. Finally, the grey partition at the bottom of
the figure is a partition with k = n = 4, which has only one possible distribution: one
element in each subset.
It is clear that, for increasing n, these partitions become increasingly complex. Using
the formula in Equation (5.1), it can be shown that just the 2-alphabets, which must
Figure 5.1: The fifteen partitions of a four-element set (n = 4), ordered in a Hassediagram. The red, green, orange, purple, and grey partitions correspond to partitionswhere k = 1, k = 2, k = 2, k = 3, and k = 4, respectively. This figure was adapted
from a similar one by Tilman Piesk [73].
Chapter 5 — 2-Alphabets 50
have n = 20 and k = 2, amount to 524 287 possibilities [74]. Extending the scope to all
possibilities produces an incredible 51 724 158 235 372 amino-acid alphabets [74]. Clearly,
it would be infeasible to test all of these, and, for this reason, the scope of this chapter
is restricted to the 2-alphabets.
5.3 Results
After writing the program as described in Section 5.2, it was now possible to evalu-
ate the performance of one amino-acid alphabet in approximately one second (on one
core). After about one week of compute time, all 2-alphabets had had their balanced
accuracies calculated, and the results were quite surprising. The 2-alphabet proposed
by Murphy et al. [60] fares relatively well against all possible 2-alphabets, achieving a
rank of 98. This 2-alphabet, as well as the one proposed by Wang and Wang [59], which
achieved a rank of 513 180, are shown alongside the ten best and ten worst 2-alphabets
in Table 5.1.
5.4 Linear Model
Much like with the results from Section 3.4, a linear model can be fit to the results
of this chapter. As mentioned in Section 5.1, amino-acid categorisations in amino-acid
alphabets are not independent of one another: it is the other amino acids with which
a category is shared that characterises an alphabet, not the individual categorisation
of each amino acid. Therefore, a more elaborate form of linear model was used in this
section. Instead of looking for the presence or absence of a particular amino acid, the
presence or absence of all possible duos of amino acids were considered. Let a duo of
amino acids be defined as two amino acids, and let the presence of any such duo indicate
that the two component amino acids share the same categorisation in the amino-acid
alphabet. For example, the presence of the ST duo—always written in alphabetical
order—would mean that serine and threonine are grouped together in the alphabet in
question.
A linear model was fit to the data, such that the balanced accuracy for EPPIC’s core–
surface evolutionary predictor could be predicted given only the status of each amino-
acid duo’s presence. Equation (5.2) shows the general structure of the linear model,
where αXY is the coefficient of the duo XY after training and δXY its status: 0 if absent,
and 1 if present. α0 is the bias of the model, referring to the theoretical balanced accuracy
that would be present if all amino acids were segregated into their own categories in the
Chapter 5 — 2-Alphabets 51
alphabet. Terms in the model whose coefficients had magnitudes of less than 2× 10−3
were removed. Even after these simplifications, the coefficient of determination, R2, was
0.96.
Balanced Accuracy ≈∑X<Y
αXYδXY + α0 (5.2)
After training, the coefficients of this model were examined. Figure 5.2 shows the ten
coefficients of highest magnitude on each side of the spectrum, with the duos whose
presences are beneficial in green and the duos whose presences are detrimental in red. It
makes sense that duos composed of similar amino acids—such as isoleucine and valine,
isoleucine and leucine, or leucine and valine—would be beneficial to group together, given
AlphabetRank
2-AlphabetBalancedAccuracy
1 ACDFGILMPSTVWY:EHKNQR 0.90142 ACDFGHILMPSTVWY:EKNQR 0.90013 ACDFGHIKLMPQRSTVWY:EN 0.89954 ACFGIKLMPQRSTVWY:DEHN 0.89885 ACDFGHIKLMPQRSTVY:ENW 0.89866 ACDFGILMPSTVY:EHKNQRW 0.89857 ACDFGHILMNPSTVWY:EKQR 0.89858 ACDFGHILMNPSTVY:EKQRW 0.89829 ACDFGILMNPSTVWY:EHKQR 0.898210 ACDFGHILMPSTVY:EKNQRW 0.8979...
......
98 ACFGILMPSTVWY:DEHKNQR 0.8938...
......
513 180 ADEGHKNPQRST:CFILMVWY 0.7277...
......
524 278 ACDEFGHKLMNPQRSTVY:IW 0.5680524 279 ACDEGHKLNPQRSTVWY:FIM 0.5664524 280 ACDEFGHKLNPQRSTVY:IMW 0.5648524 281 ADEFGHKLMNPQRSTVY:CIW 0.5645524 282 ACDEGHKLMNPQRSTVWY:FI 0.5642524 283 ACDEFGHKLMNPQRSTVWY:I 0.5632524 284 ADEFGHKLNPQRSTVY:CIMW 0.5630524 285 ADEFGHKLMNPQRSTVWY:CI 0.5629524 286 ADEFGHKLNPQRSTVWY:CIM 0.5618524 287 ACDEFGHKLNPQRSTVWY:IM 0.5605
Table 5.1: The ten best and ten worst 2-alphabets, as well as two 2-alphabets proposedby the scientific community for comparison. The rank of each alphabet and the balancedaccuracy (resulting from its use in EPPIC’s core–surface predictor) are shown alongside
the 2-alphabet itself.
Chapter 5 — 2-Alphabets 52
Amino-Acid Duo-0.03
-0.02
-0.01
0.00
0.01
0.02
0.03
0.04V
alu
eof
Coe
ffici
ent
IV
ILLM
LV
ASIM FL AV
AG FY
AK QR ADDN EK KQ
EQ AE
KR
DE
Figure 5.2: The ten best (in green) and ten worst (in red) duos of amino acids. Foreach of these duos, the presence (for the ones in green) or absence (for the ones in red)of the duo—in other words, having the duo’s component amino acids in the same orin different groups in the 2-alphabet, respectively—is beneficial to the performance of
EPPIC’s core–surface predictor across the BioMany and XtalMany datasets.
their functional similarity. It also makes sense that largely-dissimilar amino acids—such
as alanine and lysine, glutamine and arginine, or glutamic acid and lysine—when present
as a duo, are detrimental to the classifier.
The two most detrimental duos, however, ring discordant with the rest. The two
negatively-charged amino acids, aspartic acid and glutamic acid, as well as the two
positively-charged amino acids, lysine and arginine, somehow negatively influence the
classifier when considered together. This finding is counterintuitive, since one would
assume that both of these duos would rank among the most beneficial, considering the
similarities between their constituent amino acids.
Although anecdotal in nature, this result can be demonstrated to, indeed, be the case
for the 2-alphabet proposed by Murphy et al. [60], at least for the sake of applying the
finding to a known baseline. The proposed alphabet, unchanged, achieves a balanced
accuracy of 0.8938. Moving aspartic acid or glutamic acid to the other group increases
the result to 0.9014 or 0.8959, respectively. Moving both amino acids to the other group,
however, results in a balanced accuracy of 0.8910, similar to the original value.
Chapter 5 — 2-Alphabets 53
5.5 Conclusion
In this chapter, the algorithm behind EPPIC was reimplemented using a direct channel
to communicate with the precomputed database, all 524 287 possible 2-alphabets were
found and benchmarked using the core–surface predictor across the BioMany and Xtal-
Many datasets, and a linear model that considered duos of amino acids being present
in the same group was trained. As expected, it was found that, when similar amino
acids were grouped together, the predictions were improved. By the same token, it was
found that grouping dissimilar amino acids together caused detriment to the predictions.
Surprisingly, the DE and KR duos were the most harmful to the predictions, despite the
undeniable similarity between the members of each of them. Future work should address
the basis for this observation.
Chapter 6
Conclusion
6.1 Summary
All three of EPPIC’s predictors—geometric, core–rim evolutionary, and core–surface
evolutionary—have a large number of variable parameters. While the default values for
these parameters were chosen based on an optimisation carried out just prior to the
original release of EPPIC in 2012 [24], this optimisation was not an exhaustive search
and was performed using a small dataset. Two new datasets, BioMany and XtalMany,
compiled computationally by Baskaran et al. [65], allowed for a more thorough analysis of
the effects of parameter values on EPPIC’s predictions. Through this, it was found that
the geometric predictor performs best with a burial cutoff of 90% and a call threshold of
8; that the core–rim evolutionary predictor performs best with the 6-alphabet proposed
by Mirny and Shakhnovich [66], a burial cutoff of 80%, and a call threshold of 0.90; and
that the core–surface evolutionary predictor performs best with the 6-alphabet proposed
by Mirny and Shakhnovich [66], a burial cutoff of 80%, and a call threshold of −0.90.
Additionally, 2-alphabets are surprisingly prominent in the highest-ranking parameter
sets for the evolutionary predictors, a fact that drove much of the remainder of this
thesis toward further investigation of the importance of reduced amino-acid alphabets
on EPPIC’s predictive performance.
Next, reduced amino-acid alphabets published by the scientific community were tested
by running EPPIC’s core–surface evolutionary predictor on the BioMany and XtalMany
datasets and using the AUROC values as measures of classifier performance. The perfor-
mance of random 2-alphabets was also tested, and an interesting finding was that, while
random 2-alphabets exhibit little variability (0.8600± 0.0119), the 2-alphabet proposed
by Murphy et al. [60] performs much better (0.9536), and the 2-alphabet proposed by
Wang and Wang [59] performs much worse (0.7602). Despite this, the only difference
55
Chapter 6 — Conclusion 56
between the two aforementioned alphabets is the categorisation of five amino acids: ala-
nine, glycine, proline, serine, and threonine. All possible “transitioning” 2-alphabets
between these alphabets were tested: the performance was shown to be more-or-less lin-
early dependent on each of the five amino-acid categorisations, and alanine was identified
as the amino acid whose categorisation was of the most significance.
As a bit of a tangent, the thesis then sought to focus on simplifying the method by
which surface residues were sampled for the scoring component of EPPIC’s core–surface
predictor. Due to both the sheer number and the variability in character of the residues
on a protein’s surface, statistical sampling is required before the sequence entropies for
the surface residues can be compared to those of the core residues. In the classic strategy,
regions of the surface that could be involved in binding sites or interfaces were excluded
before sampling, whereas the novel strategy takes the entire surface of the protein into
consideration, even the core and rim residues in the interface of interest. A comparison of
these two strategies in this chapter shows that the classic one outperforms the novel one
by a very narrow margin of 0.78%, which is an acceptable loss to incur when weighed
against the upsides: increased simplicity, reduced computational complexity, and the
possibility of tackling cases where the classic strategy has failed due to an insufficient
number of surface residues (for example, in coiled-coil structures). Thus, the novel
method was adopted for Chapter 5.
In the final chapter, the algorithm behind EPPIC was reimplemented using a direct chan-
nel of communication with the precomputed database. All 524 287 possible 2-alphabets
were then found and benchmarked using the EPPIC core–surface predictor across the
BioMany and XtalMany datasets, and a linear model that considered duos of amino
acids being present in the same group was trained. Confirming expectations, it was
found that grouping similar amino acids together benefited predictions, while grouping
dissimilar amino acids together was decidedly harmful to the predictions. Surprisingly,
the DE and KR duos were the most detrimental to the predictions, despite the undeniable
similarity between the members of each of them.
Appendix A
Hydrogen-Bond Detection
A.1 Hydrogen Bonding
A hydrogen bond is “an attractive interaction between a hydrogen atom from a molecule
or a molecular fragment D-H, in which D is is more electronegative than H, and an atom
or a group of atoms in the same or a different molecule, in which there is evidence of
bond formation” [75]. Typically, a hydrogen bond is denoted as D-H· · ·A-AA, where the
three dots represent the bond [75]. The name “hydrogen bond” is a bit misleading:
hydrogen bonds are dipole–dipole interactions more than anything else and should not
be confused with true bonds such as covalent bonds. The heavy atom to which the
hydrogen involved in the bond is connected is called the donor atom and is represented
by D. The heavy atom to which the hydrogen is paired—via the hydrogen bond—is
called the acceptor atom and is represented by A. The parent heavy atoms of the donor
and acceptor are represented by DD and AA, respectively.
Hydrogen bonding is a very important phenomenon in protein structure, for it is because
of these interactions that a subunit of a protein remains folded. The primary structure
of the protein—its amino-acid sequence—dictates the secondary structures that form [1].
An alpha helix’s formation is stabilised by hydrogen bonding between the N-H group of
an amino acid and the O=C group of the amino acid four residues earlier [76]. Similarly,
parallel and antiparallel beta strands are held together to form beta sheets by hydrogen
bonding between the N-H group of an amino acid and the O=C group of another on
the adjacent strand [77]. Protein tertiary structure and quaternary structure are also
impacted by hydrogen bonding: energetic minimisation drives proteins to their most
stable conformation in space, given the secondary structures that first formed, and
different subunits form protein–protein interfaces due to interaction partly caused by
hydrogen bonding across the interface [78].
57
Appendix A — Hydrogen-Bond Detection 58
A.2 Rationale
Since hydrogen bonding is so key to protein structure, hydrogen-bond detection would
be very useful in the prediction of biological interfaces. Proteins, Interfaces, Structures
and Assemblies (PISA) [35] already, albeit indirectly, takes hydrogen bonding into con-
sideration, since their algorithm to predict whether an interface is stable in solution
considers the estimated free energy of the interaction. A lower energy supports the des-
ignation of the interface as biological, but the lower energy is, at least in part, caused
by the stability that hydrogen bonds across the interface impart. The Evolutionary
Protein–Protein Interface Classifier (EPPIC) [24] could also use hydrogen bonding to
help in its geometric prediction. If one can detect hydrogen bond across an interface, it
would be a good indication that the interface is biological. For this reason, this appendix
documents the creation of a hydrogen-bond detector for EPPIC.
A.3 Definition
The first step was to determine what defines a hydrogen bond, which proved to be par-
ticular challenging, since there exists so much variation in the definitions of hydrogen
bonding. Arunan et al., in 2011, provided a list of specific criteria to define what consti-
tutes and characterises a hydrogen bond [75]. The downside, however, was that many
of these rules, such as ones based on attributes unobtainable from the PDB file format,
were not possible to implement in EPPIC. In 1984, Baker and Hubbard documented
hydrogen bonding in secondary structure and showed that the main chain of a polypep-
tide can form hydrogen bonds through the N-H and O=C groups, with the O=C group able
to accept two hydrogen bonds [79]. They also noted that “all polar sidechains, with the
exception of tryptophan, can form multiple hydrogen bonds” [79]. But these definitions
still did not address the required geometry for hydrogen bonding. In 2010, Hubbard and
Haider provided some such information: the H· · ·O=C angle is usually between 130° and
170°, with a clear preference at approximately 150°; the H· · ·O distance is always between
1.6�A and 2.5�A; and the N-H· · ·O angle is nearly always between 120° and 180° [80].
A.4 Implementation
Using only the three ranges of geometric values from Section A.3 to detect hydro-
gen bonds resulted in far too many being detected. At this point, another approach
was taken. HBPLUS is a software package developed by McDonald and Thornton
in 1994 [81]. They undertook an empirical approach to hydrogen-bond detection by
Appendix A — Hydrogen-Bond Detection 59
looking at protein structures in the PDB and creating particular criteria specific to hy-
drogen bonding in proteins. The rules listed in the documentation for HBPLUS [82]
were added to EPPIC, such that, for each interface calculation, hydrogen bonds from
one side of the interface to the other are identified and counted. These are attributed
to the residues involved in the hydrogen bond, which are data that can be used in
the future in EPPIC’s geometric predictor and possibly incorporated into a model that
combines the core-residue count for the interface (and, eventually, even the insight from
core-residue decay presented in Appendix B).
Upon validation, however, it appeared that the EPPIC implementation of the HBPLUS
rules did not achieve the same results. The EPPIC implementation output some hydro-
gen bonds that HBPLUS did not find, but HBPLUS tended to find quite a few more than
EPPIC. At this point, another direction was pursued, and it was decided that HBPLUS
should simply be called by EPPIC. Some of EPPIC was recoded to allow the program
to run HBPLUS for the PDB file. HBPLUS’s output is parsed by EPPIC’s internal
code, and the hydrogen bonds are added to the component residues appropriately.
An issue with this implementation is how to ensure that the user of EPPIC also has
HBPLUS installed. Due to licensing restrictions, HBPLUS cannot be packaged with
EPPIC; thus, HBPLUS was implemented as follows: EPPIC checks for HBPLUS (whose
path must be added to the EPPIC parameter file), and, if present, it is run and its output
is parsed. If HBPLUS is not available for use, the EPPIC implementation of HBPLUS’s
hydrogen-bonding criteria is used.
A.5 Conclusion
This appendix’s purpose was to document the steps taken to implement a hydrogen-
bonding detector into EPPIC’s geometric predictor. In the end, EPPIC tries to run
HBPLUS, but, if HBPLUS is unavailable, an alternative implementation of HBPLUS’s
hydrogen-bonding criteria is used within EPPIC. The results with the non-HBPLUS
implementation do not overlap completely (data not shown). More investigation would
be required to determine the cause of the differences between the EPPIC implementation
and the HBPLUS implementation. EPPIC could then be trained to use the hydrogen-
bond information as a criterion for classification. This would likely be built alongside
the core-size discriminator in the geometric classifier (and, eventually, even the insight
from core-residue decay presented in Appendix B).
Appendix B
Core-Residue Decay
B.1 Theory
While experiments were being carried out for Chapter 2, a brief investigation into the
potential for core-residue decay to act as a criterion for interface classification in EPPIC
was undertaken. The theory is simple and goes as follows. The geometric classifier in
EPPIC only considers the number of core residues when predicting whether an interface
is bio or xtal. While performing the parameter optimisation in Subsection 2.3.4, in-
formation was available for core-residue assignment across a spectrum of burial cutoffs.
Recall that core residues are assigned based on their BSA-to-ASA ratio, the proportion
of solvent-accessible surface area that becomes buried upon interface formation. Record-
ing the number of core residues as a function of the burial cutoff allows for a gradient
to be calculated, which, it was theorised, could be telling of how likely the interface
is to be biologically relevant. For example, it could be that, since biologically-relevant
interfaces are conserved by evolution, they have more highly-buried residues involved
in the interface and fewer modestly-buried residues, whereas a crystal contact may well
exhibit a more linear or stochastic dropoff.
Every interface in the BioMany and XtalMany datasets had its number of core residues
computed for each of the following burial cutoffs: 50%, 55%, 60%, 65%, 70%, 75%,
80%, 85%, 90%, 95%, and 100%. Linear models, with the number of core residues as
a function of the burial cutoff, were then fit, one per datapoint. If the coefficient of
determination, R2, was above 0.96, the slope and y-intercept of the model were noted
down. Let the slopes and y-intercepts of these models be termed the slopes and intercepts
of core-residue decay.
61
Appendix B — Core-Residue Decay 62
B.2 Results
Figure B.1 is a histogram that has been included in this appendix for the sake of com-
parison. It shows the distribution of the number of core residues in an interface for a
burial cutoff of 95%, which is the criterion used (with a call threshold of 6) in EPPIC’s
geometric classifier. Note that the datapoints from the XtalMany dataset, shown in red,
are from a visibly-different distribution than the datapoints from the BioMany dataset,
shown in green. After choosing a criterion, the less overlap there is between these two
distributions, the more powerful the classifier can be.
Figure B.2 is a histogram showing the distribution of decay slopes, that is, the distri-
bution of the slopes from the models with R2 > 0.96. Figure B.3 is a similar histogram
showing the distribution of decay intercepts. Note that Figures B.2 and B.3 do not have
as many datapoints as Figure B.1, because only datapoints whose linear model was fit
with an R2 of at least 0.96 had their slopes and y-intercepts added to the histogram.
Upon visual inspection of Figures B.2 and B.3, it is plain to see that there is more
overlap between bio and xtal distributions than there is in Figure B.1. This likely
means that a simple linear discriminator based on either of these criteria would not
0 5 10 15 20 25 30 35 40Number of Core Residues
0
200
400
600
800
1000
1200
1400
1600
Fre
quen
cy
bio
xtal
Figure B.1: The distribution of the number of core residues at a burial cutoff of 95%for biological interfaces from the BioMany dataset, shown in green, and crystal contacts
from the XtalMany dataset, shown in red.
Appendix B — Core-Residue Decay 63
-160 -140 -120 -100 -80 -60 -40 -20 0Slope of Linear Fit
0
50
100
150
200
250
300
350
Fre
quen
cy
bio
xtal
Figure B.2: The distribution of the slopes from the linear models of the core size as afunction of the burial cutoff for biological interfaces from the BioMany dataset, shownin green, and crystal contacts from the XtalMany dataset, shown in red. Note that theslopes are all negative because increasing the burial cutoff results in a more stringent
criterion for core-residue assignment, which decreases the core size.
be able to distinguish between biological interfaces and crystal contacts in the range of
values where the result is not clear. Indeed, Table B.1 shows the enumeration of non-
overlapping bios (as a fraction of all bios), non-overlapping xtals (as a fraction of all
xtals), and overlapping bios and xtals (as a fraction of all datapoints) there are in each
of the histograms. The simple count of core residues has a lower overlap than the other
criteria and has more balance between non-overlapping categories. Both of these are
properties that would help with classification. Unfortunately, there was insufficient time
available to continue exploring options, but it is very possible that, using an approach
Criterion bio xtal bio + xtal
Core Size 0.7052 0.7156 0.2895Decay Slopes 0.5698 0.6650 0.3767
Decay Intercepts 0.6199 0.7040 0.3329
Table B.1: For three different criteria—the core size, the decay slope, and the de-cay intercept—the non-overlapping bios (as a fraction of all bios), non-overlappingxtals (as a fraction of all xtals), and overlapping bios and xtals (as a fraction of all
datapoints) are listed.
Appendix B — Core-Residue Decay 64
0 20 40 60 80 100 120 140 160y-Intercept of Linear Fit
0
50
100
150
200
250
300
350F
requ
ency
bio
xtal
Figure B.3: The distribution of the y-intercepts from the linear models of the coresize as a function of the burial cutoff for biological interfaces from the BioMany dataset,
shown in green, and crystal contacts from the XtalMany dataset, shown in red.
rooted in machine learning of higher complexity, these characteristics could achieve a
predictive power higher than that of any individual one.
B.3 Conclusion
Although the histograms of the decay slopes and intercepts do not show a clear linear
separability between groups, the results are still inconclusive. Some more experimen-
tation, perhaps with a greater number of characteristics and by using better classifiers,
should be able to improve the geometric classifier in EPPIC.
List of Figures
1.1 Composition of RNA and DNA . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Types of Protein–Protein Interactions . . . . . . . . . . . . . . . . . . . . 3
1.3 The Phase Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Crystallographic Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 The Interface-Classification Problem . . . . . . . . . . . . . . . . . . . . . 8
2.1 Effects of Reduced Amino-Acid Alphabets . . . . . . . . . . . . . . . . . . 25
3.1 Published Reduced Amino-Acid Alphabets . . . . . . . . . . . . . . . . . 29
3.2 ROC Curves for Published Alphabets . . . . . . . . . . . . . . . . . . . . 31
3.3 Random Two-Letter Amino-Acid Alphabets . . . . . . . . . . . . . . . . . 33
3.4 ROC Curves for Random 2-Alphabets . . . . . . . . . . . . . . . . . . . . 34
3.5 Transitioning Two-Letter Amino-Acid Alphabets . . . . . . . . . . . . . . 36
3.6 ROC Curves for Transitioning 2-Alphabets . . . . . . . . . . . . . . . . . 37
3.7 Linear Model for Transitioning 2-Alphabets . . . . . . . . . . . . . . . . . 39
4.1 ROC Curves for Surface-Residue Sampling . . . . . . . . . . . . . . . . . . 45
5.1 Hasse Diagram of Set Partitioning . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Ten Best and Ten Worst Amino-Acid Duos . . . . . . . . . . . . . . . . . 52
B.1 Distribution of Core Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.2 Distribution of Decay Slopes . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.3 Distribution of Decay Intercepts . . . . . . . . . . . . . . . . . . . . . . . 64
65
List of Tables
2.1 The Seven Amino-Acid Alphabets of EPPIC . . . . . . . . . . . . . . . . 17
2.2 The Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Top Parameters for Geometric Classifier . . . . . . . . . . . . . . . . . . . 21
2.4 Top Low-Stringency Parameters for Core–Rim Evolutionary Classifier . . 22
2.5 Top High-Stringency Parameters for Core–Rim Evolutionary Classifier . . 22
2.6 Top Low-Stringency Parameters for Core–Surface Evolutionary Classifier 23
2.7 Top High-Stringency Parameters for Core–Surface Evolutionary Classifier 24
3.1 The Sixteen Published Amino-Acid Alphabets . . . . . . . . . . . . . . . . 28
3.2 AUROC Values for Published Alphabets . . . . . . . . . . . . . . . . . . . 31
3.3 The Ten Random Two-Letter Amino-Acid Alphabets . . . . . . . . . . . . 32
3.4 AUROC Values for Random 2-Alphabets . . . . . . . . . . . . . . . . . . 34
3.5 The Thirty-Two Transitioning Two-Letter Amino-Acid Alphabets . . . . 35
3.6 AUROC Values for Transitioning 2-Alphabets . . . . . . . . . . . . . . . . 37
4.1 Comparison of Surface-Sampling Strategies . . . . . . . . . . . . . . . . . 42
5.1 Ten Best and Ten Worst 2-Alphabets . . . . . . . . . . . . . . . . . . . . . 51
B.1 Separability by Geometric Criterion . . . . . . . . . . . . . . . . . . . . . 63
67
Acknowledgements
First and foremost, I would like to thank Guido, for being such a caring and understand-
ing supervisor, and Jose, for always helpfully answering any semblance of a question I
had for him. Thanks also to Spencer, to Kumaran, to Aleix, and to all my other LBR
colleagues, for making my stay at PSI an enjoyable time. Next, I would like to extend
my thanks to Amedeo, for agreeing to be my internal supervisor; to my father, mother,
and brother, for their unwavering support; and to Deena, for her constant words of en-
couragement and help with editing. Finally, I would like to thank Martin, Petra, Dania,
and Aral, for always being around for those much-needed beers. Merci a tous.
69
References
[1] J. M. Berg, J. L. Tymoczko, and L. Stryer. Biochemistry. Freeman and Company:
New York, 2002.
[2] Sponk and R. Mattern. File:Difference DNA RNA-EN.svg — Wikimedia Commons,
the free media repository, 2015. URL http://commons.wikimedia.org/w/index.
php?title=File:Difference_DNA_RNA-EN.svg&oldid=132849853. [Online; ac-
cessed 26-March-2015].
[3] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. From molecular to
modular cell biology. Nature, 402(6761 Suppl):C47–C52, Dec 1999. ISSN 0028-0836.
doi: 10.1038/35011540. URL http://www.ncbi.nlm.nih.gov/pubmed/10591225.
[4] S. Jones and J. M. Thornton. Principles of protein–protein interactions. Proceedings
of the National Academy of Sciences of the United States of America, 93(1):13–
20, Jan 1996. ISSN 0027-8424. URL http://www.ncbi.nlm.nih.gov/pubmed/
8552589.
[5] J. B. Pereira-Leal, E. D. Levy, and S. A. Teichmann. The origins and evolution
of functional modules: lessons from protein complexes. Philosophical Transactions
of the Royal Society of London (Series B, Biological Sciences), 361(1467):507–517,
Mar 2006. ISSN 0962-8436. doi: 10.1098/rstb.2005.1807. URL http://www.ncbi.
nlm.nih.gov/pubmed/16524839.
[6] J. Monod, J. Wyman, and J. P. Changeux. On the nature of allosteric transitions:
A plausible model. Journal of Molecular Biology, 12:88–118, May 1965. ISSN
0022-2836. URL http://www.ncbi.nlm.nih.gov/pubmed/14343300.
[7] R. H. Garrett and C. M. Grisham. Biochemistry, Fifth Edition. Brooks/Cole, 2013.
[8] E. D. Levy. A simple definition of structural regions in proteins and its use in
analyzing interface evolution. Journal of Molecular Biology, 403(4):660–670, Nov
2010. ISSN 1089-8638. doi: 10.1016/j.jmb.2010.09.028. URL http://www.ncbi.
nlm.nih.gov/pubmed/20868694.
71
References 72
[9] B. Rupp. Biomolecular Crystallography: Principles, Practice, and Application to
Structural Biology. Garland Science, 2009.
[10] J. Jancarik and S.-H. Kim. Sparse matrix sampling: a screening method for crys-
tallization of proteins. Journal of Applied Crystallography, 24(4):409–411, Aug
1991. doi: 10.1107/S0021889891004430. URL http://dx.doi.org/10.1107/
S0021889891004430.
[11] J. W. Wooh, R. D. Kidd, J. L. Martin, and B. Kobe. Comparison of three com-
mercial sparse-matrix crystallization screens. Acta Crystallographica (Section D,
Biological Crystallography), 59(Pt 4):769–772, Apr 2003. ISSN 0907-4449. URL
http://www.ncbi.nlm.nih.gov/pubmed/12657807.
[12] G. Taylor. The phase problem. Acta Crystallographica (Section D, Biologi-
cal Crystallography), 59(Pt 11):1881–1890, Nov 2003. ISSN 0907-4449. URL
http://www.ncbi.nlm.nih.gov/pubmed/14573942.
[13] M. G. Rossmann. The molecular replacement method. Acta Crystallographica
(Section A, Foundations of Crystallography), 46 (Pt 2):73–82, Feb 1990. ISSN
0108-7673. URL http://www.ncbi.nlm.nih.gov/pubmed/2180438.
[14] K. Cowtan. Kevin Cowtan’s Picture Book of Fourier Transforms, 2014. URL http:
//www.ysbl.york.ac.uk/~cowtan/fourier/fourier.html. [Online; accessed 26-
March-2015].
[15] E. Aubert and C. Lecomte. Illustrated Fourier transforms for crystallography. Jour-
nal of Applied Crystallography, 40(6):1153–1165, 2007.
[16] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.
Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Research,
28(1):235–242, Jan 2000. ISSN 0305-1048. URL http://www.ncbi.nlm.nih.gov/
pubmed/10592235.
[17] D. Blow. Outline of Crystallography for Biologists. Oxford University Press, 2002.
[18] M. E. Kastner, E. Vasbinder, D. Kowalcyzk, S. Jackson, J. Giammalvo, J. Braun,
and K. DiMarco. Crystallographic CourseWare. Journal of Chemical Education,
77(9):1247, 2000. URL http://www.crystallographiccourseware.com/.
[19] K. Anduleit, G. Sutton, J. M. Diprose, P. P. C. Mertens, J. M. Grimes, and D. I.
Stuart. Crystal lattice as biological phenotype for insect viruses. Protein Science,
14(10):2741–2743, Oct 2005. ISSN 0961-8368. doi: 10.1110/ps.051516405. URL
http://www.ncbi.nlm.nih.gov/pubmed/16155202.
References 73
[20] X. Ji, G. Sutton, G. Evans, D. Axford, R. Owen, and D. I. Stuart. How baculovirus
polyhedra fit square pegs into round holes to robustly package viruses. The EMBO
Journal, 29(2):505–514, Jan 2010. ISSN 1460-2075. doi: 10.1038/emboj.2009.352.
URL http://www.ncbi.nlm.nih.gov/pubmed/19959989.
[21] M. R. Sawaya, D. Cascio, M. Gingery, J. Rodriguez, L. Goldschmidt, J.-P. Colletier,
M. M. Messerschmidt, S. Boutet, J. E. Koglin, G. J. Williams, A. S. Brewster,
K. Nass, J. Hattne, S. Botha, R. B. Doak, R. L. Shoeman, D. P. DePonte, H.-
W. Park, B. A. Federici, N. K. Sauter, I. Schlichting, and D. S. Eisenberg. Protein
crystal structure obtained at 2.9 A resolution from injecting bacterial cells into an X-
ray free-electron laser beam. Proceedings of the National Academy of Sciences of the
United States of America, 111(35):12769–12774, Sep 2014. ISSN 1091-6490. doi: 10.
1073/pnas.1413456111. URL http://www.ncbi.nlm.nih.gov/pubmed/25136092.
[22] Y. Tsuchiya, H. Nakamura, and K. Kinoshita. Discrimination between biologi-
cal interfaces and crystal-packing contacts. Advances and Applications in Bioin-
formatics and Chemistry, 1:99–113, Nov 2008. ISSN 1178-6949. URL http:
//www.ncbi.nlm.nih.gov/pubmed/21918609.
[23] E. D. Levy and S. Teichmann. Structural, evolutionary, and assembly principles of
protein oligomerization. Progress in Molecular Biology and Translational Science,
117:25–51, 2013. ISSN 1878-0814. doi: 10.1016/B978-0-12-386931-9.00002-7. URL
http://www.ncbi.nlm.nih.gov/pubmed/23663964.
[24] J. M. Duarte, A. Srebniak, M. A. Scharer, and G. Capitani. Protein interface
classification by evolutionary analysis. BMC Bioinformatics, 13:334, Dec 2012.
ISSN 1471-2105. doi: 10.1186/1471-2105-13-334. URL http://www.ncbi.nlm.
nih.gov/pubmed/23259833.
[25] C. Chothia and J. Janin. Principles of protein–protein recognition. Nature, 256
(5520):705–708, Aug 1975. ISSN 0028-0836. URL http://www.ncbi.nlm.nih.
gov/pubmed/1153006.
[26] S. J. Wodak and J. Janin. Computer analysis of protein–protein interaction. Journal
of Molecular Biology, 124(2):323–342, Sep 1978. ISSN 0022-2836. URL http:
//www.ncbi.nlm.nih.gov/pubmed/712840.
[27] J. Janin and F. Rodier. Protein–protein interaction at crystal contacts. Proteins,
23(4):580–587, Dec 1995. ISSN 0887-3585. doi: 10.1002/prot.340230413. URL
http://www.ncbi.nlm.nih.gov/pubmed/8749854.
References 74
[28] J. Janin. Specific versus non-specific contacts in protein crystals. Nature Structural
Biology, 4(12):973–974, Dec 1997. ISSN 1072-8368. URL http://www.ncbi.nlm.
nih.gov/pubmed/9406542.
[29] W. S. Valdar and J. M. Thornton. Conservation helps to identify biologically rel-
evant crystal contacts. Journal of Molecular Biology, 313(2):399–416, Oct 2001.
ISSN 0022-2836. doi: 10.1006/jmbi.2001.5034. URL http://www.ncbi.nlm.nih.
gov/pubmed/11800565.
[30] A. H. Elcock and J. A. McCammon. Identification of protein oligomerization
states by analysis of interface conservation. Proceedings of the National Academy
of Sciences of the United States of America, 98(6):2990–2994, Mar 2001. ISSN
0027-8424. doi: 10.1073/pnas.061411798. URL http://www.ncbi.nlm.nih.gov/
pubmed/11248019.
[31] M. Guharoy and P. Chakrabarti. Conservation and relative importance of residues
across protein–protein interfaces. Proceedings of the National Academy of Sci-
ences of the United States of America, 102(43):15447–15452, Oct 2005. ISSN
0027-8424. doi: 10.1073/pnas.0505425102. URL http://www.ncbi.nlm.nih.gov/
pubmed/16221766.
[32] J. Bernauer, R. P. Bahadur, F. Rodier, J. Janin, and A. Poupon. DiMoVo: a
Voronoi tessellation-based method for discriminating crystallographic and biological
protein–protein interactions. Bioinformatics, 24(5):652–658, Mar 2008. ISSN 1367-
4811. doi: 10.1093/bioinformatics/btn022. URL http://www.ncbi.nlm.nih.gov/
pubmed/18204058.
[33] H. Zhu, F. S. Domingues, I. Sommer, and T. Lengauer. NOXclass: prediction
of protein–protein interaction types. BMC Bioinformatics, 7:27, Jan 2006. ISSN
1471-2105. doi: 10.1186/1471-2105-7-27. URL http://www.ncbi.nlm.nih.gov/
pubmed/16423290.
[34] P. Mitra and D. Pal. Combining Bayes classification and point group symmetry un-
der Boolean framework for enhanced protein quaternary structure inference. Struc-
ture, 19(3):304–312, Mar 2011. ISSN 1878-4186. doi: 10.1016/j.str.2011.01.009.
URL http://www.ncbi.nlm.nih.gov/pubmed/21397182.
[35] E. Krissinel and K. Henrick. Inference of macromolecular assemblies from crystalline
state. Journal of Molecular Biology, 372(3):774–797, Sep 2007. ISSN 0022-2836.
doi: 10.1016/j.jmb.2007.05.022. URL http://www.ncbi.nlm.nih.gov/pubmed/
17681537.
References 75
[36] Q. Xu, A. A. Canutescu, G. Wang, M. Shapovalov, Z. Obradovic, and R. L. Dun-
brack. Statistical analysis of interface similarity in crystals of homologous proteins.
Journal of Molecular Biology, 381(2):487–507, Aug 2008. ISSN 1089-8638. doi: 10.
1016/j.jmb.2008.06.002. URL http://www.ncbi.nlm.nih.gov/pubmed/18599072.
[37] Q. Xu and R. L. Dunbrack. The protein common interface database (ProtCID)—a
comprehensive database of interactions of homologous proteins in multiple crystal
forms. Nucleic Acids Research, 39(Database issue):D761–D770, Jan 2011. ISSN
1362-4962. doi: 10.1093/nar/gkq1059. URL http://www.ncbi.nlm.nih.gov/
pubmed/21036862.
[38] R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin,
P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E. L. L. Sonnhammer, S. R. Eddy,
and A. Bateman. The Pfam protein families database. Nucleic Acids Research, 38
(Database issue):D211–D222, Jan 2010. ISSN 1362-4962. doi: 10.1093/nar/gkp985.
URL http://www.ncbi.nlm.nih.gov/pubmed/19920124.
[39] M. A. Scharer, M. G. Grutter, and G. Capitani. CRK: an evolutionary approach
for distinguishing biologically relevant interfaces from crystal contacts. Proteins, 78
(12):2707–2713, Sep 2010. ISSN 1097-0134. doi: 10.1002/prot.22787. URL http:
//www.ncbi.nlm.nih.gov/pubmed/20589644.
[40] R. P. Bahadur, P. Chakrabarti, F. Rodier, and J. Janin. A dissection of specific and
non-specific protein–protein interfaces. Journal of Molecular Biology, 336(4):943–
955, Feb 2004. ISSN 0022-2836. URL http://www.ncbi.nlm.nih.gov/pubmed/
15095871.
[41] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local
alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990.
ISSN 0022-2836. doi: 10.1016/S0022-2836(05)80360-2. URL http://www.ncbi.
nlm.nih.gov/pubmed/2231712.
[42] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller,
and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Research, 25(17):3389–3402, Sep 1997.
ISSN 0305-1048. URL http://www.ncbi.nlm.nih.gov/pubmed/9254694.
[43] UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic
Acids Research, 37(Database issue):D169–D174, Jan 2009. ISSN 1362-4962. doi:
10.1093/nar/gkn664. URL http://www.ncbi.nlm.nih.gov/pubmed/18836194.
[44] C. Notredame, D. G. Higgins, and J. Heringa. T-Coffee: A novel method for fast
and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1):
References 76
205–217, Sep 2000. ISSN 0022-2836. doi: 10.1006/jmbi.2000.4042. URL http:
//www.ncbi.nlm.nih.gov/pubmed/10964570.
[45] R. Wernersson and A. G. Pedersen. RevTrans: Multiple alignment of coding DNA
from aligned amino acid sequences. Nucleic Acids Research, 31(13):3537–3539, Jul
2003. ISSN 1362-4962. URL http://www.ncbi.nlm.nih.gov/pubmed/12824361.
[46] A. Doron-Faigenboim, A. Stern, I. Mayrose, E. Bacharach, and T. Pupko. Selecton:
a server for detecting evolutionary forces at a single amino-acid site. Bioinformatics,
21(9):2101–2103, May 2005. ISSN 1367-4803. doi: 10.1093/bioinformatics/bti259.
URL http://www.ncbi.nlm.nih.gov/pubmed/15647294.
[47] A. Stern, A. Doron-Faigenboim, E. Erez, E. Martz, E. Bacharach, and T. Pupko.
Selecton 2007: advanced models for detecting positive and purifying selection us-
ing a Bayesian inference approach. Nucleic Acids Research, 35(Web Server is-
sue):W506–W511, Jul 2007. ISSN 1362-4962. doi: 10.1093/nar/gkm382. URL
http://www.ncbi.nlm.nih.gov/pubmed/17586822.
[48] B. Lee and F. M. Richards. The interpretation of protein structures: estimation of
static accessibility. Journal of Molecular Biology, 55(3):379–400, Feb 1971. ISSN
0022-2836. URL http://www.ncbi.nlm.nih.gov/pubmed/5551392.
[49] A. Shrake and J. A. Rupley. Environment and exposure to solvent of protein atoms.
lysozyme and insulin. Journal of Molecular Biology, 79(2):351–371, Sep 1973. ISSN
0022-2836. URL http://www.ncbi.nlm.nih.gov/pubmed/4760134.
[50] S. Miller, A. M. Lesk, J. Janin, and C. Chothia. The accessible surface area and sta-
bility of oligomeric proteins. Nature, 328(6133):834–836, Aug 1987. ISSN 0028-0836.
doi: 10.1038/328834a0. URL http://www.ncbi.nlm.nih.gov/pubmed/3627230.
[51] W. S. Valdar and J. M. Thornton. Protein–protein interfaces: analysis of amino acid
conservation in homodimers. Proteins, 42(1):108–124, Jan 2001. ISSN 0887-3585.
URL http://www.ncbi.nlm.nih.gov/pubmed/11093265.
[52] C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181
(4096):223–230, Jul 1973. ISSN 0036-8075. URL http://www.ncbi.nlm.nih.gov/
pubmed/4124164.
[53] N. Sinha and R. Nussinov. Point mutations and sequence variability in proteins:
redistributions of preexisting populations. Proceedings of the National Academy
of Sciences of the United States of America, 98(6):3139–3144, Mar 2001. ISSN
0027-8424. doi: 10.1073/pnas.051399098. URL http://www.ncbi.nlm.nih.gov/
pubmed/11248045.
References 77
[54] D. S. Riddle, J. V. Santiago, S. T. Bray-Hall, N. Doshi, V. P. Grantcharova, Q. Yi,
and D. Baker. Functional rapidly folding proteins from simplified amino acid se-
quences. Nature Structural Biology, 4(10):805–809, Oct 1997. ISSN 1072-8368. URL
http://www.ncbi.nlm.nih.gov/pubmed/9334745.
[55] A. R. Davidson, K. J. Lumb, and R. T. Sauer. Cooperatively folded proteins in
random sequence libraries. Nature Structural Biology, 2(10):856–864, Oct 1995.
ISSN 1072-8368. URL http://www.ncbi.nlm.nih.gov/pubmed/7552709.
[56] C. Sander and G. E. Schulz. Degeneracy of the information contained in amino
acid sequences: evidence from overlaid genes. Journal of Molecular Evolution, 13
(3):245–252, Oct 1979. ISSN 0022-2844. URL http://www.ncbi.nlm.nih.gov/
pubmed/228047.
[57] V. B. Strelets, I. N. Shindyalov, and H. A. Lim. Analysis of peptides from known
proteins: clusterization in sequence space. Journal of Molecular Evolution, 39
(6):625–630, Dec 1994. ISSN 0022-2844. URL http://www.ncbi.nlm.nih.gov/
pubmed/7807551.
[58] P. G. Wolynes. As simple as can be? Nature Structural Biology, 4(11):871–874, Nov
1997. ISSN 1072-8368. URL http://www.ncbi.nlm.nih.gov/pubmed/9360595.
[59] J. Wang and W. Wang. A computational approach to simplifying the protein folding
alphabet. Nature Structural Biology, 6(11):1033–1038, Nov 1999. ISSN 1072-8368.
doi: 10.1038/14918. URL http://www.ncbi.nlm.nih.gov/pubmed/10542095.
[60] L. R. Murphy, A. Wallqvist, and R. M. Levy. Simplified amino acid alphabets
for protein fold recognition and implications for folding. Protein Engineering, 13
(3):149–152, Mar 2000. ISSN 0269-2139. URL http://www.ncbi.nlm.nih.gov/
pubmed/10775656.
[61] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein
blocks. Proceedings of the National Academy of Sciences of the United States of
America, 89(22):10915–10919, Nov 1992. ISSN 0027-8424. URL http://www.ncbi.
nlm.nih.gov/pubmed/1438297.
[62] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural
classification of proteins database for the investigation of sequences and structures.
Journal of Molecular Biology, 247(4):536–540, Apr 1995. ISSN 0022-2836. doi:
10.1006/jmbi.1995.0159. URL http://www.ncbi.nlm.nih.gov/pubmed/7723011.
[63] T. Li, K. Fan, J. Wang, and W. Wang. Reduction of protein sequence complexity by
residue grouping. Protein Engineering, 16(5):323–330, May 2003. ISSN 0269-2139.
URL http://www.ncbi.nlm.nih.gov/pubmed/12826723.
References 78
[64] C. Etchebest, C. Benros, A. Bornot, A.-C. Camproux, and A. G. de Brevern. A re-
duced amino acid alphabet for understanding and designing protein adaptation
to mutation. European Biophysics Journal, 36(8):1059–1069, Nov 2007. ISSN
0175-7571. doi: 10.1007/s00249-007-0188-5. URL http://www.ncbi.nlm.nih.
gov/pubmed/17565494.
[65] K. Baskaran, J. M. Duarte, N. Biyani, S. Bliven, and G. Capitani. A PDB-wide,
evolution-based assessment of protein–protein interfaces. BMC Structural Biology,
14:22, Oct 2014. ISSN 1472-6807. doi: 10.1186/s12900-014-0022-0. URL http:
//www.ncbi.nlm.nih.gov/pubmed/25326082.
[66] L. A. Mirny and E. I. Shakhnovich. Universally conserved positions in protein
folds: reading evolutionary signals about stability, folding kinetics and function.
Journal of Molecular Biology, 291(1):177–196, Aug 1999. ISSN 0022-2836. doi:
10.1006/jmbi.1999.2911. URL http://www.ncbi.nlm.nih.gov/pubmed/10438614.
[67] A. K. Akobeng. Understanding diagnostic tests 1: sensitivity, specificity and pre-
dictive values. Acta Paediatrica, 96(3):338–341, Mar 2007. ISSN 0803-5253. doi:
10.1111/j.1651-2227.2006.00180.x. URL http://www.ncbi.nlm.nih.gov/pubmed/
17407452.
[68] B. W. Matthews. Comparison of the predicted and observed secondary structure
of T4 phage lysozyme. Biochimica et Biophysica Acta, 405(2):442–451, Oct 1975.
ISSN 0006-3002. URL http://www.ncbi.nlm.nih.gov/pubmed/1180967.
[69] M. Spitzer, G. Fuellen, P. Cullen, and S. Lorkowski. VisCoSe: visualization and
comparison of consensus sequences. Bioinformatics, 20(3):433–435, Feb 2004. ISSN
1367-4803. doi: 10.1093/bioinformatics/btg497. URL http://www.ncbi.nlm.nih.
gov/pubmed/14960475.
[70] C. E. Metz. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8
(4):283–298, Oct 1978. ISSN 0001-2998. URL http://www.ncbi.nlm.nih.gov/
pubmed/112681.
[71] UniProt. UniProtKB/TrEMBL Current Release Statistics, Mar 2015. URL http:
//www.ebi.ac.uk/uniprot/TrEMBLstats. [Online; accessed 26-March-2015].
[72] Wikipedia. Stirling numbers of the second kind — Wikipedia, the free encyclopedia,
2015. URL http://en.wikipedia.org/w/index.php?title=Stirling_numbers_
of_the_second_kind&oldid=650621849. [Online; accessed 26-March-2015].
[73] T. Piesk. File:Set partitions 4; Hasse; circles.svg — Wikimedia Commons, the
free media repository, 2015. URL http://commons.wikimedia.org/w/index.php?
References 79
title=File:Set_partitions_4;_Hasse;_circles.svg&oldid=138985475. [On-
line; accessed 26-March-2015].
[74] N. Cannata, S. Toppo, C. Romualdi, and G. Valle. Simplifying amino acid al-
phabets by means of a branch and bound algorithm and substitution matri-
ces. Bioinformatics, 18(8):1102–1108, Aug 2002. ISSN 1367-4803. URL http:
//www.ncbi.nlm.nih.gov/pubmed/12176833.
[75] E. Arunan, G. R. Desiraju, R. A. Klein, J. Sadlej, S. Scheiner, I. Alkorta, D. C.
Clary, R. H. Crabtree, J. J. Dannenberg, P. Hobza, H. G. Kjaergaard, A. C. Legon,
B. Mennucci, and D. J. Nesbitt. Definition of the hydrogen bond (IUPAC Recom-
mendations 2011). Pure and Applied Chemistry, 83(8):1637–1641, 2011.
[76] L. Pauling, R. B. Corey, and H. R. Branson. The structure of proteins; two
hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the
National Academy of Sciences of the United States of America, 37(4):205–211, Apr
1951. ISSN 0027-8424. URL http://www.ncbi.nlm.nih.gov/pubmed/14816373.
[77] L. Pauling and R. B. Corey. The pleated sheet, a new layer configuration of
polypeptide chains. Proceedings of the National Academy of Sciences of the
United States of America, 37(5):251–256, May 1951. ISSN 0027-8424. URL
http://www.ncbi.nlm.nih.gov/pubmed/14834147.
[78] C. B. Anfinsen. The formation and stabilization of protein structure. The
Biochemical Journal, 128(4):737–749, Jul 1972. ISSN 0264-6021. URL http:
//www.ncbi.nlm.nih.gov/pubmed/4565129.
[79] E. N. Baker and R. E. Hubbard. Hydrogen bonding in globular proteins. Progress
in Biophysics and Molecular Biology, 44(2):97–179, 1984. ISSN 0079-6107. URL
http://www.ncbi.nlm.nih.gov/pubmed/6385134.
[80] R. E. Hubbard and M. Kamran Haider. Hydrogen Bonds in Proteins: Role and
Strength. eLS, 2010.
[81] I. K. McDonald and J. M. Thornton. Satisfying hydrogen bonding potential in
proteins. Journal of Molecular Biology, 238(5):777–793, May 1994. ISSN 0022-
2836. doi: 10.1006/jmbi.1994.1334. URL http://www.ncbi.nlm.nih.gov/pubmed/
8182748.
[82] I. McDonald, D. Naylor, D. Jones, and J. Thornton. HBPLUS Manual v.3.06. URL
http://www.ebi.ac.uk/thornton-srv/software/HBPLUS/manual.html. [Online;
accessed 26-March-2015].