Rights / License: Research Collection In Copyright - Non ...47639/eth-47639-01.pdfDeclaration of Originality I, Joseph C. Somody, hereby con rm that I am the sole author of the written

Research Collection

Master Thesis

Exploring the role of amino-acid alphabets in protein-interfaceclassification

Author(s): Somody, Joseph C.

Publication Date: 2015

Permanent Link: https://doi.org/10.3929/ethz-a-010415857

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

https://doi.org/10.3929/ethz-a-010415857

http://rightsstatements.org/page/InC-NC/1.0/

https://www.research-collection.ethz.ch

https://www.research-collection.ethz.ch/terms-of-use

Master’s Thesis

Exploring the Role of Amino-AcidAlphabets in Protein-Interface

Classification

Author:

Joseph C. Somody

Supervisors:

Prof. Dr. Amedeo Caflisch[Internal]

Dr. Guido Capitani[Lead, External]

Dr. Jose M. Duarte[External]

A thesis submitted in partial fulfilment of the requirements

for the degree of Master of Science

in

Computational Biology & Bioinformatics

Department of Computer Science

ETH Zurich

with research at

Protein Crystallography and Structural Bioinformatics

Laboratory of Biomolecular Research

Paul Scherrer Institute

http://dx.doi.org/10.3929/ethz-a-010415857

1 April 2015

http://dx.doi.org/10.3929/ethz-a-010415857

Declaration of Originality

I, Joseph C. Somody, hereby confirm that I am the sole author of the written work here

enclosed, entitled “Exploring the Role of Amino-Acid Alphabets in Protein-

Interface Classification”, and that I have compiled it in my own words. Parts

excepted are corrections of form and content by the supervisors, Prof. Dr. Amedeo

Caflisch, Dr. Guido Capitani, and Dr. Jose M. Duarte.

With my signature, I confirm that:

� I have committed none of the forms of plagiarism described on the information

sheet entitled Citation Etiquette, which are as follows:

� Using the exact words or ideas from another author’s intellectual property

(text, ideas, structure, etc.) without clear citation of the source.

� Using text from the Internet without citing the uniform resource locator

(URL) and the date it was accessed.

� Reusing one’s own written texts, from different course papers or performance

assessments, or parts thereof, without explicitly identifying them as such.

� Translating and using a foreign-language text without citing its source.

� Submitting work that has been written by another under one’s own name

(“ghostwriting”).

� Using an extract of another author’s work, paraphrasing it, and citing the

source elsewhere than in the context of that extract.

� I have documented all methods, data, and processes truthfully.

� I have not manipulated any data.

� I have mentioned all persons who were significant facilitators of the work.

I am aware that this work may be screened electronically for plagiarism.

Place and Date: Zurich, Switzerland, 1 April 2015

Signature:

iii

Abstract

Exploring the Role of Amino-Acid Alphabets

in Protein-Interface Classification

Joseph C. Somody

It is no trivial task to determine whether a protein–protein interface in a crystal structure

is biologically relevant. A large alignment of homologous sequences can be constructed,

and the evidence provided by evolution can be extracted. Residues involved in biological

interfaces tend to demonstrate less variability than other residues on the surface of the

protein, but how the variability is calculated has an interesting impact on how well the

interfaces can be classified. This thesis begins by optimising the parameters for the

algorithm described above, then continues on to an investigation of reduced alphabets

for the amino acids. Results show that grouping the amino acids into a mere two families

plays a beneficial role in discriminating between biologically-relevant interfaces and the

artefacts of crystal packing; however, grouping similar amino acids together does not

necessarily have a positive effect: placing aspartic acid and glutamic acid in the same

group of the reduced amino-acid alphabet, for example, has a decidedly-harmful impact

on the classification accuracy. Two tangential appendices are attached to the thesis to

document research forays that, although unfruitful, have the potential to succeed given

further investment.

Table of Contents

Declaration of Originality iii

Abstract v

Table of Contents vii

Abbreviations xi

1 Introduction 1

1.1 Basic Molecular Biology and Biochemistry . . . . . . . . . . . . . . . . . . 1

1.2 X-Ray Crystallography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Protein Crystallisation . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Structural Modelling, Model Building, and Model Refinement . . . 4

1.3 Interface-Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Crystallographic Symmetry . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.3 Early Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.4 CRK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 EPPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 The Fall of CRK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.2 The Rise of EPPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Amino-Acid Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

vii

Table of Contents viii

2 Parameter Optimisation 15

2.1 Parameters in EPPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 BioMany and XtalMany . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Treatment of nopred Calls . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Prefiltering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.3 Measures of Classifier Performance . . . . . . . . . . . . . . . . . . 18

2.3.4 Optimised Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Reduced Amino-Acid Alphabets 27

3.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Published Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . 30

3.2.4 Further Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Random 2-Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Transitioning 2-Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.3 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Surface-Residue Sampling 41

4.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Strategies for Surface-Residue Sampling . . . . . . . . . . . . . . . . . . . 42

4.2.1 Classic Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2 Novel Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 2-Alphabets 47

5.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 Stirling Numbers of the Second Kind . . . . . . . . . . . . . . . . . 48

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Table of Contents ix

6 Conclusion 55

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A Hydrogen-Bond Detection 57

A.1 Hydrogen Bonding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.3 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

B Core-Residue Decay 61

B.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

B.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

B.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

List of Figures 65

List of Tables 67

Acknowledgements 69

References 71

Abbreviations

ETH Eidgenossische Technische HochschuleURL Uniform Resource LocatorDNA Deoxyribonucleic AcidRNA Ribonucleic AcidA AdenineC CytosineG GuanineT ThymineU UracilPPI Protein–Protein InteractionPDB Protein Data BankAU Asymmetric UnitPISA Proteins, Interfaces, Structures and AssembliesCRK Core–Rim Ka/Ks RatioMSA Multiple Sequence AlignmentASA Accessible Surface AreaBSA Buried Surface AreaEPPIC Evolutionary Protein–Protein Interface ClassifierDC Duarte–CapitaniNMR Nuclear Magnetic ResonanceHPCC High-Performance Computing ClusterTP True PositiveFP False PositiveTN True NegativeFN False NegativeMCC Matthews Correlation CoefficientROC Receiver Operating CharacteristicAUROC Area Under the ROC CurveAPI Application Programming InterfaceJPA Java Persistence APIRDBMS Relational Database Management SystemPDB ID PDB IdentifierLBR Laboratory of Biomolecular ResearchPSI Paul Scherrer Institute

xi

Chapter 1

Introduction

This master’s thesis is mainly an investigation into using reduced amino-acid alphabets

to calculate conservation in alignments of homologous amino-acid sequences. A direct

consequence of using computational methods to solve biological questions is that the

reader should be well-versed in both basic biology and basic computer science. In order

to lay the foundation for all readers to begin with equal footing, this introductory chapter

will briefly describe the required background knowledge.

1.1 Basic Molecular Biology and Biochemistry

As with any scientific report with a focus on molecular biology, this master’s thesis must

necessarily start with an overview of the central dogma of molecular biology. Deoxyri-

bonucleic acid (DNA) is the genetic material that forms the inheritable basis for life.

DNA can be replicated, an essential part of cell division, or transcribed into ribonu-

cleic acid (RNA), the first step of gene expression. DNA and RNA are both chains

of nucleotides, each nucleotide being one of adenine (A), cytosine (C), guanine (G),

and thymine (T)—or uracil (U) for RNA—attached to a sugar–phosphate backbone,

as shown in Figure 1.1. Once RNA has been transcribed from DNA, nascent RNA is

processed by the cellular machinery, thereby regulating its transcription or splice pat-

tern and forming mature RNA. The ribosome, itself composed of protein and RNA,

is responsible for translating the RNA into the protein for which it codes, which then

folds into an energetically-stable conformation as it comes off the ribosomal complex.

A protein is a chain of amino acids, each being one of twenty possibilities that possess

different properties. It is their sequence that determines the structure of the protein

and, thus, the protein’s function [1].

1

Chapter 1 — Introduction 2

RNARibonucleic Acid

DNADeoxyribonucleic Acid

NucleobasesCytosine

Guanine

Adenine

ThymineUracil

Cytosine

Guanine

AdenineSugar–Phosphate

Nucleobasesof RNA

Nucleobasesof DNA

Backbone

Figure 1.1: The composition of RNA and DNA. RNA is a single-stranded chainof nucleotides, each being one of adenine (A), cytosine (C), guanine (G), and uracil(U), attached to a sugar–phosphate backbone. DNA is of similar composition, but itis double-stranded and contains thymine (T) instead of uracil (U). This figure was

adapted from graphics by Sponk and R. Mattern [2].

As stated by Hartwell et al., “a discrete biological function can only rarely be attributed

to an individual molecule” [3]. Biological functions, which can be anything from signal

transduction to cell-cycle regulation to enzymatic digestion, often require the formation

of protein complexes [4, 5]. These protein–protein interactions (PPIs) can be between

subunits of the same type, forming homooligomers, or of different types, forming het-

erooligomers. The interactions in homooligomers can be further subdivided into isolo-

gous and heterologous [6]. The subunits form an isologous association when they both

contribute the same set of amino-acid residues to the interface, whereas they form a het-

erologous association when they contribute a different subset of amino-acid residues to

the interface. Schematic examples of such oligomers can be seen in Figure 1.2. Further-

more, isologous interfaces have identical interacting surfaces, resulting in a necessarily

dimeric and closed structure with a twofold axis of symmetry, while heterologous inter-

faces have non-identical interacting surfaces, which must be complementary but are not

usually symmetric [7]. Heterologous interfaces can, but need not, give rise to infinite

assemblies like the one shown in Figure 1.2(c). In the case that they do not form infinite

assemblies, heterologous interfaces must display circular symmetry.


(a) (b) (c)

AB

A B

AA

A A

AAA

A A A

...

...

...

...

...

...

...

...

Figure 1.2: Different types of protein–protein interactions. Figure 1.2(a) shows aheterooligomeric interaction between two different protomers, A and B. Figure 1.2(b)shows two identical protomers coming together to form an isologous homooligomericinteraction, the black dot denoting an axis of rotational symmetry, while Figure 1.2(c)shows three identical protomers assembling to form a heterologous homooligomeric

interaction, which can, in this case, but not in general, be extended ad infinitum.

The amino-acid residues that point into the interface between the proteins are the in-

teracting residues and are often required to maintain the PPI. Typically, there are also

residues critical for the PPI that do not interact across the interface: internal support

residues that stabilise the main-chain scaffold of the interface [8].

1.2 X-Ray Crystallography

1.2.1 Protein Crystallisation

The first step in protein X-ray crystallography is to crystallise the protein of interest,

beginning with its purification. The protein is typically obtained by recombinant meth-

ods: the gene for the protein is cloned into an expression vector and transferred into a

production organism, such as Escherichia coli, in order to express the protein in large

quantities. The protein is then purified out of the culture, generally by means of affinity

chromatography targetting a protein tag whose coding sequence had been inserted at

one end of the gene of interest.

Once the protein has been purified, it must be crystallised. To crystallise, protein

molecules must fall out from solution and assemble themselves into a crystal lattice,

typically done by adding a precipitant to a concentrated solution of protein [9]. The

formation of protein crystals is a finicky procedure with many variables, and, as it is not

determinable in advance which solutions and conditions will yield protein crystals, let

alone well-diffracting ones, many trials with varying characteristics must be carried out.

In a popular method known as hanging-drop vapour diffusion, the reservoirs of microtitre

plates are filled with solutions with various acidities, temperatures, ionic strengths,


buffers, precipitants, etc. A small amount of the protein solution is mixed with an equal

amount of the reservoir solution and transferred on to a glass slide, which is then placed

atop the reservoir, thereby sealing it, with the drop facing the interior of the reservoir [9].

Since the drop has a lower precipitant concentration than the reservoir solution, water

evaporates from the drop, leading to a slow increase in protein concentration. Over

time, the drop becomes supersaturated with protein, and a nucleation event may lead

to the formation of protein crystals.

This screening process suffers from the curse of dimensionality: crystallisation is affected

by many variables, which makes sampling the multidimensional solution space an oner-

ous task, especially since the amount of protein available is generally limited. It is for

this reason that sparse-matrix sampling was proposed by Jancarik and Kim in 1991 [10],

which involves strategically screening crystallisation conditions that have been success-

ful in the past and using non-repeating combinations of reagents [9]. This method has

become a very popular technique in protein crystallisation and is now widely applied in

commercial crystallisation kits [11].

1.2.2 Data Collection

After obtaining a satisfactory crystal of the protein of interest, the crystal is most

commonly flash-frozen in liquid nitrogen and mounted to a goniometer head. The crystal,

kept at 100 K by a stream of gaseous nitrogen, is then exposed to an incident beam of

monochromatic X-ray radiation while slowly being rotated about a spindle axis. The

X-rays are diffracted, and diffraction patterns are captured by a detector sensitive to

X-ray photons [9].

1.2.3 Structural Modelling, Model Building, and Model Refinement

There is an inherent difficulty in X-ray crystallography: the phase problem. The infor-

mation regarding the phase of any given spot, known as a reflection, on the diffraction

pattern is lost when captured as an image. This is problematic, because the phases of the

reflections are more crucial to structural modelling than their intensities [12]. One of the

most common solutions to this problem is molecular replacement, where the phases from

a similar, solved structure are grafted on to the intensities of the crystal of interest after

computationally finding the correct orientation and position in the target asymmetric

unit (defined in Subsection 1.3.1) [13]. In Figure 1.3, a schematic example (with artistic

renderings of diffraction patterns, not the real ones) of molecular replacement as a solu-

tion to the phase problem is illustrated: the cat in Figure 1.3(a) represents the crystal

structure to be solved, whose diffraction pattern is shown in Figure 1.3(b), but only


the phaseless (shown as colourless in this analogy) diffraction pattern in Figure 1.3(c)

is available to crystallographers. The phaseless intensities from the reflections in Fig-

ure 1.3(c) are used, but the phases from a similar structure (not shown), appropriately

positioned in the asymmetric unit, are applied on to the intensities. The combination,

shown in Figure 1.3(d), yields the structure in Figure 1.3(e), an approximate solution.

The recorded series of two-dimensional diffraction patterns, with intensities and grafted

phases, must then be converted into a three-dimensional model of the electron density.

The Fourier transform is an operation that decomposes a periodic function into the

frequencies that constitute it. The diffraction pattern of a crystal is composed of peaks

as a result of the crystal’s periodicity, so the inverse of the Fourier transform can be

used to recompose the electron density from the observed pattern [15]. A biochemical

model of the protein can then be built into the electron density and iteratively refined

until the calculated diffraction data from the model agree as much as possible with the

experimental diffraction data. The final structure is then formatted appropriately and

deposited into a database, such as the Protein Data Bank (PDB) [16].

1.3 Interface-Classification Problem

1.3.1 Crystallographic Symmetry

When a protein structure is determined by X-ray crystallography, its space group must

be identified. The space group describes the symmetry of the crystal: it provides suf-

ficient information to mathematically generate a unit cell from the asymmetric unit

(AU) [17]. More specifically, the AU is the smallest unit of the crystal that contains

sufficient information to produce the unit cell through the symmetry operations of the

crystal space group, while the unit cell is the smallest unit of the crystal that contains

sufficient information to produce the (infinite) crystal lattice by translation only [18]. A

schematic example of crystallographic symmetry is shown in Figure 1.4, where the AU

is any one duck, of either colour, and the unit cell is shown in green overlay.

1.3.2 The Problem

The crystal structure of a protein under investigation contains non-biological interac-

tions, a direct consequence of the nature of crystals. (An exception to this is seen in the

case of natural crystals like polyhedrins in insect viruses [19, 20] or like the Cry3A toxin

found naturally crystallised in Bacillus thuringiensis [21], where all interfaces are biolog-

ically relevant since the protein’s crystal form evolved to perform a particular biological


(a) (b)

(c) (d)

(e)

Figure 1.3: An illustration of molecular replacement as an approach to solving thephase problem with artistic renderings of diffraction patterns, not the real Fouriertransforms. The cat in Figure 1.3(a), having the diffraction pattern in Figure 1.3(b),where the colours represent phases, needs to be generated from the phaseless (colourless)diffraction pattern in Figure 1.3(c). Phases from a similar structure are appropriatelyaligned and grafted on to Figure 1.3(c), resulting in Figure 1.3(d). Figure 1.3(d) yieldsthe approximate solution shown in Figure 1.3(e). This figure was inspired by a similar

one by Kevin Cowtan [14].


Figure 1.4: An anatine example of crystallographic symmetry with space group P21,shown schematically down the b axis. The orange symbols denote twofold screw axes,while the colours, pink and blue, denote differential depth along b. The AU is one duck,of either colour, and the unit cell is shown in green overlay. This figure was inspired by

a similar one by David Blow [17].

function.) The matter that constitutes each AU in a crystal interacts not only within

the AU but also between adjacent AUs. The AU is defined without reference to the bi-

ological context of the complex, meaning the complexes in each AU are not necessarily

the true biological complexes [22]. A schematic example is shown in Figure 1.5, where

the AU is clearly any one subunit (of any colour), but the definition of the biological

unit is debatable without any complementary information.

The crystal structure of a protein homooligomeric in solution is not restricted to sym-

metry operations based on the biological oligomeric interactions between amino-acid

chains. For example, a protomer that forms a homodimer in vivo does not necessarily

only have that one isologous interaction in its crystalline form. As a result, when crystal-

lographers investigate the crystal lattice, it is not usually obvious which contacts are due

to biological interactions and which contacts are due simply to crystal packing. In the

past, especially with simple structures, biological interactions have been discriminated

from crystal contacts by mere manual inspection, an endeavour that has become more

and more challenging as the complexity of the proteins being researched has risen over

time [24]. This difficulty has become further exacerbated by a lack of classical biochemi-

cal background on new proteins: it is quite rare to see full biochemical characterisations

for the majority of today’s crystallographic targets [24].


(a)

(b)

(c)

(d)

Figure 1.5: A two-dimensional schematic illustration of the interface classificationproblem. Given Figure 1.5(d) as a crystal lattice, any of Figure 1.5(a), Figure 1.5(b), orFigure 1.5(c) could be equivalently inferred as the biologically-relevant protein complex.The AU is one subunit (outlined shape) of any colour. Without further information, itis not clear which arrangement contains the true biological unit, since protein–proteininterfaces in the lattice are often artefacts of crystal formation. This figure was in-spired by a similar one by Levy and Teichmann [23], which, in turn, was inspired by

“Fish (№ 20)” by M. C. Escher.

1.3.3 Early Approaches

Over the past twenty years, attempts to solve the above problem have varied widely

in their approaches. In the 1970s, Janin started investigating protein–protein interac-

tions [25, 26] and, in 1995, began researching crystal contacts [27]. Among the first was

Janin’s method of discriminating between subunit contacts and crystal artefacts based

solely on the area of the interface [28]. A few years later, Valdar and Thornton trained

neural networks with both the area of the interface and the conservation scores of the

amino-acid residues involved, which resulted in a highly-accurate predictor [29]. A couple

more methods arose in the early 2000s, which focussed on analysing residue conservation

in order to tell biological and non-biological interfaces apart: Elcock and McCammon


determined that comparing the conservation of amino-acid residues involved in the in-

terface with the conservation of residues elsewhere on the surface of the protein was an

effective discriminator [30], and Guharoy and Chakrabarti found that subdividing the in-

terface into a “core” region and a “rim” region and comparing the residue conservation

between the two regions worked even better, since binding sites and other conserved

patches on the rest of the protein surface led to anomalous results [31]. Researchers

thereafter began using methods in machine learning in order to combine various param-

eters describing the interfaces of interest and extract statistical power from a trained

classifier [32–34]. Of particular note is NOXclass, a support vector machine–based clas-

sifier that distinguishes between three classes of interactions by analysing six interface

properties, including interface area and amino-acid composition [33]. These classes con-

sist of obligate interactions, where protomers are not independently found as stable

structures in vivo; non-obligate interactions, where protomers are; and crystal-packing

interactions [33].

In 2007, Krissinel and Henrick developed a thermodynamic method for estimating inter-

face stability based on the binding energy of the interface and the entropy change due

to complex formation [35]. This method is implemented in Proteins, Interfaces, Struc-

tures and Assemblies (PISA) [35], which has become the de facto standard for interface

classification in the community. In 2008, Xu et al. performed an analysis of interface

similarity in crystals of homologous proteins to examine whether having interfaces in

common across different crystal forms can be used to calculate the likelihood that an

interface is biological [36]. Three years later, Xu and Dunbrack released ProtCID [37],

a database of homologous interfaces—both homodimeric and heterodimeric—observed

in multiple crystal forms, grouped by Pfam domain [38] and crystal form.

1.3.4 CRK

The core–rim Ka/Ks ratio (CRK) is a measure developed by Scharer et al. in 2010

to distinguish biological interfaces from non-specific ones [39]. It is based on an idea

originally proposed by Bahadur et al. in a 2004 paper [40], which was to designate each

residue involved in the interface as either a core residue or a rim residue, depending

on the extent to which atoms in the residue are buried upon interface formation—core

residues contain at least one fully-buried atom, while rim residues, which surround core

residues, have only partially-buried atoms.

In 2001, Elcock and McCammon published a method based on comparing the informa-

tion entropy at various positions, each designated as either “interface” or “non-interface”,

in multiple sequence alignments (MSAs) [30]. It was found that the average sequence


entropy for interface residues is lower than that of non-interface residues. By restricting

the focus and looking only at interface residues, Guharoy and Chakrabarti refined this

method in 2005 and showed that the Bahadur et al. classifications—core residues and

rim residues—provided a better basis for comparisons of sequence entropy [31].

The CRK algorithm expands on these methods: given a PDB structure, the sequence

of each chain is used as a BLASTP [41, 42] query to search for homologous sequences

in the UniProt database [43], which are then filtered so as to have homology above a

threshold, appropriate taxonomy, and low pairwise redundancy. The protein sequences

for these homologues (as well as for the query) are then aligned with one another using

T-Coffee [44], the resulting protein MSA is backtranslated into an MSA of coding se-

quences using RevTrans [45], and the selection pressure at each site is calculated with

Selecton [46, 47].

The accessible surface area (ASA) of an amino-acid residue is the size of the region

accessible to a solvent molecule, usually water [48]. This is generally calculated using

the Shrake–Rupley algorithm [49], where a probe sphere is “rolled” along the surface

of the region while checking for steric clashes that would imply a particular point on

the surface is not accessible. The buried surface area (BSA) is similar but measures

how accessible the residue is after interface formation [50]. The geometric descriptions

of the interfaces for the input PDB structure are retrieved from PISA, and the residues

are divided into a rim subset and a core subset, depending on what fraction of their

accessible surface area becomes inaccessible after interface formation. This fraction is

known as the burial, and the default burial cutoff for core residues in CRK is 95%.

The final steps in the CRK algorithm are calculating the ratio ω = Ka/Ks for each

residue, separating the ω values by class (either core or rim), and calculating the statistics

(both the unweighted mean and the mean weighted by the BSA) for both residue classes.

The ratio can then be used to determine whether the interface is biologically relevant:

a biological interface is expected to have more purifying selection for core residues than

rim residues, which would yield a ratio CRK =〈ω〉core〈ω〉rim

< 1.


1.4 EPPIC

1.4.1 The Fall of CRK

CRK, as a proof-of-concept work, was successful but had its limitations. Firstly, the

CRK pipeline was complicated: finding protein homologues, aligning them, backtrans-

lating the alignment, calculating the selection pressure, classifying residues, calculat-

ing ratios, etc. This meant not only that any large-scale surveys of protein structures

would be computationally infeasible but also that there were many welds in the pipeline

where something could go wrong. In fact, mapping protein sequences back to their

coding sequences turned out to be one such problem since there were often issues with

cross-referencing between databases. A protein sequence would often map to multiple

coding sequences or, even worse, to no coding sequences, which led to fewer sufficiently-

significant predictions. In addition, the step in which CRK calculates selection pressures

using Selecton was found to be exceedingly slow, further constraining the method’s us-

ability.

1.4.2 The Rise of EPPIC

In 2012, the research group behind CRK published a new approach, the Evolution-

ary Protein–Protein Interface Classifier (EPPIC) [24]. Although grounded in the same

concepts as CRK, EPPIC overcame CRK’s limitations. Two evolutionary predictors,

core–rim and core–surface, were introduced, as well as a new geometric criterion for

interface classification, which was based simply on the number of core residues in an

interface. This allowed for a call to be made even when insufficient homologues were

available for evolutionary analysis. Next, the use of Ka/Ks ratios was abandoned in

favour of sequence entropies. It was found that better performance was obtainable by

using sequence entropies as long as homologues were stringently selected and had low

redundancy. This change not only did away with the speed-limiting step involving Se-

lecton but also meant that mapping protein sequences back to coding sequences was no

longer necessary.

As mentioned in Subsection 1.3.4, in 2001, Elcock and McCammon pioneered a method

comparing sequence entropy between interface and non-interface positions in MSAs [30].

EPPIC expands on this substantially: for the core–surface call, core residues of the

interface are compared to surface residues outside the interface. More specifically, in a

manner similar to that described by Valdar and Thornton [51], EPPIC randomly samples

the surface residues to be compared from a pool of surface residues in order to maximise

statistical robustness.


When searching for homologues, EPPIC uses a conservative sequence-identity cutoff of

60%, which is lowered to 50% in the event that insufficient homologues are found. These

high cutoffs ensure that homologues have a high structural similarity at both tertiary

and quaternary levels [24].

EPPIC uses the CRK definition for determining burial, as defined in the paper by

Scharer et al. in 2010 [39]. For each of the three predictors contained in EPPIC, a dif-

ferent burial cutoff can be used. EPPIC determines the solvent accessibility of residues

before and after the formation of the protein–protein interface, and the ratio is used to

classify them as either “core” (residues fully buried upon interface formation), “rim”

(residues partially buried upon interface formation), “surface” (surface residues not in-

volved in an interface), or “other” (any non-interface, non-surface residues; for example,

residues in a subunit’s core). The geometry-based predictor simply returns the num-

ber of core residues in the interface as its score, the core–rim evolutionary predictor

returns the ratio of sequence entropies for core interface residues compared to rim inter-

face residues, and, finally, the core–surface evolutionary predictor uses random sampling

of surface residues and returns the “distance” of the average entropy for core residues

to the mean of the samples of surface residues in terms of their standard deviation, a

Z-score–like measure. This will be explained in greater detail in Chapter 4.

These three scores are converted to calls by comparison to a threshold unique to each

predictor. Each call can be one of bio, meaning that the predictor believes the interface

to be biologically relevant; xtal, meaning that the predictor believes it to be a crystal

contact; or nopred, meaning that the predictor has insufficient data to go on. The three

calls are combined to form a consensus call through a simple-majority voting scheme

with some modifications. Since the geometric predictor always returns a call, if both

evolutionary predictors give no prediction, the geometry call is taken as the consensus.

Also, in certain cases, more weight is given to the calls by the evolutionary predictors:

when there is exactly one each of bio, xtal, and nopred, the non-nopred evolutionary

call is made the consensus call.

1.5 Amino-Acid Alphabets

1.5.1 Background

There are twenty different naturally-occurring amino acids, complex strings of which

form proteins. Anfinsen’s dogma states that the native structure of a protein is deter-

mined solely by the protein sequence of amino acids [52]. It has been shown, however,

that point mutations in a protein’s sequence are often tolerated [53] and that proteins


with far fewer types of amino acids can still form stable structures [54, 55]. In one case,

thirty-eight residues of a fifty-seven–residue protein domain were each replaced with one

of five possible residues, and the domain retained its fold [54]. This suggests that the

standard twenty-letter amino-acid alphabet could, perhaps, be reduced.

Investigations into the degeneracy of amino-acid sequences and the possibility to con-

dense the amino-acid alphabet were published as early as 1979 [56]. In 1994, Strelets et al.

developed a model of combinatorial sequence space and broke down known protein se-

quences into overlapping subsequences of a fixed length, which led to an unexpectedly-

high clusterisation that “looked as if certain protein sequence variations occurred and

were fixed in the early course of evolution” [57], further propelling the idea that there

were simplifications that could be made in attempts to model the complexity of protein

sequence.

The simplest reduced protein alphabet is a two-letter one where amino acids are grouped

as either hydrophobic or polar. This has been dubbed the HP model, but, as argued by

Wolynes in 1997, two-letter alphabets result in too many local minima and insufficient

funnelling in the energy landscape of protein folding [58]. In 1999, Wang and Wang took

a computational approach to simplifying the protein-folding alphabet using minimised

mismatch between properties of amino acids as a function of the size of the amino-acid

alphabet and determined that, depending on how many properties are to be modelled,

various plateaus exist to describe proteins at different coarse-grained levels [59].

In 2000, Murphy et al. set out to determine whether there is a minimal protein alphabet

with which all proteins could still fold [60]. They used the BLOSUM50 similarity ma-

trix [61], which was derived from gapless blocks in large alignments of protein sequences,

to correlate the most-related amino acids, merge them, and construct a simplified sim-

ilarity matrix. This process, applied iteratively, showed clear tendencies: residues with

similar physicochemical properties first group together, the small amino acids then come

together to form a group, and the alphabet ultimately reduces to the HP model. The

authors tested the effects by applying the reduced alphabets on protein sequences and

aligning them to the SCOP40 clustered database [62] to determine how much of the pro-

tein fold mapping is lost. An amino-acid alphabet of ten or twelve letters only reduced

the percentage coverage retained by approximately 10%; however, a steep loss of fold

recognition was observed with any simpler alphabet [60].

Three years later, in a very similar approach, Li et al. used the BLOSUM62 similar-

ity matrix [61] to guide an iterative condensation of the protein alphabet and compare

them for information loss just as Murphy et al. did. The authors noted that ten-letter

alphabets allow for nearly the same ability to detect distantly related folds as with


twenty-letter alphabet, which may mean that ten amino acids would suffice to charac-

terise the complexity in proteins [63]. Research into reduced protein alphabets has also

continued in recent years with a greater focus on their applications in protein design [64].

1.5.2 Relevance

In EPPIC, as described in Subsection 1.4.2, an MSA of putative homologues is generated

for the two evolutionary predictors. The sequence entropy is then calculated at each

position, and each position is categorised as “core”, “rim”, “surface”, or “other”. The

scores are based on the differences in sequence entropy between these classes. It is for

these calculations that reduced amino-acid alphabets are needed. The signal-to-noise

ratio of a column is a delicate balance in that variation should be detected but not taken

into account if the variations do not matter at a biochemical level. Part of this could be

resolved through cleverly choosing an appropriate amino-acid alphabet. As a result, a

large part of this thesis is an investigation into the effects of various reduced amino-acid

alphabets on EPPIC’s output.

Chapter 2

Parameter Optimisation

2.1 Parameters in EPPIC

The strategy behind EPPIC is fairly fixed: given a structure, for each possible interface,

the program classifies residues based on solvent accessibility before and after the forma-

tion of the interface; creates a large, non-redundant MSA of putative homologues; and

uses this information to return three different measures by which to classify the inter-

face as either biologically relevant or a crystal contact. This does not, however, mean

that EPPIC does not allow for any customisability. In fact, the gamut of adjustable

parameters that EPPIC offers proves just the opposite.

EPPIC allows the user to choose the size of the amino-acid alphabet used for the sequence

entropy calculations (from the ones proposed by Murphy et al. [60]), the BSA-to-ASA

burial cutoff for core-residue assignment for each of the three classifiers, the calling

thresholds for determining bio or xtal from the raw scores, the hard and soft sequence

identity cutoffs, the maximum number of homologues to use, the number of sphere points

for ASA calculations, as well as several other parameters.

Of the aforementioned parameters, this chapter of the thesis will concern the optimisa-

tion of seven of them: the three burial cutoffs, the three call thresholds, and the reduced

amino-acid alphabet.

15

Chapter 2 — Parameter Optimisation 16

2.2 Optimisation

2.2.1 BioMany and XtalMany

In order to optimise parameters, a method of evaluating the accuracy of the classifier is

required. Therein lies one of the most challenging obstacles: gold-standard datasets with

experimental backing are key to properly training a classifier, but the manually-curated

ones available to date have taken a long time to compile, been small (on the order of

102 entries), and been prone to human error [65].

In 2012, when EPPIC was originally released, two small datasets were compiled by

the authors. These Duarte–Capitani (DC) datasets, DCbio and DCxtal, were used to

optimise the EPPIC parameters [24]. Since it had long been known that biological

interfaces tend to be much larger than crystal contacts [28], Duarte et al. decided to

focus their curation efforts on the region harder to classify. The DCbio dataset, therefore,

consisted of small interfaces validated as biological, while the DCxtal dataset contained

large crystal contacts.

For a true determination of the optimal parameters, the DC datasets, each having

only eighty-one entries, were too small. To this end, Baskaran et al. generated the

Many datasets, BioMany and XtalMany [65]. BioMany was based on the ProtCID

database [36, 37], which infers that an interface is biological when it is present across

multiple crystal forms. The ProtCID database also clusters interfaces by Pfam architec-

ture [38], which allowed for a selection of interfaces of minimal redundancy to be pulled

from the database. Using conservative thresholds, 2666 such interfaces were added to

the BioMany dataset. Additionally, 171 further biological interfaces that had been val-

idated by nuclear magnetic resonance (NMR), another method of determining protein

structure, were added to the dataset. Any interfaces larger than 2000�A2 were removed

from BioMany in order to keep the dataset focussed on the range of interfaces harder to

classify.

In 1965, Monod et al. first discussed interfaces that lead to infinite assemblies [6]. As

mentioned in Section 1.1, homomeric interfaces can be either isologous or heterologous,

the latter of which has the potential to assemble infinitely. Baskaran et al. assert that,

when considering a protein crystal, any interfaces produced by pure translation or a

screw axis cannot result in closed assemblies, and, therefore, any such observed interfaces

can be assumed to be crystal contacts [65]. The XtalMany dataset was populated using

this assumption, the interfaces were clustered by sequence to decrease redundancy, and

any interfaces smaller than 600�A2 were removed. This resulted in a dataset of 2913

crystal contacts.


2.2.2 Computation

It was determined early on that the optimisation of parameters would have to involve a

full search of the parameter space. Not all parameters are interdependent, which meant

that there could be a certain degree of parallelisation in the optimisation. Additionally,

the call thresholds only come into play during the conversion of scores to calls, allowing

these parameters to be handled a posteriori.

The burial cutoff for the geometric predictor was evaluated from 50% to 100% in in-

crements of 5%. The burial cutoffs for the core–rim and core–surface evolutionary pre-

dictors were each evaluated from 50% to 100% in increments of 10% in combination

with the reduced amino-acid alphabet being varied across the two-, four-, eight-, ten-,

and fifteen-letter alphabets proposed by Murphy et al. [60] and the six-letter alphabet

proposed by Mirny and Shakhnovich [66]. The reduced amino-acid alphabets used for

the optimisations are shown in Table 2.1.

AlphabetIdentifier

SourceAmino-Acid

Alphabet

2 [60] ACFGILMPSTVWY:DEHKNQR

4 [60] AGPST:CILMV:DEHKNQR:FWY

6 [66] ACILMV:DE:FHWY:GP:KR:NQST

8 [60] AG:CILMV:DENQ:FWY:H:KR:P:ST

10 [60] A:C:DENQ:FWY:G:H:ILMV:KR:P:ST

15 [60] A:C:D:E:FY:G:H:ILMV:KR:N:P:Q:S:T:W

20 — A:C:D:E:F:G:H:I:K:L:M:N:P:Q:R:S:T:V:W:Y

Table 2.1: The seven amino-acid alphabets originally released with EPPIC. For eachalphabet, the alphabet identifier—simply the number of letters in the alphabet—isshown, as well as the authors who engineered it and the alphabet itself. The twenty-

letter alphabet is an unreduced alphabet and, therefore, has no source.

Table 2.1 also introduces a standardised way to describe amino-acid alphabets. Each

group of amino acids represented by one letter in the alphabet is written as a string of

one-letter codes corresponding to the amino acids in that group. The one-letter codes

are written in alphabetical order, and then the full strings are written out from left to

right, also sorted alphabetically, using the colon character (:) as a separator.

For each set of parameters, EPPIC was run on 5607 PDB structures from the BioMany

and XtalMany datasets, which led to a grand total of 5607×(11+6×7+6×7) = 532 665

runs of the program. Clearly, with each run lasting approximately half a minute, this

would have taken somewhere on the order of months to compute on a standard per-

sonal computer. As a result, computations were carried out using two high-performance

computing clusters (HPCCs): Brutus at ETH Zurich and Merlin 4 at the Paul Scherrer

Institute.


2.3 Results

2.3.1 Treatment of nopred Calls

As discussed in Subsection 1.4.2, there are three different calls that EPPIC’s predictors

can return: bio, xtal, or nopred. As a classifier, EPPIC’s output should be analysed in

the typical manner, which involves tallying correct and incorrect classifications for both

the positive and negative datasets (BioMany and XtalMany, respectively). The handling

of nopred calls in this binary scheme, however, is not trivial. This is because, no matter

how they are treated, some bias will necessarily appear in the results. For example,

treating them with the same “contempt” as incorrect classifications discourages their

occurrence, which is contrary to the fundamental idea of having nopred calls. Ideally,

nopred calls should be preferred to misclassification. On the other hand, treating them

favourably encourages EPPIC to return fewer bio and xtal calls, which, clearly, is

flawed behaviour for software whose purpose is to make predictions.

It was decided that the nopred calls could be treated as xtal calls. This converts the

hard-to-handle tertiary classification problem into a binary one. A possible downside

to this approach is that nopred calls are now considered as suitable to be returned for

crystal interfaces. This outcome is easily defended: a typical user of EPPIC is searching

for biological interfaces in a sea of crystal contacts and would likely view any non-bio

call similarly. As of February 2015, EPPIC’s final calls across the entire PDB show that

approximately six times more crystal interfaces exist than biological interfaces.

2.3.2 Prefiltering Results

For the core–rim and core–surface evolutionary predictors, two different stringencies for

the list of putative homologues were applied: a minimum of ten homologues, each having

at least 50% sequence identity to the query, and a minimum of thirty homologues, each

having at least 60% sequence identity. Imposing these homologue stringencies helps

ensure that the MSA used for calculations is of statistical significance, since too few

or too dissimilar homologues in the alignment is likely to undermine the accuracy of

classifications made.

2.3.3 Measures of Classifier Performance

As mentioned in Subsection 2.3.1, each of EPPIC’s predictors is considered a binary

classifier for analysis and optimisation, and nopred calls are simply treated as xtal


calls. Since biological interfaces are typically of interest to the user, bio and xtal calls

shall be positives and negatives, respectively. True positives (TPs), false positives (FPs),

true negatives (TNs), and false negatives (FNs) are briefly defined in Table 2.2.

Interface

Biological Crystal

Callbio TP FP

xtal FN TN

Table 2.2: The definitions of true positives (TP), false positives (FP), true nega-tives (TN), and false negatives (FN) with respect to the factual interface type, either

biological or crystal, and the call made by the binary classifier, either bio or xtal.

Various measures exist to assess the performance of a binary classifier given the defini-

tions in Table 2.2 [67]. Equation (2.1) defines the sensitivity of the classifier, which is

the proportion of positives correctly labelled as positive. The denominator is expanded

for clarity: datapoints correctly labelled as positive and datapoints incorrectly labelled

as negative together form the set of all positives.

Sensitivity =TP

TP + FN(2.1)

Similarly, Equation (2.2) defines the specificity of the classifier, which is the proportion

of negatives correctly labelled as negative. Again, the denominator is expanded for

clarity: datapoints correctly labelled as negative and datapoints incorrectly labelled as

positive together form the set of all negatives.

Specificity =TN

TN + FP(2.2)

In Equation (2.3), the accuracy of the classifier is defined as the correct proportion of all

predictions. There is, however, an issue with all hitherto-presented measures of perfor-

mance: maximising the sensitivity or specificity alone has the side-effect of minimising

the other. This can be illustrated by an extreme case wherein the classifier always makes

a bio call: perfect sensitivity is obtained at the expense of any specificity. The accuracy

is also flawed in that maximising it promotes skewing the classifier’s predictions to obtain

as many correct predictions as possible, which is not necessarily a desirable result, es-

pecially for imbalanced datasets. Considering, for example, a hypothetical dataset with

ninety positive cases and ten negative cases, a binary classifier trained by maximising

accuracy would value sensitivity—the ability to identify those ninety positives—much


more highly than specificity, since even a near-zero specificity would only reduce the

accuracy by approximately 10%.

Accuracy =TP + TN

TP + TN + FP + FN(2.3)

It follows that any useful measure of performance would have to impartially balance

the sensitivity and the specificity. One such measure is presented in Equation (2.4),

which defines the balanced accuracy as the arithmetic mean of the sensitivity and the

specificity.

Balanced Accuracy =Sensitivity + Specificity

2(2.4)

Another interesting measure of performance is the Matthews correlation coefficient

(MCC) [68], presented in Equation (2.5). The MCC was introduced in 1975 as a bal-

anced measure of performance that works with datasets of imbalanced character. It is

effectively a correlation coefficient measuring the agreement between the actual and the

predicted classifications.

MCC =TP× TN− FP× FN√

(TP + FP)(TP + FN)(TN + FP)(TN + FN)(2.5)

Over the course of the experiments carried out to generate the results presented below,

it was found that the best method of assessing a classifier’s performance was to use the

balanced accuracy. This measure, therefore, is used as the objective function for the

optimisation of EPPIC’s parameters.

2.3.4 Optimised Parameters

After varying and testing sets of parameters for each of EPPIC’s classifiers, the results

were sorted by the balanced accuracy achieved on the BioMany and XtalMany datasets.

The top ten sets of parameters for each classifier are presented in the tables below.

In Table 2.3, the ten best parameter sets for the geometric classifier are presented. Only

two parameters can be varied for this classifier: the burial cutoff, which is the proportion

of solvent-accessible surface that has to be buried upon interface formation for a residue

to be assigned to the interface core, and the call threshold, which is the minimum number

of core residues in the interface that have to be identified for an interface to be deemed

biological and given a bio call. The parameter optimisation that preceded the original


BurialCutoff

CallThreshold

BalancedAccuracy

Accuracy Sensitivity Specificity MCC

90% 8 0.8635 0.8647 0.7992 0.9279 0.734485% 10 0.8612 0.8621 0.8152 0.9072 0.726490% 9 0.8555 0.8573 0.7533 0.9577 0.728395% 5 0.8544 0.8552 0.8085 0.9004 0.712690% 7 0.8525 0.8530 0.8266 0.8784 0.706485% 9 0.8524 0.8526 0.8419 0.8629 0.705295% 6 0.8501 0.8517 0.7558 0.9443 0.714690% 10 0.8484 0.8507 0.7213 0.9756 0.723095% 4 0.8469 0.8465 0.8704 0.8234 0.694290% 6 0.8445 0.8439 0.8793 0.8097 0.6900

Table 2.3: The top ten parameter sets of burial cutoffs and call thresholds for the ge-ometric classifier in EPPIC. While many measures of classifier performance are shown,

the balanced accuracy is used for sorting purposes.

release of EPPIC in 2012 showed an optimal burial cutoff of 95% and an optimal call

threshold of 6 for the geometric classifier [24]. From the table, it can be observed that a

burial cutoff of 90% and a call threshold of 8 achieve better performance. The result is

also intuitive in the sense that relaxing the requirement to be considered a core residue

comes alongside an increase in how many core residues must be present to consider an

interface biological.

In Table 2.4, the ten best parameter sets for the core–rim evolutionary classifier are pre-

sented. In addition to the burial cutoff (as described for the geometric classifier), two fur-

ther parameters can be varied: the reduced alphabet, which is an abbreviated way of re-

ferring to the number of letters in the reduced amino-acid alphabet used to calculate the

entropy at sites in the MSA (taken from the possibilities proposed by Murphy et al. [60],

as well as the six-letter alphabet proposed by Mirny and Shakhnovich [66]), and the call

threshold, which is the scoring boundary below which an interface is given a bio call.

These results are for datapoints where at least ten putative homologues, each having at

least 50% sequence identity to the query, were identifiable.

Table 2.5 again shows the ten best parameter sets for the core–rim evolutionary classifier,

much like in Table 2.4, but only considering datapoints for which a minimum of thirty

putative homologues, each having at least 60% sequence identity to the query, was

identifiable. The results in Table 2.5 can be accepted with greater confidence, since

more data are used per considered datapoint to make the prediction. The typical values

for the balanced accuracies also contrast between tables: selecting putative homologues

more stringently allows for higher scores, further strengthening the case for the use of

Table 2.5. EPPIC’s original parameter optimisation showed an optimal burial cutoff of

70% and an optimal call threshold of 0.75 for the core–rim evolutionary predictor [24].


ReducedAlphabet

BurialCutoff

CallThreshold

BalancedAccuracy


2 80% 0.75 0.8471 0.8465 0.8215 0.8728 0.69452 80% 0.80 0.8457 0.8453 0.8303 0.8610 0.69122 90% 0.90 0.8455 0.8424 0.7184 0.9726 0.71112 80% 0.70 0.8451 0.8443 0.8099 0.8804 0.69102 80% 0.85 0.8449 0.8447 0.8371 0.8526 0.68952 90% 0.85 0.8442 0.8410 0.7140 0.9743 0.70942 80% 0.65 0.8438 0.8426 0.7938 0.8939 0.68992 90% 0.80 0.8432 0.8399 0.7092 0.9773 0.70902 80% 0.90 0.8431 0.8432 0.8468 0.8395 0.68632 80% 0.60 0.8431 0.8416 0.7826 0.9035 0.6897

Table 2.4: The top ten parameter sets of reduced alphabets, burial cutoffs, and callthresholds for the core–rim evolutionary classifier in EPPIC. These are the results ob-tained when using the less-stringent requirements for putative homologues: a minimumof ten, each having at least 50% sequence identity to the query. While many measuresof classifier performance are shown, the balanced accuracy is used for sorting purposes.

Reduced amino-acid alphabets were not tested, and the ten-letter alphabet proposed

by Murphy et al. [60] was used. From the table, it can be observed that the six-letter

reduced amino-acid alphabet proposed by Mirny and Shakhnovich [66], a burial cutoff

of 80%, and a call threshold of 0.90 produce better results.

ReducedAlphabet

BurialCutoff

CallThreshold

BalancedAccuracy


6 80% 0.90 0.8745 0.8727 0.8873 0.8617 0.74372 90% 0.90 0.8725 0.8874 0.7684 0.9766 0.77682 90% 0.85 0.8715 0.8867 0.7653 0.9777 0.77582 90% 0.80 0.8705 0.8861 0.7621 0.9789 0.77492 80% 0.85 0.8700 0.8686 0.8795 0.8605 0.73512 80% 0.75 0.8692 0.8706 0.8592 0.8792 0.73662 80% 0.90 0.8688 0.8660 0.8889 0.8488 0.73166 80% 0.85 0.8680 0.8680 0.8685 0.8675 0.73252 80% 0.80 0.8674 0.8673 0.8685 0.8664 0.73122 80% 0.70 0.8670 0.8700 0.8466 0.8875 0.7344

Table 2.5: The top ten parameter sets of reduced alphabets, burial cutoffs, and callthresholds for the core–rim evolutionary classifier in EPPIC. These are the resultsobtained when using the more-stringent requirements for putative homologues: a min-imum of thirty, each having at least 60% sequence identity to the query. While manymeasures of classifier performance are shown, the balanced accuracy is used for sorting

purposes.

In Table 2.6, the ten best parameter sets for the core–surface evolutionary classifier

are presented. The parameters varied for this analysis are exactly as those varied in

Tables 2.4 and 2.5, albeit for the core–surface predictor. These results are for datapoints

where at least ten putative homologues, each having at least 50% sequence identity to

the query, were identifiable.


ReducedAlphabet

BurialCutoff

CallThreshold

BalancedAccuracy


2 80% −0.90 0.8889 0.8878 0.8436 0.9343 0.77962 80% −0.95 0.8886 0.8874 0.8383 0.9389 0.77962 80% −0.85 0.8886 0.8876 0.8472 0.9301 0.77862 80% −0.80 0.8881 0.8872 0.8504 0.9259 0.77732 80% −1.00 0.8877 0.8864 0.8319 0.9436 0.77862 80% −1.05 0.8866 0.8851 0.8255 0.9478 0.77722 80% −0.75 0.8866 0.8858 0.8524 0.9208 0.77392 70% −1.05 0.8819 0.8814 0.8628 0.9010 0.76372 70% −1.00 0.8791 0.8788 0.8660 0.8922 0.758015 80% −0.90 0.8772 0.8759 0.8255 0.9288 0.7568

Table 2.6: The top ten parameter sets of reduced alphabets, burial cutoffs, andcall thresholds for the core–surface evolutionary classifier in EPPIC. These are theresults obtained when using the less-stringent requirements for putative homologues: aminimum of ten, each having at least 50% sequence identity to the query. While manymeasures of classifier performance are shown, the balanced accuracy is used for sorting

purposes.

Table 2.7 again shows the ten best parameter sets for the core–surface evolutionary

classifier, much like in Table 2.6, but only considering datapoints for which a minimum

of thirty putative homologues, each having at least 60% sequence identity to the query,

was identifiable. The high-stringency results in Table 2.7, again, can be accepted with

greater confidence and exhibit higher values of balanced accuracy. EPPIC’s original

parameter optimisation showed an optimal burial cutoff of 70% and an optimal call

threshold of −1.00 for the core–surface evolutionary predictor [24]. Reduced amino-acid

alphabets were not tested, and the ten-letter alphabet proposed by Murphy et al. [60]

was used. From the table, it can be observed that the six-letter reduced amino-acid

alphabet proposed by Mirny and Shakhnovich [66], a burial cutoff of 80%, and a call

threshold of −0.90 produce better results.

Perhaps more interesting than the slight differences in optimised burial cutoffs and call

thresholds is the fact that the two-letter alphabet proposed by Murphy et al. appears so

frequently in the top parameter sets. Indeed, if Tables 2.4 and 2.6 were to be extended to

each show the top twenty-five parameter sets, the two-letter alphabet would be present

in fourteen and eleven of them, respectively. This is not just the case for the low-

stringency homologues, either: similarly extending Tables 2.5 and 2.7 to each show

twenty-five parameter sets would result in twelve and nine appearances, respectively, of

the two-letter alphabet.

In Figure 2.1, the effect of different amino-acid alphabets—the two-, four-, eight-, ten-,

and fifteen-letter reduced amino-acid alphabets proposed by Murphy et al. [60]; the six-

letter one proposed by Mirny and Shakhnovich [66]; and the standard, unreduced twenty-

letter alphabet—on the balanced accuracy of the core–surface evolutionary predictor is


ReducedAlphabet

BurialCutoff

CallThreshold

BalancedAccuracy


6 80% −0.90 0.9142 0.9169 0.8951 0.9332 0.830015 80% −0.90 0.9132 0.9155 0.8967 0.9297 0.82742 80% −0.85 0.9122 0.9149 0.8936 0.9308 0.825915 80% −0.85 0.9114 0.9135 0.8967 0.9261 0.82336 80% −0.95 0.9110 0.9142 0.8889 0.9332 0.82456 80% −0.95 0.9110 0.9142 0.8889 0.9332 0.82456 80% −1.05 0.9110 0.9155 0.8795 0.9426 0.82722 80% −0.80 0.9108 0.9129 0.8967 0.9250 0.82206 80% −0.85 0.9106 0.9129 0.8951 0.9261 0.82192 80% −0.90 0.9104 0.9142 0.8842 0.9367 0.8244

Table 2.7: The top ten parameter sets of reduced alphabets, burial cutoffs, andcall thresholds for the core–surface evolutionary classifier in EPPIC. These are theresults obtained when using the more-stringent requirements for putative homologues:a minimum of thirty, each having at least 60% sequence identity to the query. Whilemany measures of classifier performance are shown, the balanced accuracy is used for

sorting purposes.

shown for various call thresholds. No matter the call threshold, the trends remain quite

consistent. The twenty-letter alphabet does quite poorly, which is to be expected, since

this maximal level of detail is likely to bring more noise than signal into the balance

during the entropy calculations. The four-letter alphabet, however, does particularly

badly, especially in comparison to its two- and six-letter neighbours, while the two-,

six-, and fifteen-letter alphabets perform best.

It is puzzling, or at least unintuitive, that a complete disregard of an arguable 90% of

the information in the MSA—going from twenty amino acids to two—is so beneficial to

the predictions made by the evolutionary classifiers in EPPIC. With a view to elucidate

this observation, the topic of reduced amino-acid alphabets forms much of the remainder

of this thesis. For the sake of conciseness, a terminological adjustment should be made:

henceforth, let an “n-alphabet” refer to an n-letter reduced amino-acid alphabet, where

n can be any natural number from 2 to 20.

2.4 Conclusion

All three of EPPIC’s predictors—geometric, core–rim evolutionary, and core–surface

evolutionary—have a large number of parameters that can be varied. While the default

values for these parameters were chosen based on an optimisation carried out just prior to

the original release of EPPIC in 2012 [24], this optimisation was not an exhaustive search

and was performed using a small dataset. Two new datasets, BioMany and XtalMany,

compiled computationally by Baskaran et al. [65], allowed for a more thorough analysis of


2 4 6 8 10 15 20Alphabet Identifier

0.86

0.87

0.88

0.89

0.90

0.91

0.92

Bal

ance

dA

ccu

racy

Core–Surface Call Threshold

−1.05

−1.00

−0.95

−0.90

−0.85

−0.80

−0.75

Figure 2.1: The effects of various reduced amino-acid alphabets on the performanceof EPPIC’s core–surface evolutionary predictor for seven different call thresholds. The

balanced accuracy is used as a measure of performance.

the effects of parameter values on EPPIC’s predictions. Through this, it was found that

the geometric predictor performs best with a burial cutoff of 90% and a call threshold

of 8, as shown in Table 2.3; that the core–rim evolutionary predictor performs best with

the 6-alphabet proposed by Mirny and Shakhnovich [66], a burial cutoff of 80%, and

a call threshold of 0.90, as shown in Table 2.5; and that the core–surface evolutionary

predictor performs best with the 6-alphabet proposed by Mirny and Shakhnovich [66], a

burial cutoff of 80%, and a call threshold of −0.90, as shown in Table 2.7. Additionally,

2-alphabets are surprisingly prominent in the highest-ranking parameter sets for the

evolutionary predictors, a fact that drove much of the remainder of this thesis toward

further investigation of the importance of reduced amino-acid alphabets on EPPIC’s

predictive performance.

Chapter 3

Reduced Amino-Acid Alphabets

3.1 Rationale

In Section 2.4, Chapter 2 was concluded with some interesting findings: the 2-alphabet

proposed by Murphy et al. [60] seems to come up surprisingly often in the highest-

ranking parameter sets for the evolutionary predictors in EPPIC, and the 6-alphabet

proposed by Mirny and Shakhnovich [66] is present in the best parameter sets. Clearly,

further investigation into the effects that different amino-acid alphabets have on EPPIC’s

performance was needed. EPPIC, however, had its seven amino-acid alphabets, those

defined in Table 2.1, hardcoded in its source code. This meant that some changes to

EPPIC had to first be made.

3.2 Published Alphabets

The first step was to collect reduced amino-acid alphabets that had already been pro-

posed by the scientific community. The VisCoSe tool, published by Spitzer et al. [69],

allows users to select from a list of compiled amino-acid alphabets and includes descrip-

tions and citations for each. These alphabets, as well as the ones from Chapter 2, are

listed in Table 3.1. Due to the style in which these alphabets were described by the

VisCoSe tool, it was not immediately obvious that Alphabets 25 and 29 were, in fact,

the same.

3.2.1 Methods

The eleven additional alphabets in Table 3.1, when compared to Table 2.1, were manually

added into EPPIC’s source code, and EPPIC was run on the BioMany and XtalMany

27

Chapter 3 — Reduced Amino-Acid Alphabets 28

AlphabetIdentifier

SourceAmino-Acid

AlphabetSize of

Alphabet

2 [60] ACFGILMPSTVWY:DEHKNQR 24 [60] AGPST:CILMV:DEHKNQR:FWY 46 [66] ACILMV:DE:FHWY:GP:KR:NQST 68 [60] AG:CILMV:DENQ:FWY:H:KR:P:ST 810 [60] A:C:DENQ:FWY:G:H:ILMV:KR:P:ST 1015 [60] A:C:D:E:FY:G:H:ILMV:KR:N:P:Q:S:T:W 1520 — A:C:D:E:F:G:H:I:K:L:M:N:P:Q:R:S:T:V:W:Y 2021 [63] AGPST:CFILMVWY:DEHKNQR 322 [63] AGPST:CFWY:DEHKNQR:ILMV 423 [63] APST:CFWY:DEHKNQR:G:ILMV 524 [63] AST:C:DEQ:FWY:G:HN:IV:KR:LM:P 1025 [59] ADEGHKNPQRST:CFILMVWY 226 [59] AGHPRT:CFILMVWY:DEKNQS 327 [59] AHT:CFILMVWY:DE:GP:KNQRS 528 [59] AGST:CFIM:DENQ:HKPR:LVWY 529 [69] ADEGHKNPQRST:CFILMVWY 230 [69] ACGS:DEKR:FHWY:ILV:MNPQT 531 [69] ACGS:DE:FHWY:ILV:KR:MNPQT 6

Table 3.1: Sixteen amino-acid alphabets proposed by the scientific community. Foreach alphabet, the alphabet identifier—no longer necessarily the number of letters inthe alphabet—is shown, as well as the authors who proposed it, the alphabet itself,and the number of groups in the alphabet. Alphabet 20 is an unreduced alphabet and,

therefore, has no source. Note that Alphabets 25 and 29 are identical.

datasets once per alphabet. The burial cutoff used for core-residue assignment was 80%.

When the authors of a PDB entry collect new data or rerefine a structure, the PDB

entry is made obsolete and superseded by a new PDB entry. When this occurs to PDB

entries in the BioMany or XtalMany datasets, the entry must be removed, since it is no

longer considered a valid structure. The new, superseding structure does not necessarily

conform to the standards applied when generating the datasets and, hence, cannot be

added to either dataset. When this experiment was carried out, four PDB entries had

been superseded: 2H7I, 2HCY, 3DMA, and 4M7N. The more-stringent criteria for putative

homologues, a minimum of thirty, each having at least 60% sequence identity to the

query, were imposed. Additionally, at this point, it was decided that, for these and

future experiments, there is little use investigating both the core–rim and core–surface

evolutionary predictors. The core–surface one offers a much more robust measure and

will, thus, represent both predictors. Computations were carried out using the Merlin 4

HPCC at the Paul Scherrer Institute.


3.2.2 Results

Figure 3.1 shows the results of running the EPPIC core–surface evolutionary predictor on

the BioMany and XtalMany datasets for each of the alphabets in Table 3.1. The results

are shown for seven different core–surface call thresholds. No matter the threshold,

Alphabet 2 clearly comes out on top, with Alphabet 15 following close behind, while

Alphabet 25, identical to Alphabet 29, is the worst by a significant margin.

In Figure 3.1, it is also interesting that the alphabets contributing to high performance

tend to have less variation for different call thresholds. Ideally, there would be a measure

that could summarise not only the balanced accuracy, which, it might be helpful to recall,

is the arithmetic mean of the sensitivity and specificity, but also the variation across call

thresholds. Such a measure is described below in Subsection 3.2.3, and the remainder

of the results from this chapter then follow in Subsection 3.2.4.

2 4 6 8 10 15 20 21 22 23 24 25 26 27 28 29 30 31Alphabet Identifier

0.65

0.70

0.75

0.80

0.85

0.90

Bal

ance

dA

ccu

racy


−1.05

−1.00

−0.95

−0.90

−0.85

−0.80

−0.75

Figure 3.1: The effects of published reduced amino-acid alphabets on the performanceof EPPIC’s core–surface evolutionary predictor for seven different call thresholds. Thebalanced accuracy is used as a measure of performance and the alphabet identifiers are

those defined in Table 3.1.


3.2.3 Receiver Operating Characteristic

The receiver operating characteristic (ROC) curve is a method of plotting particular

aspects of the performance of a binary classifier against one another, all while varying the

call threshold. Specifically, the true-positive rate, which is equivalent to the sensitivity,

is plotted as a function of the false-positive rate, which is equivalent to unity minus the

specificity, at various threshold settings. The ROC curve necessarily passes through two

points: (0, 0), since designating all interfaces as xtal correctly identifies all negatives

but also labels all positives as false negatives, and (1, 1), since designating all interfaces

as bio correctly identifies all positives but also labels all negatives as false positives.

The diagonal of the plot, stretching between the two aforementioned points, is the line

below which the performance of the binary classifier is mathematically worse than that

of random classification.

The best result for a binary classifier would be to have an ROC curve that rises sharply

from (0, 0) to (0 + ε, 1), stays flat across y = 1, and ends at (1, 1). Thus, in general, an

ROC curve that tends to be higher or more left-lying is considered to be demonstrative

of a better classifier. The area under the ROC curve (AUROC) is a measure of such

characteristics: integrating the curve over x ∈ [0, 1] results in a number from 0 to 1, which

represents the performance of the classifier irrespective of the call threshold chosen. A

binary classifier with a higher AUROC value can be said to be a better classifier; however,

when two binary classifiers have similar AUROC values but dissimilar ROC curves, one

must consider the prevalence of positives versus the prevalence of negatives, as well

as the cost of false positives versus the cost of false negatives, before deciding which

classifier is more appropriate [70].

3.2.4 Further Results

In Figure 3.2, the ROC curves for the sixteen amino-acid alphabets from Table 3.1 are

shown. The integration of each ROC curve has also been calculated, and the AUROC

values are displayed in Table 3.2. With the highest AUROC, 0.9563, Alphabet 2 is

confirmed as the best alphabet for the core–surface evolutionary predictor, and the

identical Alphabets 25 and 29, with AUROCs of 0.7602 and 0.7601, respectively, are

confirmed to be the worst alphabet. Note that the AUROC values for the two identical

alphabets are not exactly the same. This is due to some stochasticity in the algorithm of

the core–surface evolutionary predictor and will be addressed in more detail in Chapter 4.

Arguably, the most intriguing result of this section is that two different 2-alphabets

manage to achieve both first- and last-place rankings. It would be promising to delve


0.0 0.2 0.4 0.6 0.8 1.0False-Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e-P

osit

ive

Rat

e

Alphabet Identifier

2

4

6

8

10

15

20

21

22

23

24

25

26

27

28

29

30

31

Figure 3.2: The ROC curves for EPPIC’s core–surface evolutionary predictor for thepublished alphabets being evaluated in Section 3.2. The alphabet identifiers are those

defined in Table 3.1.

AlphabetIdentifier

AUROC

2 0.95364 0.90916 0.93608 0.929510 0.921715 0.9357

AlphabetIdentifier

AUROC

20 0.904521 0.926222 0.904823 0.902124 0.912625 0.7602

AlphabetIdentifier

AUROC

26 0.912727 0.928028 0.892129 0.760130 0.916731 0.9263

Table 3.2: The AUROC values for EPPIC’s core–surface evolutionary predictor forthe published alphabets being evaluated in Section 3.2. The alphabet identifiers are


further into the topic of 2-alphabets by seeing how randomly-generated 2-alphabets fare

in comparison.

3.3 Random 2-Alphabets

In order to test the effects of randomly-generated 2-alphabets, EPPIC had to be re-

designed once again. The hardcoded amino-acid alphabets were kept for convenience,


but the necessity of their use was eliminated by allowing for custom alphabets to be de-

fined in EPPIC’s configuration file. Once this was accomplished, it became much easier

to benchmark the software using random 2-alphabets.

3.3.1 Methods

The process begins by randomly generating ten 2-alphabets. These alphabets are listed

in Table 3.3. Note that the alphabets are all fairly balanced: no fewer than five (and,

hence, no more than fifteen) amino acids are in any one category.

AlphabetIdentifier

2-Alphabet

0 ACDEFGIKLPSTVW:HMNQRY

1 ACDEFGIKMNPRST:HLQVWY

2 ACDEFGKLMPQRTVW:HINSY

3 ACDEFGKNRS:HILMPQTVWY

4 ACDEFHIMNPWY:GKLQRSTV

5 ACDEFHKLNPQRSVW:GIMTY

6 ACDEFHKLPQSWY:GIMNRTV

7 ACDEHIKLMPRSTV:FGNQWY

8 ACDELMNPTWY:FGHIKQRSV

9 ACDFGHILQRSWY:EKMNPTV

Table 3.3: Ten randomly-generated 2-alphabets. For each alphabet, the alphabetidentifier—which has been reset and now starts from 0—is shown, as well as the alphabet

itself.

For each alphabet in Table 3.3, the EPPIC core–surface evolutionary predictor was run

on the BioMany and XtalMany datasets. The burial cutoff used for core-residue as-

signment was 80%. The more-stringent criteria for putative homologues, a minimum of

thirty, each having at least 60% sequence identity to the query, were imposed. Compu-

tations were carried out using the Merlin 4 HPCC at the Paul Scherrer Institute.

3.3.2 Results

Figure 3.3 shows the results of running the EPPIC core–surface evolutionary predictor on

the BioMany and XtalMany datasets for each of the alphabets in Table 3.3. The results

are shown for seven different core–surface call thresholds. No matter the threshold, it

is fairly clear that there is not much variability in the balanced accuracy as a function

of the random 2-alphabet used. Alphabet 4 ranks best, Alphabet 9 follows by a hair’s

breadth, and Alphabet 7 is worst by a fair amount. Just as for the previous section,

the ROC curves for these alphabets can be generated and their AUROC values used to

better summarise the performance.


0 1 2 3 4 5 6 7 8 9Alphabet Identifier

0.72

0.74

0.76

0.78

0.80

0.82

0.84

0.86

Bal

ance

dA

ccu

racy


−1.05

−1.00

−0.95

−0.90

−0.85

−0.80

−0.75

Figure 3.3: The effects of random 2-alphabets on the performance of EPPIC’s core–surface evolutionary predictor for seven different call thresholds. The balanced accuracyis used as a measure of performance and the alphabet identifiers are those defined in

Table 3.3.

In Figure 3.4, the ROC curves for the ten randomly-generated amino-acid alphabets

from Table 3.3 are shown. The integration of each ROC curve has also been calculated,

and the AUROC values are displayed in Table 3.4. With the highest AUROC, 0.9070,

Alphabet 4 is confirmed as the best of the alphabets in Table 3.3, and Alphabet 7, with

an AUROC of 0.7861, is confirmed to be the worst alphabet.

Although only ten random 2-alphabets were looked at, they were all well-balanced, and

there does not appear to be much variability amongst them, using either the balanced

accuracy or AUROC value as a measure of performance. The average AUROC value

for the random 2-alphabets in this section is 0.8600± 0.0119. Why, then, does the 2-

alphabet proposed by Murphy et al. [60] perform so much better than these (0.9536),

and why does the 2-alphabet proposed by Wang and Wang [59] perform so much worse

(0.7602)?



0.0

0.2

0.4

0.6

0.8

1.0T

rue-

Pos

itiv

eR

ate

Alphabet Identifier

0

1

2

3

4

5

6

7

8

9

Figure 3.4: The ROC curves for EPPIC’s core–surface evolutionary predictor for therandom 2-alphabets being evaluated in Section 3.3. The alphabet identifiers are those

defined in Table 3.3.

AlphabetIdentifier

AUROC

0 0.85411 0.87062 0.81683 0.88184 0.9070

AlphabetIdentifier

AUROC

5 0.83736 0.87417 0.78618 0.87009 0.9026

Table 3.4: The AUROC values for EPPIC’s core–surface evolutionary predictor forthe random 2-alphabets being evaluated in Section 3.3. The alphabet identifiers are


3.4 Transitioning 2-Alphabets

In Sections 3.2 and 3.3, it was found that the 2-alphabet proposed by Murphy et al. [60]

performs about 10% better than random 2-alphabets, whereas the 2-alphabet proposed

by Wang and Wang [59] performs about 10% worse than random 2-alphabets. How is it

that two alphabets published by the scientific community could span such a range? In

fact, there are only five differently-classified amino acids between the two alphabets—

alanine, glycine, proline, serine, and threonine. Thus, to mutate one alphabet into

the other requires five one-letter recategorisations. Every alphabet between these two


proposed 2-alphabets—in other words, the power set of these five changes—can be tested

with EPPIC’s core–surface evolutionary predictor.

3.4.1 Methods

All intermediary 2-alphabets can be thought of as a presence or absence of the shift for

each of the five amino acids whose categorisations differ between the alphabets. This

allows for 25 = 32 possible 2-alphabets, which have been listed in Table 3.5. Note that

Alphabet 0 is defined as the 2-alphabet proposed by Murphy et al. [60], while Alphabet V

is defined as the 2-alphabet proposed by Wang and Wang [59]. Any alphabet between

Alphabets 0 and V correspond to systematically-interpolated alphabets. For example,

Alphabets 1 through 5 correspond to taking Alphabet 0 and moving one of alanine,

glycine, proline, serine, or threonine, in that order, to the other group.

AlphabetIdentifier

2-Alphabet

0 ACFGILMPSTVWY:DEHKNQR

1 ADEHKNQR:CFGILMPSTVWY

2 ACFILMPSTVWY:DEGHKNQR

3 ACFGILMSTVWY:DEHKNPQR

4 ACFGILMPTVWY:DEHKNQRS

5 ACFGILMPSVWY:DEHKNQRT

6 ADEGHKNQR:CFILMPSTVWY

7 ADEHKNPQR:CFGILMSTVWY

8 ADEHKNQRS:CFGILMPTVWY

9 ADEHKNQRT:CFGILMPSVWY

A ACFILMSTVWY:DEGHKNPQR

B ACFILMPTVWY:DEGHKNQRS

C ACFILMPSVWY:DEGHKNQRT

D ACFGILMTVWY:DEHKNPQRS

E ACFGILMSVWY:DEHKNPQRT

F ACFGILMPVWY:DEHKNQRST

AlphabetIdentifier

2-Alphabet

G ADEGHKNPQR:CFILMSTVWY

H ADEGHKNQRS:CFILMPTVWY

I ADEGHKNQRT:CFILMPSVWY

J ADEHKNPQRS:CFGILMTVWY

K ADEHKNPQRT:CFGILMSVWY

L ADEHKNQRST:CFGILMPVWY

M ACFILMTVWY:DEGHKNPQRS

N ACFILMSVWY:DEGHKNPQRT

O ACFILMPVWY:DEGHKNQRST

P ACFGILMVWY:DEHKNPQRST

Q ADEGHKNPQRS:CFILMTVWY

R ADEGHKNPQRT:CFILMSVWY

S ADEGHKNQRST:CFILMPVWY

T ADEHKNPQRST:CFGILMVWY

U ACFILMVWY:DEGHKNPQRST

V ADEGHKNPQRST:CFILMVWY

Table 3.5: Thirty-two 2-alphabets, the first and last of which correspond to the2-alphabets proposed by Murphy et al. [60] and Wang and Wang [59], respectively.Since there are only five differences in the categorisation of amino acids between thetwo 2-alphabets, alphabets between the first and last represent all possible alphabets“between” the two. For each alphabet, the alphabet identifier—which has, again, been

reset—is shown, as well as the alphabet itself.

For each alphabet in Table 3.5, the EPPIC core–surface evolutionary predictor was run

on the BioMany and XtalMany datasets. The burial cutoff used for core-residue as-

signment was 80%. The more-stringent criteria for putative homologues, a minimum of

thirty, each having at least 60% sequence identity to the query, were imposed. Compu-

tations were carried out using the Merlin 4 HPCC at the Paul Scherrer Institute.


3.4.2 Results

Figure 3.5 shows the results of running the EPPIC core–surface evolutionary predictor

on the BioMany and XtalMany datasets for each of the alphabets in Table 3.5. The

results are shown for seven different core–surface call thresholds. No matter the thresh-

old, the trend is the same: Alphabet 0 fares best, Alphabet V fares worst, and the

transitioning alphabets between them drop off as more amino acids are recategorised.

Just as for previous sections, the ROC curves for these alphabets can be generated and

their AUROC values used to better summarise the performance.

In Figure 3.6, the ROC curves for the thirty-two transitioning 2-alphabets from Table 3.5

are shown. The integration of each ROC curve has also been calculated, and the AUROC

values are displayed in Table 3.6.

Unlike in the previous sections, however, the noteworthy results here are not immediately

visible from either Figure 3.5 or Figure 3.6. It was obvious that there would be some

sort of decrease in performance as Alphabet 0 transitioned into Alphabet V, but the

real problem was to identify which step in the transitioning was the most detrimental.

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U VAlphabet Identifier

0.70

0.75

0.80

0.85

Bal

ance

dA

ccu

racy


−1.05

−1.00

−0.95

−0.90

−0.85

−0.80

−0.75

Figure 3.5: All possible 2-alphabets that group amino acids in a manner intermedi-ary to the two 2-alphabets proposed by Murphy et al. [60] and Wang and Wang [59]and their effects on the performance of EPPIC’s core–surface evolutionary predictorfor seven different call thresholds. The balanced accuracy is used as a measure of

performance and the alphabet identifiers are those defined in Table 3.5.



0.0

0.2

0.4

0.6

0.8

1.0

Tru

e-P

osit

ive

Rat

e

Alphabet Identifier

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

Figure 3.6: The ROC curves for EPPIC’s core–surface evolutionary predictor for thetransitioning 2-alphabets being evaluated in Section 3.4. The alphabet identifiers are


AlphabetIdentifier

AUROC

0 0.95351 0.89602 0.93923 0.94654 0.92405 0.92876 0.89027 0.88918 0.87309 0.8643A 0.9332B 0.8975C 0.9069D 0.9169E 0.9255F 0.9002

AlphabetIdentifier

AUROC

G 0.8787H 0.8569I 0.8505J 0.8597K 0.8528L 0.8430M 0.8856N 0.9000O 0.8620P 0.8908Q 0.8324R 0.8305S 0.8066T 0.8176U 0.8440V 0.7601

Table 3.6: The AUROC values for EPPIC’s core–surface evolutionary predictor forthe transitioning 2-alphabets being evaluated in Section 3.4. The alphabet identifiers

are those defined in Table 3.5.


Looking at Figure 3.5, one could note that going from Alphabet 0 to Alphabet 1, which

corresponds to recategorising the alanine, brings with it a sizeable drop in performance.

But what effect does a recategorisation of the alanine have on alphabets that have already

had some other amino acids recategorised? For example, the only difference between

Alphabet M and Alphabet Q is the alanine, but both alphabets also already had glycine,

proline, and serine recategorised. Looking at the difference between Alphabet M and

Alphabet Q in Figure 3.5 shows, as expected, a decrease in performance. In fact, the

decrease in the AUROC value from Alphabet 0 to Alphabet 1 is 0.0575, and the decrease

in balanced accuracy from Alphabet M to Alphabet Q is 0.0532. The proximity of these

values to one another acted as inspiration to model the AUROC as a function of the

presence or absence of each part of the full transition from Alphabet 0 to Alphabet V.

3.4.3 Linear Model

A linear model was fit to the data, such that the AUROC value for EPPIC’s core–

surface evolutionary predictor could be predicted given only the status of each amino

acid having been recategorised. Equation (3.1) shows the linear model after training,

where δX is the status of X having been recategorised in the amino-acid alphabet being

used. For example, Alphabet 0 has δA = δG = δP = δS = δT = 0, while Alphabet V has

δA = δG = δP = δS = δT = 1.

AUROC ≈ 0.0596δA + 0.0255δG + 0.0143δP + 0.0385δS + 0.0368δT + 0.7926 (3.1)

Figure 3.7 shows the result of using the linear model to predict the AUROC values, in

blue, which, when compared to the true AUROC values, in pink, seems to match quite

well. Indeed, the coefficient of determination, R2, was 0.96. This is a very good result

for a linear model, since the classification of amino acids in a 2-alphabet is certainly not

linearly independent, which becomes evident when one considers that the only difference

in having an amino acid change category is the change in the other amino acids with

which it shares its category.

Further conclusions can be drawn by examining the coefficients in Equation (3.1). The

coefficient of largest magnitude is that for alanine, which means that its categorisation is

the most important of the five differences between Alphabet 0 and Alphabet V. The next

most important is serine, followed closely by threonine, which makes sense considering

their similar properties and prevalences [71]. The coefficient for glycine places fourth,

and, finally, proline follows in the least-important position.


0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U VAlphabet Identifier

0.75

0.80

0.85

0.90

0.95

1.00

AU

RO

CR2 = 0.96

Data

Model

Figure 3.7: A linear model used to deconvolve the effects of recategorising each ofthe five amino acids. The original data (in pink) and the predictions of the model (inblue) are shown for the transitioning 2-alphabets being evaluated in Section 3.4. Thealphabet identifiers are those defined in Table 3.5. The coefficient of determination,

R2 = 0.96.

3.5 Conclusion

In this chapter, reduced amino-acid alphabets published by the scientific community

were tested by running EPPIC’s core–surface evolutionary predictor on the BioMany

and XtalMany datasets as a measure of performance and using the AUROC values as

measures of classifier performance. The performance of random 2-alphabets was also

tested, and an interesting finding was that, while random 2-alphabets exhibit little vari-

ability (0.8600± 0.0119), the 2-alphabet proposed by Murphy et al. [60] performs much

better (0.9536), and the 2-alphabet proposed by Wang and Wang [59] performs much

worse (0.7602). Despite this, the only difference between the two aforementioned al-

phabets is the categorisation of five amino acids: alanine, glycine, proline, serine, and

threonine. All possible permutations between these alphabets were tested, the perfor-

mance was shown to be more-or-less linearly dependent on the amino-acid categorisa-

tion, and alanine was identified as the amino acid whose categorisation was of greatest

significance.

Chapter 4

Surface-Residue Sampling

4.1 Rationale

This chapter deviates from the investigative flow of the thesis but is a necessary explana-

tory tangent, since the following chapter will be predicated on the results from this one.

EPPIC’s core–surface evolutionary predictor was introduced in Subsection 1.4.2, but

was not described in detail.

EPPIC determines the solvent accessibility of residues before and after the formation of

the protein–protein interface, and the ratio—the BSA-to-ASA ratio—is used to classify

them as either “core” (residues fully buried upon interface formation), “rim” (residues

partially buried upon interface formation), “surface” (surface residues not involved in an

interface), or “other” (any non-interface, non-surface residues; for example, residues in

a subunit’s core). The burial cutoffs used in classification are variable parameters. Once

the categories for every residue have been assigned, EPPIC proceeds by searching for ho-

mologues. A minimum number of homologues, each having a minimum sequence identity

to the query, are required for the evolutionary predictor to continue. The homologues

are then input into a clustering algorithm for the purpose of reducing redundancy. This

is done to ensure that each cluster of similar homologues is fairly characterised by one

representative cluster member. These representatives are then aligned with one another

in a large MSA. The information entropy for each column in the MSA is calculated,

making sure to consider the (reduced) amino-acid alphabet of choice. The entropies of

the core residues must now be compared to the entropies of the surface residues, and

this forms the topic for this chapter.

41

Chapter 4 — Surface-Residue Sampling 42

4.2 Strategies for Surface-Residue Sampling

The first problem faced when comparing core-residue entropies to surface-residue en-

tropies is that there tend to be far fewer core residues than surface residues. To handle

this, the surface residues have to be sampled at random. The exact method by which

this is carried out is not trivial, a fact which has led to the inclusion of two different

strategies in EPPIC. Subsections 4.2.1 and 4.2.2 describe these two strategies, and their

main difference is summarised in Table 4.1.

Residues Added to Pool?

Core Rim Surface Other Interfaces

StrategyClassic 7 7 3 7 7

Novel 3 3 3 7 3

Table 4.1: A comparison of the two strategies for surface-residue sampling. In thistable, “core” and “rim” residues are core and rim residues of the interface of inter-est, “surface” residues are residues on the surface of the protein, “other” residues areresidues that do not fit in any other category (for example, residues in the protein’s hy-drophobic core), and “interface” residues are residues on the surface that are involvedin other interfaces of a minimum surface area. For each strategy and residue type,a 3 or 7 is shown to represent whether that strategy is permitted to add residues of

that type to the surface-residue pool.

4.2.1 Classic Strategy

The classic strategy was the first to be implemented, and its algorithm for calculating the

core–surface score is as follows. First, a non-zero core size is checked for: if the interface

has no core residues, no prediction is returned. The average entropy of all core residues

is computed. Then, surface residues are added to a pool as long as they are not involved

in an interface (of at least some minimum surface area). It is ensured that a sufficient

number of surface residues exist in the structure: if there are fewer than 1.5 times as

many non-interface surface residues as core residues, the algorithm does not proceed. A

number of surface residues equal to the number of existing core residues are then drawn

from the pool, without replacement, and this sampling is carried out 10 000 times. The

arithmetic mean of each sample is recorded, providing a “distribution” of 10 000 values.

The mean and standard deviation of this distribution are used to compare the surface

residues to the core residues through the Z-score test statistic.

Equation (4.1) defines the Z-score test statistic: the population mean of the core-residue

entropies minus the sample mean of the surface-residue entropies, all divided by the

sample standard deviation of the non-interface surface-residue entropies. Note that the


sample mean and sample standard deviation are different than usual, since they represent

statistics from a sample that overrepresents the population due to the large number of

samples.

Z =µcore − xsurface

ssurface(4.1)

This classic strategy for surface-residue sampling and score calculation has a few down-

sides. Firstly, it places a lot of weight on the parameters that define what constitutes a

non-interface surface residue. While it is true that the surface itself will generally contain

other interfaces or binding sites, it would be presumptuous to exclude surface residues

simply because they lie in a region that may or may not be part of a biological interface.

Secondly, since this strategy needs to refer to what is considered an interface-related

residue for all other interfaces of the protein, all these objects need to have persistent

references in the data model. This is not an issue when running an instance of EPPIC

from start to finish, but it quickly becomes problematic when working directly with the

database, which will be the case in the next chapter.

4.2.2 Novel Strategy

The novel strategy is vastly simpler than the classic one, but arguably contradicts the

name of the predictor. Instead of meddling with the surface residues in the pool, the

novel strategy leaves them as they are. In fact, it also considers the core residues and

the rim residues to be part of the surface, and the only residues not part of the pool are

those labelled as “other” residues. Of course, the definition of core residues is that they

become buried upon interface formation, which means that they have to be exposed prior

to interface formation and, hence, must also be on the surface. This is the interpretation

of “surface” for this strategy, which is a superset of the meaning of “surface” in residue

classification.

Other than this major difference, things are mostly as they are with the classic strategy.

A non-zero core size is checked for, the core-residue entropies are averaged, all non-other

residues are added to a pool, and, since there are no longer any removals from the pool,

it is ensured that at least 2.0 times (rather than 1.5) as many surface residues are in

the pool as there are core residues in the interface. The sampling is then performed

as before, and Equation (4.1) is used to calculate a test statistic. The test statistic

is now more like a real Z-score: it represents how likely it is that the values, which

come from the population and are being compared to the population as a whole—in


this case, the core of the interface compared to the entire surface of the protein—have

a statistically-differing mean.

4.3 Methods

For both of the strategies detailed in Section 4.2, the EPPIC core–surface evolutionary

predictor was run on the BioMany and XtalMany datasets. The burial cutoff used for

core-residue assignment was 80%, and the reduced amino-acid alphabet used was the

6-alphabet proposed by Mirny and Shakhnovich [66]. The more-stringent criteria for

putative homologues, a minimum of thirty, each having at least 60% sequence identity

to the query, were imposed. Computations were carried out using the Merlin 4 HPCC

at the Paul Scherrer Institute.

4.4 Results

Presenting the performance of the classifier using the balanced accuracy as a measure

no longer makes sense, since different scoring systems cannot be compared using the

same call thresholds. For this reason, the AUROC values must be used as a measure of

performance, which ensures that the potential of each classifier is being tested, without

bias, across all thresholds.

In Figure 4.1, the ROC curves for each surface-sampling strategy in Section 4.2 are

shown. The integration of each ROC curve has also been calculated, and these AUROC

values are displayed in the figure’s legend. The classic strategy for sampling surface

residues resulted in an AUROC value of 0.9338, performing slightly better than the

novel strategy, which obtained an AUROC value of 0.9265. This is an acceptable result:

the slight decrease in performance, which may well be within the noise of the data,

is more than made up for by the simplicity of the strategy, its reduced computational

complexity, and the possibility of tackling cases where the classic strategy has failed

due to an insufficient number of surface residues (for example, in coiled-coil structures).

For this reason, the next chapter, which will involve working directly with the database

behind EPPIC, will adopt the novel strategy for the sampling of surface residues.

4.5 Conclusion

Due to both the sheer number and the variability in character of the residues on a

protein’s surface, EPPIC’s core–surface evolutionary predictor requires some statistical



0.0

0.2

0.4

0.6

0.8

1.0

Tru

e-P

osit

ive

Rat

e

Surface-Sampling Method

Classic(AUROC = 0.9338)

Novel(AUROC = 0.9265)

Figure 4.1: The ROC curves for EPPIC’s core–surface evolutionary predictor for thetwo different strategies of surface-residue sampling.

sampling to take place before the sequence entropies for the surface residues can be

compared to those of the core residues. The classic strategy for this was to exclude

regions of the surface that could be involved in binding sites or interfaces before sampling,

whereas the novel strategy takes the entire surface of the protein, even the core and rim

residues in the interface of interest, into account. A comparison of these two strategies

in this chapter showed that the classic one outperforms the novel one by a very narrow

margin of 0.78%, which is an acceptable loss to incur when weighed against the gains:

increased simplicity, reduced computational complexity, and handling structures with

few surface residues. Thus, the novel method is adopted in the following chapter of this

thesis.

Chapter 5

2-Alphabets

5.1 Rationale

In Chapter 2, the parameters in EPPIC were optimised based on a maximisation of

performance for the BioMany and XtalMany datasets. Of EPPIC’s built-in amino-acid

alphabets, shown in Table 2.1, the 6-alphabet proposed by Mirny and Shakhnovich [66]

performed best. In Chapter 3, more reduced amino-acid alphabets, which had been pro-

posed by the scientific community, were tested and compared to randomised 2-alphabets.

The results showed that the 2-alphabet proposed by Murphy et al. [60] performs far

better than random 2-alphabets, but the 2-alphabet proposed by Wang and Wang [59]

performs much worse.

The linear model from Subsection 3.4.3 set out to explain this by deconvolving the five

differentially-grouped amino acids—alanine, glycine, proline, serine, and threonine—

between the two 2-alphabets, and it was found that alanine was the amino acid whose

categorisation had the greatest effect on the classifier’s performance. However, this lin-

ear model failed to take into consideration the interdependence of amino-acid categori-

sations: it is the other amino acids with which a category is shared that characterises an

alphabet, not the individual categorisation of each amino acid. For this reason, this final

chapter of the thesis focusses on generating and benchmarking all possible 2-alphabets

and fitting a model of greater precision to the data.

5.2 Methods

Unfortunately, even given the modified version of EPPIC that allows for custom alpha-

bets, any brute-force approach that tests all conceivable amino-acid alphabets would be

47

Chapter 5 — 2-Alphabets 48

an insurmountable task. Therefore, the first step was to engineer a faster way of evalu-

ating 2-alphabets. Working directly with the database that contains the computations

for all current PDB entries would make the procedure much faster, since EPPIC’s pre-

liminary calculations (finding interfaces, calculating ASAs, searching for homologues,

creating MSAs, etc.), normally repeated for each run, would be bypassed. Using an

application programming interface (API) called the Java Persistence API (JPA), a con-

nection to the relational database management system (RDBMS) is made, a list of all

Protein Data Bank identifiers (PDB IDs) in the BioMany and XtalMany datasets is

passed to the database as a set of queries, and the referenced objects are returned.

For each PDB ID in the datasets, three stages of computation need to be carried out:

firstly, acquiring the aligned homologues from the database, adding them to an MSA

structure, and calculating the sequence entropy (with the selected amino-acid alphabet)

for each column; secondly, appropriately mapping the PDB ID’s associated interface

from the dataset to the database, classifying the residues geometrically by their burial

ratio; and, finally, pooling and sampling the residues to calculate the final score for the

interface. Once these scores are available, a bio or xtal call is obtained by comparing

the score to the call threshold. Everything but the sequence-entropy calculation and

sampling is done only once: the MSAs and residue classifications for both entire datasets,

as well as any other required data, are serialised and saved to file. Further amino-acid

alphabet evaluations begin by reading in this serialised file to save time.

Due to the time that had elapsed between this experiment and previous experiments,

some PDB IDs had to be removed from the BioMany and XtalMany datasets. 2ELF,

2FRM, and 2O37 were removed because they had been superseded, while 2PYE, 2VLM,

3QH3, 3RFN, 3RY9, 3VVU, 4FQ2, 4L4V, and 4MAY were removed because one or more of

the chains referred to by the datasets had no UniProt sequence mapping. For all other

PDB IDs, the computations were carried out using the Euler HPCC at ETH Zurich.

5.2.1 Stirling Numbers of the Second Kind

Prior to beginning computations for this chapter, even listing all of the potential amino-

acid alphabets proved difficult. To illustrate the complexity of this, even once the number

of categories for the alphabet is chosen, the question of how many amino acids to add

to each category remains, not to mention determining which particular amino acids to

add.

James Stirling, an eighteenth-century Scottish mathematician, discovered the so-called

“Stirling numbers of the second kind”, which define the number of ways to partition a

set of n objects into k non-empty subsets. The explicit formula for their calculation is


shown in Equation (5.1) [72]. These numbers, as well as the formula to generate them,

are very useful for the purposes of this experiment, since amino-acid alphabets are, in

fact, partitions of a twenty-element set into k non-empty subsets, where 2 ≤ k ≤ 20 [72].

S(n, k) =

{n

k

}=

1

k!

k∑j=0

(−1)k−j(k

j

)jn (5.1)

Figure 5.1 shows all possible partitions of a set of four elements. The topmost partition,

in red, is not applicable to amino-acid alphabets, because k = 1. The green partitions

correspond to partitions where k = 2 with three elements in one subset and one element

in the other subset. The orange partitions also correspond to partitions where k = 2,

but with two elements in each subset. The six purple partitions are partitions where

k = 3, which can only be distributed by placing two elements in one subset and one

element in each of the other two subsets. Finally, the grey partition at the bottom of

the figure is a partition with k = n = 4, which has only one possible distribution: one

element in each subset.

It is clear that, for increasing n, these partitions become increasingly complex. Using

the formula in Equation (5.1), it can be shown that just the 2-alphabets, which must

Figure 5.1: The fifteen partitions of a four-element set (n = 4), ordered in a Hassediagram. The red, green, orange, purple, and grey partitions correspond to partitionswhere k = 1, k = 2, k = 2, k = 3, and k = 4, respectively. This figure was adapted

from a similar one by Tilman Piesk [73].


have n = 20 and k = 2, amount to 524 287 possibilities [74]. Extending the scope to all

possibilities produces an incredible 51 724 158 235 372 amino-acid alphabets [74]. Clearly,

it would be infeasible to test all of these, and, for this reason, the scope of this chapter

is restricted to the 2-alphabets.

5.3 Results

After writing the program as described in Section 5.2, it was now possible to evalu-

ate the performance of one amino-acid alphabet in approximately one second (on one

core). After about one week of compute time, all 2-alphabets had had their balanced

accuracies calculated, and the results were quite surprising. The 2-alphabet proposed

by Murphy et al. [60] fares relatively well against all possible 2-alphabets, achieving a

rank of 98. This 2-alphabet, as well as the one proposed by Wang and Wang [59], which

achieved a rank of 513 180, are shown alongside the ten best and ten worst 2-alphabets

in Table 5.1.

5.4 Linear Model

Much like with the results from Section 3.4, a linear model can be fit to the results

of this chapter. As mentioned in Section 5.1, amino-acid categorisations in amino-acid

alphabets are not independent of one another: it is the other amino acids with which

a category is shared that characterises an alphabet, not the individual categorisation

of each amino acid. Therefore, a more elaborate form of linear model was used in this

section. Instead of looking for the presence or absence of a particular amino acid, the

presence or absence of all possible duos of amino acids were considered. Let a duo of

amino acids be defined as two amino acids, and let the presence of any such duo indicate

that the two component amino acids share the same categorisation in the amino-acid

alphabet. For example, the presence of the ST duo—always written in alphabetical

order—would mean that serine and threonine are grouped together in the alphabet in

question.

A linear model was fit to the data, such that the balanced accuracy for EPPIC’s core–

surface evolutionary predictor could be predicted given only the status of each amino-

acid duo’s presence. Equation (5.2) shows the general structure of the linear model,

where αXY is the coefficient of the duo XY after training and δXY its status: 0 if absent,

and 1 if present. α0 is the bias of the model, referring to the theoretical balanced accuracy

that would be present if all amino acids were segregated into their own categories in the


alphabet. Terms in the model whose coefficients had magnitudes of less than 2× 10−3

were removed. Even after these simplifications, the coefficient of determination, R2, was

0.96.

Balanced Accuracy ≈∑X<Y

αXYδXY + α0 (5.2)

After training, the coefficients of this model were examined. Figure 5.2 shows the ten

coefficients of highest magnitude on each side of the spectrum, with the duos whose

presences are beneficial in green and the duos whose presences are detrimental in red. It

makes sense that duos composed of similar amino acids—such as isoleucine and valine,

isoleucine and leucine, or leucine and valine—would be beneficial to group together, given

AlphabetRank

2-AlphabetBalancedAccuracy

1 ACDFGILMPSTVWY:EHKNQR 0.90142 ACDFGHILMPSTVWY:EKNQR 0.90013 ACDFGHIKLMPQRSTVWY:EN 0.89954 ACFGIKLMPQRSTVWY:DEHN 0.89885 ACDFGHIKLMPQRSTVY:ENW 0.89866 ACDFGILMPSTVY:EHKNQRW 0.89857 ACDFGHILMNPSTVWY:EKQR 0.89858 ACDFGHILMNPSTVY:EKQRW 0.89829 ACDFGILMNPSTVWY:EHKQR 0.898210 ACDFGHILMPSTVY:EKNQRW 0.8979...

......

98 ACFGILMPSTVWY:DEHKNQR 0.8938...

......

513 180 ADEGHKNPQRST:CFILMVWY 0.7277...

......

524 278 ACDEFGHKLMNPQRSTVY:IW 0.5680524 279 ACDEGHKLNPQRSTVWY:FIM 0.5664524 280 ACDEFGHKLNPQRSTVY:IMW 0.5648524 281 ADEFGHKLMNPQRSTVY:CIW 0.5645524 282 ACDEGHKLMNPQRSTVWY:FI 0.5642524 283 ACDEFGHKLMNPQRSTVWY:I 0.5632524 284 ADEFGHKLNPQRSTVY:CIMW 0.5630524 285 ADEFGHKLMNPQRSTVWY:CI 0.5629524 286 ADEFGHKLNPQRSTVWY:CIM 0.5618524 287 ACDEFGHKLNPQRSTVWY:IM 0.5605

Table 5.1: The ten best and ten worst 2-alphabets, as well as two 2-alphabets proposedby the scientific community for comparison. The rank of each alphabet and the balancedaccuracy (resulting from its use in EPPIC’s core–surface predictor) are shown alongside

the 2-alphabet itself.


Amino-Acid Duo-0.03

-0.02

-0.01

0.00

0.01

0.02

0.03

0.04V

alu

eof

Coe

ffici

ent

IV

ILLM

LV

ASIM FL AV

AG FY

AK QR ADDN EK KQ

EQ AE

KR

DE

Figure 5.2: The ten best (in green) and ten worst (in red) duos of amino acids. Foreach of these duos, the presence (for the ones in green) or absence (for the ones in red)of the duo—in other words, having the duo’s component amino acids in the same orin different groups in the 2-alphabet, respectively—is beneficial to the performance of

EPPIC’s core–surface predictor across the BioMany and XtalMany datasets.

their functional similarity. It also makes sense that largely-dissimilar amino acids—such

as alanine and lysine, glutamine and arginine, or glutamic acid and lysine—when present

as a duo, are detrimental to the classifier.

The two most detrimental duos, however, ring discordant with the rest. The two

negatively-charged amino acids, aspartic acid and glutamic acid, as well as the two

positively-charged amino acids, lysine and arginine, somehow negatively influence the

classifier when considered together. This finding is counterintuitive, since one would

assume that both of these duos would rank among the most beneficial, considering the

similarities between their constituent amino acids.

Although anecdotal in nature, this result can be demonstrated to, indeed, be the case

for the 2-alphabet proposed by Murphy et al. [60], at least for the sake of applying the

finding to a known baseline. The proposed alphabet, unchanged, achieves a balanced

accuracy of 0.8938. Moving aspartic acid or glutamic acid to the other group increases

the result to 0.9014 or 0.8959, respectively. Moving both amino acids to the other group,

however, results in a balanced accuracy of 0.8910, similar to the original value.


5.5 Conclusion

In this chapter, the algorithm behind EPPIC was reimplemented using a direct channel

to communicate with the precomputed database, all 524 287 possible 2-alphabets were

found and benchmarked using the core–surface predictor across the BioMany and Xtal-

Many datasets, and a linear model that considered duos of amino acids being present

in the same group was trained. As expected, it was found that, when similar amino

acids were grouped together, the predictions were improved. By the same token, it was

found that grouping dissimilar amino acids together caused detriment to the predictions.

Surprisingly, the DE and KR duos were the most harmful to the predictions, despite the

undeniable similarity between the members of each of them. Future work should address

the basis for this observation.

Chapter 6

Conclusion

6.1 Summary

All three of EPPIC’s predictors—geometric, core–rim evolutionary, and core–surface

evolutionary—have a large number of variable parameters. While the default values for

these parameters were chosen based on an optimisation carried out just prior to the

original release of EPPIC in 2012 [24], this optimisation was not an exhaustive search

and was performed using a small dataset. Two new datasets, BioMany and XtalMany,

compiled computationally by Baskaran et al. [65], allowed for a more thorough analysis of

the effects of parameter values on EPPIC’s predictions. Through this, it was found that

the geometric predictor performs best with a burial cutoff of 90% and a call threshold of

8; that the core–rim evolutionary predictor performs best with the 6-alphabet proposed

by Mirny and Shakhnovich [66], a burial cutoff of 80%, and a call threshold of 0.90; and

that the core–surface evolutionary predictor performs best with the 6-alphabet proposed

by Mirny and Shakhnovich [66], a burial cutoff of 80%, and a call threshold of −0.90.

Additionally, 2-alphabets are surprisingly prominent in the highest-ranking parameter

sets for the evolutionary predictors, a fact that drove much of the remainder of this

thesis toward further investigation of the importance of reduced amino-acid alphabets

on EPPIC’s predictive performance.

Next, reduced amino-acid alphabets published by the scientific community were tested

by running EPPIC’s core–surface evolutionary predictor on the BioMany and XtalMany

datasets and using the AUROC values as measures of classifier performance. The perfor-

mance of random 2-alphabets was also tested, and an interesting finding was that, while

random 2-alphabets exhibit little variability (0.8600± 0.0119), the 2-alphabet proposed

by Murphy et al. [60] performs much better (0.9536), and the 2-alphabet proposed by

Wang and Wang [59] performs much worse (0.7602). Despite this, the only difference

55

Chapter 6 — Conclusion 56

between the two aforementioned alphabets is the categorisation of five amino acids: ala-

nine, glycine, proline, serine, and threonine. All possible “transitioning” 2-alphabets

between these alphabets were tested: the performance was shown to be more-or-less lin-

early dependent on each of the five amino-acid categorisations, and alanine was identified

as the amino acid whose categorisation was of the most significance.

As a bit of a tangent, the thesis then sought to focus on simplifying the method by

which surface residues were sampled for the scoring component of EPPIC’s core–surface

predictor. Due to both the sheer number and the variability in character of the residues

on a protein’s surface, statistical sampling is required before the sequence entropies for

the surface residues can be compared to those of the core residues. In the classic strategy,

regions of the surface that could be involved in binding sites or interfaces were excluded

before sampling, whereas the novel strategy takes the entire surface of the protein into

consideration, even the core and rim residues in the interface of interest. A comparison of

these two strategies in this chapter shows that the classic one outperforms the novel one

by a very narrow margin of 0.78%, which is an acceptable loss to incur when weighed

against the upsides: increased simplicity, reduced computational complexity, and the

possibility of tackling cases where the classic strategy has failed due to an insufficient

number of surface residues (for example, in coiled-coil structures). Thus, the novel

method was adopted for Chapter 5.

In the final chapter, the algorithm behind EPPIC was reimplemented using a direct chan-

nel of communication with the precomputed database. All 524 287 possible 2-alphabets

were then found and benchmarked using the EPPIC core–surface predictor across the

BioMany and XtalMany datasets, and a linear model that considered duos of amino

acids being present in the same group was trained. Confirming expectations, it was

found that grouping similar amino acids together benefited predictions, while grouping

dissimilar amino acids together was decidedly harmful to the predictions. Surprisingly,

the DE and KR duos were the most detrimental to the predictions, despite the undeniable

similarity between the members of each of them.

Appendix A

Hydrogen-Bond Detection

A.1 Hydrogen Bonding

A hydrogen bond is “an attractive interaction between a hydrogen atom from a molecule

or a molecular fragment D-H, in which D is is more electronegative than H, and an atom

or a group of atoms in the same or a different molecule, in which there is evidence of

bond formation” [75]. Typically, a hydrogen bond is denoted as D-H· · ·A-AA, where the

three dots represent the bond [75]. The name “hydrogen bond” is a bit misleading:

hydrogen bonds are dipole–dipole interactions more than anything else and should not

be confused with true bonds such as covalent bonds. The heavy atom to which the

hydrogen involved in the bond is connected is called the donor atom and is represented

by D. The heavy atom to which the hydrogen is paired—via the hydrogen bond—is

called the acceptor atom and is represented by A. The parent heavy atoms of the donor

and acceptor are represented by DD and AA, respectively.

Hydrogen bonding is a very important phenomenon in protein structure, for it is because

of these interactions that a subunit of a protein remains folded. The primary structure

of the protein—its amino-acid sequence—dictates the secondary structures that form [1].

An alpha helix’s formation is stabilised by hydrogen bonding between the N-H group of

an amino acid and the O=C group of the amino acid four residues earlier [76]. Similarly,

parallel and antiparallel beta strands are held together to form beta sheets by hydrogen

bonding between the N-H group of an amino acid and the O=C group of another on

the adjacent strand [77]. Protein tertiary structure and quaternary structure are also

impacted by hydrogen bonding: energetic minimisation drives proteins to their most

stable conformation in space, given the secondary structures that first formed, and

different subunits form protein–protein interfaces due to interaction partly caused by

hydrogen bonding across the interface [78].

57

Appendix A — Hydrogen-Bond Detection 58

A.2 Rationale

Since hydrogen bonding is so key to protein structure, hydrogen-bond detection would

be very useful in the prediction of biological interfaces. Proteins, Interfaces, Structures

and Assemblies (PISA) [35] already, albeit indirectly, takes hydrogen bonding into con-

sideration, since their algorithm to predict whether an interface is stable in solution

considers the estimated free energy of the interaction. A lower energy supports the des-

ignation of the interface as biological, but the lower energy is, at least in part, caused

by the stability that hydrogen bonds across the interface impart. The Evolutionary

Protein–Protein Interface Classifier (EPPIC) [24] could also use hydrogen bonding to

help in its geometric prediction. If one can detect hydrogen bond across an interface, it

would be a good indication that the interface is biological. For this reason, this appendix

documents the creation of a hydrogen-bond detector for EPPIC.

A.3 Definition

The first step was to determine what defines a hydrogen bond, which proved to be par-

ticular challenging, since there exists so much variation in the definitions of hydrogen

bonding. Arunan et al., in 2011, provided a list of specific criteria to define what consti-

tutes and characterises a hydrogen bond [75]. The downside, however, was that many

of these rules, such as ones based on attributes unobtainable from the PDB file format,

were not possible to implement in EPPIC. In 1984, Baker and Hubbard documented

hydrogen bonding in secondary structure and showed that the main chain of a polypep-

tide can form hydrogen bonds through the N-H and O=C groups, with the O=C group able

to accept two hydrogen bonds [79]. They also noted that “all polar sidechains, with the

exception of tryptophan, can form multiple hydrogen bonds” [79]. But these definitions

still did not address the required geometry for hydrogen bonding. In 2010, Hubbard and

Haider provided some such information: the H· · ·O=C angle is usually between 130° and

170°, with a clear preference at approximately 150°; the H· · ·O distance is always between

1.6�A and 2.5�A; and the N-H· · ·O angle is nearly always between 120° and 180° [80].

A.4 Implementation

Using only the three ranges of geometric values from Section A.3 to detect hydro-

gen bonds resulted in far too many being detected. At this point, another approach

was taken. HBPLUS is a software package developed by McDonald and Thornton

in 1994 [81]. They undertook an empirical approach to hydrogen-bond detection by

Appendix A — Hydrogen-Bond Detection 59

looking at protein structures in the PDB and creating particular criteria specific to hy-

drogen bonding in proteins. The rules listed in the documentation for HBPLUS [82]

were added to EPPIC, such that, for each interface calculation, hydrogen bonds from

one side of the interface to the other are identified and counted. These are attributed

to the residues involved in the hydrogen bond, which are data that can be used in

the future in EPPIC’s geometric predictor and possibly incorporated into a model that

combines the core-residue count for the interface (and, eventually, even the insight from

core-residue decay presented in Appendix B).

Upon validation, however, it appeared that the EPPIC implementation of the HBPLUS

rules did not achieve the same results. The EPPIC implementation output some hydro-

gen bonds that HBPLUS did not find, but HBPLUS tended to find quite a few more than

EPPIC. At this point, another direction was pursued, and it was decided that HBPLUS

should simply be called by EPPIC. Some of EPPIC was recoded to allow the program

to run HBPLUS for the PDB file. HBPLUS’s output is parsed by EPPIC’s internal

code, and the hydrogen bonds are added to the component residues appropriately.

An issue with this implementation is how to ensure that the user of EPPIC also has

HBPLUS installed. Due to licensing restrictions, HBPLUS cannot be packaged with

EPPIC; thus, HBPLUS was implemented as follows: EPPIC checks for HBPLUS (whose

path must be added to the EPPIC parameter file), and, if present, it is run and its output

is parsed. If HBPLUS is not available for use, the EPPIC implementation of HBPLUS’s

hydrogen-bonding criteria is used.

A.5 Conclusion

This appendix’s purpose was to document the steps taken to implement a hydrogen-

bonding detector into EPPIC’s geometric predictor. In the end, EPPIC tries to run

HBPLUS, but, if HBPLUS is unavailable, an alternative implementation of HBPLUS’s

hydrogen-bonding criteria is used within EPPIC. The results with the non-HBPLUS

implementation do not overlap completely (data not shown). More investigation would

be required to determine the cause of the differences between the EPPIC implementation

and the HBPLUS implementation. EPPIC could then be trained to use the hydrogen-

bond information as a criterion for classification. This would likely be built alongside

the core-size discriminator in the geometric classifier (and, eventually, even the insight

from core-residue decay presented in Appendix B).

Appendix B

Core-Residue Decay

B.1 Theory

While experiments were being carried out for Chapter 2, a brief investigation into the

potential for core-residue decay to act as a criterion for interface classification in EPPIC

was undertaken. The theory is simple and goes as follows. The geometric classifier in

EPPIC only considers the number of core residues when predicting whether an interface

is bio or xtal. While performing the parameter optimisation in Subsection 2.3.4, in-

formation was available for core-residue assignment across a spectrum of burial cutoffs.

Recall that core residues are assigned based on their BSA-to-ASA ratio, the proportion

of solvent-accessible surface area that becomes buried upon interface formation. Record-

ing the number of core residues as a function of the burial cutoff allows for a gradient

to be calculated, which, it was theorised, could be telling of how likely the interface

is to be biologically relevant. For example, it could be that, since biologically-relevant

interfaces are conserved by evolution, they have more highly-buried residues involved

in the interface and fewer modestly-buried residues, whereas a crystal contact may well

exhibit a more linear or stochastic dropoff.

Every interface in the BioMany and XtalMany datasets had its number of core residues

computed for each of the following burial cutoffs: 50%, 55%, 60%, 65%, 70%, 75%,

80%, 85%, 90%, 95%, and 100%. Linear models, with the number of core residues as

a function of the burial cutoff, were then fit, one per datapoint. If the coefficient of

determination, R2, was above 0.96, the slope and y-intercept of the model were noted

down. Let the slopes and y-intercepts of these models be termed the slopes and intercepts

of core-residue decay.

61

Appendix B — Core-Residue Decay 62

B.2 Results

Figure B.1 is a histogram that has been included in this appendix for the sake of com-

parison. It shows the distribution of the number of core residues in an interface for a

burial cutoff of 95%, which is the criterion used (with a call threshold of 6) in EPPIC’s

geometric classifier. Note that the datapoints from the XtalMany dataset, shown in red,

are from a visibly-different distribution than the datapoints from the BioMany dataset,

shown in green. After choosing a criterion, the less overlap there is between these two

distributions, the more powerful the classifier can be.

Figure B.2 is a histogram showing the distribution of decay slopes, that is, the distri-

bution of the slopes from the models with R2 > 0.96. Figure B.3 is a similar histogram

showing the distribution of decay intercepts. Note that Figures B.2 and B.3 do not have

as many datapoints as Figure B.1, because only datapoints whose linear model was fit

with an R2 of at least 0.96 had their slopes and y-intercepts added to the histogram.

Upon visual inspection of Figures B.2 and B.3, it is plain to see that there is more

overlap between bio and xtal distributions than there is in Figure B.1. This likely

means that a simple linear discriminator based on either of these criteria would not

0 5 10 15 20 25 30 35 40Number of Core Residues

0

200

400

600

800

1000

1200

1400

1600

Fre

quen

cy

bio

xtal

Figure B.1: The distribution of the number of core residues at a burial cutoff of 95%for biological interfaces from the BioMany dataset, shown in green, and crystal contacts

from the XtalMany dataset, shown in red.


-160 -140 -120 -100 -80 -60 -40 -20 0Slope of Linear Fit

0

50

100

150

200

250

300

350

Fre

quen

cy

bio

xtal

Figure B.2: The distribution of the slopes from the linear models of the core size as afunction of the burial cutoff for biological interfaces from the BioMany dataset, shownin green, and crystal contacts from the XtalMany dataset, shown in red. Note that theslopes are all negative because increasing the burial cutoff results in a more stringent

criterion for core-residue assignment, which decreases the core size.

be able to distinguish between biological interfaces and crystal contacts in the range of

values where the result is not clear. Indeed, Table B.1 shows the enumeration of non-

overlapping bios (as a fraction of all bios), non-overlapping xtals (as a fraction of all

xtals), and overlapping bios and xtals (as a fraction of all datapoints) there are in each

of the histograms. The simple count of core residues has a lower overlap than the other

criteria and has more balance between non-overlapping categories. Both of these are

properties that would help with classification. Unfortunately, there was insufficient time

available to continue exploring options, but it is very possible that, using an approach

Criterion bio xtal bio + xtal

Core Size 0.7052 0.7156 0.2895Decay Slopes 0.5698 0.6650 0.3767

Decay Intercepts 0.6199 0.7040 0.3329

Table B.1: For three different criteria—the core size, the decay slope, and the de-cay intercept—the non-overlapping bios (as a fraction of all bios), non-overlappingxtals (as a fraction of all xtals), and overlapping bios and xtals (as a fraction of all

datapoints) are listed.


0 20 40 60 80 100 120 140 160y-Intercept of Linear Fit

0

50

100

150

200

250

300

350F

requ

ency

bio

xtal

Figure B.3: The distribution of the y-intercepts from the linear models of the coresize as a function of the burial cutoff for biological interfaces from the BioMany dataset,

shown in green, and crystal contacts from the XtalMany dataset, shown in red.

rooted in machine learning of higher complexity, these characteristics could achieve a

predictive power higher than that of any individual one.

B.3 Conclusion

Although the histograms of the decay slopes and intercepts do not show a clear linear

separability between groups, the results are still inconclusive. Some more experimen-

tation, perhaps with a greater number of characteristics and by using better classifiers,

should be able to improve the geometric classifier in EPPIC.

List of Figures

1.1 Composition of RNA and DNA . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Types of Protein–Protein Interactions . . . . . . . . . . . . . . . . . . . . 3

1.3 The Phase Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Crystallographic Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 The Interface-Classification Problem . . . . . . . . . . . . . . . . . . . . . 8

2.1 Effects of Reduced Amino-Acid Alphabets . . . . . . . . . . . . . . . . . . 25

3.1 Published Reduced Amino-Acid Alphabets . . . . . . . . . . . . . . . . . 29

3.2 ROC Curves for Published Alphabets . . . . . . . . . . . . . . . . . . . . 31

3.3 Random Two-Letter Amino-Acid Alphabets . . . . . . . . . . . . . . . . . 33

3.4 ROC Curves for Random 2-Alphabets . . . . . . . . . . . . . . . . . . . . 34

3.5 Transitioning Two-Letter Amino-Acid Alphabets . . . . . . . . . . . . . . 36

3.6 ROC Curves for Transitioning 2-Alphabets . . . . . . . . . . . . . . . . . 37

3.7 Linear Model for Transitioning 2-Alphabets . . . . . . . . . . . . . . . . . 39

4.1 ROC Curves for Surface-Residue Sampling . . . . . . . . . . . . . . . . . . 45

5.1 Hasse Diagram of Set Partitioning . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Ten Best and Ten Worst Amino-Acid Duos . . . . . . . . . . . . . . . . . 52

B.1 Distribution of Core Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

B.2 Distribution of Decay Slopes . . . . . . . . . . . . . . . . . . . . . . . . . 63

B.3 Distribution of Decay Intercepts . . . . . . . . . . . . . . . . . . . . . . . 64

65

List of Tables

2.1 The Seven Amino-Acid Alphabets of EPPIC . . . . . . . . . . . . . . . . 17

2.2 The Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Top Parameters for Geometric Classifier . . . . . . . . . . . . . . . . . . . 21

2.4 Top Low-Stringency Parameters for Core–Rim Evolutionary Classifier . . 22

2.5 Top High-Stringency Parameters for Core–Rim Evolutionary Classifier . . 22

2.6 Top Low-Stringency Parameters for Core–Surface Evolutionary Classifier 23

2.7 Top High-Stringency Parameters for Core–Surface Evolutionary Classifier 24

3.1 The Sixteen Published Amino-Acid Alphabets . . . . . . . . . . . . . . . . 28

3.2 AUROC Values for Published Alphabets . . . . . . . . . . . . . . . . . . . 31

3.3 The Ten Random Two-Letter Amino-Acid Alphabets . . . . . . . . . . . . 32

3.4 AUROC Values for Random 2-Alphabets . . . . . . . . . . . . . . . . . . 34

3.5 The Thirty-Two Transitioning Two-Letter Amino-Acid Alphabets . . . . 35

3.6 AUROC Values for Transitioning 2-Alphabets . . . . . . . . . . . . . . . . 37

4.1 Comparison of Surface-Sampling Strategies . . . . . . . . . . . . . . . . . 42

5.1 Ten Best and Ten Worst 2-Alphabets . . . . . . . . . . . . . . . . . . . . . 51

B.1 Separability by Geometric Criterion . . . . . . . . . . . . . . . . . . . . . 63

67

Acknowledgements

First and foremost, I would like to thank Guido, for being such a caring and understand-

ing supervisor, and Jose, for always helpfully answering any semblance of a question I

had for him. Thanks also to Spencer, to Kumaran, to Aleix, and to all my other LBR

colleagues, for making my stay at PSI an enjoyable time. Next, I would like to extend

my thanks to Amedeo, for agreeing to be my internal supervisor; to my father, mother,

and brother, for their unwavering support; and to Deena, for her constant words of en-

couragement and help with editing. Finally, I would like to thank Martin, Petra, Dania,

and Aral, for always being around for those much-needed beers. Merci a tous.

69

References

[1] J. M. Berg, J. L. Tymoczko, and L. Stryer. Biochemistry. Freeman and Company:

New York, 2002.

[2] Sponk and R. Mattern. File:Difference DNA RNA-EN.svg — Wikimedia Commons,

the free media repository, 2015. URL http://commons.wikimedia.org/w/index.

php?title=File:Difference_DNA_RNA-EN.svg&oldid=132849853. [Online; ac-

cessed 26-March-2015].

[3] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. From molecular to

modular cell biology. Nature, 402(6761 Suppl):C47–C52, Dec 1999. ISSN 0028-0836.

doi: 10.1038/35011540. URL http://www.ncbi.nlm.nih.gov/pubmed/10591225.

[4] S. Jones and J. M. Thornton. Principles of protein–protein interactions. Proceedings

of the National Academy of Sciences of the United States of America, 93(1):13–

20, Jan 1996. ISSN 0027-8424. URL http://www.ncbi.nlm.nih.gov/pubmed/

8552589.

[5] J. B. Pereira-Leal, E. D. Levy, and S. A. Teichmann. The origins and evolution

of functional modules: lessons from protein complexes. Philosophical Transactions

of the Royal Society of London (Series B, Biological Sciences), 361(1467):507–517,

Mar 2006. ISSN 0962-8436. doi: 10.1098/rstb.2005.1807. URL http://www.ncbi.

nlm.nih.gov/pubmed/16524839.

[6] J. Monod, J. Wyman, and J. P. Changeux. On the nature of allosteric transitions:

A plausible model. Journal of Molecular Biology, 12:88–118, May 1965. ISSN

0022-2836. URL http://www.ncbi.nlm.nih.gov/pubmed/14343300.

[7] R. H. Garrett and C. M. Grisham. Biochemistry, Fifth Edition. Brooks/Cole, 2013.

[8] E. D. Levy. A simple definition of structural regions in proteins and its use in

analyzing interface evolution. Journal of Molecular Biology, 403(4):660–670, Nov

2010. ISSN 1089-8638. doi: 10.1016/j.jmb.2010.09.028. URL http://www.ncbi.


71

http://commons.wikimedia.org/w/index.php?title=File:Difference_DNA_RNA-EN.svg&oldid=132849853

http://commons.wikimedia.org/w/index.php?title=File:Difference_DNA_RNA-EN.svg&oldid=132849853

http://www.ncbi.nlm.nih.gov/pubmed/10591225








References 72

[9] B. Rupp. Biomolecular Crystallography: Principles, Practice, and Application to

Structural Biology. Garland Science, 2009.

[10] J. Jancarik and S.-H. Kim. Sparse matrix sampling: a screening method for crys-

tallization of proteins. Journal of Applied Crystallography, 24(4):409–411, Aug

1991. doi: 10.1107/S0021889891004430. URL http://dx.doi.org/10.1107/

S0021889891004430.

[11] J. W. Wooh, R. D. Kidd, J. L. Martin, and B. Kobe. Comparison of three com-

mercial sparse-matrix crystallization screens. Acta Crystallographica (Section D,

Biological Crystallography), 59(Pt 4):769–772, Apr 2003. ISSN 0907-4449. URL

http://www.ncbi.nlm.nih.gov/pubmed/12657807.

[12] G. Taylor. The phase problem. Acta Crystallographica (Section D, Biologi-

cal Crystallography), 59(Pt 11):1881–1890, Nov 2003. ISSN 0907-4449. URL


[13] M. G. Rossmann. The molecular replacement method. Acta Crystallographica

(Section A, Foundations of Crystallography), 46 (Pt 2):73–82, Feb 1990. ISSN


[14] K. Cowtan. Kevin Cowtan’s Picture Book of Fourier Transforms, 2014. URL http:

//www.ysbl.york.ac.uk/~cowtan/fourier/fourier.html. [Online; accessed 26-

March-2015].

[15] E. Aubert and C. Lecomte. Illustrated Fourier transforms for crystallography. Jour-

nal of Applied Crystallography, 40(6):1153–1165, 2007.

[16] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.

Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Research,

28(1):235–242, Jan 2000. ISSN 0305-1048. URL http://www.ncbi.nlm.nih.gov/

pubmed/10592235.

[17] D. Blow. Outline of Crystallography for Biologists. Oxford University Press, 2002.

[18] M. E. Kastner, E. Vasbinder, D. Kowalcyzk, S. Jackson, J. Giammalvo, J. Braun,

and K. DiMarco. Crystallographic CourseWare. Journal of Chemical Education,

77(9):1247, 2000. URL http://www.crystallographiccourseware.com/.

[19] K. Anduleit, G. Sutton, J. M. Diprose, P. P. C. Mertens, J. M. Grimes, and D. I.

Stuart. Crystal lattice as biological phenotype for insect viruses. Protein Science,

14(10):2741–2743, Oct 2005. ISSN 0961-8368. doi: 10.1110/ps.051516405. URL


http://dx.doi.org/10.1107/S0021889891004430

http://dx.doi.org/10.1107/S0021889891004430




http://www.ysbl.york.ac.uk/~cowtan/fourier/fourier.html

http://www.ysbl.york.ac.uk/~cowtan/fourier/fourier.html



http://www.crystallographiccourseware.com/


References 73

[20] X. Ji, G. Sutton, G. Evans, D. Axford, R. Owen, and D. I. Stuart. How baculovirus

polyhedra fit square pegs into round holes to robustly package viruses. The EMBO

Journal, 29(2):505–514, Jan 2010. ISSN 1460-2075. doi: 10.1038/emboj.2009.352.

URL http://www.ncbi.nlm.nih.gov/pubmed/19959989.

[21] M. R. Sawaya, D. Cascio, M. Gingery, J. Rodriguez, L. Goldschmidt, J.-P. Colletier,

M. M. Messerschmidt, S. Boutet, J. E. Koglin, G. J. Williams, A. S. Brewster,

K. Nass, J. Hattne, S. Botha, R. B. Doak, R. L. Shoeman, D. P. DePonte, H.-

W. Park, B. A. Federici, N. K. Sauter, I. Schlichting, and D. S. Eisenberg. Protein

crystal structure obtained at 2.9 A resolution from injecting bacterial cells into an X-

ray free-electron laser beam. Proceedings of the National Academy of Sciences of the

United States of America, 111(35):12769–12774, Sep 2014. ISSN 1091-6490. doi: 10.

1073/pnas.1413456111. URL http://www.ncbi.nlm.nih.gov/pubmed/25136092.

[22] Y. Tsuchiya, H. Nakamura, and K. Kinoshita. Discrimination between biologi-

cal interfaces and crystal-packing contacts. Advances and Applications in Bioin-

formatics and Chemistry, 1:99–113, Nov 2008. ISSN 1178-6949. URL http:

//www.ncbi.nlm.nih.gov/pubmed/21918609.

[23] E. D. Levy and S. Teichmann. Structural, evolutionary, and assembly principles of

protein oligomerization. Progress in Molecular Biology and Translational Science,

117:25–51, 2013. ISSN 1878-0814. doi: 10.1016/B978-0-12-386931-9.00002-7. URL


[24] J. M. Duarte, A. Srebniak, M. A. Scharer, and G. Capitani. Protein interface

classification by evolutionary analysis. BMC Bioinformatics, 13:334, Dec 2012.

ISSN 1471-2105. doi: 10.1186/1471-2105-13-334. URL http://www.ncbi.nlm.

nih.gov/pubmed/23259833.

[25] C. Chothia and J. Janin. Principles of protein–protein recognition. Nature, 256

(5520):705–708, Aug 1975. ISSN 0028-0836. URL http://www.ncbi.nlm.nih.

gov/pubmed/1153006.

[26] S. J. Wodak and J. Janin. Computer analysis of protein–protein interaction. Journal

of Molecular Biology, 124(2):323–342, Sep 1978. ISSN 0022-2836. URL http:


[27] J. Janin and F. Rodier. Protein–protein interaction at crystal contacts. Proteins,

23(4):580–587, Dec 1995. ISSN 0887-3585. doi: 10.1002/prot.340230413. URL














References 74

[28] J. Janin. Specific versus non-specific contacts in protein crystals. Nature Structural

Biology, 4(12):973–974, Dec 1997. ISSN 1072-8368. URL http://www.ncbi.nlm.

nih.gov/pubmed/9406542.

[29] W. S. Valdar and J. M. Thornton. Conservation helps to identify biologically rel-

evant crystal contacts. Journal of Molecular Biology, 313(2):399–416, Oct 2001.

ISSN 0022-2836. doi: 10.1006/jmbi.2001.5034. URL http://www.ncbi.nlm.nih.

gov/pubmed/11800565.

[30] A. H. Elcock and J. A. McCammon. Identification of protein oligomerization

states by analysis of interface conservation. Proceedings of the National Academy

of Sciences of the United States of America, 98(6):2990–2994, Mar 2001. ISSN

0027-8424. doi: 10.1073/pnas.061411798. URL http://www.ncbi.nlm.nih.gov/

pubmed/11248019.

[31] M. Guharoy and P. Chakrabarti. Conservation and relative importance of residues

across protein–protein interfaces. Proceedings of the National Academy of Sci-

ences of the United States of America, 102(43):15447–15452, Oct 2005. ISSN


pubmed/16221766.

[32] J. Bernauer, R. P. Bahadur, F. Rodier, J. Janin, and A. Poupon. DiMoVo: a

Voronoi tessellation-based method for discriminating crystallographic and biological

protein–protein interactions. Bioinformatics, 24(5):652–658, Mar 2008. ISSN 1367-

4811. doi: 10.1093/bioinformatics/btn022. URL http://www.ncbi.nlm.nih.gov/

pubmed/18204058.

[33] H. Zhu, F. S. Domingues, I. Sommer, and T. Lengauer. NOXclass: prediction

of protein–protein interaction types. BMC Bioinformatics, 7:27, Jan 2006. ISSN

1471-2105. doi: 10.1186/1471-2105-7-27. URL http://www.ncbi.nlm.nih.gov/

pubmed/16423290.

[34] P. Mitra and D. Pal. Combining Bayes classification and point group symmetry un-

der Boolean framework for enhanced protein quaternary structure inference. Struc-

ture, 19(3):304–312, Mar 2011. ISSN 1878-4186. doi: 10.1016/j.str.2011.01.009.


[35] E. Krissinel and K. Henrick. Inference of macromolecular assemblies from crystalline

state. Journal of Molecular Biology, 372(3):774–797, Sep 2007. ISSN 0022-2836.

doi: 10.1016/j.jmb.2007.05.022. URL http://www.ncbi.nlm.nih.gov/pubmed/

17681537.
















References 75

[36] Q. Xu, A. A. Canutescu, G. Wang, M. Shapovalov, Z. Obradovic, and R. L. Dun-

brack. Statistical analysis of interface similarity in crystals of homologous proteins.

Journal of Molecular Biology, 381(2):487–507, Aug 2008. ISSN 1089-8638. doi: 10.

1016/j.jmb.2008.06.002. URL http://www.ncbi.nlm.nih.gov/pubmed/18599072.

[37] Q. Xu and R. L. Dunbrack. The protein common interface database (ProtCID)—a

comprehensive database of interactions of homologous proteins in multiple crystal

forms. Nucleic Acids Research, 39(Database issue):D761–D770, Jan 2011. ISSN

1362-4962. doi: 10.1093/nar/gkq1059. URL http://www.ncbi.nlm.nih.gov/

pubmed/21036862.

[38] R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin,

P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E. L. L. Sonnhammer, S. R. Eddy,

and A. Bateman. The Pfam protein families database. Nucleic Acids Research, 38

(Database issue):D211–D222, Jan 2010. ISSN 1362-4962. doi: 10.1093/nar/gkp985.


[39] M. A. Scharer, M. G. Grutter, and G. Capitani. CRK: an evolutionary approach

for distinguishing biologically relevant interfaces from crystal contacts. Proteins, 78

(12):2707–2713, Sep 2010. ISSN 1097-0134. doi: 10.1002/prot.22787. URL http:


[40] R. P. Bahadur, P. Chakrabarti, F. Rodier, and J. Janin. A dissection of specific and

non-specific protein–protein interfaces. Journal of Molecular Biology, 336(4):943–

955, Feb 2004. ISSN 0022-2836. URL http://www.ncbi.nlm.nih.gov/pubmed/

15095871.

[41] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local

alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990.

ISSN 0022-2836. doi: 10.1016/S0022-2836(05)80360-2. URL http://www.ncbi.


[42] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller,

and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucleic Acids Research, 25(17):3389–3402, Sep 1997.

ISSN 0305-1048. URL http://www.ncbi.nlm.nih.gov/pubmed/9254694.

[43] UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic

Acids Research, 37(Database issue):D169–D174, Jan 2009. ISSN 1362-4962. doi:

10.1093/nar/gkn664. URL http://www.ncbi.nlm.nih.gov/pubmed/18836194.

[44] C. Notredame, D. G. Higgins, and J. Heringa. T-Coffee: A novel method for fast

and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1):













References 76

205–217, Sep 2000. ISSN 0022-2836. doi: 10.1006/jmbi.2000.4042. URL http:


[45] R. Wernersson and A. G. Pedersen. RevTrans: Multiple alignment of coding DNA

from aligned amino acid sequences. Nucleic Acids Research, 31(13):3537–3539, Jul

2003. ISSN 1362-4962. URL http://www.ncbi.nlm.nih.gov/pubmed/12824361.

[46] A. Doron-Faigenboim, A. Stern, I. Mayrose, E. Bacharach, and T. Pupko. Selecton:

a server for detecting evolutionary forces at a single amino-acid site. Bioinformatics,

21(9):2101–2103, May 2005. ISSN 1367-4803. doi: 10.1093/bioinformatics/bti259.


[47] A. Stern, A. Doron-Faigenboim, E. Erez, E. Martz, E. Bacharach, and T. Pupko.

Selecton 2007: advanced models for detecting positive and purifying selection us-

ing a Bayesian inference approach. Nucleic Acids Research, 35(Web Server is-

sue):W506–W511, Jul 2007. ISSN 1362-4962. doi: 10.1093/nar/gkm382. URL


[48] B. Lee and F. M. Richards. The interpretation of protein structures: estimation of

static accessibility. Journal of Molecular Biology, 55(3):379–400, Feb 1971. ISSN


[49] A. Shrake and J. A. Rupley. Environment and exposure to solvent of protein atoms.

lysozyme and insulin. Journal of Molecular Biology, 79(2):351–371, Sep 1973. ISSN


[50] S. Miller, A. M. Lesk, J. Janin, and C. Chothia. The accessible surface area and sta-

bility of oligomeric proteins. Nature, 328(6133):834–836, Aug 1987. ISSN 0028-0836.

doi: 10.1038/328834a0. URL http://www.ncbi.nlm.nih.gov/pubmed/3627230.

[51] W. S. Valdar and J. M. Thornton. Protein–protein interfaces: analysis of amino acid

conservation in homodimers. Proteins, 42(1):108–124, Jan 2001. ISSN 0887-3585.


[52] C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181

(4096):223–230, Jul 1973. ISSN 0036-8075. URL http://www.ncbi.nlm.nih.gov/

pubmed/4124164.

[53] N. Sinha and R. Nussinov. Point mutations and sequence variability in proteins:

redistributions of preexisting populations. Proceedings of the National Academy

of Sciences of the United States of America, 98(6):3139–3144, Mar 2001. ISSN


pubmed/11248045.














References 77

[54] D. S. Riddle, J. V. Santiago, S. T. Bray-Hall, N. Doshi, V. P. Grantcharova, Q. Yi,

and D. Baker. Functional rapidly folding proteins from simplified amino acid se-

quences. Nature Structural Biology, 4(10):805–809, Oct 1997. ISSN 1072-8368. URL


[55] A. R. Davidson, K. J. Lumb, and R. T. Sauer. Cooperatively folded proteins in

random sequence libraries. Nature Structural Biology, 2(10):856–864, Oct 1995.


[56] C. Sander and G. E. Schulz. Degeneracy of the information contained in amino

acid sequences: evidence from overlaid genes. Journal of Molecular Evolution, 13

(3):245–252, Oct 1979. ISSN 0022-2844. URL http://www.ncbi.nlm.nih.gov/

pubmed/228047.

[57] V. B. Strelets, I. N. Shindyalov, and H. A. Lim. Analysis of peptides from known

proteins: clusterization in sequence space. Journal of Molecular Evolution, 39

(6):625–630, Dec 1994. ISSN 0022-2844. URL http://www.ncbi.nlm.nih.gov/

pubmed/7807551.

[58] P. G. Wolynes. As simple as can be? Nature Structural Biology, 4(11):871–874, Nov


[59] J. Wang and W. Wang. A computational approach to simplifying the protein folding

alphabet. Nature Structural Biology, 6(11):1033–1038, Nov 1999. ISSN 1072-8368.

doi: 10.1038/14918. URL http://www.ncbi.nlm.nih.gov/pubmed/10542095.

[60] L. R. Murphy, A. Wallqvist, and R. M. Levy. Simplified amino acid alphabets

for protein fold recognition and implications for folding. Protein Engineering, 13

(3):149–152, Mar 2000. ISSN 0269-2139. URL http://www.ncbi.nlm.nih.gov/

pubmed/10775656.

[61] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein

blocks. Proceedings of the National Academy of Sciences of the United States of

America, 89(22):10915–10919, Nov 1992. ISSN 0027-8424. URL http://www.ncbi.


[62] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural

classification of proteins database for the investigation of sequences and structures.

Journal of Molecular Biology, 247(4):536–540, Apr 1995. ISSN 0022-2836. doi:

10.1006/jmbi.1995.0159. URL http://www.ncbi.nlm.nih.gov/pubmed/7723011.

[63] T. Li, K. Fan, J. Wang, and W. Wang. Reduction of protein sequence complexity by

residue grouping. Protein Engineering, 16(5):323–330, May 2003. ISSN 0269-2139.
















References 78

[64] C. Etchebest, C. Benros, A. Bornot, A.-C. Camproux, and A. G. de Brevern. A re-

duced amino acid alphabet for understanding and designing protein adaptation

to mutation. European Biophysics Journal, 36(8):1059–1069, Nov 2007. ISSN

0175-7571. doi: 10.1007/s00249-007-0188-5. URL http://www.ncbi.nlm.nih.


[65] K. Baskaran, J. M. Duarte, N. Biyani, S. Bliven, and G. Capitani. A PDB-wide,

evolution-based assessment of protein–protein interfaces. BMC Structural Biology,

14:22, Oct 2014. ISSN 1472-6807. doi: 10.1186/s12900-014-0022-0. URL http:


[66] L. A. Mirny and E. I. Shakhnovich. Universally conserved positions in protein

folds: reading evolutionary signals about stability, folding kinetics and function.

Journal of Molecular Biology, 291(1):177–196, Aug 1999. ISSN 0022-2836. doi:

10.1006/jmbi.1999.2911. URL http://www.ncbi.nlm.nih.gov/pubmed/10438614.

[67] A. K. Akobeng. Understanding diagnostic tests 1: sensitivity, specificity and pre-

dictive values. Acta Paediatrica, 96(3):338–341, Mar 2007. ISSN 0803-5253. doi:

10.1111/j.1651-2227.2006.00180.x. URL http://www.ncbi.nlm.nih.gov/pubmed/

17407452.

[68] B. W. Matthews. Comparison of the predicted and observed secondary structure

of T4 phage lysozyme. Biochimica et Biophysica Acta, 405(2):442–451, Oct 1975.


[69] M. Spitzer, G. Fuellen, P. Cullen, and S. Lorkowski. VisCoSe: visualization and

comparison of consensus sequences. Bioinformatics, 20(3):433–435, Feb 2004. ISSN

1367-4803. doi: 10.1093/bioinformatics/btg497. URL http://www.ncbi.nlm.nih.


[70] C. E. Metz. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8

(4):283–298, Oct 1978. ISSN 0001-2998. URL http://www.ncbi.nlm.nih.gov/

pubmed/112681.

[71] UniProt. UniProtKB/TrEMBL Current Release Statistics, Mar 2015. URL http:

//www.ebi.ac.uk/uniprot/TrEMBLstats. [Online; accessed 26-March-2015].

[72] Wikipedia. Stirling numbers of the second kind — Wikipedia, the free encyclopedia,

2015. URL http://en.wikipedia.org/w/index.php?title=Stirling_numbers_

of_the_second_kind&oldid=650621849. [Online; accessed 26-March-2015].

[73] T. Piesk. File:Set partitions 4; Hasse; circles.svg — Wikimedia Commons, the

free media repository, 2015. URL http://commons.wikimedia.org/w/index.php?













http://www.ebi.ac.uk/uniprot/TrEMBLstats

http://www.ebi.ac.uk/uniprot/TrEMBLstats

http://en.wikipedia.org/w/index.php?title=Stirling_numbers_of_the_second_kind&oldid=650621849

http://en.wikipedia.org/w/index.php?title=Stirling_numbers_of_the_second_kind&oldid=650621849

http://commons.wikimedia.org/w/index.php?title=File:Set_partitions_4;_Hasse;_circles.svg&oldid=138985475


References 79

title=File:Set_partitions_4;_Hasse;_circles.svg&oldid=138985475. [On-

line; accessed 26-March-2015].

[74] N. Cannata, S. Toppo, C. Romualdi, and G. Valle. Simplifying amino acid al-

phabets by means of a branch and bound algorithm and substitution matri-

ces. Bioinformatics, 18(8):1102–1108, Aug 2002. ISSN 1367-4803. URL http:


[75] E. Arunan, G. R. Desiraju, R. A. Klein, J. Sadlej, S. Scheiner, I. Alkorta, D. C.

Clary, R. H. Crabtree, J. J. Dannenberg, P. Hobza, H. G. Kjaergaard, A. C. Legon,

B. Mennucci, and D. J. Nesbitt. Definition of the hydrogen bond (IUPAC Recom-

mendations 2011). Pure and Applied Chemistry, 83(8):1637–1641, 2011.

[76] L. Pauling, R. B. Corey, and H. R. Branson. The structure of proteins; two

hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the

National Academy of Sciences of the United States of America, 37(4):205–211, Apr


[77] L. Pauling and R. B. Corey. The pleated sheet, a new layer configuration of

polypeptide chains. Proceedings of the National Academy of Sciences of the

United States of America, 37(5):251–256, May 1951. ISSN 0027-8424. URL


[78] C. B. Anfinsen. The formation and stabilization of protein structure. The

Biochemical Journal, 128(4):737–749, Jul 1972. ISSN 0264-6021. URL http:


[79] E. N. Baker and R. E. Hubbard. Hydrogen bonding in globular proteins. Progress

in Biophysics and Molecular Biology, 44(2):97–179, 1984. ISSN 0079-6107. URL


[80] R. E. Hubbard and M. Kamran Haider. Hydrogen Bonds in Proteins: Role and

Strength. eLS, 2010.

[81] I. K. McDonald and J. M. Thornton. Satisfying hydrogen bonding potential in

proteins. Journal of Molecular Biology, 238(5):777–793, May 1994. ISSN 0022-

2836. doi: 10.1006/jmbi.1994.1334. URL http://www.ncbi.nlm.nih.gov/pubmed/

8182748.

[82] I. McDonald, D. Naylor, D. Jones, and J. Thornton. HBPLUS Manual v.3.06. URL

http://www.ebi.ac.uk/thornton-srv/software/HBPLUS/manual.html. [Online;

accessed 26-March-2015].












http://www.ebi.ac.uk/thornton-srv/software/HBPLUS/manual.html

Documents

Rights / License: Research Collection In Copyright - Non ...47639/eth-47639-01.pdfDeclaration of Originality I, Joseph C. Somody, hereby con rm that I am the sole author of the written